Skip to content

Demo Datasets - Comprehensive Research Analytics

Explore Citation Compass capabilities using curated academic datasets. Demo mode provides realistic research data with offline functionality, useful for learning, testing, and demonstrating analytics features.

🎯 Demo Datasets Overview

Demo datasets offer immediate access to powerful research analytics without requiring database setup, API keys, or data imports. Experience the complete platform with realistic academic data spanning multiple research fields.

Home dashboard (demo entry points)

Why Use Demo Datasets?

  • Zero configuration - Works immediately out of the box
  • Full offline functionality - No internet required after initial setup
  • Complete feature access - All platform capabilities available
  • Realistic performance - Response times match production systems
  • Learn ML concepts - See how citation prediction models work
  • Understand networks - Explore real academic collaboration patterns
  • Practice analysis - Master workflows with guided examples
  • Benchmark performance - Compare different analysis approaches
  • Validate workflows - Test analysis pipelines before using your data
  • Prototype studies - Design research with realistic constraints
  • Train users - Onboard team members with safe practice environment
  • Demonstrate capabilities - Show stakeholders platform potential

πŸ“Š Available Datasets

Complete Demo Dataset

The flagship demonstration dataset with comprehensive academic research data:

πŸ“ˆ Dataset Statistics: - 13 high-impact papers carefully selected across multiple domains - 34 citation relationships showing realistic academic networks - 47 researchers demonstrating collaboration patterns - 7 research fields with cross-disciplinary connections - 16-year timespan (2009-2024) showing research evolution

πŸ”¬ Research Domains: - πŸ€– Machine Learning: Foundational papers including "Attention Is All You Need" - 🧠 Neuroscience: Brain imaging, neural networks, and cognitive studies - βš›οΈ Physics: Quantum computing and computational physics advances - πŸ₯ Medical Informatics: Healthcare AI and medical imaging research
- πŸ‘οΈ Computer Vision: Image recognition and deep learning breakthroughs - πŸ€– Robotics: Autonomous systems and intelligent control - 🧠 Psychology: Cognitive science and behavioral research

πŸ’« Network Characteristics: - Cross-field citations between related domains (ML ↔ Computer Vision) - Temporal patterns showing how newer papers build on foundational work - Collaboration networks revealing author research connections - Impact distributions from high-cited foundational papers to emerging research

Minimal Demo Dataset

Quick testing dataset for rapid exploration and development:

πŸ“Š Compact Statistics: - 5 essential papers covering key research areas - 5 citation relationships demonstrating basic network structure - 22 researchers showing collaboration patterns - 3 research fields with focused domain coverage - Perfect for: Quick demos, testing, feature validation

Quick Fixtures

Specialized mini-datasets for targeted testing scenarios:

Available Fixtures: - minimal_network: 3 papers, 2 citations - Basic network structure - collaboration_network: Focus on author collaboration patterns - temporal_network: Time-based citation evolution examples - cross_field_network: Inter-disciplinary research connections

🎭 Accessing Demo Datasets

Via Streamlit Interface

Step 1: Navigate to Demo Datasets

streamlit run app.py
1. Open sidebar menu (click hamburger icon) 2. Select "Demo Datasets" from navigation 3. View available datasets with detailed statistics

Step 2: Explore Dataset Information

Each dataset shows: - πŸ“Š Statistics: Papers, citations, authors, venues count - πŸ“… Time Range: Publication year span - 🏷️ Fields: Research domains included - ⏱️ Load Time: Expected loading duration - πŸ’Ύ Memory: Estimated memory usage

Step 3: Load Dataset

  1. Click "Load Dataset" button for your chosen dataset
  2. Monitor loading progress (typically 2-3 seconds)
  3. Confirm successful load with status indicator
  4. Begin exploring all platform features

Via Python API

Direct Dataset Loading

from src.data.demo_loader import DemoDataLoader

# Load complete demo dataset
loader = DemoDataLoader()
demo_data = loader.load_complete_demo()

# Access loaded data
papers = demo_data.get_papers()
citations = demo_data.get_citations()
authors = demo_data.get_authors()

print(f"Loaded {len(papers)} papers with {len(citations)} citations")

Custom Dataset Selection

from src.data.fixtures import get_fixture_data

# Load specific fixture
minimal_data = get_fixture_data('minimal_network')
temporal_data = get_fixture_data('temporal_network')

# Load complete demo with configuration
complete_demo = get_fixture_data('complete_demo')

Via Command Line

# Test demo dataset loading
python -c "from src.data.demo_loader import DemoDataLoader; loader = DemoDataLoader(); print('Demo datasets available:', loader.list_available_datasets())"

# Load and validate demo data
python -c "from src.data.demo_loader import DemoDataLoader; loader = DemoDataLoader(); data = loader.load_complete_demo(); print(f'Loaded {len(data.papers)} papers successfully')"

πŸ€– Demo ML Capabilities

Demo datasets include a sophisticated ML service that works entirely offline while providing realistic results:

Synthetic Embeddings

Realistic vector representations that demonstrate ML concepts:

  • Field-aware clustering - Papers cluster by research domain
  • Semantic similarity - Related papers have similar embeddings
  • Dimensional structure - High-dimensional spaces with meaningful patterns
  • Compatible with TransE - Works with existing ML infrastructure

Example Embedding Exploration:

from src.services.demo_service import get_demo_ml_service

ml_service = get_demo_ml_service()

# Get paper embeddings
paper_id = "649def34f8be52c8b66281af98ae884c09aef38f9"  # Attention Is All You Need
embedding = ml_service.get_paper_embedding(paper_id)

# Find similar papers
similar_papers = ml_service.find_similar_papers(paper_id, top_k=5)
print(f"Papers similar to Attention paper: {similar_papers}")

Intelligent Citation Predictions

Realistic prediction algorithms following academic patterns:

Temporal Intelligence: - Newer papers more likely to cite foundational work - Recent papers cite contemporary research - Classic papers continue being referenced over time

Field Relationships: - ML papers frequently cite other ML research - Cross-field citations between related domains (ML ↔ Computer Vision) - Interdisciplinary connections reflect real research patterns

Impact Weighting: - Highly-cited papers receive more predictions - Foundational papers cited across multiple fields - Quality scores influence citation probability

Example Prediction Usage:

from src.services.demo_service import get_demo_ml_service

ml_service = get_demo_ml_service()

# Predict citations for a paper
paper_id = "204e3073870fae3d05bcbc2f6a8e263d9b72e776"  # BERT paper
predictions = ml_service.predict_citations(paper_id, top_k=10)

for pred in predictions:
    print(f"Paper: {pred['target_id']}, Confidence: {pred['confidence']:.3f}")

Confidence Scoring

Realistic confidence metrics for prediction reliability:

  • Range: 0.1-0.9 matching real-world ML model outputs
  • Distribution: Higher confidence for same-field predictions
  • Uncertainty: Lower confidence for cross-field or novel connections
  • Interpretability: Scores correlate with prediction likelihood

πŸ•ΈοΈ Network Analysis Features

Demo datasets enable comprehensive network analysis with realistic academic patterns:

Community Detection

Identify research clusters within the citation network:

Available Algorithms: - Louvain method - Optimize modularity for community structure - Label propagation - Fast community discovery - Girvan-Newman - Hierarchical community detection - Leiden algorithm - High-quality community partitions

Real-world Patterns: - Research fields form natural communities - Cross-field bridges between related domains - Author collaboration clusters - Temporal community evolution

Example Analysis:

from src.services.analytics_service import get_analytics_service

analytics = get_analytics_service()

# Detect communities in demo network
communities = analytics.detect_communities()
print(f"Found {len(communities)} research communities")

# Analyze community characteristics
for i, community in enumerate(communities):
    print(f"Community {i}: {len(community)} papers")
    # Show dominant research fields in each community

Centrality Analysis

Identify influential papers and authors using network metrics:

Centrality Measures: - Degree centrality - Direct citation connections - Betweenness centrality - Bridge papers connecting different areas
- Eigenvector centrality - Citations from other high-impact papers - PageRank - Academic influence propagation

Research Insights: - Foundational papers show high centrality across all measures - Bridge papers connect different research communities - Recent breakthrough papers gain centrality over time - Author centrality reveals research leaders

Temporal Dynamics

Analyze research evolution over the 16-year dataset timespan:

Temporal Patterns: - Citation accumulation - How papers gain citations over time - Field emergence - New research areas in the network - Knowledge flow - How ideas spread between domains - Collaboration evolution - Changing author network patterns

Time-based Analysis:

from src.services.analytics_service import get_analytics_service

analytics = get_analytics_service()

# Analyze temporal citation patterns
temporal_stats = analytics.get_temporal_citation_patterns()
print(f"Citation patterns over {temporal_stats['years_span']} years")

# Track field emergence
field_evolution = analytics.analyze_field_evolution()
print("Research field development over time")

🎨 Visualization Capabilities

Demo datasets support rich interactive visualizations for exploration:

Network Graphs

Interactive citation networks with full functionality:

Features: - Clickable nodes - Explore individual papers - Dynamic filtering - Filter by field, year, citation count - Zoom and pan - Navigate large networks smoothly - Highlighting - Trace citation paths and relationships

Customization Options: - Node sizing - Scale by citation count or impact - Color coding - Research fields, publication years, or communities - Edge styling - Citation relationships with directional arrows - Layout algorithms - Force-directed, hierarchical, circular layouts

Embedding Visualizations

Explore paper relationships in high-dimensional embedding space:

Visualization Types: - 2D projections - t-SNE and UMAP dimensionality reduction - 3D explorations - Interactive 3D embedding spaces - Cluster highlighting - Research field boundaries - Similarity mapping - Distance-based relationship exploration

Statistical Charts

Comprehensive analytics dashboards with interactive plots:

Chart Types: - Citation distributions - Histograms and box plots - Temporal trends - Time series of research activity - Field comparisons - Cross-domain analytics - Network metrics - Centrality and clustering visualizations

πŸ“ˆ Analytics & Export

Demo datasets enable complete analytics workflows with publication-ready outputs:

Statistical Analysis

Comprehensive network statistics for research insights:

Network Metrics: - Global statistics - Density, clustering coefficient, diameter - Node-level metrics - Individual paper/author importance - Community analysis - Research cluster characteristics - Temporal dynamics - Evolution patterns over time

Export Capabilities

Publication-ready outputs in multiple formats:

Academic Exports: - LaTeX tables - Camera-ready for academic publications - Citation networks - GraphML and DOT formats for further analysis - Statistical summaries - CSV and JSON for data analysis - Visualization exports - High-resolution PNG/SVG/PDF formats

Report Generation:

from src.services.analytics_service import get_analytics_service

analytics = get_analytics_service()

# Generate comprehensive network report
report = analytics.generate_network_report()

# Export to LaTeX table
latex_table = analytics.export_latex_table(report)
print("LaTeX table ready for publication")

# Save visualizations
analytics.save_network_visualization("demo_network.pdf")

Research Insights

AI-powered analysis with academic context:

Insight Types: - Performance benchmarking - Compare against academic standards - Traffic light indicators - Quick quality assessment
- Research recommendations - Suggested analysis directions - Cross-field discoveries - Unexpected connection identification

πŸ”„ Transitioning to Production

Demo datasets provide seamless transition to production use:

From Demo to Real Data

When ready for your own research data:

  1. Master workflows using demo datasets first
  2. Upload your data using file upload
  3. Apply learned techniques to your research domain
  4. Scale analysis methods to larger datasets
  5. Train custom models with your domain-specific data

Workflow Preservation

Consistent interface ensures smooth transition: - Same API calls work with demo and production data - Identical analysis methods apply to any dataset size - Export formats remain consistent across modes - Visualization tools scale to larger networks

Performance Expectations

Realistic performance helps plan production deployments: - Response times similar to production database queries - Memory usage scales predictably with dataset size - Analysis complexity matches computational requirements - Export speeds reflect real-world processing times

πŸ’‘ Best Practices

Learning Strategies

  1. Start with complete_demo - Full feature exploration
  2. Try all analysis types - Network, ML, temporal analysis
  3. Practice exports - Learn report generation workflows
  4. Test edge cases - Understand limitations and error handling
  5. Experiment with parameters - See how settings affect results
  1. Map to your domain - Find parallels with your research area
  2. Practice workflows - Master analysis pipelines before using real data
  3. Understand metrics - Learn interpretation of network statistics
  4. Test export formats - Ensure compatibility with your publication workflows
  5. Validate assumptions - Confirm analysis approaches with demo results
  1. Study API patterns - Learn efficient usage of platform APIs
  2. Test integrations - Practice connecting to external systems
  3. Monitor performance - Understand computational requirements
  4. Debug workflows - Practice troubleshooting with known data
  5. Extend functionality - Use demo data for testing new features

Advanced Usage

Power user techniques for maximum demo value:

Custom Analysis Scripts

from src.data.demo_loader import DemoDataLoader
from src.services.analytics_service import get_analytics_service

# Load demo data
loader = DemoDataLoader() 
demo_data = loader.load_complete_demo()

# Perform custom analysis
analytics = get_analytics_service()

# Analyze cross-field citations
cross_field_analysis = analytics.analyze_cross_field_patterns()

# Export results
analytics.export_analysis_report(cross_field_analysis, "cross_field_report.json")

Batch Processing Demo

from src.data.fixtures import get_all_fixtures

# Test analysis on all demo datasets
fixtures = get_all_fixtures()

for name, data in fixtures.items():
    print(f"Analyzing {name}...")
    # Run analysis pipeline on each fixture
    results = run_analysis_pipeline(data)
    print(f"Results: {results}")

🚨 Demo Limitations

Understanding demo constraints helps set realistic expectations:

Data Scope Limitations

  • Small networks - 13 papers vs thousands in production
  • Limited timespan - 2009-2024 vs longer historical periods
  • Focused domains - AI/ML emphasis vs broader academic coverage
  • Synthetic elements - Some relationships constructed for demonstration

Scalability Considerations

  • Network size - Cannot test large-scale analysis (10k+ papers)
  • Community complexity - Fewer research clusters than real networks
  • Memory patterns - Won't reveal large dataset memory requirements
  • Performance limits - Cannot test high-volume processing scenarios

ML Model Constraints

  • Training limitations - Cannot train new models with demo data
  • Fixed embeddings - Synthetic embeddings don't improve with training
  • Parameter testing - Limited hyperparameter exploration capabilities
  • Validation scope - Cannot test model performance improvements

πŸ” Troubleshooting Demo Mode

Common Issues

Dataset won't load: - βœ… Refresh the Streamlit page - βœ… Check browser console for JavaScript errors - βœ… Ensure adequate browser memory - βœ… Try loading smaller minimal_demo dataset first

Slow loading performance: - βœ… Close other browser tabs and applications - βœ… Clear browser cache and cookies - βœ… Try in private/incognito browser mode - βœ… Check system memory availability

ML predictions failing: - βœ… Ensure demo dataset is fully loaded - βœ… Check that demo ML service is active - βœ… Try restarting the Streamlit application - βœ… Test with different paper IDs from the dataset

Network analysis errors: - βœ… Verify dataset has citation relationships - βœ… Check that analytics service is initialized - βœ… Try different analysis parameters - βœ… Monitor memory usage during computation

Export generation failing: - βœ… Ensure analysis completed successfully - βœ… Check file permissions in output directory - βœ… Try different export formats - βœ… Monitor disk space availability

Performance Optimization

  • Use minimal_demo for initial testing
  • Close unused browser tabs to free memory
  • Restart Streamlit if performance degrades
  • Monitor system resources during analysis

πŸŽ‰ Success Stories

Demo datasets have enabled users to:

  • Master platform workflows before importing 10,000+ research papers
  • Prototype research studies with realistic academic network patterns
  • Train research teams on advanced analytics techniques
  • Demonstrate capabilities to institutional stakeholders
  • Develop custom integrations with known-good test data
  • Validate analysis approaches before applying to sensitive datasets

πŸ”— Next Steps

After mastering demo datasets:

  1. Upload your research data using file import
  2. Configure production database for large-scale analysis
  3. Train custom ML models with your domain data
  4. Generate research reports for publications
  5. Explore advanced analytics with larger networks

Getting Started: - Demo Mode - Quick Start - Immediate exploration guide - Interactive Features - Complete web interface guide
- Quick Start Guide - Demo-first workflow

Next Steps: - File Upload - Import your research collections - Data Import - Advanced import pipeline features - ML Predictions - Citation prediction capabilities - Network Analysis - Advanced graph analysis

Ready to explore? Load the complete demo dataset and discover the power of academic research analytics! πŸš€

The demo datasets provide a comprehensive introduction to citation network analysis, ML-powered research discovery, and publication-ready research insights - all without any setup requirements.