Demo Datasets - Comprehensive Research Analytics¶

Explore Citation Compass capabilities using curated academic datasets. Demo mode provides realistic research data with offline functionality, useful for learning, testing, and demonstrating analytics features.

🎯 Demo Datasets Overview¶

Demo datasets offer immediate access to powerful research analytics without requiring database setup, API keys, or data imports. Experience the complete platform with realistic academic data spanning multiple research fields.

Home dashboard (demo entry points)

Why Use Demo Datasets?¶

🚀 Instant Access📚 Educational Value🔬 Research Applications

Zero configuration - Works immediately out of the box
Full offline functionality - No internet required after initial setup
Complete feature access - All platform capabilities available
Realistic performance - Response times match production systems

Learn ML concepts - See how citation prediction models work
Understand networks - Explore real academic collaboration patterns
Practice analysis - Master workflows with guided examples
Benchmark performance - Compare different analysis approaches

Validate workflows - Test analysis pipelines before using your data
Prototype studies - Design research with realistic constraints
Train users - Onboard team members with safe practice environment
Demonstrate capabilities - Show stakeholders platform potential

📊 Available Datasets¶

Complete Demo Dataset¶

The flagship demonstration dataset with comprehensive academic research data:

📈 Dataset Statistics: - 13 high-impact papers carefully selected across multiple domains - 34 citation relationships showing realistic academic networks - 47 researchers demonstrating collaboration patterns - 7 research fields with cross-disciplinary connections - 16-year timespan (2009-2024) showing research evolution

🔬 Research Domains: - 🤖 Machine Learning: Foundational papers including "Attention Is All You Need" - 🧠 Neuroscience: Brain imaging, neural networks, and cognitive studies - ⚛️ Physics: Quantum computing and computational physics advances - 🏥 Medical Informatics: Healthcare AI and medical imaging research
- 👁️ Computer Vision: Image recognition and deep learning breakthroughs - 🤖 Robotics: Autonomous systems and intelligent control - 🧠 Psychology: Cognitive science and behavioral research

💫 Network Characteristics: - Cross-field citations between related domains (ML ↔ Computer Vision) - Temporal patterns showing how newer papers build on foundational work - Collaboration networks revealing author research connections - Impact distributions from high-cited foundational papers to emerging research

Minimal Demo Dataset¶

Quick testing dataset for rapid exploration and development:

📊 Compact Statistics: - 5 essential papers covering key research areas - 5 citation relationships demonstrating basic network structure - 22 researchers showing collaboration patterns - 3 research fields with focused domain coverage - Perfect for: Quick demos, testing, feature validation

Quick Fixtures¶

Specialized mini-datasets for targeted testing scenarios:

Available Fixtures: - minimal_network: 3 papers, 2 citations - Basic network structure - collaboration_network: Focus on author collaboration patterns - temporal_network: Time-based citation evolution examples - cross_field_network: Inter-disciplinary research connections

🎭 Accessing Demo Datasets¶

Via Streamlit Interface¶

Step 1: Navigate to Demo Datasets¶

streamlit run app.py

1. Open sidebar menu (click hamburger icon) 2. Select "Demo Datasets" from navigation 3. View available datasets with detailed statistics

Step 2: Explore Dataset Information¶

Each dataset shows: - 📊 Statistics: Papers, citations, authors, venues count - 📅 Time Range: Publication year span - 🏷️ Fields: Research domains included - ⏱️ Load Time: Expected loading duration - 💾 Memory: Estimated memory usage

Step 3: Load Dataset¶

Click "Load Dataset" button for your chosen dataset
Monitor loading progress (typically 2-3 seconds)
Confirm successful load with status indicator
Begin exploring all platform features

Via Python API¶

Direct Dataset Loading¶

from src.data.demo_loader import DemoDataLoader

# Load complete demo dataset
loader = DemoDataLoader()
demo_data = loader.load_complete_demo()

# Access loaded data
papers = demo_data.get_papers()
citations = demo_data.get_citations()
authors = demo_data.get_authors()

print(f"Loaded {len(papers)} papers with {len(citations)} citations")

Custom Dataset Selection¶

from src.data.fixtures import get_fixture_data

# Load specific fixture
minimal_data = get_fixture_data('minimal_network')
temporal_data = get_fixture_data('temporal_network')

# Load complete demo with configuration
complete_demo = get_fixture_data('complete_demo')

Via Command Line¶

# Test demo dataset loading
python -c "from src.data.demo_loader import DemoDataLoader; loader = DemoDataLoader(); print('Demo datasets available:', loader.list_available_datasets())"

# Load and validate demo data
python -c "from src.data.demo_loader import DemoDataLoader; loader = DemoDataLoader(); data = loader.load_complete_demo(); print(f'Loaded {len(data.papers)} papers successfully')"

🤖 Demo ML Capabilities¶

Demo datasets include a sophisticated ML service that works entirely offline while providing realistic results:

Synthetic Embeddings¶

Realistic vector representations that demonstrate ML concepts:

Field-aware clustering - Papers cluster by research domain
Semantic similarity - Related papers have similar embeddings
Dimensional structure - High-dimensional spaces with meaningful patterns
Compatible with TransE - Works with existing ML infrastructure

Example Embedding Exploration:

from src.services.demo_service import get_demo_ml_service

ml_service = get_demo_ml_service()

# Get paper embeddings
paper_id = "649def34f8be52c8b66281af98ae884c09aef38f9"  # Attention Is All You Need
embedding = ml_service.get_paper_embedding(paper_id)

# Find similar papers
similar_papers = ml_service.find_similar_papers(paper_id, top_k=5)
print(f"Papers similar to Attention paper: {similar_papers}")

Intelligent Citation Predictions¶

Realistic prediction algorithms following academic patterns:

Temporal Intelligence: - Newer papers more likely to cite foundational work - Recent papers cite contemporary research - Classic papers continue being referenced over time

Field Relationships: - ML papers frequently cite other ML research - Cross-field citations between related domains (ML ↔ Computer Vision) - Interdisciplinary connections reflect real research patterns

Impact Weighting: - Highly-cited papers receive more predictions - Foundational papers cited across multiple fields - Quality scores influence citation probability

Example Prediction Usage:

from src.services.demo_service import get_demo_ml_service

ml_service = get_demo_ml_service()

# Predict citations for a paper
paper_id = "204e3073870fae3d05bcbc2f6a8e263d9b72e776"  # BERT paper
predictions = ml_service.predict_citations(paper_id, top_k=10)

for pred in predictions:
    print(f"Paper: {pred['target_id']}, Confidence: {pred['confidence']:.3f}")

Confidence Scoring¶

Realistic confidence metrics for prediction reliability:

Range: 0.1-0.9 matching real-world ML model outputs
Distribution: Higher confidence for same-field predictions
Uncertainty: Lower confidence for cross-field or novel connections
Interpretability: Scores correlate with prediction likelihood

🕸️ Network Analysis Features¶

Demo datasets enable comprehensive network analysis with realistic academic patterns:

Community Detection¶

Identify research clusters within the citation network:

Available Algorithms: - Louvain method - Optimize modularity for community structure - Label propagation - Fast community discovery - Girvan-Newman - Hierarchical community detection - Leiden algorithm - High-quality community partitions

Real-world Patterns: - Research fields form natural communities - Cross-field bridges between related domains - Author collaboration clusters - Temporal community evolution

Example Analysis:

from src.services.analytics_service import get_analytics_service

analytics = get_analytics_service()

# Detect communities in demo network
communities = analytics.detect_communities()
print(f"Found {len(communities)} research communities")

# Analyze community characteristics
for i, community in enumerate(communities):
    print(f"Community {i}: {len(community)} papers")
    # Show dominant research fields in each community

Centrality Analysis¶

Identify influential papers and authors using network metrics:

Centrality Measures: - Degree centrality - Direct citation connections - Betweenness centrality - Bridge papers connecting different areas
- Eigenvector centrality - Citations from other high-impact papers - PageRank - Academic influence propagation

Research Insights: - Foundational papers show high centrality across all measures - Bridge papers connect different research communities - Recent breakthrough papers gain centrality over time - Author centrality reveals research leaders

Temporal Dynamics¶

Analyze research evolution over the 16-year dataset timespan:

Temporal Patterns: - Citation accumulation - How papers gain citations over time - Field emergence - New research areas in the network - Knowledge flow - How ideas spread between domains - Collaboration evolution - Changing author network patterns

Time-based Analysis:

from src.services.analytics_service import get_analytics_service

analytics = get_analytics_service()

# Analyze temporal citation patterns
temporal_stats = analytics.get_temporal_citation_patterns()
print(f"Citation patterns over {temporal_stats['years_span']} years")

# Track field emergence
field_evolution = analytics.analyze_field_evolution()
print("Research field development over time")

🎨 Visualization Capabilities¶

Demo datasets support rich interactive visualizations for exploration:

Network Graphs¶

Interactive citation networks with full functionality:

Features: - Clickable nodes - Explore individual papers - Dynamic filtering - Filter by field, year, citation count - Zoom and pan - Navigate large networks smoothly - Highlighting - Trace citation paths and relationships

Customization Options: - Node sizing - Scale by citation count or impact - Color coding - Research fields, publication years, or communities - Edge styling - Citation relationships with directional arrows - Layout algorithms - Force-directed, hierarchical, circular layouts

Embedding Visualizations¶

Explore paper relationships in high-dimensional embedding space:

Visualization Types: - 2D projections - t-SNE and UMAP dimensionality reduction - 3D explorations - Interactive 3D embedding spaces - Cluster highlighting - Research field boundaries - Similarity mapping - Distance-based relationship exploration

Statistical Charts¶

Comprehensive analytics dashboards with interactive plots:

Chart Types: - Citation distributions - Histograms and box plots - Temporal trends - Time series of research activity - Field comparisons - Cross-domain analytics - Network metrics - Centrality and clustering visualizations

📈 Analytics & Export¶

Demo datasets enable complete analytics workflows with publication-ready outputs:

Statistical Analysis¶

Comprehensive network statistics for research insights:

Network Metrics: - Global statistics - Density, clustering coefficient, diameter - Node-level metrics - Individual paper/author importance - Community analysis - Research cluster characteristics - Temporal dynamics - Evolution patterns over time

Export Capabilities¶

Publication-ready outputs in multiple formats:

Academic Exports: - LaTeX tables - Camera-ready for academic publications - Citation networks - GraphML and DOT formats for further analysis - Statistical summaries - CSV and JSON for data analysis - Visualization exports - High-resolution PNG/SVG/PDF formats

Report Generation:

from src.services.analytics_service import get_analytics_service

analytics = get_analytics_service()

# Generate comprehensive network report
report = analytics.generate_network_report()

# Export to LaTeX table
latex_table = analytics.export_latex_table(report)
print("LaTeX table ready for publication")

# Save visualizations
analytics.save_network_visualization("demo_network.pdf")

Research Insights¶

AI-powered analysis with academic context:

Insight Types: - Performance benchmarking - Compare against academic standards - Traffic light indicators - Quick quality assessment
- Research recommendations - Suggested analysis directions - Cross-field discoveries - Unexpected connection identification

🔄 Transitioning to Production¶

Demo datasets provide seamless transition to production use:

From Demo to Real Data¶

When ready for your own research data:

Master workflows using demo datasets first
Upload your data using file upload
Apply learned techniques to your research domain
Scale analysis methods to larger datasets
Train custom models with your domain-specific data

Workflow Preservation¶

Consistent interface ensures smooth transition: - Same API calls work with demo and production data - Identical analysis methods apply to any dataset size - Export formats remain consistent across modes - Visualization tools scale to larger networks

Performance Expectations¶

Realistic performance helps plan production deployments: - Response times similar to production database queries - Memory usage scales predictably with dataset size - Analysis complexity matches computational requirements - Export speeds reflect real-world processing times

💡 Best Practices¶

Learning Strategies¶

🎓 For New Users🔬 For Researchers👨‍💻 For Developers

Start with complete_demo - Full feature exploration
Try all analysis types - Network, ML, temporal analysis
Practice exports - Learn report generation workflows
Test edge cases - Understand limitations and error handling
Experiment with parameters - See how settings affect results

Map to your domain - Find parallels with your research area
Practice workflows - Master analysis pipelines before using real data
Understand metrics - Learn interpretation of network statistics
Test export formats - Ensure compatibility with your publication workflows
Validate assumptions - Confirm analysis approaches with demo results

Study API patterns - Learn efficient usage of platform APIs
Test integrations - Practice connecting to external systems
Monitor performance - Understand computational requirements
Debug workflows - Practice troubleshooting with known data
Extend functionality - Use demo data for testing new features

Advanced Usage¶

Power user techniques for maximum demo value:

Custom Analysis Scripts¶

from src.data.demo_loader import DemoDataLoader
from src.services.analytics_service import get_analytics_service

# Load demo data
loader = DemoDataLoader() 
demo_data = loader.load_complete_demo()

# Perform custom analysis
analytics = get_analytics_service()

# Analyze cross-field citations
cross_field_analysis = analytics.analyze_cross_field_patterns()

# Export results
analytics.export_analysis_report(cross_field_analysis, "cross_field_report.json")

Batch Processing Demo¶

from src.data.fixtures import get_all_fixtures

# Test analysis on all demo datasets
fixtures = get_all_fixtures()

for name, data in fixtures.items():
    print(f"Analyzing {name}...")
    # Run analysis pipeline on each fixture
    results = run_analysis_pipeline(data)
    print(f"Results: {results}")

🚨 Demo Limitations¶

Understanding demo constraints helps set realistic expectations:

Data Scope Limitations¶

Small networks - 13 papers vs thousands in production
Limited timespan - 2009-2024 vs longer historical periods
Focused domains - AI/ML emphasis vs broader academic coverage
Synthetic elements - Some relationships constructed for demonstration

Scalability Considerations¶

Network size - Cannot test large-scale analysis (10k+ papers)
Community complexity - Fewer research clusters than real networks
Memory patterns - Won't reveal large dataset memory requirements
Performance limits - Cannot test high-volume processing scenarios

ML Model Constraints¶

Training limitations - Cannot train new models with demo data
Fixed embeddings - Synthetic embeddings don't improve with training
Parameter testing - Limited hyperparameter exploration capabilities
Validation scope - Cannot test model performance improvements

🔍 Troubleshooting Demo Mode¶

Common Issues¶

Loading ProblemsAnalysis ErrorsExport Issues

Dataset won't load: - ✅ Refresh the Streamlit page - ✅ Check browser console for JavaScript errors - ✅ Ensure adequate browser memory - ✅ Try loading smaller minimal_demo dataset first

Slow loading performance: - ✅ Close other browser tabs and applications - ✅ Clear browser cache and cookies - ✅ Try in private/incognito browser mode - ✅ Check system memory availability

ML predictions failing: - ✅ Ensure demo dataset is fully loaded - ✅ Check that demo ML service is active - ✅ Try restarting the Streamlit application - ✅ Test with different paper IDs from the dataset

Network analysis errors: - ✅ Verify dataset has citation relationships - ✅ Check that analytics service is initialized - ✅ Try different analysis parameters - ✅ Monitor memory usage during computation

Export generation failing: - ✅ Ensure analysis completed successfully - ✅ Check file permissions in output directory - ✅ Try different export formats - ✅ Monitor disk space availability

Performance Optimization¶

Use minimal_demo for initial testing
Close unused browser tabs to free memory
Restart Streamlit if performance degrades
Monitor system resources during analysis

🎉 Success Stories¶

Demo datasets have enabled users to:

Master platform workflows before importing 10,000+ research papers
Prototype research studies with realistic academic network patterns
Train research teams on advanced analytics techniques
Demonstrate capabilities to institutional stakeholders
Develop custom integrations with known-good test data
Validate analysis approaches before applying to sensitive datasets

🔗 Next Steps¶

After mastering demo datasets:

Upload your research data using file import
Configure production database for large-scale analysis
Train custom ML models with your domain data
Generate research reports for publications
Explore advanced analytics with larger networks

Getting Started: - Demo Mode - Quick Start - Immediate exploration guide - Interactive Features - Complete web interface guide
- Quick Start Guide - Demo-first workflow

Next Steps: - File Upload - Import your research collections - Data Import - Advanced import pipeline features - ML Predictions - Citation prediction capabilities - Network Analysis - Advanced graph analysis

Ready to explore? Load the complete demo dataset and discover the power of academic research analytics! 🚀

The demo datasets provide a comprehensive introduction to citation network analysis, ML-powered research discovery, and publication-ready research insights - all without any setup requirements.