Demo Datasets - Comprehensive Research Analytics¶
Explore Citation Compass capabilities using curated academic datasets. Demo mode provides realistic research data with offline functionality, useful for learning, testing, and demonstrating analytics features.
π― Demo Datasets Overview¶
Demo datasets offer immediate access to powerful research analytics without requiring database setup, API keys, or data imports. Experience the complete platform with realistic academic data spanning multiple research fields.

Why Use Demo Datasets?¶
- Zero configuration - Works immediately out of the box
- Full offline functionality - No internet required after initial setup
- Complete feature access - All platform capabilities available
- Realistic performance - Response times match production systems
- Learn ML concepts - See how citation prediction models work
- Understand networks - Explore real academic collaboration patterns
- Practice analysis - Master workflows with guided examples
- Benchmark performance - Compare different analysis approaches
- Validate workflows - Test analysis pipelines before using your data
- Prototype studies - Design research with realistic constraints
- Train users - Onboard team members with safe practice environment
- Demonstrate capabilities - Show stakeholders platform potential
π Available Datasets¶
Complete Demo Dataset¶
The flagship demonstration dataset with comprehensive academic research data:
π Dataset Statistics: - 13 high-impact papers carefully selected across multiple domains - 34 citation relationships showing realistic academic networks - 47 researchers demonstrating collaboration patterns - 7 research fields with cross-disciplinary connections - 16-year timespan (2009-2024) showing research evolution
π¬ Research Domains: - π€ Machine Learning: Foundational papers including "Attention Is All You Need" - π§ Neuroscience: Brain imaging, neural networks, and cognitive studies - βοΈ Physics: Quantum computing and computational physics advances - π₯ Medical Informatics: Healthcare AI and medical imaging research
- ποΈ Computer Vision: Image recognition and deep learning breakthroughs - π€ Robotics: Autonomous systems and intelligent control - π§ Psychology: Cognitive science and behavioral research
π« Network Characteristics: - Cross-field citations between related domains (ML β Computer Vision) - Temporal patterns showing how newer papers build on foundational work - Collaboration networks revealing author research connections - Impact distributions from high-cited foundational papers to emerging research
Minimal Demo Dataset¶
Quick testing dataset for rapid exploration and development:
π Compact Statistics: - 5 essential papers covering key research areas - 5 citation relationships demonstrating basic network structure - 22 researchers showing collaboration patterns - 3 research fields with focused domain coverage - Perfect for: Quick demos, testing, feature validation
Quick Fixtures¶
Specialized mini-datasets for targeted testing scenarios:
Available Fixtures: - minimal_network: 3 papers, 2 citations - Basic network structure - collaboration_network: Focus on author collaboration patterns - temporal_network: Time-based citation evolution examples - cross_field_network: Inter-disciplinary research connections
π Accessing Demo Datasets¶
Via Streamlit Interface¶
Step 1: Navigate to Demo Datasets¶
1. Open sidebar menu (click hamburger icon) 2. Select "Demo Datasets" from navigation 3. View available datasets with detailed statisticsStep 2: Explore Dataset Information¶
Each dataset shows: - π Statistics: Papers, citations, authors, venues count - π Time Range: Publication year span - π·οΈ Fields: Research domains included - β±οΈ Load Time: Expected loading duration - πΎ Memory: Estimated memory usage
Step 3: Load Dataset¶
- Click "Load Dataset" button for your chosen dataset
- Monitor loading progress (typically 2-3 seconds)
- Confirm successful load with status indicator
- Begin exploring all platform features
Via Python API¶
Direct Dataset Loading¶
from src.data.demo_loader import DemoDataLoader
# Load complete demo dataset
loader = DemoDataLoader()
demo_data = loader.load_complete_demo()
# Access loaded data
papers = demo_data.get_papers()
citations = demo_data.get_citations()
authors = demo_data.get_authors()
print(f"Loaded {len(papers)} papers with {len(citations)} citations")
Custom Dataset Selection¶
from src.data.fixtures import get_fixture_data
# Load specific fixture
minimal_data = get_fixture_data('minimal_network')
temporal_data = get_fixture_data('temporal_network')
# Load complete demo with configuration
complete_demo = get_fixture_data('complete_demo')
Via Command Line¶
# Test demo dataset loading
python -c "from src.data.demo_loader import DemoDataLoader; loader = DemoDataLoader(); print('Demo datasets available:', loader.list_available_datasets())"
# Load and validate demo data
python -c "from src.data.demo_loader import DemoDataLoader; loader = DemoDataLoader(); data = loader.load_complete_demo(); print(f'Loaded {len(data.papers)} papers successfully')"
π€ Demo ML Capabilities¶
Demo datasets include a sophisticated ML service that works entirely offline while providing realistic results:
Synthetic Embeddings¶
Realistic vector representations that demonstrate ML concepts:
- Field-aware clustering - Papers cluster by research domain
- Semantic similarity - Related papers have similar embeddings
- Dimensional structure - High-dimensional spaces with meaningful patterns
- Compatible with TransE - Works with existing ML infrastructure
Example Embedding Exploration:
from src.services.demo_service import get_demo_ml_service
ml_service = get_demo_ml_service()
# Get paper embeddings
paper_id = "649def34f8be52c8b66281af98ae884c09aef38f9" # Attention Is All You Need
embedding = ml_service.get_paper_embedding(paper_id)
# Find similar papers
similar_papers = ml_service.find_similar_papers(paper_id, top_k=5)
print(f"Papers similar to Attention paper: {similar_papers}")
Intelligent Citation Predictions¶
Realistic prediction algorithms following academic patterns:
Temporal Intelligence: - Newer papers more likely to cite foundational work - Recent papers cite contemporary research - Classic papers continue being referenced over time
Field Relationships: - ML papers frequently cite other ML research - Cross-field citations between related domains (ML β Computer Vision) - Interdisciplinary connections reflect real research patterns
Impact Weighting: - Highly-cited papers receive more predictions - Foundational papers cited across multiple fields - Quality scores influence citation probability
Example Prediction Usage:
from src.services.demo_service import get_demo_ml_service
ml_service = get_demo_ml_service()
# Predict citations for a paper
paper_id = "204e3073870fae3d05bcbc2f6a8e263d9b72e776" # BERT paper
predictions = ml_service.predict_citations(paper_id, top_k=10)
for pred in predictions:
print(f"Paper: {pred['target_id']}, Confidence: {pred['confidence']:.3f}")
Confidence Scoring¶
Realistic confidence metrics for prediction reliability:
- Range: 0.1-0.9 matching real-world ML model outputs
- Distribution: Higher confidence for same-field predictions
- Uncertainty: Lower confidence for cross-field or novel connections
- Interpretability: Scores correlate with prediction likelihood
πΈοΈ Network Analysis Features¶
Demo datasets enable comprehensive network analysis with realistic academic patterns:
Community Detection¶
Identify research clusters within the citation network:
Available Algorithms: - Louvain method - Optimize modularity for community structure - Label propagation - Fast community discovery - Girvan-Newman - Hierarchical community detection - Leiden algorithm - High-quality community partitions
Real-world Patterns: - Research fields form natural communities - Cross-field bridges between related domains - Author collaboration clusters - Temporal community evolution
Example Analysis:
from src.services.analytics_service import get_analytics_service
analytics = get_analytics_service()
# Detect communities in demo network
communities = analytics.detect_communities()
print(f"Found {len(communities)} research communities")
# Analyze community characteristics
for i, community in enumerate(communities):
print(f"Community {i}: {len(community)} papers")
# Show dominant research fields in each community
Centrality Analysis¶
Identify influential papers and authors using network metrics:
Centrality Measures: - Degree centrality - Direct citation connections - Betweenness centrality - Bridge papers connecting different areas
- Eigenvector centrality - Citations from other high-impact papers - PageRank - Academic influence propagation
Research Insights: - Foundational papers show high centrality across all measures - Bridge papers connect different research communities - Recent breakthrough papers gain centrality over time - Author centrality reveals research leaders
Temporal Dynamics¶
Analyze research evolution over the 16-year dataset timespan:
Temporal Patterns: - Citation accumulation - How papers gain citations over time - Field emergence - New research areas in the network - Knowledge flow - How ideas spread between domains - Collaboration evolution - Changing author network patterns
Time-based Analysis:
from src.services.analytics_service import get_analytics_service
analytics = get_analytics_service()
# Analyze temporal citation patterns
temporal_stats = analytics.get_temporal_citation_patterns()
print(f"Citation patterns over {temporal_stats['years_span']} years")
# Track field emergence
field_evolution = analytics.analyze_field_evolution()
print("Research field development over time")
π¨ Visualization Capabilities¶
Demo datasets support rich interactive visualizations for exploration:
Network Graphs¶
Interactive citation networks with full functionality:
Features: - Clickable nodes - Explore individual papers - Dynamic filtering - Filter by field, year, citation count - Zoom and pan - Navigate large networks smoothly - Highlighting - Trace citation paths and relationships
Customization Options: - Node sizing - Scale by citation count or impact - Color coding - Research fields, publication years, or communities - Edge styling - Citation relationships with directional arrows - Layout algorithms - Force-directed, hierarchical, circular layouts
Embedding Visualizations¶
Explore paper relationships in high-dimensional embedding space:
Visualization Types: - 2D projections - t-SNE and UMAP dimensionality reduction - 3D explorations - Interactive 3D embedding spaces - Cluster highlighting - Research field boundaries - Similarity mapping - Distance-based relationship exploration
Statistical Charts¶
Comprehensive analytics dashboards with interactive plots:
Chart Types: - Citation distributions - Histograms and box plots - Temporal trends - Time series of research activity - Field comparisons - Cross-domain analytics - Network metrics - Centrality and clustering visualizations
π Analytics & Export¶
Demo datasets enable complete analytics workflows with publication-ready outputs:
Statistical Analysis¶
Comprehensive network statistics for research insights:
Network Metrics: - Global statistics - Density, clustering coefficient, diameter - Node-level metrics - Individual paper/author importance - Community analysis - Research cluster characteristics - Temporal dynamics - Evolution patterns over time
Export Capabilities¶
Publication-ready outputs in multiple formats:
Academic Exports: - LaTeX tables - Camera-ready for academic publications - Citation networks - GraphML and DOT formats for further analysis - Statistical summaries - CSV and JSON for data analysis - Visualization exports - High-resolution PNG/SVG/PDF formats
Report Generation:
from src.services.analytics_service import get_analytics_service
analytics = get_analytics_service()
# Generate comprehensive network report
report = analytics.generate_network_report()
# Export to LaTeX table
latex_table = analytics.export_latex_table(report)
print("LaTeX table ready for publication")
# Save visualizations
analytics.save_network_visualization("demo_network.pdf")
Research Insights¶
AI-powered analysis with academic context:
Insight Types: - Performance benchmarking - Compare against academic standards - Traffic light indicators - Quick quality assessment
- Research recommendations - Suggested analysis directions - Cross-field discoveries - Unexpected connection identification
π Transitioning to Production¶
Demo datasets provide seamless transition to production use:
From Demo to Real Data¶
When ready for your own research data:
- Master workflows using demo datasets first
- Upload your data using file upload
- Apply learned techniques to your research domain
- Scale analysis methods to larger datasets
- Train custom models with your domain-specific data
Workflow Preservation¶
Consistent interface ensures smooth transition: - Same API calls work with demo and production data - Identical analysis methods apply to any dataset size - Export formats remain consistent across modes - Visualization tools scale to larger networks
Performance Expectations¶
Realistic performance helps plan production deployments: - Response times similar to production database queries - Memory usage scales predictably with dataset size - Analysis complexity matches computational requirements - Export speeds reflect real-world processing times
π‘ Best Practices¶
Learning Strategies¶
- Start with complete_demo - Full feature exploration
- Try all analysis types - Network, ML, temporal analysis
- Practice exports - Learn report generation workflows
- Test edge cases - Understand limitations and error handling
- Experiment with parameters - See how settings affect results
- Map to your domain - Find parallels with your research area
- Practice workflows - Master analysis pipelines before using real data
- Understand metrics - Learn interpretation of network statistics
- Test export formats - Ensure compatibility with your publication workflows
- Validate assumptions - Confirm analysis approaches with demo results
- Study API patterns - Learn efficient usage of platform APIs
- Test integrations - Practice connecting to external systems
- Monitor performance - Understand computational requirements
- Debug workflows - Practice troubleshooting with known data
- Extend functionality - Use demo data for testing new features
Advanced Usage¶
Power user techniques for maximum demo value:
Custom Analysis Scripts¶
from src.data.demo_loader import DemoDataLoader
from src.services.analytics_service import get_analytics_service
# Load demo data
loader = DemoDataLoader()
demo_data = loader.load_complete_demo()
# Perform custom analysis
analytics = get_analytics_service()
# Analyze cross-field citations
cross_field_analysis = analytics.analyze_cross_field_patterns()
# Export results
analytics.export_analysis_report(cross_field_analysis, "cross_field_report.json")
Batch Processing Demo¶
from src.data.fixtures import get_all_fixtures
# Test analysis on all demo datasets
fixtures = get_all_fixtures()
for name, data in fixtures.items():
print(f"Analyzing {name}...")
# Run analysis pipeline on each fixture
results = run_analysis_pipeline(data)
print(f"Results: {results}")
π¨ Demo Limitations¶
Understanding demo constraints helps set realistic expectations:
Data Scope Limitations¶
- Small networks - 13 papers vs thousands in production
- Limited timespan - 2009-2024 vs longer historical periods
- Focused domains - AI/ML emphasis vs broader academic coverage
- Synthetic elements - Some relationships constructed for demonstration
Scalability Considerations¶
- Network size - Cannot test large-scale analysis (10k+ papers)
- Community complexity - Fewer research clusters than real networks
- Memory patterns - Won't reveal large dataset memory requirements
- Performance limits - Cannot test high-volume processing scenarios
ML Model Constraints¶
- Training limitations - Cannot train new models with demo data
- Fixed embeddings - Synthetic embeddings don't improve with training
- Parameter testing - Limited hyperparameter exploration capabilities
- Validation scope - Cannot test model performance improvements
π Troubleshooting Demo Mode¶
Common Issues¶
Dataset won't load: - β Refresh the Streamlit page - β Check browser console for JavaScript errors - β Ensure adequate browser memory - β Try loading smaller minimal_demo dataset first
Slow loading performance: - β Close other browser tabs and applications - β Clear browser cache and cookies - β Try in private/incognito browser mode - β Check system memory availability
ML predictions failing: - β Ensure demo dataset is fully loaded - β Check that demo ML service is active - β Try restarting the Streamlit application - β Test with different paper IDs from the dataset
Network analysis errors: - β Verify dataset has citation relationships - β Check that analytics service is initialized - β Try different analysis parameters - β Monitor memory usage during computation
Export generation failing: - β Ensure analysis completed successfully - β Check file permissions in output directory - β Try different export formats - β Monitor disk space availability
Performance Optimization¶
- Use minimal_demo for initial testing
- Close unused browser tabs to free memory
- Restart Streamlit if performance degrades
- Monitor system resources during analysis
π Success Stories¶
Demo datasets have enabled users to:
- Master platform workflows before importing 10,000+ research papers
- Prototype research studies with realistic academic network patterns
- Train research teams on advanced analytics techniques
- Demonstrate capabilities to institutional stakeholders
- Develop custom integrations with known-good test data
- Validate analysis approaches before applying to sensitive datasets
π Next Steps¶
After mastering demo datasets:
- Upload your research data using file import
- Configure production database for large-scale analysis
- Train custom ML models with your domain data
- Generate research reports for publications
- Explore advanced analytics with larger networks
π Related Guides¶
Getting Started: - Demo Mode - Quick Start - Immediate exploration guide - Interactive Features - Complete web interface guide
- Quick Start Guide - Demo-first workflow
Next Steps: - File Upload - Import your research collections - Data Import - Advanced import pipeline features - ML Predictions - Citation prediction capabilities - Network Analysis - Advanced graph analysis
Ready to explore? Load the complete demo dataset and discover the power of academic research analytics! π
The demo datasets provide a comprehensive introduction to citation network analysis, ML-powered research discovery, and publication-ready research insights - all without any setup requirements.