๐ Citation Compass - Streamlit App¶
A comprehensive web application for academic citation analysis powered by machine learning.
๐ Features¶
๐ค ML Predictions¶
- Citation Prediction: Use our trained TransE model to predict which papers are most likely to cite a given paper
- Confidence Scoring: Get probability-like confidence scores for each prediction
- Interactive Results: Explore predicted papers with detailed information and export results
- Paper Search: Find papers by title, author, or direct paper ID
๐งญ Embedding Explorer¶
- Vector Space Exploration: Dive deep into learned paper embeddings
- Similarity Analysis: Compare papers and find semantically similar research
- Dimensionality Reduction: Visualize embeddings in 2D/3D space using PCA and t-SNE
- Embedding Statistics: Analyze embedding properties and distributions
๐ Enhanced Visualizations¶
- Network Visualization: Interactive citation network graphs with prediction overlays
- Advanced Charts: Multi-dimensional analysis with customizable visualizations
- Export Capabilities: High-quality outputs in multiple formats (PNG, SVG, PDF)
- Real-time Updates: Dynamic visualization updates based on ML predictions
๐ Interactive Analytics Pipeline¶
- Interactive Analysis: Jupyter-style notebook execution within Streamlit
- Advanced Analytics: Network analysis, community detection, temporal trends
- Batch Processing: Large-scale citation analysis and reporting
- Custom Workflows: User-defined analytical pipelines with export capabilities
๐ Advanced Analytics (New)¶
- Network Analysis: Centrality measures, community detection, path analysis
- Temporal Analysis: Citation trends, growth patterns, impact over time
- Author Analytics: Collaboration networks, influence metrics, career trajectories
- Performance Metrics: System health, prediction accuracy, cache efficiency
๐ ๏ธ Installation & Setup¶
Prerequisites¶
- Python 3.8+
- PyTorch (for ML models)
- Streamlit
- Required Python packages (see requirements)
Quick Start¶
-
Install Dependencies:
-
Run the Application:
-
Open Browser: Navigate to
http://localhost:8501
Configuration¶
The app automatically detects and loads: - TransE Model: Locally trained model from models/ directory - Entity Mapping: Paper ID to model entity mappings - API Configuration: Semantic Scholar API settings
๐ฏ How to Use¶
ML Predictions Page¶
- Input Paper:
- Enter a paper ID directly
- Search by title or keywords
-
Browse search results and select
-
Configure Predictions:
- Set number of predictions (1-50)
- Adjust confidence threshold
-
Check model health status
-
View Results:
- Interactive results table with confidence scores
- Confidence distribution charts
- Detailed paper information
- Export results as CSV
Embedding Explorer Page¶
- Individual Embeddings:
- Enter paper ID to get embedding vector
- View embedding statistics and distributions
-
Visualize embedding dimensions
-
Compare Papers:
- Enter multiple paper IDs (one per line)
- View cosine similarity matrix
-
Analyze pairwise relationships
-
Visualization:
- Plot 3+ papers in reduced dimensional space
- Choose PCA or t-SNE reduction
- Explore in 2D or 3D
Enhanced Visualizations Page¶
- Network Graphs:
- Interactive citation network visualization
- Overlay ML predictions on network structure
- Customize node sizes, colors, and layout algorithms
-
Export high-quality visualizations
-
Advanced Charts:
- Multi-dimensional scatter plots with prediction confidence
- Time-series analysis of citation patterns
- Distribution analyses and statistical summaries
Interactive Analytics Pipeline¶
- Interactive Analysis:
- Execute pre-built analytical notebooks
- Customize parameters and data ranges
-
Real-time results with progress indicators
-
Custom Workflows:
- Create custom analytical pipelines
- Combine multiple analysis types
-
Export comprehensive reports
-
Advanced Analytics:
- Network centrality analysis
- Community detection in citation networks
- Temporal trend analysis
- Performance benchmarking
๐ง About the ML Model¶
TransE Architecture¶
- Model Type: Translating Embeddings for Knowledge Graphs
- Embedding Dimension: 128
- Training Data: Academic citation networks
- Entities: 10,000+ computer science papers
- Prediction Logic:
source + relation โ target
Performance Metrics¶
- Training Loss: ~0.15
- Prediction Speed: <100ms per query
- Cache Hit Rate: 90%+ for repeated queries
- Confidence Calibration: Probability-like scores from distance metrics
๐๏ธ Architecture¶
Service Layer¶
โโโ ML Service (src/services/ml_service.py)
โ โโโ TransE Model Loading
โ โโโ Prediction Generation
โ โโโ Embedding Extraction
โ โโโ Intelligent Caching
โ
โโโ API Client (src/data/unified_api_client.py)
โ โโโ Semantic Scholar Integration
โ โโโ Rate Limiting
โ โโโ Response Caching
โ โโโ Error Handling
โ
โโโ Data Models (src/models/)
โโโ ML Models (PaperEmbedding, CitationPrediction)
โโโ Network Models (NetworkNode, NetworkEdge)
โโโ API Models (APIResponse, SearchRequest)
Streamlit Pages¶
โโโ app.py (Main Application)
โโโ src/streamlit_app/pages/
โ โโโ ML_Predictions.py # Citation prediction interface
โ โโโ Embedding_Explorer.py # Vector space exploration
โ โโโ Enhanced_Visualizations.py # Network graphs & charts
โ โโโ Notebook_Pipeline.py # Interactive analytics pipeline
โโโ .streamlit/
โโโ config.toml
Advanced Analytics Architecture¶
โโโ src/analytics/ (New)
โ โโโ __init__.py
โ โโโ network_analysis.py # Graph metrics & community detection
โ โโโ temporal_analysis.py # Time-series citation analysis
โ โโโ performance_metrics.py # System performance analysis
โ โโโ export_engine.py # Multi-format export capabilities
โ
โโโ src/services/
โ โโโ ml_service.py # Existing ML service
โ โโโ analytics_service.py # New analytics orchestration
โ
โโโ notebooks/ (New)
โโโ 01_network_exploration.ipynb
โโโ 02_citation_analysis.ipynb
โโโ 03_performance_benchmarks.ipynb
๐ง Configuration¶
Environment Variables¶
SEMANTIC_SCHOLAR_API_KEY: Optional API key for higher rate limitsNEO4J_URI: Neo4j database connection (if using database features)NEO4J_USER: Database usernameNEO4J_PASSWORD: Database password
Streamlit Configuration¶
- Port: 8501 (default)
- Theme: Custom academic theme
- Caching: Enabled for ML models and API responses
- Error Handling: Detailed error messages in development
๐ Performance Optimizations¶
Caching Strategy¶
- Model Loading: Models cached on first load
- Predictions: LRU cache with TTL expiration
- API Responses: Response caching with rate limiting
- Embeddings: In-memory caching of frequently accessed embeddings
Scalability Features¶
- Lazy Loading: Components loaded on-demand
- Batch Processing: Efficient handling of multiple predictions
- Memory Management: Automatic cache eviction
- Error Recovery: Graceful handling of service failures
๐ Troubleshooting¶
Common Issues¶
- Model Not Found:
- Ensure
models/directory contains the locally trained model files -
Check file permissions and paths
-
Paper Not in Model:
- Model trained on specific dataset (computer science papers)
-
Try papers from major CS venues (ICML, NeurIPS, etc.)
-
Slow Performance:
- First prediction takes longer (model loading)
- Subsequent predictions are cached
-
Consider GPU for large-scale usage
-
API Rate Limits:
- Built-in rate limiting prevents 429 errors
- Consider API key for higher limits
Debug Mode¶
๐ค Contributing¶
- Fork the repository
- Create feature branch
- Add tests for new functionality
- Submit pull request
๐ License¶
This project is part of Citation Compass and follows the same licensing terms.
Built with โค๏ธ using Streamlit, PyTorch, and Machine Learning