Setup Guide¶
Setup Overview
Get Citation Compass running in 4 simple steps: Clone โ Configure โ Database โ Launch
flowchart LR
A["๐ฅ Clone Repository"] --> B["โ๏ธ Configure Environment"]
B --> C["๐๏ธ Setup Database"]
C --> D["๐ Launch Application"]
style A fill:#e3f2fd
style B fill:#fff3e0
style C fill:#e8f5e8
style D fill:#fce4ec Prerequisites¶
System Requirements
- Python 3.8+ (recommended: Python 3.10+)
- Neo4j Database (local installation or Neo4j AuraDB cloud instance)
- Git version control
- 4GB+ RAM (for ML model operations)
Quick Start¶
1. Clone and Setup Environment¶
# Clone the repository
git clone https://github.com/dagny099/citation-compass.git
cd citation-compass
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -e ".[all]"
2. Configure Environment¶
# Copy environment template
cp .env.example .env
# Edit .env with your database credentials
# NEO4J_URI=neo4j+s://your-database-url
# NEO4J_USER=neo4j
# NEO4J_PASSWORD=your-password
3. Setup Database¶
4. Verify Models¶
5. Launch Application¶
Detailed Setup¶
Environment Configuration¶
Required Variables¶
# Neo4j Database (required)
NEO4J_URI=neo4j+s://your-database-url # or neo4j://localhost:7687 for local
NEO4J_USER=neo4j
NEO4J_PASSWORD=your-password
NEO4J_DATABASE=neo4j # optional, defaults to 'neo4j'
Optional Variables¶
# Semantic Scholar API (optional, improves rate limits)
SEMANTIC_SCHOLAR_API_KEY=your-api-key
# Logging configuration
LOG_LEVEL=INFO # DEBUG, INFO, WARNING, ERROR
LOG_FILE=logs/app.log
# Cache settings
CACHE_ENABLED=true
CACHE_DEFAULT_TTL=300 # seconds
# Development mode
ENVIRONMENT=development # or production
DEBUG=false
Database Setup Options¶
Option 1: Neo4j AuraDB (Recommended for beginners)¶
- Go to Neo4j AuraDB
- Create a free account and database instance
- Download connection credentials
- Use the provided URI, username, and password in
.env
Option 2: Local Neo4j Installation¶
# Download and install Neo4j Desktop
# Or use Docker:
docker run \
--name neo4j \
-p7474:7474 -p7687:7687 \
-d \
-v $HOME/neo4j/data:/data \
-v $HOME/neo4j/logs:/logs \
-v $HOME/neo4j/import:/var/lib/neo4j/import \
--env NEO4J_AUTH=neo4j/your-password \
neo4j:latest
Set in .env:
Installation Options¶
Core Installation (minimal)¶
ML Components Only¶
Web Interface Only¶
Development Environment¶
All Components¶
Model Files Setup¶
The ML models are trained locally and stored in the models/ directory. Verify they're accessible:
python -c "
from src.services.ml_service import get_ml_service
ml = get_ml_service()
health = ml.health_check()
print('Model Status:', health['status'])
print('Entities:', health.get('num_entities', 0))
"
If models are missing or corrupted, check the models/ directory contains: - transe_citation_model.pt (~19MB) - The trained TransE citation prediction model - entity_mapping.pkl (~577KB) - Maps papers/authors to model embeddings - training_metadata.pkl (~200B) - Training configuration and metrics
If these files are missing, you'll need to run the model training pipeline using the notebooks in the notebooks/ directory.
Testing Setup¶
Run All Tests¶
# Complete test suite
python -m pytest tests/ -v
# With coverage report
python -m pytest tests/ --cov=src --cov-report=html
Test Categories¶
# Data model tests
python -m pytest tests/test_models_simple.py -v
# ML service tests
python -m pytest tests/test_ml_service.py -v
# API client tests
python -m pytest tests/test_unified_api_client.py -v
# Integration tests (requires database)
python -m pytest tests/test_integration.py -v
Manual Application Testing¶
# Launch Streamlit app
streamlit run app.py
# Navigate to pages and test:
# - ML Predictions: Enter paper ID and generate predictions
# - Embedding Explorer: Visualize paper embeddings
# - Enhanced Visualizations: View network analysis
# - Results Interpretation: Explore contextual analysis
Troubleshooting¶
Database Connection Issues¶
Error: "Failed to initialize Neo4j connection"
# Check environment variables
python -c "
import os
print('URI:', os.getenv('NEO4J_URI'))
print('USER:', os.getenv('NEO4J_USER'))
print('PASS:', bool(os.getenv('NEO4J_PASSWORD')))
"
# Test connection manually
python setup_database.py
Error: "Connection test failed" - Verify Neo4j server is running - Check firewall settings for port 7687 - Confirm credentials are correct - For AuraDB, ensure you're using the correct URI format
Model Loading Issues¶
Error: "Model file not found"
Error: "Health check failed" - Ensure you have sufficient RAM (4GB+) - Check PyTorch installation: python -c "import torch; print(torch.__version__)" - Verify CUDA setup if using GPU: python -c "import torch; print(torch.cuda.is_available())"
API Client Issues¶
Error: Rate limit exceeded - Add Semantic Scholar API key to .env - Reduce batch sizes in API calls - Check rate limiter settings in configuration
Error: SSL certificate issues
Streamlit Issues¶
Error: "ModuleNotFoundError" - Ensure you're in the activated virtual environment - Reinstall with pip install -e ".[all]" - Check Python path: python -c "import sys; print(sys.path)"
Error: Page not loading - Check console for JavaScript errors - Try different browser - Clear browser cache - Restart Streamlit: Ctrl+C then streamlit run app.py
Performance Issues¶
Slow predictions: - Use smaller top_k values - Enable prediction caching - Consider using CPU vs GPU based on model size
Memory errors: - Reduce batch sizes - Clear caches periodically
- Monitor memory usage: htop or Task Manager
Development Setup¶
Code Quality Tools¶
# Install development dependencies
pip install -e ".[dev]"
# Run linting
flake8 src/
pylint src/
# Format code
black src/
isort src/
# Type checking
mypy src/
Pre-commit Hooks¶
# Install pre-commit
pip install pre-commit
# Setup hooks
pre-commit install
# Run on all files
pre-commit run --all-files
Adding New Dependencies¶
Production Deployment¶
Environment Preparation¶
# Use production environment
ENVIRONMENT=production
DEBUG=false
LOG_LEVEL=WARNING
# Use managed database
NEO4J_URI=neo4j+s://production-database-url
# Enable monitoring
LOG_FILE=/var/log/citation-compass/app.log
Docker Deployment¶
# Build container (Dockerfile not included, but recommended structure)
docker build -t citation-compass .
# Run with environment
docker run -p 8501:8501 --env-file .env citation-compass
Health Monitoring¶
# Setup health check endpoints
python -c "
from src.services.ml_service import get_ml_service
from src.database.connection import Neo4jConnection
# Check all services
ml_health = get_ml_service().health_check()
db = Neo4jConnection()
db_health = db.test_connection()
print('ML Service:', ml_health['status'])
print('Database:', 'healthy' if db_health else 'unhealthy')
"
This setup guide is updated as the system evolves. Last updated: August 2025