Setup Guide¶

Setup Overview

Get Citation Compass running in 4 simple steps: Clone → Configure → Database → Launch

flowchart LR
    A["📥 Clone Repository"] --> B["⚙️ Configure Environment"]
    B --> C["🗄️ Setup Database"]
    C --> D["🚀 Launch Application"]

    style A fill:#e3f2fd
    style B fill:#fff3e0
    style C fill:#e8f5e8
    style D fill:#fce4ec

Prerequisites¶

System Requirements

Python 3.8+ (recommended: Python 3.10+)
Neo4j Database (local installation or Neo4j AuraDB cloud instance)
Git version control
4GB+ RAM (for ML model operations)

Quick Start¶

1. Clone and Setup Environment¶

# Clone the repository
git clone https://github.com/dagny099/citation-compass.git
cd citation-compass

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -e ".[all]"

2. Configure Environment¶

# Copy environment template
cp .env.example .env

# Edit .env with your database credentials
# NEO4J_URI=neo4j+s://your-database-url
# NEO4J_USER=neo4j
# NEO4J_PASSWORD=your-password

3. Setup Database¶

# Run database setup script
python setup_database.py

4. Verify Models¶

# Verify ML model files are accessible
python verify_models.py

5. Launch Application¶

# Start Streamlit application
streamlit run app.py

Detailed Setup¶

Environment Configuration¶

Required Variables¶

# Neo4j Database (required)
NEO4J_URI=neo4j+s://your-database-url  # or neo4j://localhost:7687 for local
NEO4J_USER=neo4j
NEO4J_PASSWORD=your-password
NEO4J_DATABASE=neo4j  # optional, defaults to 'neo4j'

Optional Variables¶

# Semantic Scholar API (optional, improves rate limits)
SEMANTIC_SCHOLAR_API_KEY=your-api-key

# Logging configuration
LOG_LEVEL=INFO  # DEBUG, INFO, WARNING, ERROR
LOG_FILE=logs/app.log

# Cache settings
CACHE_ENABLED=true
CACHE_DEFAULT_TTL=300  # seconds

# Development mode
ENVIRONMENT=development  # or production
DEBUG=false

Database Setup Options¶

Option 1: Neo4j AuraDB (Recommended for beginners)¶

Go to Neo4j AuraDB
Create a free account and database instance
Download connection credentials
Use the provided URI, username, and password in .env

Option 2: Local Neo4j Installation¶

# Download and install Neo4j Desktop
# Or use Docker:
docker run \
    --name neo4j \
    -p7474:7474 -p7687:7687 \
    -d \
    -v $HOME/neo4j/data:/data \
    -v $HOME/neo4j/logs:/logs \
    -v $HOME/neo4j/import:/var/lib/neo4j/import \
    --env NEO4J_AUTH=neo4j/your-password \
    neo4j:latest

Set in .env:

NEO4J_URI=neo4j://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=your-password

Installation Options¶

Core Installation (minimal)¶

pip install -e .

ML Components Only¶

pip install -e ".[ml]"

Web Interface Only¶

pip install -e ".[web]"

Development Environment¶

pip install -e ".[dev,all]"

All Components¶

pip install -e ".[all]"

Model Files Setup¶

The ML models are trained locally and stored in the models/ directory. Verify they're accessible:

python -c "
from src.services.ml_service import get_ml_service
ml = get_ml_service()
health = ml.health_check()
print('Model Status:', health['status'])
print('Entities:', health.get('num_entities', 0))
"

If models are missing or corrupted, check the models/ directory contains: - transe_citation_model.pt (~19MB) - The trained TransE citation prediction model - entity_mapping.pkl (~577KB) - Maps papers/authors to model embeddings - training_metadata.pkl (~200B) - Training configuration and metrics

If these files are missing, you'll need to run the model training pipeline using the notebooks in the notebooks/ directory.

Testing Setup¶

Run All Tests¶

# Complete test suite
python -m pytest tests/ -v

# With coverage report
python -m pytest tests/ --cov=src --cov-report=html

Test Categories¶

# Data model tests
python -m pytest tests/test_models_simple.py -v

# ML service tests  
python -m pytest tests/test_ml_service.py -v

# API client tests
python -m pytest tests/test_unified_api_client.py -v

# Integration tests (requires database)
python -m pytest tests/test_integration.py -v

Manual Application Testing¶

# Launch Streamlit app
streamlit run app.py

# Navigate to pages and test:
# - ML Predictions: Enter paper ID and generate predictions
# - Embedding Explorer: Visualize paper embeddings
# - Enhanced Visualizations: View network analysis
# - Results Interpretation: Explore contextual analysis

Troubleshooting¶

Database Connection Issues¶

Error: "Failed to initialize Neo4j connection"

# Check environment variables
python -c "
import os
print('URI:', os.getenv('NEO4J_URI'))
print('USER:', os.getenv('NEO4J_USER'))  
print('PASS:', bool(os.getenv('NEO4J_PASSWORD')))
"

# Test connection manually
python setup_database.py

Error: "Connection test failed" - Verify Neo4j server is running - Check firewall settings for port 7687 - Confirm credentials are correct - For AuraDB, ensure you're using the correct URI format

Model Loading Issues¶

Error: "Model file not found"

# Check model files exist
ls -la models/

# Verify with script
python verify_models.py

Error: "Health check failed" - Ensure you have sufficient RAM (4GB+) - Check PyTorch installation: python -c "import torch; print(torch.__version__)" - Verify CUDA setup if using GPU: python -c "import torch; print(torch.cuda.is_available())"

API Client Issues¶

Error: Rate limit exceeded - Add Semantic Scholar API key to .env - Reduce batch sizes in API calls - Check rate limiter settings in configuration

Error: SSL certificate issues

# For development/testing only:
export PYTHONHTTPSVERIFY=0

Streamlit Issues¶

Error: "ModuleNotFoundError" - Ensure you're in the activated virtual environment - Reinstall with pip install -e ".[all]" - Check Python path: python -c "import sys; print(sys.path)"

Error: Page not loading - Check console for JavaScript errors - Try different browser - Clear browser cache - Restart Streamlit: Ctrl+C then streamlit run app.py

Performance Issues¶

Slow predictions: - Use smaller top_k values - Enable prediction caching - Consider using CPU vs GPU based on model size

Memory errors: - Reduce batch sizes - Clear caches periodically
- Monitor memory usage: htop or Task Manager

Development Setup¶

Code Quality Tools¶

# Install development dependencies
pip install -e ".[dev]"

# Run linting
flake8 src/
pylint src/

# Format code
black src/
isort src/

# Type checking
mypy src/

Pre-commit Hooks¶

# Install pre-commit
pip install pre-commit

# Setup hooks
pre-commit install

# Run on all files
pre-commit run --all-files

Adding New Dependencies¶

# Edit pyproject.toml dependencies
# Then reinstall in development mode
pip install -e ".[dev,all]"

Production Deployment¶

Environment Preparation¶

# Use production environment
ENVIRONMENT=production
DEBUG=false
LOG_LEVEL=WARNING

# Use managed database
NEO4J_URI=neo4j+s://production-database-url

# Enable monitoring
LOG_FILE=/var/log/citation-compass/app.log

Docker Deployment¶

# Build container (Dockerfile not included, but recommended structure)
docker build -t citation-compass .

# Run with environment
docker run -p 8501:8501 --env-file .env citation-compass

Health Monitoring¶

# Setup health check endpoints
python -c "
from src.services.ml_service import get_ml_service
from src.database.connection import Neo4jConnection

# Check all services
ml_health = get_ml_service().health_check()
db = Neo4jConnection()
db_health = db.test_connection()

print('ML Service:', ml_health['status'])
print('Database:', 'healthy' if db_health else 'unhealthy')
"

This setup guide is updated as the system evolves. Last updated: August 2025