Data Models API Reference¶

Comprehensive documentation for data models, schemas, and data structures used throughout Citation Compass.

Core Data Models¶

Paper Model¶

The central data structure representing academic papers.

`models.paper.Paper` ¶

Bases: PaperBase

Complete paper model with all fields.

`has_abstract` `property` ¶

Check if paper has an abstract.

`from_neo4j_record(record)` `classmethod` ¶

Create Paper instance from Neo4j record.

`from_semantic_scholar_response(data)` `classmethod` ¶

Create Paper instance from Semantic Scholar API response.

`is_highly_cited(threshold=100)` ¶

Check if paper is highly cited.

`to_neo4j_dict()` ¶

Convert to dictionary for Neo4j storage.

Citation Model¶

Represents citation relationships between papers.

`models.citation.Citation` ¶

Bases: CitationBase

Complete citation model.

`to_neo4j_dict()` ¶

Convert to dictionary for Neo4j relationship properties.

Author Model¶

Represents academic authors and their affiliations.

`models.author.Author` ¶

Bases: AuthorBase

Complete author model with all fields.

`career_span` `property` ¶

Calculate career span in years.

`is_highly_cited` `property` ¶

Check if author is highly cited.

`is_prolific` `property` ¶

Check if author is prolific (high paper count).

`Config` ¶

Pydantic configuration.

`from_neo4j_record(record)` `classmethod` ¶

Create Author instance from Neo4j record.

`from_semantic_scholar_response(data)` `classmethod` ¶

Create Author instance from Semantic Scholar API response.

`to_neo4j_dict()` ¶

Convert to dictionary for Neo4j storage.

Venue Model¶

Represents publication venues (journals, conferences).

`models.venue.Venue` ¶

Bases: VenueBase

Complete venue model.

`to_neo4j_dict()` ¶

Convert to dictionary for Neo4j storage.

Field Model¶

Represents academic fields of study.

`models.field.Field = ResearchField` `module-attribute` ¶

Network Models¶

Network Graph¶

`models.network.NetworkGraph` ¶

Bases: BaseModel

Complete network graph representation.

Aggregates nodes and edges with metadata for visualization and analysis across different backends.

`filter_by_node_type(node_types)` ¶

Create subgraph with only specified node types.

`get_edges_for_node(node_id)` ¶

Get all edges connected to a node.

`get_neighbors(node_id)` ¶

Get neighbor node IDs for a given node.

`get_node_by_id(node_id)` ¶

Get node by ID.

`get_statistics()` ¶

Get comprehensive graph statistics.

`to_networkx_format()` ¶

Convert to format suitable for NetworkX.

`to_pandas_dataframes()` ¶

Convert to pandas DataFrames for analysis.

`to_pyvis_format()` ¶

Convert to format suitable for Pyvis.

`update_counts_and_density()` ¶

Update node count, edge count, and density when nodes/edges are set.

Network Analysis¶

`models.network.NetworkAnalysis` ¶

Bases: BaseModel

Results of network analysis calculations.

Stores computed network metrics and statistics for display and further analysis.

`get_summary_statistics()` ¶

Get summary of key network statistics.

`get_top_nodes_by_metric(metric_name, k=10)` ¶

Get top K nodes by specified metric.

Machine Learning Models¶

Citation Prediction¶

`models.ml.CitationPrediction` ¶

Bases: BaseModel

Model for citation prediction results.

Represents the output of ML models predicting whether one paper should cite another, with confidence scores and metadata.

`confidence_level` `property` ¶

Get categorical confidence level.

`is_positive_prediction(threshold=0.5)` ¶

Check if this is a positive prediction above threshold.

`to_dict()` ¶

Convert to dictionary for storage or API responses.

Training Configuration¶

`models.ml.TrainingConfig` ¶

Bases: BaseModel

Configuration for model training.

Encapsulates all parameters needed to train a citation prediction model, supporting reproducible training workflows.

`get_test_split()` ¶

Calculate test split fraction.

`validate_splits(v, info)` `classmethod` ¶

Ensure data splits are valid.

Evaluation Metrics¶

`models.ml.EvaluationMetrics` ¶

Bases: BaseModel

Comprehensive evaluation metrics for citation prediction models.

Provides a standardized way to store and compare model performance across different evaluation runs and model types.

`classification_metrics_summary()` ¶

Get summary of classification metrics.

`get_performance_grade(metric='mean_reciprocal_rank')` ¶

Get letter grade for performance.

`is_better_than(other, primary_metric='mean_reciprocal_rank')` ¶

Compare performance with another evaluation.

`ranking_metrics_summary()` ¶

Get summary of ranking metrics.

Paper Embedding¶

`models.ml.PaperEmbedding` ¶

Bases: BaseModel

Model for storing and managing paper embeddings from ML models.

This supports the TransE model integration from citation-map-dashboard and provides a unified interface for embedding storage.

`cosine_similarity(other)` ¶

Calculate cosine similarity with another embedding.

`from_numpy(paper_id, embedding, model_name, model_version=None)` `classmethod` ¶

Create PaperEmbedding from numpy array.

`to_numpy()` ¶

Convert embedding to numpy array.

`validate_embedding_dim_positive(v)` `classmethod` ¶

Ensure embedding dimension is positive.

`validate_embedding_dimension(v, info)` `classmethod` ¶

Ensure embedding dimension matches declared dimension.

API Models¶

API Response Models¶

`models.api.APIResponse` ¶

Bases: BaseModel, Generic[T]

Generic API response wrapper.

Provides consistent response structure across all endpoints with status, data, metadata, and error handling.

`error(errors, message=None)` `classmethod` ¶

Create error response.

`partial(data, errors, message=None)` `classmethod` ¶

Create partial success response.

`success(data, message=None, meta=None)` `classmethod` ¶

Create successful response.

API Error¶

`models.api.APIError` ¶

Bases: BaseModel

Standardized error response model.

Provides consistent error reporting across all API endpoints with detailed information for debugging and user feedback.

`not_found(resource, identifier)` `classmethod` ¶

Create not found error.

`rate_limited(retry_after=None)` `classmethod` ¶

Create rate limit error.

`server_error(message='Internal server error', details=None)` `classmethod` ¶

Create server error.

`validation_error(message, field=None, details=None)` `classmethod` ¶

Create validation error.

Pagination¶

`models.api.PaginatedResponse` ¶

Bases: BaseModel, Generic[T]

Paginated API response wrapper.

Extends the standard API response with pagination metadata for endpoints that return multiple items.

`success(data, pagination, message=None, meta=None)` `classmethod` ¶

Create successful paginated response.

Model Schemas and Validation¶

Pydantic Base Models¶

All models inherit from enhanced Pydantic base classes:

from pydantic import BaseModel, Field, validator
from typing import Optional, List, Dict, Any
from datetime import datetime
import re

class CitationPlatformBaseModel(BaseModel):
    """Base model with common configuration."""

    class Config:
        # Enable ORM mode for database integration
        orm_mode = True

        # Allow population by field name or alias
        allow_population_by_field_name = True

        # Validate assignment to ensure data integrity
        validate_assignment = True

        # Use enum values for serialization
        use_enum_values = True

        # Generate JSON schema
        schema_extra = {
            "example": {}
        }

Field Validation Examples¶

class Paper(CitationPlatformBaseModel):
    paper_id: str = Field(..., regex=r'^[a-zA-Z0-9\-_]+$', description="Unique paper identifier")
    title: str = Field(..., min_length=1, max_length=500, description="Paper title")
    abstract: Optional[str] = Field(None, max_length=5000, description="Paper abstract")
    year: Optional[int] = Field(None, ge=1900, le=2030, description="Publication year")
    citation_count: int = Field(0, ge=0, description="Number of citations")

    @validator('title')
    def validate_title(cls, v):
        """Ensure title is properly formatted."""
        if not v.strip():
            raise ValueError('Title cannot be empty or whitespace only')
        return v.strip()

    @validator('abstract')
    def validate_abstract(cls, v):
        """Clean and validate abstract."""
        if v:
            # Remove excessive whitespace
            v = re.sub(r'\s+', ' ', v.strip())
            if len(v) < 10:
                raise ValueError('Abstract too short (minimum 10 characters)')
        return v

Custom Field Types¶

from pydantic import BaseModel, Field
from typing import NewType, List
from decimal import Decimal

# Custom types for semantic clarity
PaperID = NewType('PaperID', str)
AuthorID = NewType('AuthorID', str)
VenueID = NewType('VenueID', str)
ConfidenceScore = NewType('ConfidenceScore', float)

class Prediction(BaseModel):
    source_paper: PaperID = Field(..., description="Source paper ID")
    target_paper: PaperID = Field(..., description="Target paper ID")
    confidence: ConfidenceScore = Field(..., ge=0.0, le=1.0, description="Prediction confidence")
    score: float = Field(..., description="Raw model score")
    model_version: str = Field(..., description="Model version used")

    @validator('confidence')
    def round_confidence(cls, v):
        """Round confidence to 3 decimal places."""
        return round(v, 3)

Model Relationships¶

Entity Relationships¶

from typing import List, Optional, ForwardRef
from pydantic import BaseModel

# Forward references for circular dependencies
AuthorRef = ForwardRef('Author')
VenueRef = ForwardRef('Venue')
FieldRef = ForwardRef('Field')

class Paper(BaseModel):
    paper_id: str
    title: str

    # Relationships
    authors: List[AuthorRef] = []
    venue: Optional[VenueRef] = None
    fields: List[FieldRef] = []
    citations: List['Citation'] = []
    references: List['Citation'] = []

class Author(BaseModel):
    author_id: str
    name: str

    # Relationships  
    papers: List['Paper'] = []
    affiliations: List[str] = []

# Update forward references
Paper.model_rebuild()
Author.model_rebuild()

Database Integration¶

from src.database.connection import Neo4jConnection
from typing import Optional

class PaperRepository:
    def __init__(self, connection: Neo4jConnection):
        self.conn = connection

    async def get_by_id(self, paper_id: str) -> Optional[Paper]:
        """Retrieve paper by ID with full relationships."""

        query = """
        MATCH (p:Paper {paper_id: $paper_id})
        OPTIONAL MATCH (p)<-[:AUTHORED]-(a:Author)
        OPTIONAL MATCH (p)-[:PUBLISHED_IN]->(v:Venue)
        OPTIONAL MATCH (p)-[:BELONGS_TO]->(f:Field)
        OPTIONAL MATCH (p)-[:CITES]->(cited:Paper)
        OPTIONAL MATCH (citing:Paper)-[:CITES]->(p)

        RETURN p,
               collect(DISTINCT a) as authors,
               v as venue,
               collect(DISTINCT f) as fields,
               collect(DISTINCT cited) as citations,
               collect(DISTINCT citing) as cited_by
        """

        result = await self.conn.run(query, paper_id=paper_id)
        record = await result.single()

        if not record:
            return None

        return Paper(
            paper_id=record['p']['paper_id'],
            title=record['p']['title'],
            abstract=record['p'].get('abstract'),
            year=record['p'].get('year'),
            authors=[Author(**author) for author in record['authors']],
            venue=Venue(**record['venue']) if record['venue'] else None,
            fields=[Field(**field) for field in record['fields']],
            citation_count=len(record['citations'])
        )

Serialization and Export¶

JSON Serialization¶

import json
from typing import Any, Dict
from datetime import datetime

class CitationPlatformJSONEncoder(json.JSONEncoder):
    """Custom JSON encoder for platform models."""

    def default(self, obj: Any) -> Any:
        if isinstance(obj, datetime):
            return obj.isoformat()
        elif hasattr(obj, 'dict'):
            # Pydantic models
            return obj.dict()
        elif hasattr(obj, '__dict__'):
            # Other objects with __dict__
            return obj.__dict__
        else:
            return super().default(obj)

# Usage
paper = Paper(paper_id="123", title="Example Paper")
json_string = json.dumps(paper, cls=CitationPlatformJSONEncoder, indent=2)

Export Formats¶

from typing import Union, List
import pandas as pd

class ModelExporter:
    @staticmethod
    def to_dataframe(models: List[BaseModel]) -> pd.DataFrame:
        """Convert list of models to pandas DataFrame."""
        data = [model.dict() for model in models]
        return pd.DataFrame(data)

    @staticmethod
    def to_csv(models: List[BaseModel], file_path: str):
        """Export models to CSV file."""
        df = ModelExporter.to_dataframe(models)
        df.to_csv(file_path, index=False)

    @staticmethod
    def to_latex(models: List[BaseModel], caption: str = "") -> str:
        """Convert models to LaTeX table."""
        df = ModelExporter.to_dataframe(models)
        return df.to_latex(index=False, caption=caption)

# Usage
papers = [Paper(paper_id=f"{i}", title=f"Paper {i}") for i in range(10)]
ModelExporter.to_csv(papers, "papers.csv")
latex_table = ModelExporter.to_latex(papers, "Sample Papers")

Model Validation and Testing¶

Unit Tests for Models¶

import pytest
from pydantic import ValidationError

class TestPaperModel:
    def test_valid_paper_creation(self):
        paper = Paper(
            paper_id="valid123",
            title="A Valid Paper Title",
            abstract="This is a valid abstract with sufficient length.",
            year=2023,
            citation_count=5
        )
        assert paper.paper_id == "valid123"
        assert paper.citation_count == 5

    def test_invalid_paper_id(self):
        with pytest.raises(ValidationError) as exc_info:
            Paper(
                paper_id="invalid@id!",  # Contains invalid characters
                title="Valid Title"
            )
        assert "paper_id" in str(exc_info.value)

    def test_title_validation(self):
        with pytest.raises(ValidationError):
            Paper(paper_id="123", title="   ")  # Whitespace only

        with pytest.raises(ValidationError):
            Paper(paper_id="123", title="")  # Empty string

Model Factory for Testing¶

import factory
from factory import fuzzy
from datetime import datetime

class PaperFactory(factory.Factory):
    class Meta:
        model = Paper

    paper_id = factory.Sequence(lambda n: f"paper_{n}")
    title = factory.Faker('sentence', nb_words=6)
    abstract = factory.Faker('paragraph')
    year = fuzzy.FuzzyInteger(2000, 2023)
    citation_count = fuzzy.FuzzyInteger(0, 100)

class AuthorFactory(factory.Factory):
    class Meta:
        model = Author

    author_id = factory.Sequence(lambda n: f"author_{n}")
    name = factory.Faker('name')

class CitationFactory(factory.Factory):
    class Meta:
        model = Citation

    source_paper = factory.SubFactory(PaperFactory)
    target_paper = factory.SubFactory(PaperFactory)
    citation_context = factory.Faker('sentence')

# Usage in tests
def test_paper_with_citations():
    paper = PaperFactory()
    citations = CitationFactory.create_batch(5, source_paper=paper)

    assert len(citations) == 5
    assert all(c.source_paper.paper_id == paper.paper_id for c in citations)

Performance Considerations¶

Model Optimization¶

from pydantic import BaseModel, Field
from typing import Optional, List

class OptimizedPaper(BaseModel):
    """Memory-optimized paper model for large-scale processing."""

    # Use slots to reduce memory overhead
    __slots__ = ('paper_id', 'title', 'year', 'citation_count')

    paper_id: str
    title: str
    year: Optional[int] = None
    citation_count: int = 0

    class Config:
        # Disable arbitrary types for performance
        arbitrary_types_allowed = False

        # Validate only on assignment, not creation
        validate_assignment = True
        validate_all = False

        # Skip validation for known safe data
        allow_reuse = True

# Batch processing optimization
def process_papers_batch(papers: List[dict]) -> List[OptimizedPaper]:
    """Efficiently process large batches of paper data."""
    return [OptimizedPaper.parse_obj(paper) for paper in papers]

Lazy Loading¶

from typing import Optional, Callable

class LazyPaper(BaseModel):
    """Paper model with lazy loading of expensive relationships."""

    paper_id: str
    title: str

    # Lazy-loaded fields
    _authors: Optional[List[Author]] = None
    _citations: Optional[List[Citation]] = None
    _loader: Optional[Callable] = None

    def __init__(self, **data):
        self._loader = data.pop('loader', None)
        super().__init__(**data)

    @property
    def authors(self) -> List[Author]:
        if self._authors is None and self._loader:
            self._authors = self._loader.load_authors(self.paper_id)
        return self._authors or []

    @property
    def citations(self) -> List[Citation]:
        if self._citations is None and self._loader:
            self._citations = self._loader.load_citations(self.paper_id)
        return self._citations or []

This model system provides type safety, validation, and performance optimization for Citation Compass.

Data Models API Reference¶

Core Data Models¶

Paper Model¶

models.paper.Paper ¶

has_abstract property ¶

from_neo4j_record(record) classmethod ¶

from_semantic_scholar_response(data) classmethod ¶

is_highly_cited(threshold=100) ¶

to_neo4j_dict() ¶

Citation Model¶

models.citation.Citation ¶

to_neo4j_dict() ¶

Author Model¶

models.author.Author ¶

career_span property ¶

is_highly_cited property ¶

is_prolific property ¶

Config ¶

from_neo4j_record(record) classmethod ¶

from_semantic_scholar_response(data) classmethod ¶

to_neo4j_dict() ¶

Venue Model¶

models.venue.Venue ¶

to_neo4j_dict() ¶

Field Model¶

models.field.Field = ResearchField module-attribute ¶

Network Models¶

Network Graph¶

models.network.NetworkGraph ¶

filter_by_node_type(node_types) ¶

get_edges_for_node(node_id) ¶

get_neighbors(node_id) ¶

get_node_by_id(node_id) ¶

get_statistics() ¶

to_networkx_format() ¶

to_pandas_dataframes() ¶

to_pyvis_format() ¶

update_counts_and_density() ¶

Network Analysis¶

models.network.NetworkAnalysis ¶

get_summary_statistics() ¶

get_top_nodes_by_metric(metric_name, k=10) ¶

Machine Learning Models¶

Citation Prediction¶

models.ml.CitationPrediction ¶

confidence_level property ¶

is_positive_prediction(threshold=0.5) ¶

to_dict() ¶

Training Configuration¶

models.ml.TrainingConfig ¶

get_test_split() ¶

validate_splits(v, info) classmethod ¶

Evaluation Metrics¶

models.ml.EvaluationMetrics ¶

classification_metrics_summary() ¶

get_performance_grade(metric='mean_reciprocal_rank') ¶

is_better_than(other, primary_metric='mean_reciprocal_rank') ¶

ranking_metrics_summary() ¶

Paper Embedding¶

models.ml.PaperEmbedding ¶

cosine_similarity(other) ¶

from_numpy(paper_id, embedding, model_name, model_version=None) classmethod ¶

to_numpy() ¶

validate_embedding_dim_positive(v) classmethod ¶

validate_embedding_dimension(v, info) classmethod ¶

API Models¶

API Response Models¶

models.api.APIResponse ¶

error(errors, message=None) classmethod ¶

partial(data, errors, message=None) classmethod ¶

success(data, message=None, meta=None) classmethod ¶

API Error¶

models.api.APIError ¶

not_found(resource, identifier) classmethod ¶

rate_limited(retry_after=None) classmethod ¶

server_error(message='Internal server error', details=None) classmethod ¶

validation_error(message, field=None, details=None) classmethod ¶

Pagination¶

models.api.PaginatedResponse ¶

success(data, pagination, message=None, meta=None) classmethod ¶

`models.paper.Paper` ¶

`has_abstract` `property` ¶

`from_neo4j_record(record)` `classmethod` ¶

`from_semantic_scholar_response(data)` `classmethod` ¶

`is_highly_cited(threshold=100)` ¶

`to_neo4j_dict()` ¶

`models.citation.Citation` ¶

`to_neo4j_dict()` ¶

`models.author.Author` ¶

`career_span` `property` ¶

`is_highly_cited` `property` ¶

`is_prolific` `property` ¶

`Config` ¶

`from_neo4j_record(record)` `classmethod` ¶

`from_semantic_scholar_response(data)` `classmethod` ¶

`to_neo4j_dict()` ¶

`models.venue.Venue` ¶

`to_neo4j_dict()` ¶

`models.field.Field = ResearchField` `module-attribute` ¶

`models.network.NetworkGraph` ¶

`filter_by_node_type(node_types)` ¶

`get_edges_for_node(node_id)` ¶

`get_neighbors(node_id)` ¶

`get_node_by_id(node_id)` ¶

`get_statistics()` ¶

`to_networkx_format()` ¶

`to_pandas_dataframes()` ¶

`to_pyvis_format()` ¶

`update_counts_and_density()` ¶

`models.network.NetworkAnalysis` ¶

`get_summary_statistics()` ¶

`get_top_nodes_by_metric(metric_name, k=10)` ¶

`models.ml.CitationPrediction` ¶

`confidence_level` `property` ¶

`is_positive_prediction(threshold=0.5)` ¶

`to_dict()` ¶

`models.ml.TrainingConfig` ¶

`get_test_split()` ¶

`validate_splits(v, info)` `classmethod` ¶

`models.ml.EvaluationMetrics` ¶

`classification_metrics_summary()` ¶

`get_performance_grade(metric='mean_reciprocal_rank')` ¶

`is_better_than(other, primary_metric='mean_reciprocal_rank')` ¶

`ranking_metrics_summary()` ¶

`models.ml.PaperEmbedding` ¶

`cosine_similarity(other)` ¶

`from_numpy(paper_id, embedding, model_name, model_version=None)` `classmethod` ¶

`to_numpy()` ¶

`validate_embedding_dim_positive(v)` `classmethod` ¶

`validate_embedding_dimension(v, info)` `classmethod` ¶

`models.api.APIResponse` ¶

`error(errors, message=None)` `classmethod` ¶

`partial(data, errors, message=None)` `classmethod` ¶

`success(data, message=None, meta=None)` `classmethod` ¶

`models.api.APIError` ¶

`not_found(resource, identifier)` `classmethod` ¶

`rate_limited(retry_after=None)` `classmethod` ¶

`server_error(message='Internal server error', details=None)` `classmethod` ¶

`validation_error(message, field=None, details=None)` `classmethod` ¶

`models.api.PaginatedResponse` ¶

`success(data, pagination, message=None, meta=None)` `classmethod` ¶