Skip to content

Data Models API Reference

Comprehensive documentation for data models, schemas, and data structures used throughout Citation Compass.

Core Data Models

Paper Model

The central data structure representing academic papers.

models.paper.Paper

Bases: PaperBase

Complete paper model with all fields.

has_abstract property

Check if paper has an abstract.

from_neo4j_record(record) classmethod

Create Paper instance from Neo4j record.

from_semantic_scholar_response(data) classmethod

Create Paper instance from Semantic Scholar API response.

is_highly_cited(threshold=100)

Check if paper is highly cited.

to_neo4j_dict()

Convert to dictionary for Neo4j storage.

Citation Model

Represents citation relationships between papers.

models.citation.Citation

Bases: CitationBase

Complete citation model.

to_neo4j_dict()

Convert to dictionary for Neo4j relationship properties.

Author Model

Represents academic authors and their affiliations.

models.author.Author

Bases: AuthorBase

Complete author model with all fields.

career_span property

Calculate career span in years.

is_highly_cited property

Check if author is highly cited.

is_prolific property

Check if author is prolific (high paper count).

Config

Pydantic configuration.

from_neo4j_record(record) classmethod

Create Author instance from Neo4j record.

from_semantic_scholar_response(data) classmethod

Create Author instance from Semantic Scholar API response.

to_neo4j_dict()

Convert to dictionary for Neo4j storage.

Venue Model

Represents publication venues (journals, conferences).

models.venue.Venue

Bases: VenueBase

Complete venue model.

to_neo4j_dict()

Convert to dictionary for Neo4j storage.

Field Model

Represents academic fields of study.

models.field.Field = ResearchField module-attribute


Network Models

Network Graph

models.network.NetworkGraph

Bases: BaseModel

Complete network graph representation.

Aggregates nodes and edges with metadata for visualization and analysis across different backends.

filter_by_node_type(node_types)

Create subgraph with only specified node types.

get_edges_for_node(node_id)

Get all edges connected to a node.

get_neighbors(node_id)

Get neighbor node IDs for a given node.

get_node_by_id(node_id)

Get node by ID.

get_statistics()

Get comprehensive graph statistics.

to_networkx_format()

Convert to format suitable for NetworkX.

to_pandas_dataframes()

Convert to pandas DataFrames for analysis.

to_pyvis_format()

Convert to format suitable for Pyvis.

update_counts_and_density()

Update node count, edge count, and density when nodes/edges are set.

Network Analysis

models.network.NetworkAnalysis

Bases: BaseModel

Results of network analysis calculations.

Stores computed network metrics and statistics for display and further analysis.

get_summary_statistics()

Get summary of key network statistics.

get_top_nodes_by_metric(metric_name, k=10)

Get top K nodes by specified metric.


Machine Learning Models

Citation Prediction

models.ml.CitationPrediction

Bases: BaseModel

Model for citation prediction results.

Represents the output of ML models predicting whether one paper should cite another, with confidence scores and metadata.

confidence_level property

Get categorical confidence level.

is_positive_prediction(threshold=0.5)

Check if this is a positive prediction above threshold.

to_dict()

Convert to dictionary for storage or API responses.

Training Configuration

models.ml.TrainingConfig

Bases: BaseModel

Configuration for model training.

Encapsulates all parameters needed to train a citation prediction model, supporting reproducible training workflows.

get_test_split()

Calculate test split fraction.

validate_splits(v, info) classmethod

Ensure data splits are valid.

Evaluation Metrics

models.ml.EvaluationMetrics

Bases: BaseModel

Comprehensive evaluation metrics for citation prediction models.

Provides a standardized way to store and compare model performance across different evaluation runs and model types.

classification_metrics_summary()

Get summary of classification metrics.

get_performance_grade(metric='mean_reciprocal_rank')

Get letter grade for performance.

is_better_than(other, primary_metric='mean_reciprocal_rank')

Compare performance with another evaluation.

ranking_metrics_summary()

Get summary of ranking metrics.

Paper Embedding

models.ml.PaperEmbedding

Bases: BaseModel

Model for storing and managing paper embeddings from ML models.

This supports the TransE model integration from citation-map-dashboard and provides a unified interface for embedding storage.

cosine_similarity(other)

Calculate cosine similarity with another embedding.

from_numpy(paper_id, embedding, model_name, model_version=None) classmethod

Create PaperEmbedding from numpy array.

to_numpy()

Convert embedding to numpy array.

validate_embedding_dim_positive(v) classmethod

Ensure embedding dimension is positive.

validate_embedding_dimension(v, info) classmethod

Ensure embedding dimension matches declared dimension.


API Models

API Response Models

models.api.APIResponse

Bases: BaseModel, Generic[T]

Generic API response wrapper.

Provides consistent response structure across all endpoints with status, data, metadata, and error handling.

error(errors, message=None) classmethod

Create error response.

partial(data, errors, message=None) classmethod

Create partial success response.

success(data, message=None, meta=None) classmethod

Create successful response.

API Error

models.api.APIError

Bases: BaseModel

Standardized error response model.

Provides consistent error reporting across all API endpoints with detailed information for debugging and user feedback.

not_found(resource, identifier) classmethod

Create not found error.

rate_limited(retry_after=None) classmethod

Create rate limit error.

server_error(message='Internal server error', details=None) classmethod

Create server error.

validation_error(message, field=None, details=None) classmethod

Create validation error.

Pagination

models.api.PaginatedResponse

Bases: BaseModel, Generic[T]

Paginated API response wrapper.

Extends the standard API response with pagination metadata for endpoints that return multiple items.

success(data, pagination, message=None, meta=None) classmethod

Create successful paginated response.


Model Schemas and Validation

Pydantic Base Models

All models inherit from enhanced Pydantic base classes:

from pydantic import BaseModel, Field, validator
from typing import Optional, List, Dict, Any
from datetime import datetime
import re

class CitationPlatformBaseModel(BaseModel):
    """Base model with common configuration."""

    class Config:
        # Enable ORM mode for database integration
        orm_mode = True

        # Allow population by field name or alias
        allow_population_by_field_name = True

        # Validate assignment to ensure data integrity
        validate_assignment = True

        # Use enum values for serialization
        use_enum_values = True

        # Generate JSON schema
        schema_extra = {
            "example": {}
        }

Field Validation Examples

class Paper(CitationPlatformBaseModel):
    paper_id: str = Field(..., regex=r'^[a-zA-Z0-9\-_]+$', description="Unique paper identifier")
    title: str = Field(..., min_length=1, max_length=500, description="Paper title")
    abstract: Optional[str] = Field(None, max_length=5000, description="Paper abstract")
    year: Optional[int] = Field(None, ge=1900, le=2030, description="Publication year")
    citation_count: int = Field(0, ge=0, description="Number of citations")

    @validator('title')
    def validate_title(cls, v):
        """Ensure title is properly formatted."""
        if not v.strip():
            raise ValueError('Title cannot be empty or whitespace only')
        return v.strip()

    @validator('abstract')
    def validate_abstract(cls, v):
        """Clean and validate abstract."""
        if v:
            # Remove excessive whitespace
            v = re.sub(r'\s+', ' ', v.strip())
            if len(v) < 10:
                raise ValueError('Abstract too short (minimum 10 characters)')
        return v

Custom Field Types

from pydantic import BaseModel, Field
from typing import NewType, List
from decimal import Decimal

# Custom types for semantic clarity
PaperID = NewType('PaperID', str)
AuthorID = NewType('AuthorID', str)
VenueID = NewType('VenueID', str)
ConfidenceScore = NewType('ConfidenceScore', float)

class Prediction(BaseModel):
    source_paper: PaperID = Field(..., description="Source paper ID")
    target_paper: PaperID = Field(..., description="Target paper ID")
    confidence: ConfidenceScore = Field(..., ge=0.0, le=1.0, description="Prediction confidence")
    score: float = Field(..., description="Raw model score")
    model_version: str = Field(..., description="Model version used")

    @validator('confidence')
    def round_confidence(cls, v):
        """Round confidence to 3 decimal places."""
        return round(v, 3)

Model Relationships

Entity Relationships

from typing import List, Optional, ForwardRef
from pydantic import BaseModel

# Forward references for circular dependencies
AuthorRef = ForwardRef('Author')
VenueRef = ForwardRef('Venue')
FieldRef = ForwardRef('Field')

class Paper(BaseModel):
    paper_id: str
    title: str

    # Relationships
    authors: List[AuthorRef] = []
    venue: Optional[VenueRef] = None
    fields: List[FieldRef] = []
    citations: List['Citation'] = []
    references: List['Citation'] = []

class Author(BaseModel):
    author_id: str
    name: str

    # Relationships  
    papers: List['Paper'] = []
    affiliations: List[str] = []

# Update forward references
Paper.model_rebuild()
Author.model_rebuild()

Database Integration

from src.database.connection import Neo4jConnection
from typing import Optional

class PaperRepository:
    def __init__(self, connection: Neo4jConnection):
        self.conn = connection

    async def get_by_id(self, paper_id: str) -> Optional[Paper]:
        """Retrieve paper by ID with full relationships."""

        query = """
        MATCH (p:Paper {paper_id: $paper_id})
        OPTIONAL MATCH (p)<-[:AUTHORED]-(a:Author)
        OPTIONAL MATCH (p)-[:PUBLISHED_IN]->(v:Venue)
        OPTIONAL MATCH (p)-[:BELONGS_TO]->(f:Field)
        OPTIONAL MATCH (p)-[:CITES]->(cited:Paper)
        OPTIONAL MATCH (citing:Paper)-[:CITES]->(p)

        RETURN p,
               collect(DISTINCT a) as authors,
               v as venue,
               collect(DISTINCT f) as fields,
               collect(DISTINCT cited) as citations,
               collect(DISTINCT citing) as cited_by
        """

        result = await self.conn.run(query, paper_id=paper_id)
        record = await result.single()

        if not record:
            return None

        return Paper(
            paper_id=record['p']['paper_id'],
            title=record['p']['title'],
            abstract=record['p'].get('abstract'),
            year=record['p'].get('year'),
            authors=[Author(**author) for author in record['authors']],
            venue=Venue(**record['venue']) if record['venue'] else None,
            fields=[Field(**field) for field in record['fields']],
            citation_count=len(record['citations'])
        )

Serialization and Export

JSON Serialization

import json
from typing import Any, Dict
from datetime import datetime

class CitationPlatformJSONEncoder(json.JSONEncoder):
    """Custom JSON encoder for platform models."""

    def default(self, obj: Any) -> Any:
        if isinstance(obj, datetime):
            return obj.isoformat()
        elif hasattr(obj, 'dict'):
            # Pydantic models
            return obj.dict()
        elif hasattr(obj, '__dict__'):
            # Other objects with __dict__
            return obj.__dict__
        else:
            return super().default(obj)

# Usage
paper = Paper(paper_id="123", title="Example Paper")
json_string = json.dumps(paper, cls=CitationPlatformJSONEncoder, indent=2)

Export Formats

from typing import Union, List
import pandas as pd

class ModelExporter:
    @staticmethod
    def to_dataframe(models: List[BaseModel]) -> pd.DataFrame:
        """Convert list of models to pandas DataFrame."""
        data = [model.dict() for model in models]
        return pd.DataFrame(data)

    @staticmethod
    def to_csv(models: List[BaseModel], file_path: str):
        """Export models to CSV file."""
        df = ModelExporter.to_dataframe(models)
        df.to_csv(file_path, index=False)

    @staticmethod
    def to_latex(models: List[BaseModel], caption: str = "") -> str:
        """Convert models to LaTeX table."""
        df = ModelExporter.to_dataframe(models)
        return df.to_latex(index=False, caption=caption)

# Usage
papers = [Paper(paper_id=f"{i}", title=f"Paper {i}") for i in range(10)]
ModelExporter.to_csv(papers, "papers.csv")
latex_table = ModelExporter.to_latex(papers, "Sample Papers")

Model Validation and Testing

Unit Tests for Models

import pytest
from pydantic import ValidationError

class TestPaperModel:
    def test_valid_paper_creation(self):
        paper = Paper(
            paper_id="valid123",
            title="A Valid Paper Title",
            abstract="This is a valid abstract with sufficient length.",
            year=2023,
            citation_count=5
        )
        assert paper.paper_id == "valid123"
        assert paper.citation_count == 5

    def test_invalid_paper_id(self):
        with pytest.raises(ValidationError) as exc_info:
            Paper(
                paper_id="invalid@id!",  # Contains invalid characters
                title="Valid Title"
            )
        assert "paper_id" in str(exc_info.value)

    def test_title_validation(self):
        with pytest.raises(ValidationError):
            Paper(paper_id="123", title="   ")  # Whitespace only

        with pytest.raises(ValidationError):
            Paper(paper_id="123", title="")  # Empty string

Model Factory for Testing

import factory
from factory import fuzzy
from datetime import datetime

class PaperFactory(factory.Factory):
    class Meta:
        model = Paper

    paper_id = factory.Sequence(lambda n: f"paper_{n}")
    title = factory.Faker('sentence', nb_words=6)
    abstract = factory.Faker('paragraph')
    year = fuzzy.FuzzyInteger(2000, 2023)
    citation_count = fuzzy.FuzzyInteger(0, 100)

class AuthorFactory(factory.Factory):
    class Meta:
        model = Author

    author_id = factory.Sequence(lambda n: f"author_{n}")
    name = factory.Faker('name')

class CitationFactory(factory.Factory):
    class Meta:
        model = Citation

    source_paper = factory.SubFactory(PaperFactory)
    target_paper = factory.SubFactory(PaperFactory)
    citation_context = factory.Faker('sentence')

# Usage in tests
def test_paper_with_citations():
    paper = PaperFactory()
    citations = CitationFactory.create_batch(5, source_paper=paper)

    assert len(citations) == 5
    assert all(c.source_paper.paper_id == paper.paper_id for c in citations)

Performance Considerations

Model Optimization

from pydantic import BaseModel, Field
from typing import Optional, List

class OptimizedPaper(BaseModel):
    """Memory-optimized paper model for large-scale processing."""

    # Use slots to reduce memory overhead
    __slots__ = ('paper_id', 'title', 'year', 'citation_count')

    paper_id: str
    title: str
    year: Optional[int] = None
    citation_count: int = 0

    class Config:
        # Disable arbitrary types for performance
        arbitrary_types_allowed = False

        # Validate only on assignment, not creation
        validate_assignment = True
        validate_all = False

        # Skip validation for known safe data
        allow_reuse = True

# Batch processing optimization
def process_papers_batch(papers: List[dict]) -> List[OptimizedPaper]:
    """Efficiently process large batches of paper data."""
    return [OptimizedPaper.parse_obj(paper) for paper in papers]

Lazy Loading

from typing import Optional, Callable

class LazyPaper(BaseModel):
    """Paper model with lazy loading of expensive relationships."""

    paper_id: str
    title: str

    # Lazy-loaded fields
    _authors: Optional[List[Author]] = None
    _citations: Optional[List[Citation]] = None
    _loader: Optional[Callable] = None

    def __init__(self, **data):
        self._loader = data.pop('loader', None)
        super().__init__(**data)

    @property
    def authors(self) -> List[Author]:
        if self._authors is None and self._loader:
            self._authors = self._loader.load_authors(self.paper_id)
        return self._authors or []

    @property
    def citations(self) -> List[Citation]:
        if self._citations is None and self._loader:
            self._citations = self._loader.load_citations(self.paper_id)
        return self._citations or []

This model system provides type safety, validation, and performance optimization for Citation Compass.