Data Models API Reference¶
Comprehensive documentation for data models, schemas, and data structures used throughout Citation Compass.
Core Data Models¶
Paper Model¶
The central data structure representing academic papers.
models.paper.Paper ¶
Bases: PaperBase
Complete paper model with all fields.
has_abstract property ¶
Check if paper has an abstract.
from_neo4j_record(record) classmethod ¶
Create Paper instance from Neo4j record.
from_semantic_scholar_response(data) classmethod ¶
Create Paper instance from Semantic Scholar API response.
is_highly_cited(threshold=100) ¶
Check if paper is highly cited.
to_neo4j_dict() ¶
Convert to dictionary for Neo4j storage.
Citation Model¶
Represents citation relationships between papers.
models.citation.Citation ¶
Bases: CitationBase
Complete citation model.
to_neo4j_dict() ¶
Convert to dictionary for Neo4j relationship properties.
Author Model¶
Represents academic authors and their affiliations.
models.author.Author ¶
Bases: AuthorBase
Complete author model with all fields.
career_span property ¶
Calculate career span in years.
is_highly_cited property ¶
Check if author is highly cited.
is_prolific property ¶
Check if author is prolific (high paper count).
Config ¶
Pydantic configuration.
from_neo4j_record(record) classmethod ¶
Create Author instance from Neo4j record.
from_semantic_scholar_response(data) classmethod ¶
Create Author instance from Semantic Scholar API response.
to_neo4j_dict() ¶
Convert to dictionary for Neo4j storage.
Venue Model¶
Represents publication venues (journals, conferences).
models.venue.Venue ¶
Field Model¶
Represents academic fields of study.
models.field.Field = ResearchField module-attribute ¶
Network Models¶
Network Graph¶
models.network.NetworkGraph ¶
Bases: BaseModel
Complete network graph representation.
Aggregates nodes and edges with metadata for visualization and analysis across different backends.
filter_by_node_type(node_types) ¶
Create subgraph with only specified node types.
get_edges_for_node(node_id) ¶
Get all edges connected to a node.
get_neighbors(node_id) ¶
Get neighbor node IDs for a given node.
get_node_by_id(node_id) ¶
Get node by ID.
get_statistics() ¶
Get comprehensive graph statistics.
to_networkx_format() ¶
Convert to format suitable for NetworkX.
to_pandas_dataframes() ¶
Convert to pandas DataFrames for analysis.
to_pyvis_format() ¶
Convert to format suitable for Pyvis.
update_counts_and_density() ¶
Update node count, edge count, and density when nodes/edges are set.
Network Analysis¶
models.network.NetworkAnalysis ¶
Machine Learning Models¶
Citation Prediction¶
models.ml.CitationPrediction ¶
Bases: BaseModel
Model for citation prediction results.
Represents the output of ML models predicting whether one paper should cite another, with confidence scores and metadata.
Training Configuration¶
models.ml.TrainingConfig ¶
Evaluation Metrics¶
models.ml.EvaluationMetrics ¶
Bases: BaseModel
Comprehensive evaluation metrics for citation prediction models.
Provides a standardized way to store and compare model performance across different evaluation runs and model types.
classification_metrics_summary() ¶
Get summary of classification metrics.
get_performance_grade(metric='mean_reciprocal_rank') ¶
Get letter grade for performance.
is_better_than(other, primary_metric='mean_reciprocal_rank') ¶
Compare performance with another evaluation.
ranking_metrics_summary() ¶
Get summary of ranking metrics.
Paper Embedding¶
models.ml.PaperEmbedding ¶
Bases: BaseModel
Model for storing and managing paper embeddings from ML models.
This supports the TransE model integration from citation-map-dashboard and provides a unified interface for embedding storage.
cosine_similarity(other) ¶
Calculate cosine similarity with another embedding.
from_numpy(paper_id, embedding, model_name, model_version=None) classmethod ¶
Create PaperEmbedding from numpy array.
to_numpy() ¶
Convert embedding to numpy array.
validate_embedding_dim_positive(v) classmethod ¶
Ensure embedding dimension is positive.
validate_embedding_dimension(v, info) classmethod ¶
Ensure embedding dimension matches declared dimension.
API Models¶
API Response Models¶
models.api.APIResponse ¶
Bases: BaseModel, Generic[T]
Generic API response wrapper.
Provides consistent response structure across all endpoints with status, data, metadata, and error handling.
API Error¶
models.api.APIError ¶
Bases: BaseModel
Standardized error response model.
Provides consistent error reporting across all API endpoints with detailed information for debugging and user feedback.
not_found(resource, identifier) classmethod ¶
Create not found error.
rate_limited(retry_after=None) classmethod ¶
Create rate limit error.
server_error(message='Internal server error', details=None) classmethod ¶
Create server error.
validation_error(message, field=None, details=None) classmethod ¶
Create validation error.
Pagination¶
models.api.PaginatedResponse ¶
Bases: BaseModel, Generic[T]
Paginated API response wrapper.
Extends the standard API response with pagination metadata for endpoints that return multiple items.
success(data, pagination, message=None, meta=None) classmethod ¶
Create successful paginated response.
Model Schemas and Validation¶
Pydantic Base Models¶
All models inherit from enhanced Pydantic base classes:
from pydantic import BaseModel, Field, validator
from typing import Optional, List, Dict, Any
from datetime import datetime
import re
class CitationPlatformBaseModel(BaseModel):
"""Base model with common configuration."""
class Config:
# Enable ORM mode for database integration
orm_mode = True
# Allow population by field name or alias
allow_population_by_field_name = True
# Validate assignment to ensure data integrity
validate_assignment = True
# Use enum values for serialization
use_enum_values = True
# Generate JSON schema
schema_extra = {
"example": {}
}
Field Validation Examples¶
class Paper(CitationPlatformBaseModel):
paper_id: str = Field(..., regex=r'^[a-zA-Z0-9\-_]+$', description="Unique paper identifier")
title: str = Field(..., min_length=1, max_length=500, description="Paper title")
abstract: Optional[str] = Field(None, max_length=5000, description="Paper abstract")
year: Optional[int] = Field(None, ge=1900, le=2030, description="Publication year")
citation_count: int = Field(0, ge=0, description="Number of citations")
@validator('title')
def validate_title(cls, v):
"""Ensure title is properly formatted."""
if not v.strip():
raise ValueError('Title cannot be empty or whitespace only')
return v.strip()
@validator('abstract')
def validate_abstract(cls, v):
"""Clean and validate abstract."""
if v:
# Remove excessive whitespace
v = re.sub(r'\s+', ' ', v.strip())
if len(v) < 10:
raise ValueError('Abstract too short (minimum 10 characters)')
return v
Custom Field Types¶
from pydantic import BaseModel, Field
from typing import NewType, List
from decimal import Decimal
# Custom types for semantic clarity
PaperID = NewType('PaperID', str)
AuthorID = NewType('AuthorID', str)
VenueID = NewType('VenueID', str)
ConfidenceScore = NewType('ConfidenceScore', float)
class Prediction(BaseModel):
source_paper: PaperID = Field(..., description="Source paper ID")
target_paper: PaperID = Field(..., description="Target paper ID")
confidence: ConfidenceScore = Field(..., ge=0.0, le=1.0, description="Prediction confidence")
score: float = Field(..., description="Raw model score")
model_version: str = Field(..., description="Model version used")
@validator('confidence')
def round_confidence(cls, v):
"""Round confidence to 3 decimal places."""
return round(v, 3)
Model Relationships¶
Entity Relationships¶
from typing import List, Optional, ForwardRef
from pydantic import BaseModel
# Forward references for circular dependencies
AuthorRef = ForwardRef('Author')
VenueRef = ForwardRef('Venue')
FieldRef = ForwardRef('Field')
class Paper(BaseModel):
paper_id: str
title: str
# Relationships
authors: List[AuthorRef] = []
venue: Optional[VenueRef] = None
fields: List[FieldRef] = []
citations: List['Citation'] = []
references: List['Citation'] = []
class Author(BaseModel):
author_id: str
name: str
# Relationships
papers: List['Paper'] = []
affiliations: List[str] = []
# Update forward references
Paper.model_rebuild()
Author.model_rebuild()
Database Integration¶
from src.database.connection import Neo4jConnection
from typing import Optional
class PaperRepository:
def __init__(self, connection: Neo4jConnection):
self.conn = connection
async def get_by_id(self, paper_id: str) -> Optional[Paper]:
"""Retrieve paper by ID with full relationships."""
query = """
MATCH (p:Paper {paper_id: $paper_id})
OPTIONAL MATCH (p)<-[:AUTHORED]-(a:Author)
OPTIONAL MATCH (p)-[:PUBLISHED_IN]->(v:Venue)
OPTIONAL MATCH (p)-[:BELONGS_TO]->(f:Field)
OPTIONAL MATCH (p)-[:CITES]->(cited:Paper)
OPTIONAL MATCH (citing:Paper)-[:CITES]->(p)
RETURN p,
collect(DISTINCT a) as authors,
v as venue,
collect(DISTINCT f) as fields,
collect(DISTINCT cited) as citations,
collect(DISTINCT citing) as cited_by
"""
result = await self.conn.run(query, paper_id=paper_id)
record = await result.single()
if not record:
return None
return Paper(
paper_id=record['p']['paper_id'],
title=record['p']['title'],
abstract=record['p'].get('abstract'),
year=record['p'].get('year'),
authors=[Author(**author) for author in record['authors']],
venue=Venue(**record['venue']) if record['venue'] else None,
fields=[Field(**field) for field in record['fields']],
citation_count=len(record['citations'])
)
Serialization and Export¶
JSON Serialization¶
import json
from typing import Any, Dict
from datetime import datetime
class CitationPlatformJSONEncoder(json.JSONEncoder):
"""Custom JSON encoder for platform models."""
def default(self, obj: Any) -> Any:
if isinstance(obj, datetime):
return obj.isoformat()
elif hasattr(obj, 'dict'):
# Pydantic models
return obj.dict()
elif hasattr(obj, '__dict__'):
# Other objects with __dict__
return obj.__dict__
else:
return super().default(obj)
# Usage
paper = Paper(paper_id="123", title="Example Paper")
json_string = json.dumps(paper, cls=CitationPlatformJSONEncoder, indent=2)
Export Formats¶
from typing import Union, List
import pandas as pd
class ModelExporter:
@staticmethod
def to_dataframe(models: List[BaseModel]) -> pd.DataFrame:
"""Convert list of models to pandas DataFrame."""
data = [model.dict() for model in models]
return pd.DataFrame(data)
@staticmethod
def to_csv(models: List[BaseModel], file_path: str):
"""Export models to CSV file."""
df = ModelExporter.to_dataframe(models)
df.to_csv(file_path, index=False)
@staticmethod
def to_latex(models: List[BaseModel], caption: str = "") -> str:
"""Convert models to LaTeX table."""
df = ModelExporter.to_dataframe(models)
return df.to_latex(index=False, caption=caption)
# Usage
papers = [Paper(paper_id=f"{i}", title=f"Paper {i}") for i in range(10)]
ModelExporter.to_csv(papers, "papers.csv")
latex_table = ModelExporter.to_latex(papers, "Sample Papers")
Model Validation and Testing¶
Unit Tests for Models¶
import pytest
from pydantic import ValidationError
class TestPaperModel:
def test_valid_paper_creation(self):
paper = Paper(
paper_id="valid123",
title="A Valid Paper Title",
abstract="This is a valid abstract with sufficient length.",
year=2023,
citation_count=5
)
assert paper.paper_id == "valid123"
assert paper.citation_count == 5
def test_invalid_paper_id(self):
with pytest.raises(ValidationError) as exc_info:
Paper(
paper_id="invalid@id!", # Contains invalid characters
title="Valid Title"
)
assert "paper_id" in str(exc_info.value)
def test_title_validation(self):
with pytest.raises(ValidationError):
Paper(paper_id="123", title=" ") # Whitespace only
with pytest.raises(ValidationError):
Paper(paper_id="123", title="") # Empty string
Model Factory for Testing¶
import factory
from factory import fuzzy
from datetime import datetime
class PaperFactory(factory.Factory):
class Meta:
model = Paper
paper_id = factory.Sequence(lambda n: f"paper_{n}")
title = factory.Faker('sentence', nb_words=6)
abstract = factory.Faker('paragraph')
year = fuzzy.FuzzyInteger(2000, 2023)
citation_count = fuzzy.FuzzyInteger(0, 100)
class AuthorFactory(factory.Factory):
class Meta:
model = Author
author_id = factory.Sequence(lambda n: f"author_{n}")
name = factory.Faker('name')
class CitationFactory(factory.Factory):
class Meta:
model = Citation
source_paper = factory.SubFactory(PaperFactory)
target_paper = factory.SubFactory(PaperFactory)
citation_context = factory.Faker('sentence')
# Usage in tests
def test_paper_with_citations():
paper = PaperFactory()
citations = CitationFactory.create_batch(5, source_paper=paper)
assert len(citations) == 5
assert all(c.source_paper.paper_id == paper.paper_id for c in citations)
Performance Considerations¶
Model Optimization¶
from pydantic import BaseModel, Field
from typing import Optional, List
class OptimizedPaper(BaseModel):
"""Memory-optimized paper model for large-scale processing."""
# Use slots to reduce memory overhead
__slots__ = ('paper_id', 'title', 'year', 'citation_count')
paper_id: str
title: str
year: Optional[int] = None
citation_count: int = 0
class Config:
# Disable arbitrary types for performance
arbitrary_types_allowed = False
# Validate only on assignment, not creation
validate_assignment = True
validate_all = False
# Skip validation for known safe data
allow_reuse = True
# Batch processing optimization
def process_papers_batch(papers: List[dict]) -> List[OptimizedPaper]:
"""Efficiently process large batches of paper data."""
return [OptimizedPaper.parse_obj(paper) for paper in papers]
Lazy Loading¶
from typing import Optional, Callable
class LazyPaper(BaseModel):
"""Paper model with lazy loading of expensive relationships."""
paper_id: str
title: str
# Lazy-loaded fields
_authors: Optional[List[Author]] = None
_citations: Optional[List[Citation]] = None
_loader: Optional[Callable] = None
def __init__(self, **data):
self._loader = data.pop('loader', None)
super().__init__(**data)
@property
def authors(self) -> List[Author]:
if self._authors is None and self._loader:
self._authors = self._loader.load_authors(self.paper_id)
return self._authors or []
@property
def citations(self) -> List[Citation]:
if self._citations is None and self._loader:
self._citations = self._loader.load_citations(self.paper_id)
return self._citations or []
This model system provides type safety, validation, and performance optimization for Citation Compass.