DSPy + MLflow Integration Plan¶
Superseded by Updated Plan
This document has been superseded by the December 9, 2025 snapshot which includes current implementation status and explicit next steps.
See: dspy-mlflow-plan-2025-12-09.md for the current plan.
Historical Version (Pre-December 2025)¶
This plan covers Option C (DSPy pipeline for Q&A with sources) and Option D (tracking DSPy optimization runs in MLflow), with a staged path that can execute DSPy alone or as a combined approach with MLflow tracking and registry support. The goal is to maximize transparency, reproducibility, and operational clarity.
Executive Summary¶
Objectives - Introduce DSPy programs to formalize retrieval → reasoning → verification flows and enable self-improvement. - Use MLflow to track DSPy experimentation, store artifacts, and optionally register tuned pipelines for controlled rollout. - Maintain operational guardrails (observability, rollback, and documentation) across both approaches.
Tasks - Prototype a DSPy Q&A pipeline using existing retrieval/search and verification components. - Instrument DSPy training/evaluation runs with MLflow logging of params, metrics, and serialized programs. - Package the best-performing DSPy program as an artifact (and optionally register it) for API consumption behind a feature flag. - Document workflows, configuration, and rollback procedures. - Log dataset identifiers + hashes (and lightweight snapshots when feasible) alongside runs to keep experiments reproducible as datasets evolve. - Enforce PII/secret scrubbing before uploading misclassified examples or traces to MLflow artifacts to avoid leaking user data.
Dependencies
- Existing retrieval/search services, evaluation datasets, and FastAPI runtime in apps/api/.
- Python tooling (poetry/uv) and access to any LLM/provider credentials already used by the service.
- MLflow backend (local mlruns/ or remote server) for Option D; optional object store/DB if using registry.
Deliverables - DSPy prototype module(s) with runnable scripts/tests for Q&A evaluation. - MLflow-integrated training/evaluation scripts and logged artifacts for DSPy runs. - Feature-flagged API integration path to toggle DSPy pipelines and load MLflow-produced artifacts. - Updated diagrams and documentation describing architecture, data flows, and ops runbooks.
Updates to Architecture¶
Objectives¶
- Clarify how DSPy modules fit into the existing retrieval and chat API stack.
- Show where MLflow tracks experiments and where the API loads DSPy artifacts from the registry or filesystem.
- Keep rollback and baseline paths explicit.
Tasks¶
- Define DSPy pipeline boundaries (retriever, reasoning, verifier) aligned with current search/vector store utilities.
- Add MLflow tracking and artifact storage touchpoints to the experimentation loop.
- Extend the FastAPI service to resolve DSPy artifacts (local or MLflow registry) behind a feature flag.
- Update diagrams to reflect data flow with and without MLflow involvement.
Dependencies¶
- Current service topology (FastAPI, vector search, LLM providers) and evaluation harness.
- MLflow tracking URI and registry backend if used.
Deliverables¶
- Updated architecture diagram (below) and supporting text in docs.
Architecture & Data Flow Diagram¶
graph TD
subgraph Client
U[User]
end
subgraph API[FastAPI Service]
FF[Feature Flag<br/>Baseline vs DSPy]
ORCH[Query Orchestrator]
DSPY[Optional DSPy Program<br/>(Retrieve -> Reason -> Verify)]
BASE[Baseline Pipeline]
end
subgraph Retrieval
VS[Vector Store / Search]
IDX[Index Docs]
end
subgraph LLMs[LLM Providers]
GEN[Generator]
VERIFY[Verifier]
end
subgraph Eval[Evaluation Harness]
DATA[Eval Datasets]
end
subgraph MLflow
TRACK[Tracking Server]
ART[Artifacts / Registry]
end
U -->|Query| API
API --> FF
FF -->|DSPy enabled| DSPY
FF -->|DSPy disabled| BASE
DSPY --> VS
DSPY --> GEN
DSPY --> VERIFY
BASE --> VS
BASE --> GEN
BASE --> VERIFY
VS --> IDX
ORCH -->|Log runs| TRACK
DSPY -->|Log artifacts (program, prompts)| TRACK
TRACK --> ART
ORCH -->|Load DSPy artifact
(local or registry)| DSPY
EvalHarness((CLI/Tests)) -->|Run eval
+ log metrics| TRACK
EvalHarness -->|Use DSPy program| DSPY
EvalHarness -->|Use baseline| BASE
Detailed Development Phase¶
Phase 1: DSPy Prototype (Option C foundation)¶
- Objectives
- Build a minimal DSPy pipeline for core Q&A journeys with explicit input/output schemas and source citations.
- Validate functional parity with the baseline pipeline using existing evaluation datasets.
- Tasks
- Add dependency
dspy-aiand createpipelines/dspy/(or similar) with modules: Retriever (wraps vector search), Reasoner (ReAct-style or chain-of-thought), and Verifier (leveraging existing verification logic). - Pin
dspy-aito a stable 2.x release (e.g.,dspy-ai==2.5.xonce verified) and seed control (plus record LLM provider/model versions) so tuned programs are reproducible and rollbackable; lock inpyproject.toml/uv.lockand bump only with re-evals. - Translate 1–2 canonical flows (e.g., rental income queries, compliance lookup) into DSPy
Programclasses with unit-style tests or CLI runner. - Add configuration/feature flag scaffolding in
apps/api/to toggle DSPy per-request or per-tenant; ensure observability hooks (structured logging with prompt/model IDs, sources, latencies). - Run baseline evaluation harness against DSPy prototype to collect accuracy/latency metrics; document findings.
- Dependencies
- Existing retrieval utilities and evaluation datasets.
- Access to LLM provider credentials used in production.
- Deliverables
- DSPy module(s) checked into the repo with runnable examples/tests.
- Evaluation results comparing baseline vs DSPy prototype, captured in docs.
- Feature-flag wiring allowing safe opt-in for DSPy in the API.
Phase 2: MLflow-Tracked DSPy Experiments (Option D core)¶
- Objectives
- Make DSPy experimentation reproducible with MLflow logging and artifact storage.
- Capture metrics (accuracy, latency, hallucination rate) and program artifacts for each optimization run.
- Tasks
- Add
mlflowdependency and configure tracking URI (localmlruns/default). Parameterize via env vars for CI/local parity. - Wrap DSPy training/evaluation scripts with
mlflow.start_run(); log params (LLM/model IDs, prompt templates, module configs), metrics, and serialized DSPy program artifacts (mlflow.log_artifact). - Standardize dataset format as JSONL (
id,split,question,context/evidence,answer,metadata) and log dataset provenance per run: dataset name, semantic/commit version, content hash, and row counts; include a lightweight manifest (manifest.json) and gzip snapshot (dataset.jsonl.gz) when size allows (keeps runs reproducible as datasets grow). - Size guardrail: aim for ≤50–100 MB compressed per snapshot. If larger, log manifest + stratified sample (1–2% capped at 1k rows) and the full-file hash; note snapshot omission in the manifest to protect local MLflow store size/perf.
- Emit error-case bundles (misclassified examples, hallucination traces) as artifacts for post-mortem analysis, scrubbed for PII/secrets before upload (protects user data in MLflow store).
- Document how to launch runs (commands, env vars) and where artifacts live.
Dataset manifest format and sampling logic (to implement in tooling):
- manifest.json fields (all logged as an artifact):
- dataset_name, version (semantic tag or commit SHA), source_uri (repo path/object store path), created_at.
- total_rows, splits (map of split -> row_count), schema_sample (first k field names/optional types), has_snapshot (bool).
- jsonl_sha256 (full file), gzip_sha256 (if snapshot emitted), sample_path (if sample emitted), notes (reason if snapshot omitted).
- Sampling rules when over size budget:
- If compressed snapshot >100 MB: emit manifest plus stratified sample across splits at 1–2% capped at 1k rows total; store sample as dataset.sample.jsonl.
- Always compute and log full-file hash even if snapshot omitted; record omission reason in notes.
- Keep sampling deterministic per run using a fixed seed so samples are reproducible.
- Dependencies
- Phase 1 DSPy prototype runnable end-to-end.
- MLflow accessible locally or remotely (file store is acceptable initially).
- Deliverables
- MLflow-instrumented DSPy training/eval scripts or notebooks.
- Logged runs with metrics and artifacts stored under mlruns/ (or configured backend).
- Documentation for running and inspecting experiments.
Phase 3: Registry & API Integration (Option D extended)¶
- Objectives
- Enable the API to load the best-performing DSPy program from MLflow artifacts/registry with staged promotion.
- Provide rollback to baseline or last-known-good DSPy artifact.
- Tasks
- Decide artifact resolution order: explicit file path (for local dev) → MLflow Registry Production stage (for deployed envs).
- Implement loader in
apps/api/that fetches a pickled DSPy program artifact and validates it on load (schema/version check, dry-run); keep an optional path to wrap in a custom MLflowpyfuncflavor later if genericpredictis needed. - Add smoke tests in CI to ensure the API can load the current Production artifact; include fallback to bundled baseline if registry unreachable.
- Document promotion/runbook: how to register a run as Staging, perform eval, and promote to Production; how to rollback.
- Dependencies
- MLflow server/registry backend (SQLite+filesystem for starter or Postgres+S3/MinIO for shared envs).
- Stable DSPy serialization/deserialization path from Phase 2.
- Deliverables
- API code path that conditionally loads DSPy artifacts from MLflow.
- CI smoke test coverage for artifact loading.
- Runbook for promotion and rollback.
Phase 4: Operations & Observability (applies to both C and D)¶
- Objectives
- Ensure production readiness with monitoring, drift detection, and clear documentation.
- Tasks
- Standardize structured logging for both baseline and DSPy paths (prompts, model IDs, sources, latencies, error codes) with PII scrubbing.
- Add evaluation thresholds and alerts: fail builds or block promotions if metrics regress; monitor live metrics for drift vs. MLflow baselines.
- Maintain "last-known-good" DSPy artifact reference and automated rollback switch in feature flag config.
- Consolidate documentation (diagrams, runbooks, config examples) in
docs/and keep change log entries for each rollout. - Dependencies
- Observability stack currently used by the service (logging/metrics backend).
- Feature flag mechanism from earlier phases.
- Deliverables
- Logging/metrics instrumentation parity between DSPy and baseline paths.
- Alerting thresholds and rollback procedures codified.
- Updated documentation and diagrams reflecting operational flows.
Notes on Sequencing (C vs D)¶
- Do Option C first for functional coverage, then layer Option D for reproducibility and lifecycle control. Option D depends on having a runnable DSPy pipeline.
- Parallelizable work: docs/diagram updates and feature-flag scaffolding can start early; MLflow environment setup can proceed while DSPy prototype is built.
- Minimal viable path: Phase 1 (DSPy) → Phase 2 (MLflow logging) → Phase 3 (registry integration) → Phase 4 (ops hardening).