LLM Provider Architecture¶

Implementation Status

Current State: This document describes the planned architecture for multi-provider LLM support. The abstraction layer design is documented, but the current implementation (Phase 2) is tightly coupled to Anthropic Claude.

Status: - 🚧 Provider abstraction layer - Designed (Phase 6-7) - ✅ Anthropic Claude integration - Operational - 🚧 OpenAI provider - Planned (Phase 6-7) - 🚧 Ollama local models - Planned (Phase 6-7)

Migration Plan: See docs/planning/2025-12-03-llm-agnosticism-plan.md for detailed decoupling strategy.

Overview¶

The Poolula Platform chatbot is designed to use a provider-agnostic abstraction layer to support multiple Large Language Model (LLM) backends. This design will allow seamless switching between cloud-based providers (Anthropic, OpenAI) and local models (Ollama) without changing business logic.

Architecture Principles¶

Provider Independence: Core RAG logic doesn't depend on specific LLM APIs
Unified Interface: All providers implement the same LLMProvider base class
Easy Extensibility: Adding new providers requires only implementing the interface
Configuration-Based Switching: Change providers via environment variable (LLM_PROVIDER)

Provider Abstraction Layer¶

Base Interface¶

File: apps/chatbot/llm_providers/base.py

class LLMProvider(ABC):
    """Abstract base class for LLM providers"""

    @abstractmethod
    def generate(
        self,
        messages: List[LLMMessage],
        system_prompt: Optional[str] = None,
        tools: Optional[List[Dict]] = None,
        temperature: float = 0.0,
        max_tokens: int = 800,
        **kwargs
    ) -> LLMResponse:
        """Generate a response from the LLM"""
        pass

    @abstractmethod
    def translate_tool_definition(self, tool_def: Dict[str, Any]) -> Dict[str, Any]:
        """Translate from internal tool format to provider-specific format"""
        pass

    @property
    @abstractmethod
    def provider_name(self) -> str:
        """Return provider name for logging/debugging"""
        pass

    @property
    @abstractmethod
    def supports_native_tool_calling(self) -> bool:
        """Whether provider has native tool calling support"""
        pass

Data Models¶

LLMMessage: Provider-agnostic message format

@dataclass
class LLMMessage:
    role: str  # "user", "assistant", "system"
    content: Any

LLMToolCall: Provider-agnostic tool call representation

@dataclass
class LLMToolCall:
    id: str
    name: str
    parameters: Dict[str, Any]

LLMResponse: Unified response format

@dataclass
class LLMResponse:
    text: Optional[str] = None
    tool_calls: Optional[List[LLMToolCall]] = None
    stop_reason: str = "complete"
    raw_response: Any = None
    content: List[Any] = field(default_factory=list)

Supported Providers¶

1. Anthropic Claude (Default)¶

File: apps/chatbot/llm_providers/anthropic_provider.py

Characteristics: - Native Tool Calling: Yes (function calling API) - Models: claude-sonnet-4-20250514 (default), claude-opus-4-5-20251101 - Cost: ~$3/M input tokens, ~$15/M output tokens - Latency: 1-3s typical - Best For: Production use, highest quality responses

Configuration:

LLM_PROVIDER=anthropic
ANTHROPIC_API_KEY=sk-ant-...
ANTHROPIC_MODEL=claude-sonnet-4-20250514

Key Features: - Direct integration with Anthropic SDK - Multi-round tool calling support - Tool definitions pass through directly (no translation needed) - Excellent at following system prompts

2. OpenAI GPT¶

File: apps/chatbot/llm_providers/openai_provider.py

Characteristics: - Native Tool Calling: Yes (function calling API) - Models: gpt-4o (default), gpt-4o-mini - Cost: ~$2.50/M input tokens, ~$10/M output tokens (17% cheaper than Claude) - Latency: 1-4s typical - Best For: Cost-effective alternative to Claude

Configuration:

LLM_PROVIDER=openai
OPENAI_API_KEY=sk-...
OPENAI_MODEL=gpt-4o

Key Features: - Translates between Anthropic and OpenAI tool formats - Slightly different response phrasing (affects keyword matching scores) - Compatible with Azure OpenAI endpoints (future enhancement)

Tool Format Translation:

# Anthropic format (internal)
{
  "name": "query_database",
  "description": "Query the business database",
  "input_schema": {"type": "object", "properties": {...}}
}

# OpenAI format (translated)
{
  "type": "function",
  "function": {
    "name": "query_database",
    "description": "Query the business database",
    "parameters": {"type": "object", "properties": {...}}
  }
}

3. Ollama (Local Models)¶

File: apps/chatbot/llm_providers/ollama_provider.py

Characteristics: - Native Tool Calling: No (uses prompt-based approach) - Models: llama3.1:8b-instruct-q4_K_M (recommended), mistral:7b-instruct-q4_0, qwen2.5:7b-instruct-q4_K_M - Cost: Free (runs locally) - Latency: 5-15s on CPU, 1-4s on GPU - Best For: Learning, privacy-focused use, offline capability

Configuration:

LLM_PROVIDER=ollama
LOCAL_MODEL_PATH=llama3.1:8b-instruct-q4_K_M
LOCAL_MODEL_URL=http://localhost:11434

Key Features: - Connects to Ollama via HTTP API - Prompt-based tool calling (instructs model to output JSON) - Free and private (no data leaves machine) - Requires Ollama installed separately

Prompt-Based Tool Calling:

Since local models lack native tool calling APIs, we inject tool descriptions into the system prompt:

def _build_tool_prompt(self, tools: List[Dict]) -> str:
    """Build tool usage instructions for prompt engineering"""
    tool_descriptions = [f"- **{tool['name']}**: {tool['description']}" for tool in tools]

    return f"""Available Tools:
{chr(10).join(tool_descriptions)}

To use a tool, respond with ONLY JSON: {{"tool": "tool_name", "parameters": {{...}}}}"""

The model responds with JSON, which we parse back into LLMToolCall objects.

Provider Factory¶

File: apps/chatbot/rag_system.py (line 48)

def _create_llm_provider(self, config) -> LLMProvider:
    """Factory method for creating LLM providers based on configuration"""
    provider_type = config.LLM_PROVIDER.lower()

    if provider_type == "anthropic":
        return AnthropicProvider(
            api_key=config.ANTHROPIC_API_KEY,
            model=config.ANTHROPIC_MODEL
        )
    elif provider_type == "openai":
        from .llm_providers import OpenAIProvider
        if OpenAIProvider is None:
            raise ImportError("Install with: uv sync --group openai")
        return OpenAIProvider(
            api_key=config.OPENAI_API_KEY,
            model=config.OPENAI_MODEL
        )
    elif provider_type == "ollama":
        from .llm_providers import OllamaProvider
        if OllamaProvider is None:
            raise ImportError("Install with: uv sync --group local")
        return OllamaProvider(
            model=config.LOCAL_MODEL_PATH,
            base_url=config.LOCAL_MODEL_URL
        )
    else:
        raise ValueError(f"Unknown LLM provider: {config.LLM_PROVIDER}")

Provider Comparison¶

How to Run Comparison¶

# Compare all providers
python scripts/evaluate_providers.py --providers anthropic openai ollama

# Compare specific providers
python scripts/evaluate_providers.py --providers anthropic openai

# View results
cat data/provider_comparison.json

See docs/evaluation/provider-comparison.md for detailed guide.

Baseline Performance (To Be Established)¶

Run the comparison script to establish baseline scores for each provider. Results will include:

Overall Scores: Average across 15 golden questions
Component Breakdown: Tool usage, response quality, error handling
Category Performance: Scores by question category (property_info, compliance, etc.)
Per-Question Analysis: Where providers diverge

Expected Results:

Based on implementation characteristics, we anticipate:

Provider	Tool Usage	Response Quality	Error Handling	Overall (Expected)
Anthropic	90-95%	85-90%	95-100%	85-90%
OpenAI	90-95%	80-85%	95-100%	82-87%
Ollama	85-90%	60-70%	90-95%	70-75%

Key Factors: - Tool usage is high across all providers (abstraction works well) - Response quality varies due to model size and keyword matching - Local models trade quality for privacy/cost benefits

To populate actual results: 1. Run: python scripts/evaluate_providers.py --providers anthropic openai ollama 2. Update this section with real scores 3. Add insights about provider-specific strengths/weaknesses

Decision Matrix: Choosing a Provider¶

Use Anthropic When:¶

✅ You need the best quality responses
✅ Budget is not the primary concern (~$6/1K queries)
✅ Multi-round tool calling is critical
✅ Production deployment with high-stakes queries

Use OpenAI When:¶

✅ You want cost optimization (~$5/1K queries, 17% savings)
✅ You have existing OpenAI credits or infrastructure
✅ Quality requirements are 80-85% vs Claude's 85-90%
✅ You need comparable performance to Anthropic

Use Ollama When:¶

✅ Privacy is critical (medical, legal, sensitive data)
✅ You want zero ongoing costs (free)
✅ You have decent hardware (16GB+ RAM for CPU, or GPU)
✅ You're learning/experimenting with local LLMs
✅ Offline capability is required
✅ Acceptable quality threshold is 70-75%

Implementation Details¶

Multi-Round Tool Calling¶

The AIGenerator class handles multi-round tool calling in a provider-agnostic way:

def generate_response(self, query, conversation_history, tools, tool_manager):
    """Generate response using provider-agnostic interface"""

    messages = [LLMMessage(role="user", content=query)]

    # Multi-round tool calling loop
    max_rounds = 2
    for round_num in range(max_rounds):
        response = self.provider.generate(
            messages=messages,
            system_prompt=system_prompt,
            tools=provider_tools,
            temperature=0,
            max_tokens=800
        )

        # Check for tool usage
        if response.stop_reason == "tool_use" and response.tool_calls:
            # Execute tools and continue
            tool_results = self._execute_tools(response.tool_calls, tool_manager)
            messages.append(LLMMessage(role="assistant", content=response))
            messages.append(LLMMessage(role="user", content=tool_results))
            continue
        else:
            # Return final response
            return response.text

This works identically across all providers, regardless of native vs prompt-based tool calling.

Error Handling¶

Each provider implements graceful error handling:

Anthropic: API errors wrapped with retry logic
OpenAI: API errors wrapped with retry logic
Ollama: Connection errors if Ollama not running, clear error messages

Testing & Validation¶

Unit Tests¶

File: tests/chatbot/test_ai_generator_integration.py

12 tests for AI generator with provider abstraction
All tests pass with AnthropicProvider
Mock provider responses for deterministic testing

Integration Tests¶

File: tests/chatbot/test_rag_system.py

31/37 tests passing (6 failures unrelated to provider abstraction)
Tests RAG system with provider factory

Evaluation Harness¶

File: scripts/evaluate_chatbot.py

15 golden questions covering 8 categories
3-component scoring: tool usage (40%), response quality (40%), error handling (20%)
Provider-agnostic (tests RAG system, not specific provider)

Multi-Provider Comparison: scripts/evaluate_providers.py

Future Enhancements¶

Short-Term (Next 6 Months)¶

Azure OpenAI Support: Add Azure endpoints to OpenAI provider
Streaming Responses: Add streaming interface to base provider
Token Usage Tracking: Log tokens per query for cost analysis
Provider-Specific Prompt Tuning: Optimize system prompts per provider

Medium-Term (6-12 Months)¶

Local Model Fine-Tuning: Fine-tune Llama on Poolula-specific data
Hybrid Approach: Use cheap provider for simple queries, expensive for complex
Fallback Chain: Try OpenAI if Anthropic fails, Ollama if both fail
Custom Evaluation Metrics: LLM-as-judge scoring, semantic similarity

Long-Term (12+ Months)¶

Provider Auto-Selection: ML model predicts best provider per query
Cost Optimization: Automatically route queries based on cost/quality trade-off
Multi-Provider Ensembling: Combine responses from multiple providers
Local Model Scaling: Support larger local models (70B+) with quantization

Setup Guide: docs/workflows/llm-provider-setup.md - How to configure each provider
Comparison Guide: docs/evaluation/provider-comparison.md - How to run and interpret comparisons
Qualitative Checklist: docs/evaluation/provider_checklist.md - Manual testing guide
Planning Document: docs/planning/2025-12-03-llm-agnosticism-plan.md - Original implementation plan
Project Overview: CLAUDE.md - High-level project context

Revision History¶

2025-12-04: Initial architecture document with provider abstraction layer
Future: Update with actual baseline comparison results after running evaluation

LLM Provider Architecture¶

Overview¶

Architecture Principles¶

Provider Abstraction Layer¶

Base Interface¶

Data Models¶

Supported Providers¶

1. Anthropic Claude (Default)¶

2. OpenAI GPT¶

3. Ollama (Local Models)¶

Provider Factory¶

Provider Comparison¶

How to Run Comparison¶

Baseline Performance (To Be Established)¶

Decision Matrix: Choosing a Provider¶

Use Anthropic When:¶

Use OpenAI When:¶

Use Ollama When:¶

Implementation Details¶

Multi-Round Tool Calling¶

Error Handling¶

Testing & Validation¶

Unit Tests¶

Integration Tests¶

Evaluation Harness¶

Future Enhancements¶

Short-Term (Next 6 Months)¶

Medium-Term (6-12 Months)¶

Long-Term (12+ Months)¶

Related Documentation¶

Revision History¶