Evaluation Harness¶

The evaluation harness systematically tests the Poolula Platform chatbot against a golden question set, providing objective quality metrics.

Overview¶

File: scripts/evaluate_chatbot.py

Purpose: Automated testing of chatbot responses using pre-defined questions with expected outcomes.

Output: Detailed scoring report showing performance across question categories.

How It Works¶

sequenceDiagram
    participant Harness as Evaluation Harness
    participant RAG as RAG System
    participant Tools as Search Tools
    participant Scorer as Scoring Engine

    Harness->>Harness: Load golden questions (JSONL)
    loop For each question
        Harness->>RAG: Send question
        RAG->>Tools: Use database/document tools
        Tools-->>RAG: Return results
        RAG-->>Harness: Response + sources
        Harness->>Scorer: Evaluate response
        Scorer->>Scorer: Check tool usage (40%)
        Scorer->>Scorer: Check keywords (40%)
        Scorer->>Scorer: Check for errors (20%)
        Scorer-->>Harness: Score (0.0-1.0)
    end
    Harness->>Harness: Generate report

Running the Harness¶

Basic Usage¶

# Run evaluation with default question set
python scripts/evaluate_chatbot.py

# Output:
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
#  POOLULA CHATBOT EVALUATION RESULTS
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
#  Overall Score: 87.3%
#
#  Component Scores:
#   - Tool Usage: 93.3%
#   - Response Quality: 86.7%
#   - Error Rate: 0.0%
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Verbose Mode¶

# Show detailed per-question results
python scripts/evaluate_chatbot.py --verbose

# Output shows:
# - Question text
# - AI response
# - Tools used
# - Expected vs actual
# - Individual scores

Custom Question Set¶

# Use custom evaluation questions
python scripts/evaluate_chatbot.py --eval-set data/my_questions.jsonl

Question File Format¶

Golden questions are stored in JSONL (JSON Lines) format:

{
  "question": "What is our property's total depreciable basis?",
  "category": "property_financials",
  "expected_tools": ["query_database"],
  "expected_keywords": ["basis", "depreciation", "442300"],
  "context": "Tests ability to query property financial data"
}

Required Fields¶

Field	Type	Description
`question`	string	The user query to test
`category`	string	Question category for grouping results
`expected_tools`	array	Tools the AI should use
`expected_keywords`	array	Keywords that should appear in response

Optional Fields¶

Field	Type	Description
`context`	string	Why this question matters
`expected_entities`	array	Entities that should be mentioned

Scoring Components¶

1. Tool Usage (40% of score)¶

What it checks: Did the AI choose the correct search tools?

Available tools:

query_database - Query SQLite database for properties, transactions, obligations
search_document_content - Semantic search through ingested documents
list_business_documents - List available documents

Scoring:

✅ All expected tools used: 1.0
❌ Missing or wrong tools: 0.0

Example:

# Question: "What was my rental income in August 2024?"
# Expected: ["query_database"]
# AI used: ["query_database"] ✓
# Tool score: 1.0

2. Response Quality (40% of score)¶

What it checks: Does the response contain expected information?

Methodology: Keyword matching (case-insensitive)

Scoring:

score = (keywords_found / total_expected_keywords)

Example:

# Question: "What is our property address?"
# Expected keywords: ["900", "9th", "Montrose", "CO"]
# Response: "Your property is located at 900 S 9th St, Montrose, CO 81401"
# Keywords found: 4/4
# Quality score: 1.0

3. Error Handling (20% of score)¶

What it checks: Did the query complete without errors?

Scoring:

✅ No errors: 1.0
❌ Exception or error response: 0.0

Common errors caught:

Database connection failures
Tool execution errors
Empty/null responses
API exceptions

Output Report Structure¶

Summary Statistics¶

Overall Score: 87.3%
Total Questions: 15
Passed (≥70%): 13
Failed (<70%): 2

Category Breakdown¶

Performance by Category:
  property_info:        100.0% (3/3 passed)
  property_financials:   93.3% (3/3 passed)
  transactions:          86.7% (2/3 passed)
  documents:            100.0% (2/2 passed)
  hybrid:                73.3% (2/3 passed)

Individual Results¶

[✓] Question: What is our EIN number?
    Score: 100.0%
    Tools: query_database ✓
    Keywords: 4/4 found

[✗] Question: Show rental income by month for 2024
    Score: 53.3%
    Tools: query_database ✓
    Keywords: 2/5 found (missing: "August", "breakdown")

Interpreting Results¶

Score Ranges¶

Score	Interpretation	Action
90-100%	Excellent	Maintain quality
80-89%	Good	Minor improvements possible
70-79%	Acceptable	Review failed questions
60-69%	Needs improvement	Investigation required
<60%	Poor	Significant issues

Common Issues¶

Low tool usage score:

AI isn't recognizing when to use database vs documents
Tool definitions may need clarification
System prompt may need adjustment

Low response quality score:

Missing expected keywords suggests incomplete answers
May need better context in questions
Retrieval quality issues

High error rate:

Check database connectivity
Review tool error handling
Verify data exists for questions

Integration with Development¶

Pre-Commit Workflow¶

# Before committing chatbot changes
python scripts/evaluate_chatbot.py

# Only commit if score ≥ baseline
# Current baseline: 85%

CI/CD Integration¶

# .github/workflows/test.yml (future)
- name: Run evaluation harness
  run: python scripts/evaluate_chatbot.py

- name: Check threshold
  run: |
    if [ $SCORE -lt 85 ]; then
      echo "Score below threshold!"
      exit 1
    fi

Code Structure¶

Main Class: ChatbotEvaluator¶

class ChatbotEvaluator:
    """Evaluation harness for chatbot quality assessment"""

    def load_eval_set(self, eval_path: str) -> List[Dict]
        # Load questions from JSONL

    def evaluate_response(...) -> Tuple[float, Dict]
        # Score a single response

    def run_evaluation(...) -> Dict
        # Run full evaluation suite

    def print_results(self, results: Dict)
        # Generate detailed report

Key Methods¶

evaluate_response():

Takes question, response, sources, expected values
Returns score (0.0-1.0) and detailed breakdown
Checks tools, keywords, errors independently

run_evaluation():

Iterates through all questions
Calls RAG system for each
Collects and aggregates scores
Handles exceptions gracefully

Advanced Usage¶

Creating Custom Evaluations¶

# 1. Create custom question set
cat > data/my_eval.jsonl << EOF
{"question": "Custom question 1", "category": "test", ...}
{"question": "Custom question 2", "category": "test", ...}
EOF

# 2. Run evaluation
python scripts/evaluate_chatbot.py --eval-set data/my_eval.jsonl

Tracking Over Time¶

# Save results with timestamp
python scripts/evaluate_chatbot.py > results/eval_$(date +%Y%m%d).txt

# Compare to previous runs
diff results/eval_20241114.txt results/eval_20241115.txt

Future Enhancements¶

See Improvement Roadmap for planned features:

LLM-as-judge scoring (more nuanced than keyword matching)
Retrieval accuracy metrics (precision/recall)
Latency tracking per question
User feedback integration
A/B testing framework

Question Design - How questions are designed
Scoring Methodology - Detailed scoring explanation
Results & Baselines - Current performance
Testing Guide - Traditional pytest tests

Evaluation Harness¶

Overview¶

How It Works¶

Running the Harness¶

Basic Usage¶

Verbose Mode¶

Custom Question Set¶

Question File Format¶

Required Fields¶

Optional Fields¶

Scoring Components¶

1. Tool Usage (40% of score)¶

2. Response Quality (40% of score)¶

3. Error Handling (20% of score)¶

Output Report Structure¶

Summary Statistics¶

Category Breakdown¶

Individual Results¶

Interpreting Results¶

Score Ranges¶

Common Issues¶

Integration with Development¶

Pre-Commit Workflow¶

CI/CD Integration¶

Code Structure¶

Main Class: ChatbotEvaluator¶

Key Methods¶

Advanced Usage¶

Creating Custom Evaluations¶

Tracking Over Time¶

Future Enhancements¶

Related Documentation¶