Scoring Methodology¶
The evaluation harness uses a three-component scoring system to assess chatbot response quality.
Overview¶
Each response receives a score from 0.0 to 1.0 (0% to 100%) based on three weighted components:
Component Weights¶
| Component | Weight | Rationale |
|---|---|---|
| Tool Usage | 40% | Correct tool selection is critical for accuracy |
| Response Quality | 40% | Answer must contain expected information |
| Error Handling | 20% | System must be reliable and not crash |
Component 1: Tool Usage (40%)¶
What It Measures¶
Whether the AI correctly selected the appropriate search tools for the question type.
Scoring Logic¶
def score_tool_usage(expected_tools: List[str], tools_used: List[str]) -> float:
"""
Returns 1.0 if all expected tools were used, 0.0 otherwise
"""
if not expected_tools:
return 1.0 # No tools required
# Check if all expected tools are present
all_tools_used = all(tool in tools_used for tool in expected_tools)
return 1.0 if all_tools_used else 0.0
Tool Categories¶
Available tools:
-
query_database -
For: Properties, transactions, obligations, documents metadata
- When: Structured data queries
-
Example: "What was my rental income in August?"
-
search_document_content -
For: Searching within ingested documents
- When: Semantic search for specific information
-
Example: "What's our business purpose in the operating agreement?"
-
list_business_documents - For: Document discovery
- When: User wants to see what documents exist
- Example: "What documents do we have?"
Examples¶
Example 1: Correct Tool Selection
Question: "What is our property address?"
Expected: ["query_database"]
AI used: ["query_database"]
Tool score: 1.0 ✓
Example 2: Missing Tool
Question: "What's in our operating agreement?"
Expected: ["search_document_content"]
AI used: [] # AI tried to answer without searching
Tool score: 0.0 ✗
Example 3: Wrong Tool
Question: "List all our documents"
Expected: ["list_business_documents"]
AI used: ["query_database"] # Wrong tool
Tool score: 0.0 ✗
Example 4: Hybrid Query
Question: "What properties do we own and what documents mention them?"
Expected: ["query_database", "search_document_content"]
AI used: ["query_database", "search_document_content"]
Tool score: 1.0 ✓
Why Tool Usage Matters¶
Accuracy: Using the right tool determines answer correctness.
- Database query for "rental income" → precise numbers
- Document search for "business purpose" → exact wording from docs
Efficiency: Correct tool selection minimizes response time.
- Don't search documents for structured data
- Don't query database for unstructured document content
Capability demonstration: Shows the AI understands tool purposes.
Component 2: Response Quality (40%)¶
What It Measures¶
Whether the response contains the expected information based on keyword matching.
Scoring Logic¶
def score_response_quality(expected_keywords: List[str], response: str) -> float:
"""
Returns ratio of expected keywords found in response (case-insensitive)
"""
if not expected_keywords:
return 1.0 # No keywords required
response_lower = response.lower()
keywords_found = sum(
1 for keyword in expected_keywords
if keyword.lower() in response_lower
)
return keywords_found / len(expected_keywords)
Keyword Selection Guidelines¶
Good keywords:
- Specific values: "442300", "83-4567890", "900"
- Key concepts: "depreciation", "basis", "rental income"
- Critical terms: "Montrose", "CO", "August"
Avoid:
- Common words: "the", "and", "is"
- Generic terms: "property", "LLC" (unless specifically testing for these)
- Ambiguous words that could appear in any response
Examples¶
Example 1: Perfect Match
Question: "What is our property address?"
Expected keywords: ["900", "9th", "Montrose", "CO"]
Response: "Your property is located at 900 S 9th St, Montrose, CO 81401"
Keywords found: 4/4
Quality score: 1.0 (100%)
Example 2: Partial Match
Question: "What is our property's total basis?"
Expected keywords: ["basis", "depreciation", "442300", "land", "building"]
Response: "The total basis for your property is $442,300, which includes the land and building components."
Keywords found: 4/5 (missing "depreciation")
Quality score: 0.8 (80%)
Example 3: Poor Match
Question: "What was my rental income in August 2024?"
Expected keywords: ["August", "2024", "rental", "income", "16144"]
Response: "I found some transaction data for you."
Keywords found: 0/5
Quality score: 0.0 (0%)
Limitations of Keyword Matching¶
Current approach:
- Simple, fast, objective
- Works well for factual questions
- No false positives from irrelevant matches
Limitations:
- Doesn't understand synonyms ("property" vs "real estate")
- Doesn't verify logical correctness
- Can't assess explanation quality
Future enhancement: See Roadmap for planned LLM-as-judge scoring.
Component 3: Error Handling (20%)¶
What It Measures¶
Whether the query completed successfully without crashes or error responses.
Scoring Logic¶
def score_error_handling(error: Optional[str]) -> float:
"""
Returns 1.0 if no error, 0.0 if error occurred
"""
return 0.0 if error else 1.0
Error Types Detected¶
System errors:
- Database connection failures
- Tool execution exceptions
- API timeout errors
- Memory errors
Response errors:
- Empty responses
- Null values
- Error messages in response text
- Exception traces
Examples¶
Example 1: Success
Example 2: Database Error
Example 3: Tool Error
Response: "I encountered an error while searching..."
Error: "ToolExecutionError: Document not found"
Error score: 0.0 ✗
Why Error Handling Matters¶
Reliability: System must be robust for production use.
User experience: Crashes frustrate users and reduce trust.
Production readiness: Error handling is critical for deployed applications.
Composite Scoring¶
Calculation Example¶
Question: "What is our property's total depreciable basis?"
Results:
- Tool usage: 1.0 (used
query_databaseas expected) - Response quality: 0.8 (⅘ keywords found)
- Error handling: 1.0 (no errors)
Final score:
Score Interpretation¶
| Score Range | Grade | Interpretation |
|---|---|---|
| 90-100% | A | Excellent response |
| 80-89% | B | Good response, minor issues |
| 70-79% | C | Acceptable, needs improvement |
| 60-69% | D | Below acceptable, investigation needed |
| 0-59% | F | Poor response, significant issues |
Aggregate Metrics¶
Overall performance:
Category performance:
Comparison to Traditional Testing¶
Unit Tests¶
Traditional:
def test_get_property():
response = api.get_property("property-id")
assert response.address == "900 S 9th St" # Exact match
Evaluation harness:
# Question: "What is our property address?"
# Expected keywords: ["900", "9th", "Montrose"]
# Score: 0.0-1.0 based on keyword presence
# More flexible, accounts for variation
Why This Approach?¶
LLM outputs vary: Same query can produce different valid responses.
Multiple correct answers: Various phrasings can be equally correct.
Graceful degradation: Partial answers get partial credit.
Realistic assessment: Reflects actual user experience.
Limitations and Future Improvements¶
Current Limitations¶
-
Keyword matching is simplistic
-
Doesn't understand synonyms
- Can't verify logical consistency
-
Misses semantic equivalence
-
Binary tool scoring
-
No credit for using some correct tools
-
Doesn't account for extra tools (as long as expected ones are present)
-
No retrieval quality metrics
- Doesn't verify sources are relevant
- Can't detect hallucinations
- No precision/recall for search results
Planned Improvements¶
See Improvement Roadmap:
- LLM-as-judge scoring for nuanced evaluation
- Retrieval metrics (precision, recall, NDCG)
- Latency tracking per question
- User feedback integration
- A/B testing framework
Related Documentation¶
- Evaluation Harness - How to run evaluations
- Question Design - Question selection criteria
- Results & Baselines - Current performance metrics