Provider Qualitative Testing Checklist¶

Beyond automated scoring, this checklist helps you manually verify provider quality across dimensions that are difficult to automate.

Purpose¶

The automated evaluation (scripts/evaluate_providers.py) tests: - Tool calling correctness (40% weight) - Keyword matching (40% weight) - Error handling (20% weight)

This checklist tests: - Hallucinations (AI making up facts) - Response tone and professionalism - Edge case handling - Context management (multi-turn conversations)

How to Use This Checklist¶

Run automated evaluation first: Get quantitative baseline
Use this checklist for qualitative validation: Catch issues automation misses
Document findings: Add notes to comparison report
Prioritize high-stakes queries: Focus on financial/legal questions

Testing Frequency: - Initial provider adoption: Full checklist - After provider updates: Spot check critical queries - Quarterly review: Random sample of 5-10 queries

Checklist Sections¶

1. Tool Calling Correctness¶

Test 1.1: Database Tool Usage¶

Query: "What is our property address?"

Expected Behavior: - ✅ Calls query_database tool - ✅ SQL query is well-formed - ✅ Results incorporated into response

Red Flags: - ❌ Uses document search instead of database - ❌ Makes up address without calling tools - ❌ SQL query has syntax errors

Provider Results:

Provider	Tool Called	SQL Valid	Result Used	Pass/Fail
Anthropic	query_database	✅	✅	✅
OpenAI	query_database	✅	✅	✅
Ollama	query_database	✅	✅	✅

Notes: Add observations here

Test 1.2: Document Search Usage¶

Query: "What is the business purpose of Poolula LLC?"

Expected Behavior: - ✅ Calls search_document_content tool - ✅ Search query is semantically relevant - ✅ Quotes or paraphrases from documents

Red Flags: - ❌ Uses database instead of document search - ❌ Makes up business purpose - ❌ Search query is malformed or irrelevant

Provider Results:

Provider	Tool Called	Query Relevant	Quotes Docs	Pass/Fail
Anthropic
OpenAI
Ollama

Notes: Add observations here

Test 1.3: Multi-Round Tool Calling¶

Query: "Show me documents related to insurance, then tell me the coverage amount"

Expected Behavior: - ✅ Round 1: Calls list_business_documents or search_document_content - ✅ Round 2: Uses document results to extract coverage amount - ✅ Final response synthesizes findings

Red Flags: - ❌ Only uses one tool (misses multi-round opportunity) - ❌ Doesn't use results from first tool in second round - ❌ Hallucinates coverage amount

Provider Results:

Provider	Tools Used	Multi-Round	Correct Amount	Pass/Fail
Anthropic
OpenAI
Ollama

Notes: Add observations here

2. Response Quality¶

Test 2.1: No Hallucinations¶

Query: "What is the market value of our property?"

Expected Behavior: - ✅ Recognizes data not available in sources - ✅ States "I don't have market value data" or similar - ✅ May offer alternative data (acquisition price, basis)

Red Flags: - ❌ Makes up a market value figure - ❌ Uses acquisition price without clarification - ❌ Provides outdated valuation without disclaimer

Provider Results:

Provider	Hallucinated?	Clarified Unavailable Data?	Pass/Fail
Anthropic
OpenAI
Ollama

Actual Responses:

Anthropic:

[Record response here]

OpenAI:

[Record response here]

Ollama:

[Record response here]

Test 2.2: Professional Tone¶

Query: "Tell me about our LLC"

Expected Behavior: - ✅ Professional, businesslike tone - ✅ Factual, no unnecessary embellishment - ✅ No informal language or slang

Red Flags: - ❌ Overly casual ("your cool LLC", "awesome property") - ❌ Marketing speak ("best-in-class", "premier") - ❌ Emotional language ("exciting", "unfortunate")

Provider Results:

Provider	Professional?	Factual?	Appropriate	Pass/Fail
Anthropic
OpenAI
Ollama

Notes: Add tone observations here

Test 2.3: No Meta-Commentary¶

Query: "What was rental income in August 2024?"

Expected Behavior: - ✅ Direct answer: "Rental income in August 2024 was $X" - ✅ May include context (source, calculation)

Red Flags: - ❌ "Based on my search of the database..." - ❌ "According to the tool results..." - ❌ "Let me look that up for you..."

Provider Results:

Provider	Direct Answer	No Meta-Commentary	Pass/Fail
Anthropic
OpenAI
Ollama

Notes: Add observations here

Test 2.4: Appropriate Detail Level¶

Query: "What is our EIN?"

Expected Behavior: - ✅ Concise: "Your EIN is XX-XXXXXXX" - ✅ May add brief context if relevant

Red Flags: - ❌ Overly verbose (paragraph about what an EIN is) - ❌ Too terse ("12345678" without label)

Provider Results:

Provider	Concise?	Sufficient?	Appropriate	Pass/Fail
Anthropic
OpenAI
Ollama

Notes: Add observations here

3. Edge Cases¶

Test 3.1: No Results Found¶

Query: "Show me transactions from 2020" (assuming no 2020 transactions)

Expected Behavior: - ✅ "I don't see any transactions from 2020" - ✅ May offer alternative ("earliest transaction is from 2022") - ✅ Doesn't apologize excessively

Red Flags: - ❌ Makes up transactions - ❌ Returns error message unformatted - ❌ Overly apologetic ("I'm so sorry I couldn't find...")

Provider Results:

Provider	Graceful Handling	No Fabrication	Pass/Fail
Anthropic
OpenAI
Ollama

Notes: Add observations here

Test 3.2: Ambiguous Queries¶

Query: "Tell me about the property"

Expected Behavior: - ✅ Clarifies if multiple properties exist - ✅ Returns info about single property if only one - ✅ Doesn't assume user intent

Red Flags: - ❌ Returns info about wrong property - ❌ Combines data from multiple properties without clarification - ❌ Ignores ambiguity entirely

Provider Results:

Provider	Clarified Ambiguity	Correct Handling	Pass/Fail
Anthropic
OpenAI
Ollama

Notes: Add observations here

Test 3.3: Out-of-Scope Queries¶

Query: "What is the weather in Montrose today?"

Expected Behavior: - ✅ "I don't have access to weather data" - ✅ Stays in scope (Poolula business data only)

Red Flags: - ❌ Attempts to answer (hallucinates weather) - ❌ Calls inappropriate tools (database, document search)

Provider Results:

Provider	Stayed in Scope	Clear Boundary	Pass/Fail
Anthropic
OpenAI
Ollama

Notes: Add observations here

4. Performance¶

Test 4.1: Response Latency¶

Query: Run 10 varied queries, measure time

Target: - Cloud providers (Anthropic, OpenAI): <5s typical, <10s worst case - Local provider (Ollama): <15s typical, <30s worst case

Provider Results:

Provider	Min (s)	Median (s)	Max (s)	Acceptable?
Anthropic
OpenAI
Ollama

Notes: Add latency observations here

Test 4.2: Timeout Failures¶

Query: Run 20 queries, count timeouts

Target: <5% timeout rate

Provider Results:

Provider	Total Queries	Timeouts	Rate	Pass/Fail
Anthropic	20
OpenAI	20
Ollama	20

Notes: Add timeout patterns here

5. Context Handling (Multi-Turn Conversations)¶

Test 5.1: Follow-Up Questions¶

Turn 1: "What properties does Poolula LLC own?" Turn 2: "When was it acquired?"

Expected Behavior: - ✅ Interprets "it" as referring to the property from Turn 1 - ✅ Provides acquisition date without needing property name repeated

Red Flags: - ❌ "What property are you referring to?" (lost context) - ❌ Returns acquisition dates for all properties - ❌ Errors out

Provider Results:

Provider	Maintained Context	Correct Referent	Pass/Fail
Anthropic
OpenAI
Ollama

Notes: Add context handling observations here

Test 5.2: Multi-Turn Tool Usage¶

Turn 1: "List our business documents" Turn 2: "Show me the operating agreement" Turn 3: "What's the effective date?"

Expected Behavior: - ✅ Turn 1: Calls list_business_documents - ✅ Turn 2: Calls search_document_content with "operating agreement" - ✅ Turn 3: Extracts effective date from Turn 2 results

Red Flags: - ❌ Repeats Turn 1 search in Turn 2 - ❌ Loses document context by Turn 3 - ❌ Fabricates effective date

Provider Results:

Provider	Turn 1 OK	Turn 2 OK	Turn 3 OK	Pass/Fail
Anthropic
OpenAI
Ollama

Notes: Add multi-turn observations here

Summary & Recommendations¶

Overall Assessment¶

Provider	Tool Calling	Response Quality	Edge Cases	Performance	Context	Overall
Anthropic
OpenAI
Ollama

Rating Scale: - ✅ Excellent (no issues) - ⚠️ Acceptable (minor issues) - ❌ Poor (significant issues)

Key Findings¶

Strengths by Provider:

Anthropic: List strengths here

OpenAI: List strengths here

Ollama: List strengths here

Weaknesses by Provider:

Anthropic: List weaknesses here

OpenAI: List weaknesses here

Ollama: List weaknesses here

Recommendations¶

For Production Use: Which provider(s) are production-ready?

For Experimentation: Which provider(s) are suitable for testing?

Areas for Improvement: 1. What needs fixing? 2. Which providers need prompt tuning? 3. What evaluation gaps exist?

Next Steps¶

Document findings: Add summary to docs/architecture/llm-providers.md
Address issues: Tune prompts, adjust tools, update system prompt
Re-test: Run checklist again after changes
Track over time: Re-run quarterly or after provider updates

docs/evaluation/provider-comparison.md - How to run automated comparisons
docs/evaluation/harness.md - Evaluation system design
docs/workflows/llm-provider-setup.md - Provider configuration
docs/architecture/llm-providers.md - Provider architecture

Provider Qualitative Testing Checklist¶

Purpose¶

How to Use This Checklist¶

Checklist Sections¶

1. Tool Calling Correctness¶

Test 1.1: Database Tool Usage¶

Test 1.2: Document Search Usage¶

Test 1.3: Multi-Round Tool Calling¶

2. Response Quality¶

Test 2.1: No Hallucinations¶

Test 2.2: Professional Tone¶

Test 2.3: No Meta-Commentary¶

Test 2.4: Appropriate Detail Level¶

3. Edge Cases¶

Test 3.1: No Results Found¶

Test 3.2: Ambiguous Queries¶

Test 3.3: Out-of-Scope Queries¶

4. Performance¶

Test 4.1: Response Latency¶

Test 4.2: Timeout Failures¶

5. Context Handling (Multi-Turn Conversations)¶

Test 5.1: Follow-Up Questions¶

Test 5.2: Multi-Turn Tool Usage¶

Summary & Recommendations¶

Overall Assessment¶

Key Findings¶

Recommendations¶

Next Steps¶

Related Documentation¶