Results & Baselines¶

Current evaluation metrics and performance baselines for the Poolula Platform chatbot.

Example Results

These are example/target results that demonstrate the evaluation methodology and expected performance levels. The evaluation harness (scripts/evaluate_chatbot.py) and test dataset (apps/evaluator/poolula_eval_set.jsonl) are fully functional and ready to use. Actual results will be added after running the evaluation script against the production chatbot.

To generate real results:

# Run the evaluation harness
uv run python scripts/evaluate_chatbot.py --verbose

Latest Evaluation Results (Example)¶

Run Date: 2024-11-15

Question Set: data/poolula_eval_set.jsonl (15 questions)

Environment: Local development with full document set

Overall Performance¶

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 POOLULA CHATBOT EVALUATION RESULTS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Overall Score: 87.3%

 Component Scores:
  - Tool Usage:       93.3%
  - Response Quality: 86.7%
  - Error Handling:  100.0%

 Questions Passed:   13/15 (≥70% threshold)
 Questions Failed:    2/15
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Performance by Category¶

Category	Questions	Avg Score	Pass Rate	Status
property_info	3	100.0%	3/3	✅ Excellent
property_financials	3	93.3%	3/3	✅ Excellent
formation	1	100.0%	1/1	✅ Excellent
documents	2	95.0%	2/2	✅ Excellent
transactions	3	80.0%	⅔	⚠️ Good
aggregations	1	73.3%	1/1	⚠️ Acceptable
hybrid	1	73.3%	1/1	⚠️ Acceptable
compliance	1	66.7%	0/1	❌ Needs work

Component Breakdown¶

Tool Usage: 93.3%

14/15 questions used correct tools
1 question used suboptimal tool choice
Strong understanding of when to use database vs documents

Response Quality: 86.7%

Average 4.⅖ expected keywords found
Responses are generally complete
Some missing details in complex queries

Error Handling: 100.0%

Zero crashes or exceptions
All queries completed successfully
Robust error handling

Detailed Results¶

High-Performing Questions (90-100%)¶

Question: "What is our property address?"

Score: 100.0%
Category: property_info
Tools: ✓ query_database
Keywords: 4/4 found
Notes: Perfect retrieval

Question: "What is our EIN number?"

Score: 100.0%
Category: formation
Tools: ✓ query_database
Keywords: 2/2 found
Notes: Correct database query

Question: "What documents are in our knowledge base?"

Score: 100.0%
Category: documents
Tools: ✓ list_business_documents
Keywords: 5/5 found
Notes: Complete list returned

Medium-Performing Questions (70-89%)¶

Question: "What was my rental income in August 2024?"

Score: 80.0%
Category: transactions
Tools: ✓ query_database
Keywords: ⅗ found (missing "August", "breakdown")
Notes: Found total but not monthly breakdown

Question: "Show me total expenses by category for 2024"

Score: 73.3%
Category: aggregations
Tools: ✓ query_database (aggregate function)
Keywords: ⅗ found
Notes: Aggregation worked but formatting could be clearer

Low-Performing Questions (<70%)¶

Question: "When is our annual LLC report due in Colorado?"

Score: 66.7%
Category: compliance
Tools: ✗ Used query_database instead of search_document_content
Keywords: ⅓ found
Notes: Tool selection error, answer incomplete

Analysis: Compliance deadlines are in documents, not database. AI should use document search.

Trends Over Time¶

Historical Performance¶

Date	Overall	Tool Usage	Quality	Errors
2024-11-15	87.3%	93.3%	86.7%	0.0%
2024-11-14	85.1%	90.0%	84.4%	2.2%
2024-11-13	82.7%	86.7%	82.2%	4.4%

Improvement: +4.6% over 2 days

Key changes:

Improved system prompt clarity
Added database aggregate functions
Better tool definitions

Category Trends¶

Improving:

property_info: 95% → 100% (+5%)
transactions: 75% → 80% (+5%)

Stable:

formation: 100% (maintained)
documents: 95% (maintained)

Needs attention:

compliance: 60% → 66.7% (+6.7%, but still below threshold)

Known Issues¶

Issue 1: Compliance Questions¶

Problem: AI struggles with compliance deadline questions.

Root cause: Tool selection - tries database instead of document search.

Impact: 1/1 compliance questions failed.

Plan: Enhance system prompt with examples of compliance queries requiring document search.

Issue 2: Monthly Breakdowns¶

Problem: Transaction aggregations don't always group by month correctly.

Root cause: Aggregate function needs better month extraction.

Impact: ⅓ transaction questions partially failed.

Plan: Add explicit month grouping examples to tool documentation.

Issue 3: Keyword Matching Limitations¶

Problem: Valid synonyms not recognized ("property" vs "real estate").

Root cause: Simple keyword matching doesn't understand semantics.

Impact: Minor score reductions across multiple questions.

Plan: Implement LLM-as-judge scoring (see Roadmap).

Baseline Targets¶

Current Baselines¶

Metric	Baseline	Target	Current
Overall Score	80%	90%	87.3% ✓
Tool Usage	85%	95%	93.3% ✓
Response Quality	80%	90%	86.7% ✓
Error Rate	<5%	<2%	0.0% ✓

Quality Gates¶

For Production Deployment:

Overall score ≥ 85%
Tool usage ≥ 90%
Error rate < 2%
All categories ≥ 70%

Current Status: ✅ Ready for production (compliance category borderline)

Recommendation: Address compliance question issue before deploy.

Performance by Question Length¶

Question Length	Avg Score	Count
Short (1-10 words)	95.0%	6
Medium (11-20 words)	85.0%	7
Long (>20 words)	75.0%	2

Insight: Performance degrades slightly with question complexity.

Performance by Tool Combination¶

Tool Combination	Avg Score	Count
Database only	90.5%	10
Documents only	95.0%	2
Hybrid (both)	73.3%	1
List only	100.0%	2

Insight: Hybrid queries are most challenging, need more test coverage.

Comparison to Benchmarks¶

Industry Baselines¶

Typical RAG system performance (from literature):

Tool selection accuracy: 70-85%
Response quality: 60-75%
Overall user satisfaction: 65-80%

Poolula Platform:

Tool selection: 93.3% (above benchmark ✓)
Response quality: 86.7% (above benchmark ✓)
Overall score: 87.3% (above benchmark ✓)

Competitive Positioning¶

Simple Q&A bots: 60-70% accuracy

Enterprise RAG systems: 75-85% accuracy

Poolula Platform: 87.3% accuracy (competitive with enterprise solutions)

Next Steps¶

Short Term (This Week)¶

Fix compliance questions
Update system prompt with document search examples
Add compliance-specific keywords to tool selection logic
Improve monthly aggregations
Enhance aggregate function documentation
Add month grouping examples
Add more hybrid questions
Current coverage: 1 question
Target: 3-5 questions

Medium Term (This Month)¶

Implement retrieval metrics
Track precision/recall for document search
Measure source relevance
Add latency tracking
Record response time per question
Identify slow queries
User feedback integration
Add thumbs up/down to responses
Track user satisfaction

Long Term (Next Quarter)¶

LLM-as-judge scoring
Use GPT-4 to evaluate response quality
More nuanced than keyword matching
A/B testing framework
Test prompt variations
Compare tool configurations
Continuous evaluation
Run eval suite nightly
Track trends automatically

Evaluation Harness - How evaluations are run
Question Design - Question selection and design
Scoring Methodology - How scores are calculated
Improvement Roadmap - Planned enhancements