Results & Baselines¶
Current evaluation metrics and performance baselines for the Poolula Platform chatbot.
Example Results
These are example/target results that demonstrate the evaluation methodology and expected performance levels. The evaluation harness (scripts/evaluate_chatbot.py) and test dataset (apps/evaluator/poolula_eval_set.jsonl) are fully functional and ready to use. Actual results will be added after running the evaluation script against the production chatbot.
To generate real results:
Latest Evaluation Results (Example)¶
Run Date: 2024-11-15
Question Set: data/poolula_eval_set.jsonl (15 questions)
Environment: Local development with full document set
Overall Performance¶
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
POOLULA CHATBOT EVALUATION RESULTS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Overall Score: 87.3%
Component Scores:
- Tool Usage: 93.3%
- Response Quality: 86.7%
- Error Handling: 100.0%
Questions Passed: 13/15 (≥70% threshold)
Questions Failed: 2/15
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Performance by Category¶
| Category | Questions | Avg Score | Pass Rate | Status |
|---|---|---|---|---|
| property_info | 3 | 100.0% | 3/3 | ✅ Excellent |
| property_financials | 3 | 93.3% | 3/3 | ✅ Excellent |
| formation | 1 | 100.0% | 1/1 | ✅ Excellent |
| documents | 2 | 95.0% | 2/2 | ✅ Excellent |
| transactions | 3 | 80.0% | ⅔ | ⚠️ Good |
| aggregations | 1 | 73.3% | 1/1 | ⚠️ Acceptable |
| hybrid | 1 | 73.3% | 1/1 | ⚠️ Acceptable |
| compliance | 1 | 66.7% | 0/1 | ❌ Needs work |
Component Breakdown¶
Tool Usage: 93.3%
- 14/15 questions used correct tools
- 1 question used suboptimal tool choice
- Strong understanding of when to use database vs documents
Response Quality: 86.7%
- Average 4.⅖ expected keywords found
- Responses are generally complete
- Some missing details in complex queries
Error Handling: 100.0%
- Zero crashes or exceptions
- All queries completed successfully
- Robust error handling
Detailed Results¶
High-Performing Questions (90-100%)¶
Question: "What is our property address?"
- Score: 100.0%
- Category: property_info
- Tools: ✓ query_database
- Keywords: 4/4 found
- Notes: Perfect retrieval
Question: "What is our EIN number?"
- Score: 100.0%
- Category: formation
- Tools: ✓ query_database
- Keywords: 2/2 found
- Notes: Correct database query
Question: "What documents are in our knowledge base?"
- Score: 100.0%
- Category: documents
- Tools: ✓ list_business_documents
- Keywords: 5/5 found
- Notes: Complete list returned
Medium-Performing Questions (70-89%)¶
Question: "What was my rental income in August 2024?"
- Score: 80.0%
- Category: transactions
- Tools: ✓ query_database
- Keywords: ⅗ found (missing "August", "breakdown")
- Notes: Found total but not monthly breakdown
Question: "Show me total expenses by category for 2024"
- Score: 73.3%
- Category: aggregations
- Tools: ✓ query_database (aggregate function)
- Keywords: ⅗ found
- Notes: Aggregation worked but formatting could be clearer
Low-Performing Questions (<70%)¶
Question: "When is our annual LLC report due in Colorado?"
- Score: 66.7%
- Category: compliance
- Tools: ✗ Used query_database instead of search_document_content
- Keywords: ⅓ found
- Notes: Tool selection error, answer incomplete
Analysis: Compliance deadlines are in documents, not database. AI should use document search.
Trends Over Time¶
Historical Performance¶
| Date | Overall | Tool Usage | Quality | Errors |
|---|---|---|---|---|
| 2024-11-15 | 87.3% | 93.3% | 86.7% | 0.0% |
| 2024-11-14 | 85.1% | 90.0% | 84.4% | 2.2% |
| 2024-11-13 | 82.7% | 86.7% | 82.2% | 4.4% |
Improvement: +4.6% over 2 days
Key changes:
- Improved system prompt clarity
- Added database aggregate functions
- Better tool definitions
Category Trends¶
Improving:
- property_info: 95% → 100% (+5%)
- transactions: 75% → 80% (+5%)
Stable:
- formation: 100% (maintained)
- documents: 95% (maintained)
Needs attention:
- compliance: 60% → 66.7% (+6.7%, but still below threshold)
Known Issues¶
Issue 1: Compliance Questions¶
Problem: AI struggles with compliance deadline questions.
Root cause: Tool selection - tries database instead of document search.
Impact: 1/1 compliance questions failed.
Plan: Enhance system prompt with examples of compliance queries requiring document search.
Issue 2: Monthly Breakdowns¶
Problem: Transaction aggregations don't always group by month correctly.
Root cause: Aggregate function needs better month extraction.
Impact: ⅓ transaction questions partially failed.
Plan: Add explicit month grouping examples to tool documentation.
Issue 3: Keyword Matching Limitations¶
Problem: Valid synonyms not recognized ("property" vs "real estate").
Root cause: Simple keyword matching doesn't understand semantics.
Impact: Minor score reductions across multiple questions.
Plan: Implement LLM-as-judge scoring (see Roadmap).
Baseline Targets¶
Current Baselines¶
| Metric | Baseline | Target | Current |
|---|---|---|---|
| Overall Score | 80% | 90% | 87.3% ✓ |
| Tool Usage | 85% | 95% | 93.3% ✓ |
| Response Quality | 80% | 90% | 86.7% ✓ |
| Error Rate | <5% | <2% | 0.0% ✓ |
Quality Gates¶
For Production Deployment:
- Overall score ≥ 85%
- Tool usage ≥ 90%
- Error rate < 2%
- All categories ≥ 70%
Current Status: ✅ Ready for production (compliance category borderline)
Recommendation: Address compliance question issue before deploy.
Performance by Question Length¶
| Question Length | Avg Score | Count |
|---|---|---|
| Short (1-10 words) | 95.0% | 6 |
| Medium (11-20 words) | 85.0% | 7 |
| Long (>20 words) | 75.0% | 2 |
Insight: Performance degrades slightly with question complexity.
Performance by Tool Combination¶
| Tool Combination | Avg Score | Count |
|---|---|---|
| Database only | 90.5% | 10 |
| Documents only | 95.0% | 2 |
| Hybrid (both) | 73.3% | 1 |
| List only | 100.0% | 2 |
Insight: Hybrid queries are most challenging, need more test coverage.
Comparison to Benchmarks¶
Industry Baselines¶
Typical RAG system performance (from literature):
- Tool selection accuracy: 70-85%
- Response quality: 60-75%
- Overall user satisfaction: 65-80%
Poolula Platform:
- Tool selection: 93.3% (above benchmark ✓)
- Response quality: 86.7% (above benchmark ✓)
- Overall score: 87.3% (above benchmark ✓)
Competitive Positioning¶
Simple Q&A bots: 60-70% accuracy
Enterprise RAG systems: 75-85% accuracy
Poolula Platform: 87.3% accuracy (competitive with enterprise solutions)
Next Steps¶
Short Term (This Week)¶
-
Fix compliance questions
-
Update system prompt with document search examples
-
Add compliance-specific keywords to tool selection logic
-
Improve monthly aggregations
-
Enhance aggregate function documentation
-
Add month grouping examples
-
Add more hybrid questions
- Current coverage: 1 question
- Target: 3-5 questions
Medium Term (This Month)¶
-
Implement retrieval metrics
-
Track precision/recall for document search
-
Measure source relevance
-
Add latency tracking
-
Record response time per question
-
Identify slow queries
-
User feedback integration
- Add thumbs up/down to responses
- Track user satisfaction
Long Term (Next Quarter)¶
-
LLM-as-judge scoring
-
Use GPT-4 to evaluate response quality
-
More nuanced than keyword matching
-
A/B testing framework
-
Test prompt variations
-
Compare tool configurations
-
Continuous evaluation
- Run eval suite nightly
- Track trends automatically
Related Documentation¶
- Evaluation Harness - How evaluations are run
- Question Design - Question selection and design
- Scoring Methodology - How scores are calculated
- Improvement Roadmap - Planned enhancements