Skip to content

Results & Baselines

Current evaluation metrics and performance baselines for the Poolula Platform chatbot.

Example Results

These are example/target results that demonstrate the evaluation methodology and expected performance levels. The evaluation harness (scripts/evaluate_chatbot.py) and test dataset (apps/evaluator/poolula_eval_set.jsonl) are fully functional and ready to use. Actual results will be added after running the evaluation script against the production chatbot.

To generate real results:

# Run the evaluation harness
uv run python scripts/evaluate_chatbot.py --verbose

Latest Evaluation Results (Example)

Run Date: 2024-11-15

Question Set: data/poolula_eval_set.jsonl (15 questions)

Environment: Local development with full document set

Overall Performance

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 POOLULA CHATBOT EVALUATION RESULTS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Overall Score: 87.3%

 Component Scores:
  - Tool Usage:       93.3%
  - Response Quality: 86.7%
  - Error Handling:  100.0%

 Questions Passed:   13/15 (≥70% threshold)
 Questions Failed:    2/15
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Performance by Category

Category Questions Avg Score Pass Rate Status
property_info 3 100.0% 3/3 ✅ Excellent
property_financials 3 93.3% 3/3 ✅ Excellent
formation 1 100.0% 1/1 ✅ Excellent
documents 2 95.0% 2/2 ✅ Excellent
transactions 3 80.0% ⚠️ Good
aggregations 1 73.3% 1/1 ⚠️ Acceptable
hybrid 1 73.3% 1/1 ⚠️ Acceptable
compliance 1 66.7% 0/1 ❌ Needs work

Component Breakdown

Tool Usage: 93.3%

  • 14/15 questions used correct tools
  • 1 question used suboptimal tool choice
  • Strong understanding of when to use database vs documents

Response Quality: 86.7%

  • Average 4.⅖ expected keywords found
  • Responses are generally complete
  • Some missing details in complex queries

Error Handling: 100.0%

  • Zero crashes or exceptions
  • All queries completed successfully
  • Robust error handling

Detailed Results

High-Performing Questions (90-100%)

Question: "What is our property address?"

  • Score: 100.0%
  • Category: property_info
  • Tools: ✓ query_database
  • Keywords: 4/4 found
  • Notes: Perfect retrieval

Question: "What is our EIN number?"

  • Score: 100.0%
  • Category: formation
  • Tools: ✓ query_database
  • Keywords: 2/2 found
  • Notes: Correct database query

Question: "What documents are in our knowledge base?"

  • Score: 100.0%
  • Category: documents
  • Tools: ✓ list_business_documents
  • Keywords: 5/5 found
  • Notes: Complete list returned

Medium-Performing Questions (70-89%)

Question: "What was my rental income in August 2024?"

  • Score: 80.0%
  • Category: transactions
  • Tools: ✓ query_database
  • Keywords: ⅗ found (missing "August", "breakdown")
  • Notes: Found total but not monthly breakdown

Question: "Show me total expenses by category for 2024"

  • Score: 73.3%
  • Category: aggregations
  • Tools: ✓ query_database (aggregate function)
  • Keywords: ⅗ found
  • Notes: Aggregation worked but formatting could be clearer

Low-Performing Questions (<70%)

Question: "When is our annual LLC report due in Colorado?"

  • Score: 66.7%
  • Category: compliance
  • Tools: ✗ Used query_database instead of search_document_content
  • Keywords: ⅓ found
  • Notes: Tool selection error, answer incomplete

Analysis: Compliance deadlines are in documents, not database. AI should use document search.

Historical Performance

Date Overall Tool Usage Quality Errors
2024-11-15 87.3% 93.3% 86.7% 0.0%
2024-11-14 85.1% 90.0% 84.4% 2.2%
2024-11-13 82.7% 86.7% 82.2% 4.4%

Improvement: +4.6% over 2 days

Key changes:

  • Improved system prompt clarity
  • Added database aggregate functions
  • Better tool definitions

Improving:

  • property_info: 95% → 100% (+5%)
  • transactions: 75% → 80% (+5%)

Stable:

  • formation: 100% (maintained)
  • documents: 95% (maintained)

Needs attention:

  • compliance: 60% → 66.7% (+6.7%, but still below threshold)

Known Issues

Issue 1: Compliance Questions

Problem: AI struggles with compliance deadline questions.

Root cause: Tool selection - tries database instead of document search.

Impact: 1/1 compliance questions failed.

Plan: Enhance system prompt with examples of compliance queries requiring document search.

Issue 2: Monthly Breakdowns

Problem: Transaction aggregations don't always group by month correctly.

Root cause: Aggregate function needs better month extraction.

Impact: ⅓ transaction questions partially failed.

Plan: Add explicit month grouping examples to tool documentation.

Issue 3: Keyword Matching Limitations

Problem: Valid synonyms not recognized ("property" vs "real estate").

Root cause: Simple keyword matching doesn't understand semantics.

Impact: Minor score reductions across multiple questions.

Plan: Implement LLM-as-judge scoring (see Roadmap).

Baseline Targets

Current Baselines

Metric Baseline Target Current
Overall Score 80% 90% 87.3% ✓
Tool Usage 85% 95% 93.3% ✓
Response Quality 80% 90% 86.7% ✓
Error Rate <5% <2% 0.0% ✓

Quality Gates

For Production Deployment:

  • Overall score ≥ 85%
  • Tool usage ≥ 90%
  • Error rate < 2%
  • All categories ≥ 70%

Current Status: ✅ Ready for production (compliance category borderline)

Recommendation: Address compliance question issue before deploy.

Performance by Question Length

Question Length Avg Score Count
Short (1-10 words) 95.0% 6
Medium (11-20 words) 85.0% 7
Long (>20 words) 75.0% 2

Insight: Performance degrades slightly with question complexity.

Performance by Tool Combination

Tool Combination Avg Score Count
Database only 90.5% 10
Documents only 95.0% 2
Hybrid (both) 73.3% 1
List only 100.0% 2

Insight: Hybrid queries are most challenging, need more test coverage.

Comparison to Benchmarks

Industry Baselines

Typical RAG system performance (from literature):

  • Tool selection accuracy: 70-85%
  • Response quality: 60-75%
  • Overall user satisfaction: 65-80%

Poolula Platform:

  • Tool selection: 93.3% (above benchmark ✓)
  • Response quality: 86.7% (above benchmark ✓)
  • Overall score: 87.3% (above benchmark ✓)

Competitive Positioning

Simple Q&A bots: 60-70% accuracy

Enterprise RAG systems: 75-85% accuracy

Poolula Platform: 87.3% accuracy (competitive with enterprise solutions)

Next Steps

Short Term (This Week)

  1. Fix compliance questions

  2. Update system prompt with document search examples

  3. Add compliance-specific keywords to tool selection logic

  4. Improve monthly aggregations

  5. Enhance aggregate function documentation

  6. Add month grouping examples

  7. Add more hybrid questions

  8. Current coverage: 1 question
  9. Target: 3-5 questions

Medium Term (This Month)

  1. Implement retrieval metrics

  2. Track precision/recall for document search

  3. Measure source relevance

  4. Add latency tracking

  5. Record response time per question

  6. Identify slow queries

  7. User feedback integration

  8. Add thumbs up/down to responses
  9. Track user satisfaction

Long Term (Next Quarter)

  1. LLM-as-judge scoring

  2. Use GPT-4 to evaluate response quality

  3. More nuanced than keyword matching

  4. A/B testing framework

  5. Test prompt variations

  6. Compare tool configurations

  7. Continuous evaluation

  8. Run eval suite nightly
  9. Track trends automatically