Skip to content

Evaluation Harness: Executive Walkthrough

This document gives business leaders a clear, intuitive understanding of how the evaluation harness ensures the Poolula chatbot stays accurate, grounded, and trustworthy. Think of it as the quality-control lane at the end of a production line—except instead of checking physical products, it checks AI answers.


1. What the Chatbot Is Meant to Do

The chatbot acts as a financial and compliance analyst for Poolula LLC. It answers questions about:

  • Property details
  • Rental income, expenses, obligations, and basis
  • Business documents such as insurance policies, filings, and the operating agreement

It draws from two core information sources:

  • Structured Data (database tables: properties, transactions, obligations)
  • Unstructured Documents (PDFs: leases, policies, agreements)

It uses Retrieval-Augmented Generation (RAG)—meaning:
1) retrieve facts, then
2) draft an answer with citations.

graph LR
  A[Business question] --> B{Pick search route?}
  B -->|Financial fact| C[Query database of properties, transactions, obligations]
  B -->|Document detail| D[Search documents (policies, agreements)]
  C --> E[Combine findings]
  D --> E
  E --> F[Summarize answer with sources]

2. What an Evaluation Harness Is

In plain English:

A test-drive loop that runs a fixed set of important business questions through the chatbot and scores how well it answers.

In business analogy terms:

A mystery shopper for the AI. It asks the same questions every week, checks which “shelves” (data sources) the bot uses, and records a scorecard.

It serves as:

  • Scoreboard: tracks accuracy trends
  • Referee: judges routing decisions, factual alignment, and completeness
flowchart LR
    A[Business Question] --> B[RAG: search data + docs]
    B --> C[Draft Answer]
    C --> D{Evaluation Harness}
    D -->|Scores| E[Scorecard / Report]
    D -->|Findings| F[Fixes: data, prompts, tooling]

3. How the Evaluation Loop Works

sequenceDiagram
    participant Exec as Business Lead
    participant Harness as Evaluation Harness
    participant Bot as Chatbot (RAG)
    participant Data as Data & Docs
    Exec->>Harness: Provide gold questions (prioritized business needs)
    loop Each question
        Harness->>Bot: Ask question
        Bot->>Data: Retrieve facts
        Data-->>Bot: Evidence
        Bot-->>Harness: Answer + sources used
        Harness-->>Harness: Score tools, keywords, errors
    end
    Harness-->>Exec: Scoreboard & weak spots

4. What We Measure Today (The Three Scoring Lenses)

pie showData
    title Score Weighting
    "Tool Choice" : 40
    "Content Match" : 40
    "Completeness" : 20

Tool Choice (40%)

Did the bot choose the correct retrieval method?
(e.g., query_database vs. search_document_content)

Content Match (40%)

Did the answer contain the expected concepts, terminology, or facts?

Completeness (20%)

Did the bot return a non-error, reasonably shaped response?


5. Current Evaluation Set (Coverage Overview)

We currently have 15 questions covering:

  • Property info
  • Basis and depreciation
  • Revenue & expenses
  • Obligations and transactions
  • Governance & insurance documents
  • Hybrid queries
gantt
    title Current Evaluation Coverage (15 questions)
    dateFormat  YYYY-MM-DD
    section Database
    Property Info        :done,    des1, 2025-01-01, 1d
    Financial Basis      :active,  des2, 2025-01-01, 1d
    Transactions/Expenses:active,  des3, 2025-01-01, 1d
    Compliance           :active,  des4, 2025-01-01, 1d
    section Documents
    Insurance & Governance:active, des5, 2025-01-01, 1d
    Formation Docs       :active,  des6, 2025-01-01, 1d
    section Hybrid
    Mixed Queries        :crit,    des7, 2025-01-01, 1d

Interpretation: Coverage is thin—single-turn, limited nuance, no seasonal or multi-property scenarios.


6. Strengths of the Current System

  • Fast regression detection
  • Aligned to core business workflows
  • Good at catching routing mistakes
  • Repeatable & lightweight

7. Gaps, Risks, and Where This Can Mislead Leadership

A. Small Question Set → Blind Spots

Many realistic scenarios remain untested (e.g., seasonal trends, partial-year depreciation, insurance renewals).

B. Keyword-Based Scoring Is Shallow

Allows partially correct or numerically wrong answers to pass.

C. Tool-Use Detection Is Inferred

A missing citation can mis-score an otherwise correct answer.

D. Retrieval Quality Is Not Verified

We don’t check whether the cited paragraphs actually support the response.

E. Missing Edge Cases

No tests for missing documents, multi-step reasoning, or numerical reconciliation.

F. No Trend Tracking

Executives cannot yet visually see improvement or degradation over time.


8. Straightforward Recommendations

1. Expand the question set to 50–75 prompts

Include multi-step, seasonal, edge-case, and negative tests.

2. Add reference (“gold”) answers

Including acceptable numeric ranges & exact phrases.

3. Validate retrieval results

Verify that returned IDs and document passages match expectations.

4. Log actual tool calls

Replace heuristic inference with explicit instrumentation.

5. Track performance across versions/releases

Use dashboards or weekly snapshots.

6. Add a business-owner review lane

For borderline answers (40–70%), collect a human “trust/not trust” decision.


9. Leadership Summary (Shareable)

“Every week, we run a fixed set of business questions through the chatbot—like a secret shopper. The harness checks whether it searched the right place, cited the right facts, and avoided errors. Scores roll up into a dashboard. If accuracy drifts, we know before customers ever see a problem.”