Evaluation Harness: Executive Walkthrough¶

This document gives business leaders a clear, intuitive understanding of how the evaluation harness ensures the Poolula chatbot stays accurate, grounded, and trustworthy. Think of it as the quality-control lane at the end of a production line—except instead of checking physical products, it checks AI answers.

1. What the Chatbot Is Meant to Do¶

The chatbot acts as a financial and compliance analyst for Poolula LLC. It answers questions about:

Property details
Rental income, expenses, obligations, and basis
Business documents such as insurance policies, filings, and the operating agreement

It draws from two core information sources:

Structured Data (database tables: properties, transactions, obligations)
Unstructured Documents (PDFs: leases, policies, agreements)

It uses Retrieval-Augmented Generation (RAG)—meaning:
1) retrieve facts, then
2) draft an answer with citations.

graph LR
  A[Business question] --> B{Pick search route?}
  B -->|Financial fact| C[Query database of properties, transactions, obligations]
  B -->|Document detail| D[Search documents (policies, agreements)]
  C --> E[Combine findings]
  D --> E
  E --> F[Summarize answer with sources]

2. What an Evaluation Harness Is¶

In plain English:

A test-drive loop that runs a fixed set of important business questions through the chatbot and scores how well it answers.

In business analogy terms:

A mystery shopper for the AI. It asks the same questions every week, checks which “shelves” (data sources) the bot uses, and records a scorecard.

It serves as:

Scoreboard: tracks accuracy trends
Referee: judges routing decisions, factual alignment, and completeness

flowchart LR
    A[Business Question] --> B[RAG: search data + docs]
    B --> C[Draft Answer]
    C --> D{Evaluation Harness}
    D -->|Scores| E[Scorecard / Report]
    D -->|Findings| F[Fixes: data, prompts, tooling]

3. How the Evaluation Loop Works¶

sequenceDiagram
    participant Exec as Business Lead
    participant Harness as Evaluation Harness
    participant Bot as Chatbot (RAG)
    participant Data as Data & Docs
    Exec->>Harness: Provide gold questions (prioritized business needs)
    loop Each question
        Harness->>Bot: Ask question
        Bot->>Data: Retrieve facts
        Data-->>Bot: Evidence
        Bot-->>Harness: Answer + sources used
        Harness-->>Harness: Score tools, keywords, errors
    end
    Harness-->>Exec: Scoreboard & weak spots

4. What We Measure Today (The Three Scoring Lenses)¶

pie showData
    title Score Weighting
    "Tool Choice" : 40
    "Content Match" : 40
    "Completeness" : 20

Tool Choice (40%)¶

Did the bot choose the correct retrieval method?
(e.g., query_database vs. search_document_content)

Content Match (40%)¶

Did the answer contain the expected concepts, terminology, or facts?

Completeness (20%)¶

Did the bot return a non-error, reasonably shaped response?

5. Current Evaluation Set (Coverage Overview)¶

We currently have 15 questions covering:

Property info
Basis and depreciation
Revenue & expenses
Obligations and transactions
Governance & insurance documents
Hybrid queries

gantt
    title Current Evaluation Coverage (15 questions)
    dateFormat  YYYY-MM-DD
    section Database
    Property Info        :done,    des1, 2025-01-01, 1d
    Financial Basis      :active,  des2, 2025-01-01, 1d
    Transactions/Expenses:active,  des3, 2025-01-01, 1d
    Compliance           :active,  des4, 2025-01-01, 1d
    section Documents
    Insurance & Governance:active, des5, 2025-01-01, 1d
    Formation Docs       :active,  des6, 2025-01-01, 1d
    section Hybrid
    Mixed Queries        :crit,    des7, 2025-01-01, 1d

Interpretation: Coverage is thin—single-turn, limited nuance, no seasonal or multi-property scenarios.

6. Strengths of the Current System¶

Fast regression detection
Aligned to core business workflows
Good at catching routing mistakes
Repeatable & lightweight

7. Gaps, Risks, and Where This Can Mislead Leadership¶

A. Small Question Set → Blind Spots¶

Many realistic scenarios remain untested (e.g., seasonal trends, partial-year depreciation, insurance renewals).

B. Keyword-Based Scoring Is Shallow¶

Allows partially correct or numerically wrong answers to pass.

C. Tool-Use Detection Is Inferred¶

A missing citation can mis-score an otherwise correct answer.

D. Retrieval Quality Is Not Verified¶

We don’t check whether the cited paragraphs actually support the response.

E. Missing Edge Cases¶

No tests for missing documents, multi-step reasoning, or numerical reconciliation.

F. No Trend Tracking¶

Executives cannot yet visually see improvement or degradation over time.

8. Straightforward Recommendations¶

1. Expand the question set to 50–75 prompts¶

Include multi-step, seasonal, edge-case, and negative tests.

2. Add reference (“gold”) answers¶

Including acceptable numeric ranges & exact phrases.

3. Validate retrieval results¶

Verify that returned IDs and document passages match expectations.

4. Log actual tool calls¶

Replace heuristic inference with explicit instrumentation.

5. Track performance across versions/releases¶

Use dashboards or weekly snapshots.

6. Add a business-owner review lane¶

For borderline answers (40–70%), collect a human “trust/not trust” decision.

9. Leadership Summary (Shareable)¶

“Every week, we run a fixed set of business questions through the chatbot—like a secret shopper. The harness checks whether it searched the right place, cited the right facts, and avoided errors. Scores roll up into a dashboard. If accuracy drifts, we know before customers ever see a problem.”