Question Design¶

The golden question set is carefully designed to test core chatbot capabilities across different query types and data sources.

Question Selection Criteria¶

1. Representative Coverage¶

Questions cover the main use cases from four user personas:

New LLC Owner (compliance, getting started)
Bookkeeper (financial records, accounting)
Property Manager (operations, maintenance)
Tax Preparer (basis, depreciation, deductions)

2. Tool Diversity¶

Questions test different tool combinations:

Database-only queries
Document-only queries
Hybrid queries (database + documents)
List/discovery queries

3. Realistic Scenarios¶

All questions reflect actual business needs:

Compliance deadlines
Financial reporting
Property information
Document searches

Golden Question Set¶

Location: data/poolula_eval_set.jsonl

Total Questions: 15

Category Breakdown¶

Category	Count	Description
`property_info`	3	Basic property information
`property_financials`	3	Property basis and depreciation
`transactions`	3	Transaction queries and filtering
`documents`	2	Document search and discovery
`formation`	1	LLC formation details
`aggregations`	1	Financial aggregations
`compliance`	1	Compliance requirements
`hybrid`	1	Multi-source queries

Question Examples¶

Property Information¶

Question: "What is our property address?"

Category: property_info

Expected Tools:

query_database

Expected Keywords:

"900"
"9^th"
"Montrose"
"CO"

Why this matters: Tests basic database retrieval of property information.

Property Financials¶

Question: "What is our property's total depreciable basis?"

Category: property_financials

Expected Tools:

query_database

Expected Keywords:

"basis"
"depreciation"
"442300"

Why this matters: Tests ability to query financial data and perform calculations.

Transactions¶

Question: "What was my rental income in August 2024?"

Category: transactions

Expected Tools:

query_database

Expected Keywords:

"August"
"2024"
"rental"
"income"

Why this matters: Tests date-based filtering and category queries.

Documents¶

Question: "What documents are in our knowledge base?"

Category: documents

Expected Tools:

list_business_documents

Expected Keywords:

"documents"
"Articles"
"Operating Agreement"

Why this matters: Tests document discovery functionality.

Hybrid Queries¶

Question: "What is the business purpose stated in our LLC formation documents?"

Category: hybrid

Expected Tools:

search_document_content
query_database (optional)

Expected Keywords:

"business purpose"
"rental"
"property"

Why this matters: Tests ability to search documents for specific information.

JSONL Format¶

Each question is a single-line JSON object:

{
  "question": "What is our property address?",
  "category": "property_info",
  "expected_tools": ["query_database"],
  "expected_keywords": ["900", "9th", "Montrose", "CO"],
  "context": "Tests basic property information retrieval from database"
}

Field Descriptions¶

Field	Required	Description	Example
`question`	Yes	The user query	"What is our EIN?"
`category`	Yes	Question category	"formation"
`expected_tools`	Yes	Tools AI should use	["query_database"]
`expected_keywords`	Yes	Keywords in response	["83", "4567890"]
`context`	No	Why question matters	"Tests EIN retrieval"

Question Design Principles¶

1. Specificity¶

Good: "What was my rental income in August 2024?"

Clear time period
Specific category
Measurable answer

Bad: "Tell me about my finances"

Too vague
Multiple valid answers
Hard to score

2. Single Concept¶

Good: "What is our property's land basis?"

One piece of information
Clear expected answer
Single tool needed

Bad: "What is our property's land basis, building basis, and total depreciation schedule?"

Multiple concepts
Complex answer
Hard to score components

3. Realistic Language¶

Use natural phrasing that actual users would employ:

Natural: "What's our EIN number?"

Unnatural: "Query the database for the employer identification number field"

4. Testable Outcomes¶

Ensure questions have verifiable answers:

Verifiable: "What is our property address?"

Expected: "900 S 9^th St, Montrose, CO"
Can check for keywords: "900", "9^th", "Montrose"

Not verifiable: "Is our property in a good location?"

Subjective answer
No clear keywords
Opinion-based

Adding New Questions¶

Step 1: Identify Gap¶

Review current coverage and identify missing scenarios:

# Count questions by category
cat data/poolula_eval_set.jsonl | jq -r '.category' | sort | uniq -c

# Look for underrepresented categories

Step 2: Write Question¶

Follow the design principles above:

{
  "question": "When is our annual LLC filing due in Colorado?",
  "category": "compliance",
  "expected_tools": ["search_document_content"],
  "expected_keywords": ["annual", "report", "Colorado", "periodic"],
  "context": "Tests ability to find compliance deadlines in documents"
}

Step 3: Validate Answer¶

Test the question manually before adding:

# Start API server
uv run uvicorn apps.api.main:app --reload --port 8082

# Ask the question via frontend
# Verify response is correct

Step 4: Add to Set¶

# Append to JSONL file
echo '{"question": "...", ...}' >> data/poolula_eval_set.jsonl

# Run evaluation to get baseline
python scripts/evaluate_chatbot.py

Question Maintenance¶

When to Update Questions¶

Add questions when:

New features are added (new tools, data sources)
Coverage gaps are identified
User feedback reveals untested scenarios

Update questions when:

Expected keywords change due to data updates
Tool definitions change
Question wording is unclear

Remove questions when:

Feature is deprecated
Question is no longer relevant
Duplicate coverage exists

Version Control¶

Track changes to the question set:

# Commit changes with context
git add data/poolula_eval_set.jsonl
git commit -m "Add 3 compliance deadline questions

- Test annual report deadline lookup
- Test business license renewal
- Test tax filing deadlines

Increases compliance category coverage from 1 to 4 questions."

Full Question List¶

See the complete set in data/poolula_eval_set.jsonl:

# View all questions
cat data/poolula_eval_set.jsonl | jq -r '.question'

# View by category
cat data/poolula_eval_set.jsonl | jq -r 'select(.category=="property_info") | .question'

Statistical Analysis¶

Category Distribution¶

# Count by category
cat data/poolula_eval_set.jsonl | jq -r '.category' | sort | uniq -c | sort -rn

# Output:
#   3 property_financials
#   3 property_info
#   3 transactions
#   2 documents
#   1 formation
#   1 aggregations
#   1 compliance
#   1 hybrid

Tool Usage Distribution¶

# Count expected tool usage
cat data/poolula_eval_set.jsonl | jq -r '.expected_tools[]' | sort | uniq -c

# Output:
#  12 query_database
#   3 search_document_content
#   1 list_business_documents

Keyword Complexity¶

# Average keywords per question
cat data/poolula_eval_set.jsonl | jq -r '.expected_keywords | length' | \
  awk '{sum+=$1} END {print "Average:", sum/NR}'

# Output:
# Average: 4.2 keywords per question

Evaluation Harness - How evaluation works
Scoring Methodology - How questions are scored
Sample Questions - 133 additional questions for reference

Question Design¶

Question Selection Criteria¶

1. Representative Coverage¶

2. Tool Diversity¶

3. Realistic Scenarios¶

Golden Question Set¶

Category Breakdown¶

Question Examples¶

Property Information¶

Property Financials¶

Transactions¶

Documents¶

Hybrid Queries¶

JSONL Format¶

Field Descriptions¶

Question Design Principles¶

1. Specificity¶

2. Single Concept¶

3. Realistic Language¶

4. Testable Outcomes¶

Adding New Questions¶

Step 1: Identify Gap¶

Step 2: Write Question¶

Step 3: Validate Answer¶

Step 4: Add to Set¶

Question Maintenance¶

When to Update Questions¶

Version Control¶

Full Question List¶

Statistical Analysis¶

Category Distribution¶

Tool Usage Distribution¶

Keyword Complexity¶

Related Documentation¶