Monitoring AI Quality

How to track AI quality with evaluator scores, feedback signals, and review filters.

Use both operator feedback and automated evaluator signals to monitor AI quality.

Main quality views

AI quality report API (/api/reports/ai-quality)
- acceptance rate
- edit rate
- rejection rate
Live chat quality filters
- low quality sessions
- hallucination flag
- circular response flag
- negative feedback count

Conversation evaluator

Both completed widget sessions and auto-replied email tickets are evaluated automatically each night with the same structured metrics:

accuracy
completeness
resolution
hallucination flag
circular flag
question type key

A composite quality score is stored and surfaced as badges — on the Live Chat list for widget sessions and on the inbox ticket list for email tickets. Conversations scoring below 70%, or tripping the hallucination / circular flags, are routed into the Review queue for triage.

Sampling follows the workspace's evaluation mode (full / ramping / spot_check), with forced evaluation for negative-feedback messages, low-confidence auto-sends, and unseen question types.

What to watch weekly

rising rejection or edit rates in one category
repeated hallucination flags for the same question type
low-quality clusters after KB or policy changes
high negative-feedback sessions that were not escalated

Closing the loop

The fastest path to fixing a bad response pattern is AI → Lessons → Review queue: every eval-flagged conversation lands there, and a single coaching note becomes a lesson the AI applies on similar future tickets overnight. See How the AI learns from your team.

For broader pattern fixes:

Add/refresh KB coverage for that scenario.
Tighten instructions for risky behavior.
Increase confidence threshold for that flow.
Leave coaching notes on representative flagged tickets.
Monitor acceptance/edit/rejection deltas over the next week.

Nightly aggregation jobs keep quality insights fresh, but immediate operator feedback signals are still the fastest indicator of drift.

Main quality views

Conversation evaluator

What to watch weekly

Closing the loop

On this page