AgentBrains - Conversation Scoring

Conversation Scoring

AgentBrains automatically scores 100% of your AI agent conversations. Learn how the Aggregate Score, individual test breakdowns, and scoring tests help you build production-grade agents.

> Why Score Every Conversation?

> How Scoring Works: The Aggregate Score

> The Detailed Breakdown

> Choosing the Right Tests for Your Agent

> Scoring Tests Library

> Scoring Across Your Entire Lifecycle

Why Score Every Conversation?

If you're building AI agents for production, you already know that launching is only half the battle. The harder part is knowing whether your agent is actually doing its job — across hundreds or thousands of conversations you'll never have time to read.

That's the problem scoring solves.

AgentBrains automatically evaluates 100% of your agent's conversations — whether they happen with real customers or Synthetic Users during validation. Every single interaction receives a quality score, giving you a definitive answer to the question every builder eventually asks: "Is this thing actually working?"

Here's why this matters for your build:

You can't read every log.

At scale, manual QA is impossible. You might review a handful of transcripts per week, but your agent handles many more. Scoring gives you coverage you can't achieve by hand. Every conversation is evaluated against a consistent rubric — not just the ones you happen to click on.

Scores tell you where to look.

A low score is a signal. Instead of scrolling through an entire inbox hoping to spot a problem, you filter by score and immediately find the conversations that need attention. A 30% score tells you something broke. A 95% tells you to move on.

Scoring connects testing to production.

The same scoring engine runs during Synthetic QA validation and on live conversations. This means the quality bar you set during development carries through to production. If your agent passes validation at 90%, you can track whether it holds that standard with real users — or starts to drift.

Data replaces opinions.

Without scoring, agent quality is a feeling. With it, you have a number. You can track that number over time, compare it across agents, filter your analytics by it, and use it to justify decisions — whether that's rewriting a System Prompt or expanding a Knowledge Base.

How Scoring Works: The Aggregate Score

Every conversation in your Inbox is tagged with a single Aggregate Quality Score — a percentage from 0 to 100 that tells you, at a glance, how well that interaction went.

This score is calculated by averaging the results of every individual test applied to that conversation. You choose which tests apply to each agent (up to five), and we handle the math.

The score is color-coded directly in the Inbox sidebar:

Green

The agent performed well. No action needed.

Yellow

Mixed results. Worth a look if you're debugging.

Red

Something went wrong. Open this conversation.

The Detailed Breakdown

For example, a conversation with a 63% Aggregate Score might break down as:

Problem Solving: 9/10 — The agent identified the issue and resolved it.

Information Completeness: 8/10 — All questions were answered with minor gaps.

On Task: 7/10 — Mostly focused, with some unnecessary filler.

Customer Mood Change: 4/10 — The customer's frustration was never addressed.

Human-Free Issue Handling: 3/10 — The agent escalated prematurely.

This tells you something very specific: the agent knows the material (high Problem Solving and Completeness scores), but it's losing the customer emotionally and giving up too early. That's a behavioral issue — not a knowledge gap — and now you know exactly where to focus your next round of tuning.

Choosing the Right Tests for Your Agent

Not every agent has the same job, and not every conversation should be measured the same way. A Sales agent needs to be graded on closing deals and handling objections. A Support agent needs to be graded on solving problems and keeping customers calm. A Compliance bot needs hard pass/fail checks.

When you register an agent in AgentBrains, you select up to 5 scoring tests from our library. These tests run automatically against every conversation that agent handles — in validation and in production. Choose the tests that match the agent's role, and you'll get scores that actually mean something for your use case.

Below is an overview of the tests currently available. For full documentation — including the complete 1–10 scoring scale, examples of high and low scores, and common failure patterns — visit each test's dedicated page in our Docs.

Learn About Scores

Scoring Tests Library

Test: On Task

What it measures: Whether the agent stays focused on the customer's actual question and delivers specific, relevant answers — not generic filler.

What it measures: Whether the agent stays focused on the customer's actual question and delivers specific, relevant answers — not generic filler.

Read the full On Task test documentation

Test: Objection Handling

What it measures: How effectively the agent responds when a customer raises a concern about buying — whether that's price, timing, fit, or risk.

What good looks like: The agent acknowledges the concern respectfully, counters with clear reasoning or alternatives, and suggests a concrete next step to keep the conversation moving toward a decision. It doesn't ignore the objection or respond with empty "we're the best" claims.

Read the full Objection Handling test documentation

Test: Problem Solving

What it measures: Whether the agent actually resolves the customer's issue — and how confidently we can say it's fixed. This test also considers whether the proposed solution is practical and safe.

What good looks like: The agent identifies the real issue quickly, provides clear and correct steps, and the conversation ends with the customer confirming the fix worked — or with high confidence that the guidance was sound.

Read the full Problem Solving test documentation

Test: Information Completeness

What it measures: Whether the agent fully addresses everything the customer asked — not just part of it, and not with vague non-answers.

What good looks like: Every question the customer raises gets a clear, detailed answer. If the agent doesn't know something, it explains what it can do next instead of guessing. Nothing is skipped or dodged.

Read the full Information Completeness test documentation

Test: Customer Mood Change

What it measures: The emotional trajectory of the conversation — whether the customer's mood improves, stays neutral, or worsens over the course of the interaction.

What good looks like: The customer ends the conversation feeling the same or better than when they started. If frustration appears mid-conversation, the agent recovers it quickly. The interaction closes on a positive or calmly resolved note.

Read the full Customer Mood Change test documentation

Test: Human-Free Issue Handling

What it measures: Whether the agent can resolve conversations autonomously without unnecessary handoffs to a human. It rewards independence when the agent should be able to solve the request, and only penalizes escalation when it's clearly premature.

What good looks like: The agent handles common requests end-to-end. It only escalates when truly necessary — and when it does, the handoff is smooth and contextual. The customer never has to demand a human because the bot stopped being useful.

Read the full Human-Free Issue Handling test documentation

Test: Making a Sale

What it measures: How effectively the agent moves a customer from interest toward commitment — answering buying questions, handling hesitation, and guiding toward a clear next step like checkout, booking, or payment.

What good looks like: The agent answers pricing and availability questions directly, confirms the customer's intent, and proactively suggests the next action. The conversation doesn't end with a vague "let me know if you need anything" — it ends with progress toward a purchase.

Read the full Making a Sale test documentation

Scoring Across Your Entire Lifecycle

One Scoring Engine. Every Stage

The same tests that grade your live customer conversations can be the same tests that run during Synthetic QA validation.

During development, you run Synthetic Users against your agent and get scored results before a single real customer ever interacts with it. If your Refund Agent scores a 45% against a "Frustrated Customer" synthetic persona, you catch and fix before release..

In production, the scoring engine continues to run on every real interaction. Your Inbox becomes a filterable, score-driven command center — but the real power is what happens when those scores flow into Analytics

Scores become your analysis layer

Every conversation score feeds directly into your Analytics dashboards, where individual data points turn into trends you can actually act on. Filter by date range, by agent, or by specific test — and within seconds you can see whether your "Problem Solving" scores have been climbing since you updated the Knowledge Base last Tuesday, or whether "Objection Handling" has been quietly sliding downward over the past two weeks.

Track improvement after every change

When you retrain an agent — whether that's adjusting the System Prompt, restructuring a knowledge folder, or refining a policy document — scores give you a before-and-after comparison that's immediate and concrete. You don't have to read fifty transcripts to decide if the change worked. Filter your Analytics to the relevant time window, compare the Aggregate Scores, and drill into the individual test results that matter. The data tells the story.

Spot performance trends before they become problems

A single low-scoring conversation is a data point. A downward trend across your last 200 conversations is a pattern — and Analytics surfaces that pattern for you. Maybe your agent handles weekday volume well but degrades under weekend traffic. Maybe a specific test category has been slowly declining as your product catalog has grown. These are the kinds of insights you can only see when every conversation is scored consistently, and when those scores are aggregated over time in one place.

Scoring isn't a feature you check once and forget. It's the continuous signal that tells you whether your agents are getting better, holding steady, or starting to drift — and it gives you the data to prove it.

Build agents you can measure. Measure agents you can trust

Start your 30-day free trial