Skip to main content
Scorers

Prebuilt Scorers

VoltAgent provides prebuilt scorers for common evaluation scenarios. These scorers are production-ready and can be used in both offline and live evaluations.

Evaluating Tool Calls During Agent Execution

Use createToolCallAccuracyScorerCode from @voltagent/scorers to deterministically evaluate tool selection and tool order in live or offline runs.

Built-in scorer

import { createToolCallAccuracyScorerCode } from "@voltagent/scorers";

const toolOrderScorer = createToolCallAccuracyScorerCode({
expectedToolOrder: ["searchProducts", "checkInventory"],
strictMode: false, // allows extra tools as long as relative order is correct
});
const singleToolScorer = createToolCallAccuracyScorerCode({
expectedTool: "searchProducts",
strictMode: true, // requires exactly one tool call and it must match expectedTool
});

Live eval payloads already include messages, toolCalls, and toolResults. This scorer reads toolCalls first, then falls back to message-chain extraction.

Custom scorer with toolCalls and toolResults

import { buildScorer } from "@voltagent/core";

const toolExecutionHealthScorer = buildScorer({
id: "tool-execution-health",
type: "agent",
})
.score(({ payload }) => {
const toolCalls = payload.toolCalls ?? [];
const toolResults = payload.toolResults ?? [];

const calledToolNames = toolCalls
.map((call) => call.toolName)
.filter((name): name is string => Boolean(name));

const failedResults = toolResults.filter((result) => result.isError === true || !!result.error);

const completionRatio =
toolCalls.length === 0 ? 1 : Math.min(toolResults.length / toolCalls.length, 1);
const score = Math.max(0, completionRatio - failedResults.length * 0.25);

return {
score,
metadata: {
calledToolNames,
toolCallCount: toolCalls.length,
toolResultCount: toolResults.length,
failedResultCount: failedResults.length,
completionRatio,
},
};
})
.build();

This pattern lets you score both tool selection (toolCalls) and execution quality (toolResults) in one scorer.

Heuristic Scorers (No LLM Required)

These scorers from AutoEvals perform deterministic evaluations without requiring an LLM or API keys:

Exact Match

Checks if the output exactly matches the expected value.

import { scorers } from "@voltagent/scorers";

// Use in offline evaluation
const experiment = await voltagent.evals.runExperiment({
dataset: {
items: [{ input: "What is 2+2?", expected: "4" }],
},
runner: async ({ item }) => ({ output: "4" }),
scorers: [scorers.exactMatch],
});

Parameters (optional):

  • ignoreCase (boolean): Case-insensitive comparison (default: false)

Score: Binary (0 or 1)


Levenshtein Distance

Measures string similarity using Levenshtein distance.

import { scorers } from "@voltagent/scorers";

const experiment = await voltagent.evals.runExperiment({
dataset: {
items: [{ input: "Spell 'algorithm'", expected: "algorithm" }],
},
runner: async ({ item }) => ({ output: "algoritm" }),
scorers: [scorers.levenshtein],
});

Parameters (optional):

  • threshold (number): Minimum similarity score (0-1)

Score: Normalized similarity (0-1)


JSON Diff

Compares JSON objects for structural and value differences.

import { scorers } from "@voltagent/scorers";

const experiment = await voltagent.evals.runExperiment({
dataset: {
items: [
{
input: "Generate user object",
expected: JSON.stringify({ name: "John", age: 30 }),
},
],
},
runner: async ({ item }) => ({
output: JSON.stringify({ name: "John", age: 30, extra: "field" }),
}),
scorers: [scorers.jsonDiff],
});

Parameters: None required (uses expected from dataset)

Score: Similarity score based on structural matching (0-1)


List Contains

Checks if output contains all expected items.

import { scorers } from "@voltagent/scorers";

const experiment = await voltagent.evals.runExperiment({
dataset: {
items: [
{
input: "List primary colors",
expected: ["red", "blue", "yellow"],
},
],
},
runner: async ({ item }) => ({
output: ["red", "blue", "yellow", "green"],
}),
scorers: [scorers.listContains],
});

Parameters: None required (uses expected from dataset)

Score: Fraction of expected items found (0-1)


Numeric Diff

Evaluates numeric accuracy within a threshold.

import { scorers } from "@voltagent/scorers";

const experiment = await voltagent.evals.runExperiment({
dataset: {
items: [
{
input: "What is pi to 2 decimal places?",
expected: 3.14,
},
],
},
runner: async ({ item }) => ({ output: 3.1415 }),
scorers: [
{
scorer: scorers.numericDiff,
params: { threshold: 0.01 },
},
],
});

Parameters (optional):

  • threshold (number): Maximum allowed difference

Score: Binary (1 if within threshold, 0 otherwise)


RAG Scorers (LLM Required)

These native VoltAgent scorers evaluate Retrieval-Augmented Generation systems:

Answer Correctness

Evaluates factual accuracy of answers against expected ground truth.

import { createAnswerCorrectnessScorer } from "@voltagent/scorers";
import { openai } from "@ai-sdk/openai";

const scorer = createAnswerCorrectnessScorer({
model: openai("gpt-4o-mini"),
buildPayload: ({ payload, params }) => ({
input: String(payload.input),
output: String(payload.output),
expected: String(params.expectedAnswer),
}),
});

Payload Fields:

  • input (string): The question
  • output (string): The answer to evaluate
  • expected (string): The ground truth answer

Options:

  • factualityWeight (number): Weight for factual accuracy (default: 1.0)

Score: F1 score based on statement classification (0-1)

Metadata:

{
classification: {
TP: string[]; // True positive statements
FP: string[]; // False positive statements
FN: string[]; // False negative statements
f1Score: number; // F1 score
}
}
import { createAnswerCorrectnessScorer } from "@voltagent/scorers";
import { openai } from "@ai-sdk/openai";

const scorer = createAnswerCorrectnessScorer({
model: openai("gpt-4o-mini"),
});

const experiment = await voltagent.evals.runExperiment({
dataset: {
items: [
{
input: "What is the capital of France?",
expected: "Paris is the capital of France.",
},
],
},
runner: async ({ item }) => {
const result = await agent.generateText(item.input);
return { output: result.text };
},
scorers: [scorer],
});

Answer Relevancy

Evaluates how relevant an answer is to the original question.

import { createAnswerRelevancyScorer } from "@voltagent/scorers";
import { openai } from "@ai-sdk/openai";

const scorer = createAnswerRelevancyScorer({
model: openai("gpt-4o-mini"),
embeddingModel: "openai/text-embedding-3-small",
strictness: 3,
buildPayload: ({ payload, params }) => ({
input: String(payload.input),
output: String(payload.output),
context: String(params.referenceContext),
}),
});

Payload Fields:

  • input (string): The original question
  • output (string): The answer to evaluate
  • context (string): Reference context for the answer

Options:

  • strictness (number): Number of questions to generate for evaluation (default: 3)
  • embeddingExpectedMin (number): Minimum expected similarity (default: 0.7)
  • embeddingPrefix (string): Prefix for embeddings

Score: Average similarity score (0-1)

Metadata:

{
strictness: number;
questions: Array<{
question: string;
noncommittal: boolean;
}>;
similarity: Array<{
question: string;
score: number;
rawScore: number;
usage: number;
}>;
noncommittal: boolean;
}

Context Precision

Evaluates whether the provided context was useful for generating the answer.

import { createContextPrecisionScorer } from "@voltagent/scorers";
import { openai } from "@ai-sdk/openai";

const scorer = createContextPrecisionScorer({
model: openai("gpt-4o-mini"),
buildPayload: ({ payload }) => ({
input: String(payload.input),
output: String(payload.output),
context: String(payload.context),
expected: String(payload.expected),
}),
});

Payload Fields:

  • input (string): The question
  • output (string): The generated answer
  • context (string): Retrieved context
  • expected (string): Expected answer

Score: Binary (1 if useful, 0 if not)

Metadata:

{
reason: string; // Explanation for the verdict
verdict: number; // 1 if useful, 0 if not
}

Context Recall

Measures how well the context covers the expected answer.

import { createContextRecallScorer } from "@voltagent/scorers";
import { openai } from "@ai-sdk/openai";

const scorer = createContextRecallScorer({
model: openai("gpt-4o-mini"),
buildPayload: ({ payload }) => ({
input: String(payload.input),
expected: String(payload.expected),
context: payload.context,
}),
});

Payload Fields:

  • input (string): The question
  • expected (string): The ground truth answer
  • context (string | string[]): Retrieved context

Score: Percentage of statements found in context (0-1)

Metadata:

{
classifications: Array<{
statement: string;
attributed: number; // 1 if found in context, 0 if not
reason: string;
}>;
score: number;
}

Context Relevancy

Evaluates how relevant the retrieved context is to the question.

import { createContextRelevancyScorer } from "@voltagent/scorers";
import { openai } from "@ai-sdk/openai";

const scorer = createContextRelevancyScorer({
model: openai("gpt-4o-mini"),
buildPayload: ({ payload }) => ({
input: String(payload.input),
context: payload.context,
}),
});

Payload Fields:

  • input (string): The question
  • context (string | string[]): Retrieved context

Score: Coverage ratio of relevant sentences (0-1)

Metadata:

{
sentences: Array<{
sentence: string;
isRelevant: number;
reason: string;
}>;
coverageRatio: number;
}

Task-Specific Scorers (LLM Required)

Factuality

Verifies factual accuracy against ground truth.

import { createFactualityScorer } from "@voltagent/scorers";
import { openai } from "@ai-sdk/openai";

const scorer = createFactualityScorer({
model: openai("gpt-4o-mini"),
buildPayload: ({ payload }) => ({
input: String(payload.input),
output: String(payload.output),
expected: String(payload.expected),
}),
});

Payload Fields:

  • input (string): The input/question
  • output (string): Generated response
  • expected (string): Expected factual answer

Score: Binary (0 or 1) based on factual accuracy

Metadata:

{
rationale: string; // Explanation of the verdict
}
import { createFactualityScorer } from "@voltagent/scorers";
import { openai } from "@ai-sdk/openai";

const scorer = createFactualityScorer({
model: openai("gpt-4o-mini"),
});

const experiment = await voltagent.evals.runExperiment({
dataset: {
items: [
{
input: "When was the Eiffel Tower built?",
expected: "1889",
},
],
},
runner: async ({ item }) => {
const result = await agent.generateText(item.input);
return { output: result.text };
},
scorers: [scorer],
});

Summary

Evaluates the quality of generated summaries.

import { createSummaryScorer } from "@voltagent/scorers";
import { openai } from "@ai-sdk/openai";

const scorer = createSummaryScorer({
model: openai("gpt-4o-mini"),
buildPayload: ({ payload }) => ({
input: String(payload.content),
output: String(payload.summary),
}),
});

Payload Fields:

  • input (string): Original content to summarize
  • output (string): Generated summary

Score: Quality score (0-1)

Metadata:

{
coherence: number; // 0-5 rating
consistency: number; // 0-5 rating
fluency: number; // 0-5 rating
relevance: number; // 0-5 rating
rationale: string; // Detailed explanation
}

Translation

Evaluates translation quality and accuracy.

import { createTranslationScorer } from "@voltagent/scorers";
import { openai } from "@ai-sdk/openai";

const scorer = createTranslationScorer({
model: openai("gpt-4o-mini"),
buildPayload: ({ payload }) => ({
input: String(payload.source),
output: String(payload.translation),
expected: String(payload.reference),
}),
});

Payload Fields:

  • input (string): Source text
  • output (string): Generated translation
  • expected (string): Reference translation

Score: Translation quality (0-1)

Metadata:

{
accuracy: number; // Semantic accuracy (0-5)
fluency: number; // Language fluency (0-5)
consistency: number; // Term consistency (0-5)
rationale: string; // Detailed feedback
}

Humor

Evaluates if a response is appropriately humorous.

import { createHumorScorer } from "@voltagent/scorers";
import { openai } from "@ai-sdk/openai";

const scorer = createHumorScorer({
model: openai("gpt-4o-mini"),
buildPayload: ({ payload }) => ({
output: String(payload.response),
}),
});

Payload Fields:

  • output (string): Response to evaluate

Score: Binary (0 or 1) - 1 if humorous, 0 if not

Metadata:

{
rationale: string; // Explanation of humor assessment
}

Possible

Tests if a task or scenario is possible/feasible.

import { createPossibleScorer } from "@voltagent/scorers";
import { openai } from "@ai-sdk/openai";

const scorer = createPossibleScorer({
model: openai("gpt-4o-mini"),
buildPayload: ({ payload }) => ({
input: String(payload.task),
output: String(payload.response),
}),
});

Payload Fields:

  • input (string): Task or scenario description
  • output (string): Assessment response

Score: Binary (0 or 1) - 1 if possible, 0 if not

Metadata:

{
rationale: string; // Reasoning about possibility
}

Moderation

Checks content for safety and appropriateness.

import { createModerationScorer } from "@voltagent/scorers";
import { openai } from "@ai-sdk/openai";

const scorer = createModerationScorer({
model: openai("gpt-4o-mini"),
threshold: 0.5,
categories: ["hate", "harassment", "violence", "sexual", "self-harm"],
buildPayload: ({ payload }) => ({
output: String(payload.content),
}),
});

Payload Fields:

  • output (string): Content to moderate

Options:

  • threshold (number): Threshold for flagging content (default: 0.5)
  • categories (string[]): Categories to check

Score: Binary (0 or 1) - 1 if safe, 0 if problematic

Metadata:

{
categories: {
hate: boolean;
violence: boolean;
sexual: boolean;
selfHarm: boolean;
harassment: boolean;
}
rationale: string; // Explanation of moderation decision
}
import { createModerationScorer } from "@voltagent/scorers";
import { openai } from "@ai-sdk/openai";

const scorer = createModerationScorer({
model: openai("gpt-4o-mini"),
threshold: 0.5,
});

const experiment = await voltagent.evals.runExperiment({
dataset: {
items: [{ input: "User generated content to check..." }],
},
runner: async ({ item }) => {
return { output: item.input };
},
scorers: [scorer],
});

Using Scorers

In Offline Evaluations

import { createAnswerCorrectnessScorer } from "@voltagent/scorers";
import { scorers } from "@voltagent/scorers";
import { openai } from "@ai-sdk/openai";

const experiment = await voltagent.evals.runExperiment({
dataset: { name: "my-test-dataset" },
runner: myAgent,
scorers: [
// Heuristic scorer (gets expected from dataset)
scorers.exactMatch,
// LLM-based scorer
createAnswerCorrectnessScorer({
model: openai("gpt-4o-mini"),
}),
],
});

const results = await experiment.results();

In Live Evaluations

import { Agent } from "@voltagent/core";
import { scorers, createAnswerCorrectnessScorer } from "@voltagent/scorers";
import { openai } from "@ai-sdk/openai";

const agent = new Agent({
name: "production-agent",
model: openai("gpt-4o"),
eval: {
scorers: {
// Heuristic scorer
exact: {
scorer: scorers.exactMatch,
params: { expected: "expected value" },
},
// LLM-based scorer
correctness: {
scorer: createAnswerCorrectnessScorer({
model: openai("gpt-4o-mini"),
}),
},
},
},
});

Custom Payload Mapping

All scorers support custom payload mapping:

const scorer = createAnswerCorrectnessScorer({
model: openai("gpt-4o-mini"),
buildPayload: ({ payload, params }) => ({
input: payload.question,
output: payload.answer,
expected: params.groundTruth,
}),
});

Combining Scorer Types

Mix heuristic and LLM-based scorers for comprehensive evaluation:

import {
scorers,
createAnswerCorrectnessScorer,
createAnswerRelevancyScorer,
} from "@voltagent/scorers";
import { openai } from "@ai-sdk/openai";

const allScorers = [
// Heuristic scorers (no LLM, use expected from dataset)
scorers.levenshtein,
{
scorer: scorers.numericDiff,
params: { threshold: 0.1 }, // Only threshold param needed
},
// LLM-based scorers
createAnswerCorrectnessScorer({
model: openai("gpt-4o-mini"),
}),
createAnswerRelevancyScorer({
model: openai("gpt-4o-mini"),
embeddingModel: "openai/text-embedding-3-small",
}),
];

const experiment = await voltagent.evals.runExperiment({
dataset: { name: "qa-dataset" },
runner: ragPipeline,
scorers: allScorers,
});

Choosing the Right Scorer

Use Heuristic Scorers When:

  • You need deterministic, reproducible results
  • You want fast evaluation without API costs
  • You're comparing exact values or simple patterns
  • You don't have access to LLM APIs

Use LLM-Based Scorers When:

  • You need semantic understanding
  • You're evaluating natural language quality
  • You want nuanced judgment of correctness
  • You need to evaluate subjective qualities

Performance Considerations:

  • Heuristic scorers: Fast, no API calls, deterministic
  • LLM-based scorers: Slower, require API calls, may vary slightly between runs

Next Steps

Table of Contents