alepot55/agentrial: Statistical evaluation framework for AI agents

The pytest for AI agents. Run your agent 100 times, get confidence intervals instead of anecdotes.

Your agent passes Monday, fails Wednesday. Same prompt, same model. LLMs show up to 72% variance across runs even at temperature=0.

agentrial runs your agent N times and gives you statistics, not luck.

pip install agentrial
agentrial init
agentrial run

╭──────────────────────────────────────────────────────────────────────────╮
│ my-agent - FAILED                                                        │
╰───────────────────────────────────────────────────────── Threshold: 85% ─╯
┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Test Case            ┃ Pass Rate ┃ 95% CI           ┃ Avg Cost ┃ Avg Latency ┃
┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ easy-multiply        │    100.0% │ (72.2%-100.0%)   │  $0.0005 │       320ms │
│ tool-selection       │     90.0% │ (59.6%-98.2%)    │  $0.0006 │       450ms │
│ multi-step-task      │     70.0% │ (39.7%-89.2%)    │  $0.0011 │       890ms │
│ ambiguous-query      │     50.0% │ (23.7%-76.3%)    │  $0.0008 │       670ms │
└──────────────────────┴───────────┴──────────────────┴──────────┴─────────────┘

Failure Attribution:
  tool-selection: Step 0 — called 'calculate' instead of 'lookup_country_info' (p=0.003)
  multi-step-task: Step 2 — missing second tool call 'calculate' after lookup (p=0.01)
  ambiguous-query: Step 0 — tool selection inconsistent across runs (p

That 100% on easy-multiply? Wilson CI says it’s actually 72-100% with 10 trials. That multi-step-task at 70%? Step 2 is the bottleneck. Now you know what to fix.

Every agent framework demos 90%+ accuracy. Run those agents 100 times on the same task, pass rates drop to 60-80% with wide variance. Benchmarks measure one run; production sees thousands.

No existing tool combines trajectory evaluation, multi-trial statistics, and CI/CD integration in a single open-source package. LangSmith requires paid accounts and LangChain lock-in. Promptfoo doesn’t do multi-trial with confidence intervals. DeepEval and Arize don’t do step-level failure attribution.

agentrial fills that gap: open-source, free, local-first, works with any framework.

Statistical rigor by default. Every evaluation runs N trials with Wilson confidence intervals. Bootstrap resampling for cost/latency. Benjamini-Hochberg correction for multiple comparisons. No single-run pass/fail.

Step-level failure attribution. When tests fail, agentrial compares trajectories from passing and failing runs. Fisher exact test identifies the specific step where behavior diverges. You see “Step 2 tool selection is the problem” instead of “test failed.”

Real cost tracking. Token usage from API response metadata, not estimates. 45+ models across Anthropic, OpenAI, Google, Mistral, Meta, DeepSeek. Cost-per-correct-answer as a first-class metric — the number that actually matters for production.

Regression detection. Fisher exact test on pass rates, Mann-Whitney U on cost/latency. Catches statistically significant drops between versions. Exit code 1 blocks your PR in CI.

Agent Reliability Score. A single 0-100 composite metric that combines accuracy (40%), consistency (20%), cost efficiency (10%), latency (10%), trajectory quality (10%), and recovery (10%). One number to track across releases — like Lighthouse for agents.

Production monitoring. Deploy agentrial monitor as a cron job or sidecar. CUSUM and Page-Hinkley detectors catch drift in pass rate, cost, and latency. Kolmogorov-Smirnov test detects distribution shifts. Alerts before users notice.

Local-first. Data never leaves your machine. No accounts, no SaaS, no telemetry.

suite: my-agent-tests
agent: my_module.agent       # Python import path
trials: 10
threshold: 0.85

cases:
  - name: basic-math
    input:
      query: "What is 15 * 37?"
    expected:
      output_contains: ["555"]
      tool_calls:
        - tool: calculate

  - name: capital-lookup
    input:
      query: "What is the capital of Japan?"
    expected:
      output_contains: ["Tokyo"]

  - name: error-handling
    input:
      query: "Divide 10 by 0"
    expected:
      output_contains_any: ["undefined", "cannot", "error"]
    max_cost: 0.05
    max_latency_ms: 5000

For complex assertions, use the fluent Python API:

from agentrial import expect, AgentInput

result = agent(AgentInput(query="Book a flight to Rome"))

expect(result).succeeded() \
    .tool_called("search_flights", params_contain={"destination": "FCO"}) \
    .cost_below(0.15) \
    .latency_below(5000)

All assertion types: output_contains, output_contains_any, exact_match, regex, tool_calls with params_contain, per-step expectations via step_expectations. See full docs.

agentrial needs a callable: AgentInput -> AgentOutput. Native adapters handle the wiring.

# LangGraph
from agentrial.runner.adapters import wrap_langgraph_agent
agent = wrap_langgraph_agent(your_compiled_graph)

# CrewAI
from agentrial.runner.adapters import wrap_crewai_agent
agent = wrap_crewai_agent(crew)

# Custom — implement the protocol directly
from agentrial.types import AgentInput, AgentOutput, AgentMetadata

def agent(input: AgentInput) -> AgentOutput:
    return AgentOutput(
        output="result", steps=[],
        metadata=AgentMetadata(total_tokens=100, cost=0.001, duration_ms=500.0),
        success=True,
    )

Framework	Adapter	What it captures
LangGraph	`wrap_langgraph_agent`	Callbacks, trajectory, tokens, cost
CrewAI	`wrap_crewai_agent`	Task-level trajectory, crew cost
AutoGen	`wrap_autogen_agent`	v0.4+ and legacy pyautogen
Pydantic AI	`wrap_pydantic_ai_agent`	Tool calls, response parts, tokens
OpenAI Agents SDK	`wrap_openai_agents_agent`	Runner integration, tool calls
smolagents (HF)	`wrap_smolagents_agent`	Dict and object log formats
Any OTel agent	Automatic	Span capture via OTel SDK
Custom	`AgentInput -> AgentOutput`	Whatever you return

name: Agent Evaluation
on: [push, pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install agentrial && pip install -e .
      - run: agentrial run --trials 10 --threshold 0.85 -o results.json
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Regression detection between runs:

agentrial compare results.json --baseline baseline.json

Fisher exact test (p

agentrial run --flamegraph                         # Terminal
agentrial run --flamegraph --html flamegraph.html   # Interactive HTML

Visualize agent execution paths across trials. See where passing and failing runs diverge, step by step.

A second LLM evaluates response quality with calibrated scoring. Krippendorff’s alpha for inter-rater reliability, t-distribution CI for score estimates. Calibration protocol runs before scoring to ensure consistency.

agentrial snapshot update     # Save current behavior as baseline
agentrial snapshot check      # Compare against baseline

Fisher exact test on pass rates, Mann-Whitney U on cost/latency, Benjamini-Hochberg correction across all comparisons.

agentrial security scan --mcp-config servers.json

Audits MCP server configurations for 6 vulnerability classes: prompt injection, tool shadowing, data exfiltration, permission escalation, rug pull, configuration weakness.

agentrial pareto --models claude-3-haiku,gpt-4o-mini,gemini-flash

Find the optimal cost-accuracy trade-off across models. ASCII plot in terminal.

agentrial prompt track prompts/v2.txt
agentrial prompt diff v1 v2

Track, diff, and compare prompt versions with statistical significance testing between them.

agentrial publish results.json --agent-name my-agent --agent-version 1.0.0
agentrial verify --agent-name my-agent --agent-version 1.0.0 --suite-name my-suite

Publish evaluation results as verifiable benchmark files with SHA-256 integrity checksums.

Delegation accuracy, handoff fidelity, redundancy rate, cascade failure depth, communication efficiency — five metrics for multi-agent systems.

Local FastAPI dashboard for browsing results, comparing runs, and tracking trends.

Domain-specific evaluation packages via Python entry points. Install a pack, get specialized test templates and evaluators.

Browse test suites, run evaluations, view flame graphs, and compare snapshots from your editor. Install from the VS Code Marketplace or search “agentrial” in extensions.

Method	Purpose
Wilson score interval	Pass rate CI — accurate at 0%, 100%, and small N
Bootstrap resampling	Cost/latency CI — non-parametric, 500 iterations
Fisher exact test	Regression detection and failure attribution (p
Mann-Whitney U	Cost/latency comparison between versions
Benjamini-Hochberg	False discovery rate control for multiple comparisons
CUSUM / Page-Hinkley	Sequential change-point detection for production monitoring
Kolmogorov-Smirnov	Distribution shift detection
Krippendorff’s alpha	Inter-rater reliability for LLM-as-Judge

Failure attribution works by grouping trials into pass/fail, comparing tool call distributions at each step, and identifying the step with the lowest p-value as the most likely divergence point.

agentrial init                              # Scaffold sample project
agentrial run                               # Run all tests
agentrial run tests/ --trials 20            # Custom trials
agentrial run -o results.json               # JSON export
agentrial run --flamegraph                  # Trajectory flame graphs
agentrial run --judge                       # LLM-as-Judge evaluation
agentrial compare results.json -b base.json # Regression detection
agentrial baseline results.json             # Save baseline
agentrial snapshot update / check           # Snapshot testing
agentrial security scan --mcp-config c.json # MCP security scan
agentrial pareto --models m1,m2,m3          # Cost-accuracy Pareto frontier
agentrial prompt track/diff/list            # Prompt version control
agentrial monitor --baseline snap.json      # Production drift detection
agentrial ars results.json                  # Agent Reliability Score
agentrial publish results.json --agent-name X --agent-version Y
agentrial verify --agent-name X --agent-version Y --suite-name Z
agentrial packs list                        # Installed eval packs
agentrial dashboard                         # Local dashboard

	agentrial	Promptfoo	LangSmith	DeepEval	Arize Phoenix
Multi-trial with CI	Free	—	$39/mo	—	—
Confidence intervals	Wilson CI	—	—	—	—
Step-level failure attribution	Fisher exact	—	—	—	Partial
Framework-agnostic	6 adapters + OTel	Yes	LangChain only	Yes	Yes
Cost-per-correct-answer	Yes	—	—	—	—
LLM-as-Judge with calibration	Krippendorff α	—	Yes	Yes	—
Composite reliability score	ARS (0-100)	—	—	—	—
MCP security scanning	6 vuln classes	—	—	—	—
Production drift detection	CUSUM + PH + KS	—	—	—	Partial
VS Code extension	Yes	—	—	—	—
Local-first	Yes	Yes	No	No	Self-host option

git clone https://github.com/alepot55/agentrial.git
cd agentrial
pip install -e ".[dev]"
pytest                    # 450 tests
ruff check .
mypy agentrial/

See CONTRIBUTING.md for details.

MIT

Source link

Leave a Reply Cancel reply