alepot55/agentrial: Statistical evaluation framework for AI agents


The pytest for AI agents. Run your agent 100 times, get confidence intervals instead of anecdotes.

PyPI
License: MIT
Python 3.11+
Tests
VS Code

Your agent passes Monday, fails Wednesday. Same prompt, same model. LLMs show up to 72% variance across runs even at temperature=0.

agentrial runs your agent N times and gives you statistics, not luck.

pip install agentrial
agentrial init
agentrial run
╭──────────────────────────────────────────────────────────────────────────╮
│ my-agent - FAILED                                                        │
╰───────────────────────────────────────────────────────── Threshold: 85% ─╯
┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Test Case            ┃ Pass Rate ┃ 95% CI           ┃ Avg Cost ┃ Avg Latency ┃
┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ easy-multiply        │    100.0% │ (72.2%-100.0%)   │  $0.0005 │       320ms │
│ tool-selection       │     90.0% │ (59.6%-98.2%)    │  $0.0006 │       450ms │
│ multi-step-task      │     70.0% │ (39.7%-89.2%)    │  $0.0011 │       890ms │
│ ambiguous-query      │     50.0% │ (23.7%-76.3%)    │  $0.0008 │       670ms │
└──────────────────────┴───────────┴──────────────────┴──────────┴─────────────┘

Failure Attribution:
  tool-selection: Step 0 — called 'calculate' instead of 'lookup_country_info' (p=0.003)
  multi-step-task: Step 2 — missing second tool call 'calculate' after lookup (p=0.01)
  ambiguous-query: Step 0 — tool selection inconsistent across runs (p

That 100% on easy-multiply? Wilson CI says it’s actually 72-100% with 10 trials. That multi-step-task at 70%? Step 2 is the bottleneck. Now you know what to fix.


Every agent framework demos 90%+ accuracy. Run those agents 100 times on the same task, pass rates drop to 60-80% with wide variance. Benchmarks measure one run; production sees thousands.

No existing tool combines trajectory evaluation, multi-trial statistics, and CI/CD integration in a single open-source package. LangSmith requires paid accounts and LangChain lock-in. Promptfoo doesn’t do multi-trial with confidence intervals. DeepEval and Arize don’t do step-level failure attribution.

agentrial fills that gap: open-source, free, local-first, works with any framework.


Statistical rigor by default. Every evaluation runs N trials with Wilson confidence intervals. Bootstrap resampling for cost/latency. Benjamini-Hochberg correction for multiple comparisons. No single-run pass/fail.

Step-level failure attribution. When tests fail, agentrial compares trajectories from passing and failing runs. Fisher exact test identifies the specific step where behavior diverges. You see “Step 2 tool selection is the problem” instead of “test failed.”

Real cost tracking. Token usage from API response metadata, not estimates. 45+ models across Anthropic, OpenAI, Google, Mistral, Meta, DeepSeek. Cost-per-correct-answer as a first-class metric — the number that actually matters for production.

Regression detection. Fisher exact test on pass rates, Mann-Whitney U on cost/latency. Catches statistically significant drops between versions. Exit code 1 blocks your PR in CI.

Agent Reliability Score. A single 0-100 composite metric that combines accuracy (40%), consistency (20%), cost efficiency (10%), latency (10%), trajectory quality (10%), and recovery (10%). One number to track across releases — like Lighthouse for agents.

Production monitoring. Deploy agentrial monitor as a cron job or sidecar. CUSUM and Page-Hinkley detectors catch drift in pass rate, cost, and latency. Kolmogorov-Smirnov test detects distribution shifts. Alerts before users notice.

Local-first. Data never leaves your machine. No accounts, no SaaS, no telemetry.


suite: my-agent-tests
agent: my_module.agent       # Python import path
trials: 10
threshold: 0.85

cases:
  - name: basic-math
    input:
      query: "What is 15 * 37?"
    expected:
      output_contains: ["555"]
      tool_calls:
        - tool: calculate

  - name: capital-lookup
    input:
      query: "What is the capital of Japan?"
    expected:
      output_contains: ["Tokyo"]

  - name: error-handling
    input:
      query: "Divide 10 by 0"
    expected:
      output_contains_any: ["undefined", "cannot", "error"]
    max_cost: 0.05
    max_latency_ms: 5000

For complex assertions, use the fluent Python API:

from agentrial import expect, AgentInput

result = agent(AgentInput(query="Book a flight to Rome"))

expect(result).succeeded() \
    .tool_called("search_flights", params_contain={"destination": "FCO"}) \
    .cost_below(0.15) \
    .latency_below(5000)

All assertion types: output_contains, output_contains_any, exact_match, regex, tool_calls with params_contain, per-step expectations via step_expectations. See full docs.


agentrial needs a callable: AgentInput -> AgentOutput. Native adapters handle the wiring.

# LangGraph
from agentrial.runner.adapters import wrap_langgraph_agent
agent = wrap_langgraph_agent(your_compiled_graph)

# CrewAI
from agentrial.runner.adapters import wrap_crewai_agent
agent = wrap_crewai_agent(crew)

# Custom — implement the protocol directly
from agentrial.types import AgentInput, AgentOutput, AgentMetadata

def agent(input: AgentInput) -> AgentOutput:
    return AgentOutput(
        output="result", steps=[],
        metadata=AgentMetadata(total_tokens=100, cost=0.001, duration_ms=500.0),
        success=True,
    )

Framework Adapter What it captures
LangGraph wrap_langgraph_agent Callbacks, trajectory, tokens, cost
CrewAI wrap_crewai_agent Task-level trajectory, crew cost
AutoGen wrap_autogen_agent v0.4+ and legacy pyautogen
Pydantic AI wrap_pydantic_ai_agent Tool calls, response parts, tokens
OpenAI Agents SDK wrap_openai_agents_agent Runner integration, tool calls
smolagents (HF) wrap_smolagents_agent Dict and object log formats
Any OTel agent Automatic Span capture via OTel SDK
Custom AgentInput -> AgentOutput Whatever you return


name: Agent Evaluation
on: [push, pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install agentrial && pip install -e .
      - run: agentrial run --trials 10 --threshold 0.85 -o results.json
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Regression detection between runs:

agentrial compare results.json --baseline baseline.json

Fisher exact test (p


agentrial run --flamegraph                         # Terminal
agentrial run --flamegraph --html flamegraph.html   # Interactive HTML

Visualize agent execution paths across trials. See where passing and failing runs diverge, step by step.

A second LLM evaluates response quality with calibrated scoring. Krippendorff’s alpha for inter-rater reliability, t-distribution CI for score estimates. Calibration protocol runs before scoring to ensure consistency.

agentrial snapshot update     # Save current behavior as baseline
agentrial snapshot check      # Compare against baseline

Fisher exact test on pass rates, Mann-Whitney U on cost/latency, Benjamini-Hochberg correction across all comparisons.

agentrial security scan --mcp-config servers.json

Audits MCP server configurations for 6 vulnerability classes: prompt injection, tool shadowing, data exfiltration, permission escalation, rug pull, configuration weakness.

agentrial pareto --models claude-3-haiku,gpt-4o-mini,gemini-flash

Find the optimal cost-accuracy trade-off across models. ASCII plot in terminal.

agentrial prompt track prompts/v2.txt
agentrial prompt diff v1 v2

Track, diff, and compare prompt versions with statistical significance testing between them.

agentrial publish results.json --agent-name my-agent --agent-version 1.0.0
agentrial verify --agent-name my-agent --agent-version 1.0.0 --suite-name my-suite

Publish evaluation results as verifiable benchmark files with SHA-256 integrity checksums.

Delegation accuracy, handoff fidelity, redundancy rate, cascade failure depth, communication efficiency — five metrics for multi-agent systems.

Local FastAPI dashboard for browsing results, comparing runs, and tracking trends.

Domain-specific evaluation packages via Python entry points. Install a pack, get specialized test templates and evaluators.


Browse test suites, run evaluations, view flame graphs, and compare snapshots from your editor. Install from the VS Code Marketplace or search “agentrial” in extensions.


Method Purpose
Wilson score interval Pass rate CI — accurate at 0%, 100%, and small N
Bootstrap resampling Cost/latency CI — non-parametric, 500 iterations
Fisher exact test Regression detection and failure attribution (p
Mann-Whitney U Cost/latency comparison between versions
Benjamini-Hochberg False discovery rate control for multiple comparisons
CUSUM / Page-Hinkley Sequential change-point detection for production monitoring
Kolmogorov-Smirnov Distribution shift detection
Krippendorff’s alpha Inter-rater reliability for LLM-as-Judge

Failure attribution works by grouping trials into pass/fail, comparing tool call distributions at each step, and identifying the step with the lowest p-value as the most likely divergence point.


agentrial init                              # Scaffold sample project
agentrial run                               # Run all tests
agentrial run tests/ --trials 20            # Custom trials
agentrial run -o results.json               # JSON export
agentrial run --flamegraph                  # Trajectory flame graphs
agentrial run --judge                       # LLM-as-Judge evaluation
agentrial compare results.json -b base.json # Regression detection
agentrial baseline results.json             # Save baseline
agentrial snapshot update / check           # Snapshot testing
agentrial security scan --mcp-config c.json # MCP security scan
agentrial pareto --models m1,m2,m3          # Cost-accuracy Pareto frontier
agentrial prompt track/diff/list            # Prompt version control
agentrial monitor --baseline snap.json      # Production drift detection
agentrial ars results.json                  # Agent Reliability Score
agentrial publish results.json --agent-name X --agent-version Y
agentrial verify --agent-name X --agent-version Y --suite-name Z
agentrial packs list                        # Installed eval packs
agentrial dashboard                         # Local dashboard

agentrial Promptfoo LangSmith DeepEval Arize Phoenix
Multi-trial with CI Free $39/mo
Confidence intervals Wilson CI
Step-level failure attribution Fisher exact Partial
Framework-agnostic 6 adapters + OTel Yes LangChain only Yes Yes
Cost-per-correct-answer Yes
LLM-as-Judge with calibration Krippendorff α Yes Yes
Composite reliability score ARS (0-100)
MCP security scanning 6 vuln classes
Production drift detection CUSUM + PH + KS Partial
VS Code extension Yes
Local-first Yes Yes No No Self-host option


git clone https://github.com/alepot55/agentrial.git
cd agentrial
pip install -e ".[dev]"
pytest                    # 450 tests
ruff check .
mypy agentrial/

See CONTRIBUTING.md for details.


MIT



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *