ricardomoratomateos/distill: Automatic LLM agent migration from expensive to cheap models


Migrate LLM agents from expensive to cheap models. Automatically.

Claude Sonnet ($15/MTok) → GPT-4o-mini ($0.15/MTok) = 100x cost reduction

You built an agent that works perfectly with Claude Sonnet. It costs $0.02 per run. At 10,000 runs/month, that’s $200. With GPT-4o-mini, it could be $2.

But migrating manually means 15+ hours of:

  • Rewriting prompts that “just work” on smart models
  • Running hundreds of test cases
  • Debugging subtle failures
  • Repeat for every new cheaper model

Distill automates this entire process.

┌─────────────────────────────────────────────────────────────────────┐
│                                                                     │
│   Your Agent          Distill              Optimized Agent          │
│   (Sonnet)              │                  (GPT-4o-mini)            │
│       │                 │                       │                   │
│       ▼                 ▼                       ▼                   │
│   ┌───────┐      ┌─────────────┐         ┌───────────┐              │
│   │$0.02  │ ───▶ │  Profile    │         │  $0.002   │              │
│   │/run   │      │  Judge      │ ──────▶ │  /run     │              │
│   │       │      │  Optimize   │         │           │              │
│   │  95%  │      │  Validate   │         │   96%     │              │
│   │success│      └─────────────┘         │  success  │              │
│   └───────┘            │                 └───────────┘              │
│                        │                                            │
│                        ▼                                            │
│              Iterates until 95%+ success                            │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
# Clone and setup
git clone https://github.com/your-org/distill.git
cd distill
pnpm install
pnpm build

# Configure API keys
cp .env.example .env
# Add your ANTHROPIC_API_KEY and OPENAI_API_KEY

1. Define your agent (agent.yaml):

name: "Customer Support Bot"
description: "Answers product questions accurately and concisely"

model:
  provider: anthropic
  name: claude-sonnet-4-20250514
  temperature: 0

systemPrompt: |
  You are a helpful customer support agent for TechCorp.
  Answer questions about our products accurately and concisely.
  Be professional and friendly.

objective: "Provide accurate, helpful answers to customer questions"

successCriteria:
  - "Answers are factually correct"
  - "Tone is professional and friendly"
  - "Responses are concise"

2a. Option A: Profile with expensive model (automatic):

Create test inputs (test-inputs.json):

[
  "What is your return policy?",
  "How do I reset my password?"
]

Profile to create gold standard:

pnpm profile -c agent.yaml -i test-inputs.json -o test-suite.json

2b. Option B: Use existing gold standard (manual – no cost):

If you already have gold standard responses from production:

[
  {
    "input": { "message": "What is your return policy?" },
    "expectedOutput": { "response": "Our return policy allows..." }
  },
  {
    "input": { "message": "How do I reset my password?" },
    "expectedOutput": { "response": "To reset your password..." }
  }
]

Create test suite directly:

pnpm create-test-suite -i manual-test-cases.json -o test-suite.json

3. Migrate to cheaper model:

pnpm migrate -c agent.yaml -p test-suite.json -t gpt-4o-mini -o agent.optimized.yaml

4. Evaluate the result:

pnpm evaluate -c agent.optimized.yaml -p test-suite.json

Output:

🔧 Loading agent from agent.yaml...
📊 Profiling source model (claude-sonnet-4-20250514)...
   Running 50 test cases...
✅ Baseline established:
   • Cost: $0.018/run
   • Success rate: 94%
   • Avg latency: 2.3s

🎯 Target: gpt-4o-mini

🚀 Starting migration...

   Iteration 1/10
   ├─ Running evaluation... 42% success
   ├─ Judge feedback: "Missing context for product-specific terms"
   └─ Modifier: Adding explicit product glossary to prompt

   Iteration 2/10
   ├─ Running evaluation... 71% success
   ├─ Judge feedback: "Inconsistent formatting in responses"
   └─ Modifier: Adding output format examples

   Iteration 3/10
   ├─ Running evaluation... 88% success
   ├─ Judge feedback: "Edge cases with ambiguous questions"
   └─ Modifier: Adding chain-of-thought for complex queries

   Iteration 4/10
   ├─ Running evaluation... 96% success ✓
   └─ Target achieved!

✨ Migration complete!

   ┌────────────────────────────────────────┐
   │  Metric          Before      After     │
   │  ─────────────────────────────────────│
   │  Model           Sonnet      4o-mini   │
   │  Cost/run        $0.018      $0.002    │
   │  Success rate    94%         96%       │
   │  Latency         2.3s        0.8s      │
   │  ─────────────────────────────────────│
   │  Monthly savings (10k runs): $160      │
   └────────────────────────────────────────┘

📝 Saved: agent.optimized.yaml

Phase 1 (Current – MVP Complete ✅)

  • Automatic Profiling: Run your agent, capture gold standard outputs
  • Manual Test Suites: Use your existing gold standards from production (no profiling cost!)
  • LLM-as-Judge Evaluation: Smart comparison of outputs, not just string matching
  • Iterative Prompt Optimization: Automatically refines prompts based on failures
  • Convergence Strategies: Choose how to stop optimization
    • ThresholdPlusBonusRounds (default): Extra iterations after reaching threshold
    • AlwaysRunMax: Always run all iterations, return best result
    • EarlyStoppingWithPatience: Stop early if no improvement
  • Anti-Overfitting: Modifier learns general strategies, not specific test answers
  • Multi-Provider Support: Anthropic (Claude), OpenAI (GPT)
  • CLI Tools: profile, migrate, evaluate commands
  • LangGraph Orchestration: Flexible graph-based migration flow
  • Cost Tracking: Real execution traces with token counts
  • Multi-Agent Decomposition: Automatically split complex agents into specialized sub-agents
  • Architecture Optimization: Router + specialized agents for different task types
  • Visual Dashboard: Web UI for monitoring migrations
  • CI/CD Integration: Automated regression testing

Scenario Distill Helps?
Agent works on Sonnet, need to cut costs ✅ Yes
Building new agent, want to start cheap ⚠️ Build with Sonnet first, then Distill
Agent has consistent failures on cheap model ✅ Yes
Need to migrate to new model (GPT-5, etc) ✅ Yes
Agent requires vision/multimodal 🔜 Coming in Phase 2

Distill is designed to evolve:

Phase 1: SingleAgent optimization
         ┌─────────────────────────┐
         │  Profiler → Judge →     │
         │  PromptModifier →       │
         │  Validator              │
         └─────────────────────────┘

Phase 2: MultiAgent decomposition (same interfaces)
         ┌─────────────────────────┐
         │  Profiler → Judge →     │
         │  ArchModifier →         │  ← New modifier, same flow
         │  Validator              │
         └─────────────────────────┘

The Agent interface abstracts single vs multi-agent:

interface Agent {
  execute(input: Input): PromiseOutput>;
  getCost(): number;
}

// Phase 1: SingleAgent implements Agent
// Phase 2: MultiAgent implements Agent (drop-in replacement)

Full architecture docs

Distill provides convenient npm scripts for all commands:

# Profile an agent (creates gold standard)
pnpm profile -c agent.yaml -i test-inputs.json -o test-suite.json

# Create test suite from manual gold standards (no profiling cost!)
pnpm create-test-suite -i manual-test-cases.json -o test-suite.json

# Migrate to cheaper model
pnpm migrate -c agent.yaml -p test-suite.json -t gpt-4o-mini

# Evaluate results
pnpm evaluate -c agent.optimized.yaml -p test-suite.json

Available scripts:

  • pnpm distill – Run any CLI command
  • pnpm profile – Shortcut for profile command
  • pnpm create-test-suite – Shortcut for create-test-suite
  • pnpm migrate – Shortcut for migrate command
  • pnpm evaluate – Shortcut for evaluate command

All commands automatically load .env file for API keys.

distill/
├── packages/
│   ├── core/           # Main logic (provider-agnostic)
│   │   ├── profiler/   # Captures agent behavior
│   │   ├── judge/      # LLM-based evaluation
│   │   ├── modifier/   # Prompt & architecture optimization
│   │   ├── agents/     # Agent abstractions
│   │   └── validator/  # End-to-end testing
│   ├── cli/            # Command-line interface
│   └── web/            # Dashboard (placeholder)
├── examples/           # Real-world examples
├── docs/               # Documentation
├── turbo.json          # Turborepo task configuration
└── pnpm-workspace.yaml # pnpm workspaces config

This project uses pnpm workspaces + Turborepo for fast, cached builds:

pnpm build      # Build all packages (with caching)
pnpm dev        # Watch mode for all packages
pnpm test       # Run tests in parallel
pnpm lint       # Lint all packages
pnpm typecheck  # TypeScript validation
pnpm clean      # Clean build artifacts and cache

Turborepo caches build outputs – subsequent builds are near-instant if nothing changed.

v0.1 – MVP (Phase 1) ✅ COMPLETE

LLM costs are a universal problem. Every team building with AI faces the same migration pain. By open-sourcing Distill:

  1. Community improvements: More edge cases, more optimizations
  2. Provider support: Community can add Mistral, Llama, etc.
  3. Trust: See exactly how your prompts are being modified
  4. Standards: Help establish best practices for LLM migration

We welcome contributions! See CONTRIBUTING.md for:

  • Development setup
  • Running tests
  • Submitting PRs
  • Code style guidelines

MIT License – see LICENSE


Built with frustration from manually migrating agents too many times.

Documentation · Examples · Discord · Twitter



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *