Chaos engineering tool that uses agents to break your infrastructure on purpose, then clean up after itself.
You tell it what to target (a database, a k8s cluster, some servers), pick the skills you want to run, and it handles discovery, fault injection, and rollback. You can also point an LLM at your infra and let it decide what to break.
Databases (PostgreSQL, MySQL, CockroachDB, YugabyteDB, MongoDB) — Connects to your DB, discovers the schema (or collections for MongoDB), and hammers it with inserts, updates, heavy reads, or config changes. Rolls back everything when done.
Kubernetes — Finds workloads in your cluster and starts killing pods, cordoning nodes, dropping network policies, or deploying resource hogs. Cleans up on exit.
Servers — SSHes into hosts, discovers what’s running (services, ports, filesystems), and goes after them: fills disks, stops services, changes permissions, spikes CPU/memory. Restores original state after.
- Discover — Agent connects to the target and figures out what’s there (tables, pods, services, filesystems, etc.)
- Plan — The orchestrator (or an LLM) picks skills and sets parameters
- Execute — Skills run and each one saves what it needs for rollback
- Observe — Events get emitted in real time
- Rollback — When the duration expires (or something fails), everything reverts in LIFO order
curl -fsSL https://raw.githubusercontent.com/system32-ai/chaos-agents/master/install.sh | bash
You can also set a specific version or install directory:
VERSION=v0.1.0 curl -fsSL https://raw.githubusercontent.com/system32-ai/chaos-agents/master/install.sh | bash
# custom install location
INSTALL_DIR=~/.local/bin curl -fsSL https://raw.githubusercontent.com/system32-ai/chaos-agents/master/install.sh | bash
Run chaos with no arguments to launch the interactive terminal UI:
The TUI walks you through a guided wizard:
- Select provider — Anthropic, OpenAI, or Ollama (auto-detects API keys from env)
- Configure provider — API key, model, max turns
- Select target — Database, Kubernetes, or Server
- Configure target — Connection URL, namespace, SSH hosts, etc.
- Enter prompt — Describe what chaos to run and set a duration
- Review & confirm — Check settings and press Enter to start
Once running, the dashboard shows four live panels:
| Panel | What it shows |
|---|---|
| Chat | LLM conversation, tool calls, and experiment lifecycle events |
| Resources | Discovered targets (tables, pods, services) |
| Skills | Execution progress for each skill |
| Rollback | Rollback step status |
Keyboard shortcuts during execution:
| Key | Action |
|---|---|
Tab |
Switch between panels |
Up/Down |
Scroll the Chat panel |
Ctrl+C |
Cancel experiment (stay in TUI) |
Ctrl+W |
Cancel experiment and quit |
q |
Quit (after experiment finishes) |
Plan and execute from the command line:
export OPENAI_API_KEY="sk.."
chaos agent "Test cockroachdb resilience at postgres://root@localhost:26257/mydb"
chaos list-skills
chaos list-skills --target database
chaos list-skills --target kubernetes
chaos list-skills --target server
SKILL TARGET DESCRIPTION
----------------------------------------------------------------------
db.insert_load database Bulk INSERT random rows into target tables
db.update_load database Randomly UPDATE existing rows in target tables
db.select_load database Generate heavy SELECT query load against target tables
db.config_change database ALTER database configuration parameters with rollback
db.table_lock database Acquire table-level locks to simulate lock contention
db.row_lock database Acquire row-level locks (SELECT FOR UPDATE) to simulate row contention
mongo.insert_load database Bulk INSERT random documents into MongoDB collections
mongo.update_load database Randomly UPDATE existing documents in MongoDB collections
mongo.find_load database Generate heavy read (find) query load against MongoDB collections
mongo.index_drop database Drop secondary indexes from MongoDB collections
mongo.profiling_change database Change MongoDB profiling level to add overhead
mongo.connection_pool_stress database Open many MongoDB connections to exhaust limits
crdb.zone_config_change database Change CockroachDB zone config (replication, GC TTL)
ysql.follower_reads database Toggle YugabyteDB follower reads for eventual consistency
k8s.pod_kill kubernetes Delete random pods matching label selector
k8s.node_drain kubernetes Cordon a node (mark unschedulable), rollback uncordons it
k8s.network_chaos kubernetes Apply deny-all NetworkPolicy to isolate pods
k8s.resource_stress kubernetes Deploy a stress-ng pod to consume cluster resources
server.disk_fill server Fill disk space with a large file, rollback removes it
server.permission_change server Change file permissions to disrupt services, rollback restores them
server.service_stop server Stop random running services, rollback restarts them
server.cpu_stress server Run stress-ng to load CPU, rollback kills the process
server.memory_stress server Run stress-ng to consume memory, rollback kills the process
chaos run config/example-db.yaml
chaos run config/example-k8s.yaml
chaos run config/example-server.yaml
# dry-run — validates and discovers but doesn't execute anything
chaos run config/example-db.yaml --dry-run
chaos validate config/example-db.yaml
Let an LLM look at your setup and decide what chaos to run. The provider is auto-detected from your API key environment variables:
# Anthropic — auto-detected from ANTHROPIC_API_KEY
export ANTHROPIC_API_KEY="sk-ant-..."
chaos plan "Test our PostgreSQL database resilience under heavy write load"
# OpenAI — auto-detected from OPENAI_API_KEY
export OPENAI_API_KEY="sk-..."
chaos plan "Kill random pods in the staging namespace"
# Ollama (local) — used as fallback when no API key is set
chaos plan "Stress test the web servers" --model llama3.1
# Explicit provider override
chaos plan "Break the database" --provider openai
# With MCP servers for extra context
chaos plan "Run chaos on the entire staging environment" --config config/example-llm.yaml
Plan and execute in one step — the LLM generates experiments, you review, and approve:
# Plan and run interactively
chaos agent "Test our PostgreSQL database resilience under heavy write load"
# Target CockroachDB or YugabyteDB — auto-detected from prompt keywords
chaos agent "Test cockroachdb resilience at postgres://root@localhost:26257/mydb"
# MongoDB — auto-detected from mongodb:// URL
chaos agent "Load test mongodb://localhost:27017 collections"
# Preview the generated config without executing
chaos agent "Kill random pods in staging" --dry-run
# Auto-approve (skip confirmation)
chaos agent "Stress test the web servers" -y
# Save the generated config to a file and run
chaos agent "Fill disk on 10.0.1.50" --save plan.yaml
Run experiments on a cron schedule:
chaos daemon config/daemon.yaml
# with a PID file
chaos daemon config/daemon.yaml --pid-file /var/run/chaos.pid
experiments:
- name: "postgres-load-test"
target: database
target_config:
connection_url: "postgres://user:pass@localhost:5432/mydb"
db_type: postgres
skills:
- skill_name: "db.insert_load"
params:
rows_per_table: 10000
tables: ["users", "orders"]
- skill_name: "db.config_change"
params:
changes:
- param: "work_mem"
value: "4MB"
duration: "5m"
parallel: false
CockroachDB and YugabyteDB are PostgreSQL wire-compatible, so they use postgres:// connection URLs. The SQL skills (db.insert_load, db.select_load, db.update_load) work as-is. The db.config_change skill uses CockroachDB’s SET CLUSTER SETTING syntax automatically.
experiments:
- name: "cockroachdb-resilience"
target: database
target_config:
connection_url: "postgres://root@localhost:26257/mydb"
db_type: cockroach_db
skills:
- skill_name: "db.insert_load"
params:
rows_per_table: 5000
- skill_name: "crdb.zone_config_change"
params:
target: "DATABASE mydb"
changes:
- param: "num_replicas"
value: "1"
- param: "gc.ttlseconds"
value: "600"
duration: "5m"
experiments:
- name: "yugabyte-consistency-test"
target: database
target_config:
connection_url: "postgres://yugabyte@localhost:5433/mydb"
db_type: yugabyte_db
skills:
- skill_name: "db.insert_load"
params:
rows_per_table: 5000
- skill_name: "ysql.follower_reads"
params:
enable: true
staleness: "60000ms"
duration: "5m"
experiments:
- name: "mongodb-load-test"
target: database
target_config:
connection_url: "mongodb://localhost:27017"
db_type: mongo_d_b
databases: ["myapp"]
skills:
- skill_name: "mongo.insert_load"
params:
database: "myapp"
docs_per_collection: 5000
- skill_name: "mongo.update_load"
params:
database: "myapp"
docs: 200
- skill_name: "mongo.find_load"
params:
database: "myapp"
query_count: 1000
- skill_name: "mongo.index_drop"
params:
database: "myapp"
max_per_collection: 2
- skill_name: "mongo.profiling_change"
params:
database: "myapp"
level: 2
duration: "5m"
experiments:
- name: "k8s-pod-chaos"
target: kubernetes
target_config:
namespace: "staging"
label_selector: "app=web"
skills:
- skill_name: "k8s.pod_kill"
params:
namespace: "staging"
label_selector: "app=web"
count: 2
- skill_name: "k8s.network_chaos"
params:
namespace: "staging"
pod_selector:
app: "web"
duration: "5m"
The server agent auto-discovers running services and picks targets based on what it finds:
experiments:
- name: "server-chaos"
target: server
target_config:
hosts:
- host: "10.0.1.50"
port: 22
username: "chaos-agent"
auth:
type: key
private_key_path: "~/.ssh/id_ed25519"
discovery:
enabled: true
exclude_services: ["docker", "containerd"]
skills:
- skill_name: "server.service_stop"
params:
max_services: 2
- skill_name: "server.disk_fill"
params:
size: "5GB"
target_mount: "/tmp"
duration: "10m"
resource_filters:
- "nginx.*"
- "postgres.*"
settings:
max_concurrent: 2
experiments:
- experiment:
name: "nightly-db-chaos"
target: database
target_config:
connection_url: "postgres://chaos:pw@db:5432/staging"
db_type: postgres
skills:
- skill_name: "db.insert_load"
params:
rows_per_table: 5000
duration: "15m"
schedule: "0 0 2 * * *"
enabled: true
llm:
provider: anthropic
api_key: "${ANTHROPIC_API_KEY}"
model: "claude-sonnet-4-5-20250929"
max_tokens: 4096
mcp_servers:
- name: "prometheus-mcp"
transport:
type: stdio
command: "npx"
args: ["-y", "@modelcontextprotocol/server-prometheus"]
env:
PROMETHEUS_URL: "http://prometheus:9090"
max_turns: 10
Every skill saves the original state before doing anything. Rollback happens in LIFO order — last thing changed gets reverted first.
| Skill | What it does | Rollback |
|---|---|---|
db.insert_load |
INSERT rows | DELETE by stored IDs |
db.update_load |
UPDATE rows | Restore original values |
db.select_load |
Heavy SELECT queries | No-op (read-only) |
db.config_change |
ALTER SYSTEM SET / SET CLUSTER SETTING | Restore original value |
db.table_lock |
Acquire table-level locks | Release locks on transaction end |
db.row_lock |
SELECT FOR UPDATE on rows | Release locks on transaction end |
mongo.insert_load |
INSERT documents | DELETE by stored ObjectIds |
mongo.update_load |
UPDATE documents | Replace with original documents |
mongo.find_load |
Heavy find/aggregate queries | No-op (read-only) |
mongo.index_drop |
Drop secondary indexes | Recreate indexes with original key/options |
mongo.profiling_change |
Set profiling level to 2 (all ops) | Restore original profiling level |
mongo.connection_pool_stress |
Open many connections | Connections drain on process exit |
crdb.zone_config_change |
ALTER zone config (replication, GC) | Re-apply original zone config |
ysql.follower_reads |
Enable follower reads + staleness | Restore original follower read settings |
k8s.pod_kill |
Delete pod | Verify replacement pod is running |
k8s.node_drain |
Cordon node | Uncordon node |
k8s.network_chaos |
Create deny-all NetworkPolicy | Delete the policy |
k8s.resource_stress |
Deploy stress-ng pod | Delete the pod |
server.disk_fill |
Allocate large file | Remove the file |
server.permission_change |
chmod to 000 | Restore original permissions |
server.service_stop |
systemctl stop | systemctl start |
server.cpu_stress |
Run stress-ng CPU | Kill the process |
server.memory_stress |
Run stress-ng memory | Kill the process |
If the process crashes mid-experiment, the rollback log is serializable so it can be replayed on restart.
- Adaptive chaos — agents that learn from past runs and escalate intensity on their own
- Multi-target experiments — coordinated chaos across DB + k8s + server in one go
- Observability integrations — Prometheus, Grafana, Datadog, PagerDuty
- Steady-state assertions — define what “healthy” looks like and let the agent check
- Cloud targets — AWS, GCP, Azure fault injection (Lambda throttling, S3 latency, IAM revocation)
- Distributed agent mesh — agents across regions for cascading failure scenarios
Join us on Discord for questions, feedback, and discussion.
MIT