Your AI Copilot for Incident Response
Investigate incidents, find root causes, and suggest fixes — automatically
Try Free in Slack · 5-Min Docker Setup · Deploy for Your Team
IncidentFox is an open-source AI SRE that integrates with your observability stack, infrastructure, and collaboration tools. It automatically forms hypotheses, collects data from your systems, and reasons through to find root causes — all while you focus on the fix.
Built for production on-call — handles log sampling, alert correlation, anomaly detection, and dependency mapping so you don’t have to.
An AI SRE that helps root cause and propose mitigations for production on-call issues. It automatically forms hypotheses, collects info from your infrastructure, observability tools, and code, and reasons through to an answer.
Slack-first (see screenshot above), but also works on web UI, GitHub, PagerDuty, and API.
Highly customizable — set up in minutes, and it self-improves by automatically learning and persisting your team’s context.
AI SRE is not a new idea. The problem? Most AI SREs don’t actually work — they lack the context to debug your specific systems.
Other tools ask you to manually configure integrations, write runbooks, and hope the AI figures it out. IncidentFox does the opposite.
On setup, we analyze your codebase, Slack history, and past incidents to understand how your org actually works. Internal CI/CD system with weird quirks? Custom deployment tooling? We learn it automatically and build integrations that work out of the box.
No weeks of integration work. No building your own MCP servers. We connect to the tools that actually matter for root cause — so you can skip straight to debugging.
We’re opinionated: you shouldn’t leave Slack during an incident.
- Upload a Grafana screenshot → we analyze it
- Attach a log file → we parse and correlate
- All tool outputs, evidence, and reasoning → visible as Slack attachments
- No new tabs. No context switching. Debug where you already work.
Our agents run in sandboxed environments with filesystem access — enabling code generation, script execution, and deep analysis. Security guardrails keep them focused on the task.
The result: Higher accuracy, faster resolution, less time wasted on integration work.
IncidentFox is open source (Apache 2.0). You can try it instantly in Slack, or deploy it yourself for full control. Pick the option that fits your needs:
| Option | Best For | Setup Time | Cost | Privacy | |
|---|---|---|---|---|---|
| Try Free | See it in action | Instant | Free | Our playground environment | |
| Local Docker | Evaluate with your infra | 5 minutes | Free | Everything local | Setup Guide → |
| Managed (premium features) | Production, we handle ops | 30 minutes | Contact us (7-day free trial) | SaaS or on-prem, SOC2 | |
| Self-Host (Open Core) | Production, full control | 30 minutes | Free | Everything local | Deployment Guide → |
New to IncidentFox? We recommend trying it in our Slack first — no setup required, see how it works instantly.
For Engineering Leaders: What this means for your team.
| Outcome | Impact |
|---|---|
| Faster Incident Resolution | Hours → minutes. Auto-correlates alerts, analyzes logs, traces dependencies. |
| 85-95% Less Alert Noise | Smart correlation finds root cause. Engineers focus on real problems. |
| Knowledge Retention | Learns your systems and runbooks. Knowledge stays when people leave. |
| Works on Day One | 300+ integrations. No months of setup — connect and go. |
| No Vendor Lock-In | Open source, bring your own LLM keys, deploy anywhere. |
| Gets Smarter Over Time | Learns from every investigation. Your expertise compounds. |
The bottom line: Less time firefighting, more time building.
IncidentFox connects to your existing tools and infrastructure. No manual setup required — configure once and it works everywhere.
| Category | Integrations |
|---|---|
| Logs & Metrics | Coralogix · Grafana · Elasticsearch · Datadog · Prometheus · Jaeger |
| Incidents | incident.io |
| Cloud & Infra | Kubernetes |
| Dev Tools | GitHub · Confluence |
| Category | Integrations |
|---|---|
| Logs & Metrics | CloudWatch · Splunk · OpenSearch · New Relic · Honeycomb · Dynatrace · Chronosphere · VictoriaMetrics · Kloudfuse · Sentry · Snowflake |
| Incidents | PagerDuty · Opsgenie · ServiceNow |
| Cloud & Infra | AWS · GCP · Azure · Temporal |
| Dev Tools | Jira · Linear · Notion · Glean |
Need an integration? Contact us or contribute via MCP protocol — add new integrations in minutes.
┌───────────────────────────────────┐ ┌──────────────────────┐
│ Slack / GitHub / PagerDuty / API │ │ Web UI │
└─────────────────┬─────────────────┘ │ (dashboard, team │
│ webhooks │ management) │
┌─────────────────▼─────────────────┐ └──────────┬───────────┘
│ Orchestrator │ │
│ (routes webhooks, team lookup, │ │
│ token auth, audit logging) │ │
└────────┬─────────────────┬────────┘ │
│ │ │
┌────────▼────────┐ ┌────▼─────────────────────────▼───┐
│ Agent ││ Config Service │
│ (Claude/OpenAI, │ │ (multi-tenant cfg, RBAC, │
│ 300+ tools, │ │ routing, team hierarchy) │
│ multi-agent) │ └─────────────────┬────────────────┘
└────┬───────┬────┘ │
│ │ ▼
│ │ ┌───────────────────────┐
│ │ │ PostgreSQL │
│ │ │ (config, audit, │
│ │ │ investigations) │
│ │ └───────────────────────┘
│ │
▼ ▼
┌──────────┐ ┌─────────────────────────┐
│ Knowledge│ │ External APIs │
│ Base │ │ (K8s, AWS, Datadog, │
│ (RAPTOR) │ │ Grafana, etc.) │
└──────────┘ └─────────────────────────┘

Web Console — Easiest way to view and customize agents
The engineering that makes IncidentFox actually work in production:
| Capability | What It Does | Why It Matters |
|---|---|---|
| RAPTOR Knowledge Base | Hierarchical tree structure (ICLR 2024) — clusters → summarizes → abstracts | Standard RAG fails on 100-page runbooks. RAPTOR maintains context across long documents. |
| Smart Log Sampling | Statistics first → sample errors → drill down on anomalies | Other tools load 100K lines and hit context limits. We sample intelligently to stay useful. |
| Alert Correlation Engine | 3-layer analysis: temporal + topology + semantic | Groups alerts AND finds root cause. Reduces noise by 85-95%. |
| Prophet Anomaly Detection | Meta’s Prophet algorithm with seasonality-aware forecasting | Detects anomalies that account for daily/weekly patterns, not just static thresholds. |
| Dependency Discovery | Automatic service topology mapping with blast radius analysis | Know what’s affected before you start investigating. No manual service maps needed. |
| 300+ Built-in Tools | Kubernetes, AWS, Azure, GCP, Grafana, Datadog, Prometheus, GitHub, and more | No “bring your own tools” setup. Works out of the box with your stack. |
| MCP Protocol Support | Connect to any MCP server for unlimited integrations | Add new tools in minutes via config, not code. |
| Multi-Agent Orchestration | Planner routes to specialist agents (K8s, AWS, Metrics, Code, etc.) | Complex investigations get handled by the right expert, not a generic agent. |
| Model Flexibility | Supports OpenAI and Claude SDKs — use the model that fits your needs | No vendor lock-in. Switch models or use different models for different tasks. |
| Continuous Self-Improvement | Learns from investigations, persists patterns, builds team context | Gets smarter over time. Your past incidents inform future investigations. |

RAPTOR knowledge base storing 50K+ docs as your proprietary knowledge
Security, compliance, and deep customization for production deployments.
Every team is different — different tech stacks, observability tools, incident patterns, and services. Enterprise unlocks deep specialization:
| Feature | Description |
|---|---|
| Auto-Learn Your Org | We analyze your codebase, Slack history, and past incidents to identify which internal tools matter most for debugging. Then we auto-build integrations. |
| Team-Specific Agents | Each team gets agents tuned to their stack. Your payments team and your infra team have different needs — their agents reflect that. |
| Custom Prompts & Tools | Auto-learned defaults, with full control to tune. Engineers can adjust prompts, add tools, and configure agents per team. |
| Context Compounds | Every investigation makes IncidentFox smarter about your systems. Tribal knowledge gets captured, not lost. |
| Feature | Description |
|---|---|
| SOC 2 Compliant | Audited security controls, data handling, and access management |
| Sandboxed Execution | Isolated Kubernetes sandboxes for agent execution — no shared state between runs |
| Secrets Proxy | Credentials never touch the agent. Envoy proxy injects secrets at request time. |
| Approval Workflows | Critical changes (prompts, tools, configs) require review before deployment |
| SSO/OIDC | Google, Azure AD, Okta — per-organization configuration |
| Hierarchical Config | Org → Business Unit → Team inheritance with override capabilities |
| Audit Logging | Full trail of all agent actions, config changes, and investigations |
| On-Premise | Deploy entirely in your environment — air-gapped support available |
We welcome contributions! See issues labeled good first issue to get started.
For bugs or feature requests, open an issue on GitHub.
Claude Code Plugin — Standalone SRE tools for individual developers using Claude Code CLI. Not connected to the IncidentFox platform above.
Built with ❤️ by the IncidentFox team





