Dataset
Our benchmark uses a curated dataset containing both AI-generated images and real photographs. AI images are generated from popular models including Midjourney, Stable Diffusion (SDXL, SD 3.5), DALL-E 3, Flux, Adobe Firefly, Leonardo.ai, Runway, Google Imagen, and Ideogram. Real images are sourced from photography databases to test for false positives. The dataset is designed to represent realistic use cases — not cherry-picked easy examples.
How Detectors Are Tested
Each image in the dataset is submitted to every detector under the same conditions. We record whether the detector classified the image as AI-generated or real, along with any confidence scores provided. Results are aggregated to compute accuracy, false positive rate, and false negative rate for each detector.
Metrics Explained
- Accuracy — Percentage of correct predictions across all images (both AI and real)
- False Positive Rate (FP) — Percentage of real images incorrectly flagged as AI-generated. High FP means the detector is too aggressive
- False Negative Rate (FN) — Percentage of AI images that the detector missed (classified as real). High FN means the detector is too lenient
A detector with high accuracy but a high false positive rate may be unsuitable for contexts where wrongly accusing someone of using AI has consequences. Conversely, a low false negative rate matters most when catching AI content is the priority.
Arena Elo Rankings
In addition to the static benchmark, the Arena provides live rankings using an Elo rating system. Users are shown an image alongside two detector verdicts and vote for the better answer. Over thousands of votes, detectors that consistently give correct answers rise in the rankings. Elo ratings complement the benchmark by incorporating real-world user judgment.