vakra-dev/reader: Open-source, production-grade web scraping engine built for LLMs. Scrape and crawl the entire web, clean markdown, ready for your agents.

Open-source, production-grade web scraping engine built for LLMs.

Scrape and crawl the entire web, clean markdown, ready for your agents.

Building agents that need web access is frustrating. You piece together Puppeteer, add stealth plugins, fight Cloudflare, manage proxies and it still breaks in production.

Because production grade web scraping isn’t about rendering a page and converting HTML to markdown. It’s about everything underneath:

Layer	What it actually takes
Browser architecture	Managing browser instances at scale, not one-off scripts
Anti-bot bypass	Cloudflare, Turnstile, JS challenges, they all block naive scrapers
TLS fingerprinting	Real browsers have fingerprints. Puppeteer doesn’t. Sites know.
Proxy infrastructure	Datacenter vs residential, rotation strategies, sticky sessions
Resource management	Browser pooling, memory limits, graceful recycling
Reliability	Rate limiting, retries, timeouts, caching, graceful degradation

I built Reader, a production-grade web scraping engine on top of Ulixee Hero, a headless browser designed for exactly this.

Two primitives. That’s it.

import { ReaderClient } from "@vakra-dev/reader";

const reader = new ReaderClient();

// Scrape URLs → clean markdown
const result = await reader.scrape({ urls: ["https://example.com"] });
console.log(result.data[0].markdown);

// Crawl a site → discover + scrape pages
const pages = await reader.crawl({
  url: "https://example.com",
  depth: 2,
  scrape: true,
});
console.log(`Found ${pages.urls.length} pages`);

All the hard stuff, browser pooling, challenge detection, proxy rotation, retries, happens under the hood. You get clean markdown. Your agents get the web.

Tip

If Reader is useful to you, a star on GitHub helps others discover the project.

Cloudflare Bypass – TLS fingerprinting, DNS over TLS, WebRTC masking
Clean Output – Markdown and HTML with automatic main content extraction
Smart Content Cleaning – Removes nav, headers, footers, popups, cookie banners
CLI & API – Use from command line or programmatically
Browser Pool – Auto-recycling, health monitoring, queue management
Concurrent Scraping – Parallel URL processing with progress tracking
Website Crawling – BFS link discovery with depth/page limits
Proxy Support – Datacenter and residential with sticky sessions

npm install @vakra-dev/reader

Requirements: Node.js >= 18

import { ReaderClient } from "@vakra-dev/reader";

const reader = new ReaderClient();

const result = await reader.scrape({
  urls: ["https://example.com"],
  formats: ["markdown", "html"],
});

console.log(result.data[0].markdown);
console.log(result.data[0].html);

await reader.close();

Batch Scraping with Concurrency

import { ReaderClient } from "@vakra-dev/reader";

const reader = new ReaderClient();

const result = await reader.scrape({
  urls: ["https://example.com", "https://example.org", "https://example.net"],
  formats: ["markdown"],
  batchConcurrency: 3,
  onProgress: (progress) => {
    console.log(`${progress.completed}/${progress.total}: ${progress.currentUrl}`);
  },
});

console.log(`Scraped ${result.batchMetadata.successfulUrls} URLs`);

await reader.close();

import { ReaderClient } from "@vakra-dev/reader";

const reader = new ReaderClient();

const result = await reader.crawl({
  url: "https://example.com",
  depth: 2,
  maxPages: 20,
  scrape: true,
});

console.log(`Discovered ${result.urls.length} URLs`);
console.log(`Scraped ${result.scraped?.batchMetadata.successfulUrls} pages`);

await reader.close();

import { ReaderClient } from "@vakra-dev/reader";

const reader = new ReaderClient();

const result = await reader.scrape({
  urls: ["https://example.com"],
  formats: ["markdown"],
  proxy: {
    type: "residential",
    host: "proxy.example.com",
    port: 8080,
    username: "username",
    password: "password",
    country: "us",
  },
});

await reader.close();

import { ReaderClient } from "@vakra-dev/reader";

const reader = new ReaderClient({
  proxies: [
    { host: "proxy1.example.com", port: 8080, username: "user", password: "pass" },
    { host: "proxy2.example.com", port: 8080, username: "user", password: "pass" },
  ],
  proxyRotation: "round-robin", // or "random"
});

const result = await reader.scrape({
  urls: ["https://example.com", "https://example.org"],
  formats: ["markdown"],
  batchConcurrency: 2,
});

await reader.close();

With Browser Pool Configuration

import { ReaderClient } from "@vakra-dev/reader";

const reader = new ReaderClient({
  browserPool: {
    size: 5, // 5 browser instances
    retireAfterPages: 50, // Recycle after 50 pages
    retireAfterMinutes: 15, // Recycle after 15 minutes
  },
  verbose: true,
});

const result = await reader.scrape({
  urls: manyUrls,
  batchConcurrency: 5,
});

await reader.close();

For multiple requests, start a daemon to keep browser pool warm:

# Start daemon with browser pool
npx reader start --pool-size 5

# All subsequent commands auto-connect to daemon
npx reader scrape https://example.com
npx reader crawl https://example.com -d 2

# Check daemon status
npx reader status

# Stop daemon
npx reader stop

# Force standalone mode (bypass daemon)
npx reader scrape https://example.com --standalone

Scrape one or more URLs.

# Scrape a single URL
npx reader scrape https://example.com

# Scrape with multiple formats
npx reader scrape https://example.com -f markdown,html

# Scrape multiple URLs concurrently
npx reader scrape https://example.com https://example.org -c 2

# Save to file
npx reader scrape https://example.com -o output.md

Option	Type	Default	Description
`-f, --format`	string	`"markdown"`	Output formats (comma-separated: markdown,html)
`-o, --output`	string	stdout	Output file path
`-c, --concurrency`	number	`1`	Parallel requests
`-t, --timeout`	number	`30000`	Request timeout in milliseconds
`--batch-timeout`	number	`300000`	Total timeout for entire batch operation
`--proxy`	string	–	Proxy URL (e.g., http://user:pass@host:port)
`--user-agent`	string	–	Custom user agent string
`--show-chrome`	flag	–	Show browser window for debugging
`--no-main-content`	flag	–	Disable main content extraction (include full page)
`--include-tags`	string	–	CSS selectors for elements to include (comma-separated)
`--exclude-tags`	string	–	CSS selectors for elements to exclude (comma-separated)
`-v, --verbose`	flag	–	Enable verbose logging

Crawl a website to discover pages.

# Crawl with default settings
npx reader crawl https://example.com

# Crawl deeper with more pages
npx reader crawl https://example.com -d 3 -m 50

# Crawl and scrape content
npx reader crawl https://example.com -d 2 --scrape

# Filter URLs with patterns
npx reader crawl https://example.com --include "blog/*" --exclude "admin/*"

Option	Type	Default	Description
`-d, --depth`	number	`1`	Maximum crawl depth
`-m, --max-pages`	number	`20`	Maximum pages to discover
`-s, --scrape`	flag	–	Also scrape content of discovered pages
`-f, --format`	string	`"markdown"`	Output formats when scraping (comma-separated)
`-o, --output`	string	stdout	Output file path
`--delay`	number	`1000`	Delay between requests in milliseconds
`-t, --timeout`	number	–	Total timeout for crawl operation
`--include`	string	–	URL patterns to include (comma-separated regex)
`--exclude`	string	–	URL patterns to exclude (comma-separated regex)
`--proxy`	string	–	Proxy URL (e.g., http://user:pass@host:port)
`--user-agent`	string	–	Custom user agent string
`--show-chrome`	flag	–	Show browser window for debugging
`-v, --verbose`	flag	–	Enable verbose logging

The recommended way to use Reader. Manages HeroCore lifecycle automatically.

import { ReaderClient } from "@vakra-dev/reader";

const reader = new ReaderClient({ verbose: true });

// Scrape
const result = await reader.scrape({ urls: ["https://example.com"] });

// Crawl
const crawlResult = await reader.crawl({ url: "https://example.com", depth: 2 });

// Close when done (optional - auto-closes on exit)
await reader.close();

Option	Type	Default	Description
`verbose`	`boolean`	`false`	Enable verbose logging
`showChrome`	`boolean`	`false`	Show browser window for debugging
`browserPool`	`BrowserPoolConfig`	`undefined`	Browser pool configuration (size, recycling)
`proxies`	`ProxyConfig[]`	`undefined`	Array of proxies for rotation
`proxyRotation`	`string`	`"round-robin"`	Rotation strategy: `"round-robin"` or `"random"`

Option	Type	Default	Description
`size`	`number`	`2`	Number of browser instances in pool
`retireAfterPages`	`number`	`100`	Recycle browser after N page loads
`retireAfterMinutes`	`number`	`30`	Recycle browser after N minutes
`maxQueueSize`	`number`	`100`	Max pending requests in queue

Method	Description
`scrape(options)`	Scrape one or more URLs
`crawl(options)`	Crawl a website to discover pages
`start()`	Pre-initialize HeroCore (optional)
`isReady()`	Check if client is initialized
`close()`	Close client and release resources

Scrape one or more URLs. Can be used directly or via ReaderClient.

Option	Type	Required	Default	Description
`urls`	`string[]`	Yes	–	Array of URLs to scrape
`formats`	`Array`	No	`["markdown"]`	Output formats
`onlyMainContent`	`boolean`	No	`true`	Extract only main content (removes nav/header/footer)
`includeTags`	`string[]`	No	`[]`	CSS selectors for elements to keep
`excludeTags`	`string[]`	No	`[]`	CSS selectors for elements to remove
`userAgent`	`string`	No	–	Custom user agent string
`timeoutMs`	`number`	No	`30000`	Request timeout in milliseconds
`includePatterns`	`string[]`	No	`[]`	URL patterns to include (regex strings)
`excludePatterns`	`string[]`	No	`[]`	URL patterns to exclude (regex strings)
`batchConcurrency`	`number`	No	`1`	Number of URLs to process in parallel
`batchTimeoutMs`	`number`	No	`300000`	Total timeout for entire batch operation
`maxRetries`	`number`	No	`2`	Maximum retry attempts for failed URLs
`onProgress`	`function`	No	–	Progress callback: `({ completed, total, currentUrl }) => void`
`proxy`	`ProxyConfig`	No	–	Proxy configuration object
`waitForSelector`	`string`	No	–	CSS selector to wait for before page is loaded
`verbose`	`boolean`	No	`false`	Enable verbose logging
`showChrome`	`boolean`	No	`false`	Show Chrome window for debugging
`connectionToCore`	`any`	No	–	Connection to shared Hero Core (for production)

Returns: Promise

interface ScrapeResult {
  data: WebsiteScrapeResult[];
  batchMetadata: BatchMetadata;
}

interface WebsiteScrapeResult {
  markdown?: string;
  html?: string;
  metadata: {
    baseUrl: string;
    totalPages: number;
    scrapedAt: string;
    duration: number;
    website: WebsiteMetadata;
  };
}

interface BatchMetadata {
  totalUrls: number;
  successfulUrls: number;
  failedUrls: number;
  scrapedAt: string;
  totalDuration: number;
  errors?: Array{ url: string; error: string }>;
}

Crawl a website to discover pages.

Option	Type	Required	Default	Description
`url`	`string`	Yes	–	Single seed URL to start crawling from
`depth`	`number`	No	`1`	Maximum depth to crawl
`maxPages`	`number`	No	`20`	Maximum pages to discover
`scrape`	`boolean`	No	`false`	Also scrape full content of discovered pages
`delayMs`	`number`	No	`1000`	Delay between requests in milliseconds
`timeoutMs`	`number`	No	–	Total timeout for entire crawl operation
`includePatterns`	`string[]`	No	–	URL patterns to include (regex strings)
`excludePatterns`	`string[]`	No	–	URL patterns to exclude (regex strings)
`formats`	`Array`	No	`["markdown", "html"]`	Output formats for scraped content
`scrapeConcurrency`	`number`	No	`2`	Number of URLs to scrape in parallel
`proxy`	`ProxyConfig`	No	–	Proxy configuration object
`userAgent`	`string`	No	–	Custom user agent string
`verbose`	`boolean`	No	`false`	Enable verbose logging
`showChrome`	`boolean`	No	`false`	Show Chrome window for debugging
`connectionToCore`	`any`	No	–	Connection to shared Hero Core (for production)

Returns: Promise

interface CrawlResult {
  urls: CrawlUrl[];
  scraped?: ScrapeResult;
  metadata: CrawlMetadata;
}

interface CrawlUrl {
  url: string;
  title: string;
  description: string | null;
}

interface CrawlMetadata {
  totalUrls: number;
  maxDepth: number;
  totalDuration: number;
  seedUrl: string;
}

Option	Type	Required	Default	Description
`url`	`string`	No	–	Full proxy URL (takes precedence over other fields)
`type`	`"datacenter" \| "residential"`	No	–	Proxy type
`host`	`string`	No	–	Proxy host
`port`	`number`	No	–	Proxy port
`username`	`string`	No	–	Proxy username
`password`	`string`	No	–	Proxy password
`country`	`string`	No	–	Country code for residential proxies (e.g., ‘us’, ‘uk’)

For high-volume scraping, use the browser pool directly:

import { BrowserPool } from "@vakra-dev/reader";

const pool = new BrowserPool({ size: 5 });
await pool.initialize();

// Use withBrowser for automatic acquire/release
const title = await pool.withBrowser(async (hero) => {
  await hero.goto("https://example.com");
  return await hero.document.title;
});

// Check pool health
const health = await pool.healthCheck();
console.log(`Pool healthy: ${health.healthy}`);

await pool.shutdown();

Shared Hero Core (Production)

For production servers, use a shared Hero Core to avoid spawning new Chrome for each request:

import HeroCore from "@ulixee/hero-core";
import { TransportBridge } from "@ulixee/net";
import { ConnectionToHeroCore } from "@ulixee/hero";
import { scrape } from "@vakra-dev/reader";

// Initialize once at startup
const heroCore = new HeroCore();
await heroCore.start();

// Create connection for each request
function createConnection() {
  const bridge = new TransportBridge();
  heroCore.addConnection(bridge.transportToClient);
  return new ConnectionToHeroCore(bridge.transportToCore);
}

// Use in requests
const result = await scrape({
  urls: ["https://example.com"],
  connectionToCore: createConnection(),
});

// Shutdown on exit
await heroCore.close();

Cloudflare Challenge Detection

import { detectChallenge, waitForChallengeResolution } from "@vakra-dev/reader";

const detection = await detectChallenge(hero);

if (detection.isChallenge) {
  console.log(`Challenge detected: ${detection.type}`);

  const result = await waitForChallengeResolution(hero, {
    maxWaitMs: 45000,
    pollIntervalMs: 500,
    verbose: true,
    initialUrl: await hero.url,
  });

  if (result.resolved) {
    console.log(`Challenge resolved via ${result.method} in ${result.waitedMs}ms`);
  }
}

import { formatToMarkdown, formatToText, formatToHTML, formatToJson } from "@vakra-dev/reader";

// Format pages to different outputs
const markdown = formatToMarkdown(pages, baseUrl, scrapedAt, duration, metadata);
const text = formatToText(pages, baseUrl, scrapedAt, duration, metadata);
const html = formatToHTML(pages, baseUrl, scrapedAt, duration, metadata);
const json = formatToJson(pages, baseUrl, scrapedAt, duration, metadata);

Reader uses Ulixee Hero, a headless browser with advanced anti-detection:

TLS Fingerprinting – Emulates real Chrome browser fingerprints
DNS over TLS – Uses Cloudflare DNS (1.1.1.1) to mimic Chrome behavior
WebRTC IP Masking – Prevents IP leaks
Multi-Signal Detection – Detects challenges using DOM elements and text patterns
Dynamic Waiting – Polls for challenge resolution with URL redirect detection

Auto-Recycling – Browsers recycled after 100 requests or 30 minutes
Health Monitoring – Background health checks every 5 minutes
Request Queuing – Queues requests when pool is full (max 100)

HTML to Markdown: supermarkdown

Reader uses supermarkdown for HTML to Markdown conversion – a sister project we built from scratch specifically for web scraping and LLM pipelines.

Why we built it:

When you’re scraping the web, you encounter messy, malformed HTML that breaks most converters. And when you’re feeding content to LLMs, you need clean output without artifacts or noise. We needed a converter that handles real-world HTML reliably while producing high-quality markdown.

What supermarkdown offers:

Feature	Benefit
Written in Rust	Native performance with Node.js bindings via napi-rs
Full GFM support	Tables, task lists, strikethrough, autolinks
LLM-optimized	Clean output designed for AI consumption
Battle-tested	Handles malformed HTML from real web pages
CSS selectors	Include/exclude elements during conversion

supermarkdown is open source and available as both a Rust crate and npm package:

# npm
npm install @vakra-dev/supermarkdown

# Rust
cargo add supermarkdown

Check out the supermarkdown repository for examples and documentation.

Reader uses a real Chromium browser under the hood. On headless Linux servers (VPS, EC2, etc.), you need to install Chrome’s system dependencies:

# Debian/Ubuntu
sudo apt-get install -y libnspr4 libnss3 libatk1.0-0 libatk-bridge2.0-0 \
  libcups2 libxcb1 libatspi2.0-0 libx11-6 libxcomposite1 libxdamage1 \
  libxext6 libxfixes3 libxrandr2 libgbm1 libcairo2 libpango-1.0-0 libasound2

This is the same requirement that Puppeteer and Playwright have on headless Linux. macOS, Windows, and Linux desktops already have these libraries.

For Docker and production deployment guides, see the deployment documentation.

Full documentation is available at docs.reader.dev, including guides for scraping, crawling, proxy configuration, browser pool management, and deployment.

# Install dependencies
npm install

# Run linting
npm run lint

# Format code
npm run format

# Type check
npm run typecheck

# Find TODOs
npm run todo

Contributions welcome! Please see CONTRIBUTING.md for guidelines.

Apache 2.0 – See LICENSE for details.

If you use Reader in your research or project, please cite it:

@software{reader.dev,
  author = {Kaul, Nihal},
  title = {Reader: Open-source, production-grade web scraping engine built for LLMs},
  year = {2026},
  publisher = {GitHub},
  url = {https://github.com/vakra-dev/reader}
}

Source link

Batch Scraping with Concurrency

With Browser Pool Configuration

Shared Hero Core (Production)

Cloudflare Challenge Detection

HTML to Markdown: supermarkdown

Leave a Reply Cancel reply