Web-reading tools for a sovereign agent

Zoya runs three tools for the web (web_fetch, mcp_lightpanda, mcp_scrapling) plus a sovereign-first search chain. The other eighty-odd projects in this space were evaluated and rejected. This is the decision record: what runs, what was cut, and the dimension that decided each call. Four facts about Zoya did most of the filtering before quality was assessed: it is sovereign, single-user, MCP-native, and it is already the reasoning agent rather than a tool that needs one bolted on.

Zoya's web ladder: start cheap, escalate only when the cheaper tier fails.

The constraints did most of the work

Four facts about what Zoya is remove most of the market before quality is assessed:

Sovereign / self-hosted. No managed cloud services. That deletes the entire managed-browser and managed-extraction tier on contact: Browserbase, Bright Data, ScrapingBee, Diffbot, and friends.
Single-user personal agent, not a scraping farm or a RAG-ingestion pipeline. "Run 100+ concurrent headless" and "crawl an entire docs site" aren't the workload; reading a URL inside a conversation is.
MCP-native. Zoya is an MCP client; new tools plug in over stdio through a glibc sidecar. Anything that ships an MCP server drops in cleanly; anything that needs wrapping pays a tax.
Zoya is already the agent. Its runtime wraps an LLM and runs the loop itself, so it already reasons. It does not need a second LLM to drive a browser, and it turns fetched content into typed JSON natively. This single fact rules out two whole categories (agent harnesses and LLM extractors) as redundant.

What we run

A three-tier escalation ladder for fetching, plus a search backend:

web_fetch: an in-process HTTP GET (curl-shaped), no JavaScript. The default for static pages, JSON APIs, RSS, downloads. Cheapest and fastest; tried first.
mcp_lightpanda_*: Lightpanda, a Zig headless browser, for JS-rendered pages and interaction (goto, markdown, fill, click, detectForms, evaluate). Used when web_fetch comes back with an empty shell or a "please enable JavaScript" page.
mcp_scrapling_*: Scrapling, for bot-protected sites (TLS fingerprinting, Cloudflare/Turnstile) and adaptive extraction. The tier above a plain browser.
Search: web_search runs a sovereign-first chain, SearXNG (self-hosted meta-search) → Brave Search API → DuckDuckGo, falling through on empty results. The agent can also force a specific engine per call (backend), so it can cross-check two engines on a hard lookup.

The agent-facing rule is one line: try cheap, escalate only when the cheaper tier fails.

Why Lightpanda

Zig, Apache-2.0, self-hostable, CDP-compatible, ~9× less RAM than headless Chrome. Headless Chrome costs ~200 MB per instance, the wrong tool for a single-host agent that needs a DOM occasionally. Lightpanda is beta: coverage gaps, some sites crash it, a few block it. Acceptable, because the tier below it (web_fetch) and above it (scrapling) cover the misses.

Why Scrapling

BSD-3, built-in MCP server, three fetcher tiers in one package: HTTP (curl-impersonate-style TLS fingerprinting, no browser), Stealthy (Cloudflare bypass), Dynamic (full browser). Its structured extraction is deterministic (element fingerprinting, not an LLM), so it costs nothing per call and survives layout drift. It closes the two gaps above Lightpanda (anti-bot fetch and selector extraction) in one sovereign tool that installs into the sidecar like the document-to-markdown server.

The dimensions we scored on

Every candidate was weighed on the same axes. For Zoya, the first four behaved as near-vetoes; the rest were tie-breakers.

Sovereignty: self-hostable vs managed-only. (veto: managed-only is out.)
License: permissive (BSD/MIT/Apache) vs copyleft (AGPL) vs commercial. (AGPL tolerated for a true service, avoided for a dependency.)
LLM dependency: deterministic vs needs an LLM at runtime. (veto-ish: a second runtime LLM is redundant, Zoya already drives one.)
MCP-native: ready to plug in vs needs wrapping.
Anti-bot capability: none / TLS-level / full stealth / managed-bypass.
Adaptivity: breaks on layout change vs adapts.
Scale shape: single-page vs spider vs concurrent pool. (deprioritized: single-user.)
Language / container fit: does it slot into the glibc sidecar; does it ship a usable image.
Maintenance: release cadence, star velocity, sponsor health.

What we skipped, and why

All managed / cloud (Browserbase, Bright Data, Firecrawl-cloud, Tavily/Exa-managed, ScrapingBee, Apify, ZenRows, Diffbot, AgentQL, Kadoa): breaks sovereignty. This is the whole point of the project, so it's a hard no, not a trade-off.
Agent harnesses (Browser Use, Stagehand, Skyvern, Magnitude, Anthropic computer-use): Zoya is the agent that drives tools. Bolting on another LLM-driven browser is an architectural inversion (and a second model bill). Hermes Agent stays a reference for how a memory-first multi-surface agent is built, not a dependency.
Browser drivers (Playwright/Patchright/Puppeteer/Selenium/chromedp/Rod): only needed if you script a browser. Zoya drives Lightpanda and Scrapling through MCP instead, so a driver layer would be dead weight.
Crawl4AI / Firecrawl: genuinely good, but single-URL→markdown is already covered three ways (web_fetch / lightpanda / scrapling). Their unique value is recursive multi-page crawl and RAG ingestion, which isn't a current use case (our knowledge base is populated out-of-band). Firecrawl is also AGPL and heavy. Deferred, with a written revisit trigger.
Dedicated structured extractors (ScrapegraphAI, Firecrawl-Extract): redundant. Fetch a page through the ladder, and Zoya's own model emits the typed JSON; Scrapling covers the deterministic/selector case. A separate extractor is a second LLM doing what the agent already does.
Other engines: Obscura (Rust, stealthier CDP) is the one to watch as a future swap for Lightpanda if the beta gaps bite; Servo and Ladybird aren't production-ready headless.
Unbrowse / API-discovery: clever (call a site's hidden API directly), but niche; revisit per-site if it ever matters.

Adjacent: document parsers

This stack is web-reading; the sibling problem is converting documents (PDF, DocX, PPTX) into clean Markdown for the agent to read. Zoya runs markitdown as an MCP sidecar for that today. Docling (IBM Research, MIT, MCP-capable) is the more capable peer: stronger on table structure, layout reconstruction, and embedded figures, the obvious swap if markitdown's output ever costs us fidelity on tables or scientific PDFs. Both are parsing tools, not retrieval ones. A clean, structurally-perfect Markdown'd document is still untrusted content until it passes the memory store's trust gate (see Agent memory is an attack surface).

What would change our mind

Skips aren't permanent. They're recorded with the condition that would reopen them:

Exa: the moment semantic / research-discovery search (find by meaning, find-similar) becomes a real workload. It's the one search API that adds something the keyword stack can't; until then its SaaS key + cost aren't worth it. (Tavily was evaluated and skipped: it's roughly Brave with cleaner snippets, i.e. redundant; Firecrawl is a crawler, not search, covered below.)
Crawl4AI: the moment Zoya needs to bulk-ingest whole sites/docs into memory (crawl → chunk → store). That's the one job the read-one-URL loop doesn't scale to.
Obscura: if Lightpanda's beta coverage gaps start costing us real pages.

These live in a decision log where every entry carries an explicit "revisit when" trigger. A deferral you can't reopen is just a guess you got attached to.

Appendix: how to choose

The full reference the decision was made against. Sorted by layer; each row carries the dimensions above.

Decision tree

Need to extract data from web pages?
│
├─ Just need clean text/markdown for an LLM to read?
│    → Jina Reader (cheapest), Firecrawl, Crawl4AI (self-hosted)
│
├─ Need typed JSON matching a schema?
│    ├─ Schema known, layout stable  → Scrapling (selectors + adaptive)
│    ├─ Schema known, layout volatile → ScrapegraphAI (LLM extraction)
│    └─ Schema unknown, exploratory  → Diffbot, AgentQL
│
├─ Need to click, fill forms, log in?
│    ├─ Deterministic script → Playwright + Patchright (anti-detect)
│    ├─ LLM-driven actions   → Browser Use, Stagehand
│    └─ Heavy anti-bot       → Browserbase, or Scrapling StealthyFetcher
│
├─ Need to crawl an entire site?
│    ├─ Self-hosted → Scrapling Spiders, Scrapy, Crawl4AI
│    └─ Managed     → Firecrawl, Apify
│
├─ Avoid the browser (use the site's hidden API)?
│    → Unbrowse, or DIY with curl-impersonate
│
└─ Just need search results, not page scraping?
     → Tavily, Exa, SearXNG (self-hosted), Brave Search API

1. Browser engines

The thing that parses HTML and runs JavaScript. Drop-in CDP-compatible engines.

Name	Lang	License	Self-host	MCP	Status	Notes
Lightpanda	Zig	Apache-2.0	Yes	No (CDP)	Beta	9× less RAM, 11× faster than Chrome headless. No CSS layout/image decode/GPU; web-API coverage incomplete.
Obscura	Rust	MIT	Yes	No (CDP)	Beta	Stealth mode, 3,520-domain tracker block, CDP-compatible.
Chromium / Chrome	C++	BSD	Yes	No (CDP)	Stable	Default everywhere. Heavy: ~200 MB+ RAM per instance.
Firefox (headless)	C++/Rust	MPL-2.0	Yes	via mediator	Stable	Different fingerprint than Chrome, useful for diversity.
WebKit (headless)	C++	LGPL/BSD	Yes	via mediator	Stable	Smaller footprint than Chromium; fewer sites optimise for it.
Servo	Rust	MPL-2.0	Yes	No	Research	Not production-ready for headless automation.
Ladybird	C++	BSD	Partial	No	Pre-alpha	Independent engine, not yet usable headless.

Pick rule: Chromium unless you have a strong reason. Lightpanda/Obscura when you'll spawn >50 concurrent and memory matters. Firefox/WebKit for fingerprint diversity.

2. Browser drivers

Libraries that control engines via CDP or WebDriver.

Name	Lang	License	Anti-detect	MCP	Notes
Playwright	TS/Py/Java/.NET	Apache-2.0	No	Yes (MS)	The standard. Multi-engine (Chromium/Firefox/WebKit).
Puppeteer	JS/TS	Apache-2.0	No	Yes	Chrome-only, simpler API than Playwright.
Selenium	Multi	Apache-2.0	No	Community	Legacy WebDriver. Worse perf, broader site support.
chromedp	Go	MIT	No	Community	Go-native CDP driver, no Node dependency.
Rod	Go	MIT	No	Community	More ergonomic Go CDP driver than chromedp.
Patchright	Py/Node	Apache-2.0	Yes	No	Playwright fork with anti-detection patches.
rebrowser-playwright	Node	Apache-2.0	Yes	No	Different anti-detection patches than Patchright.
curl-impersonate	C	MIT	TLS-level	No	HTTP only, no JS. TLS fingerprint matching for Chrome/Firefox/Safari.

Pick rule: Playwright + Patchright if anti-detection matters; plain Playwright otherwise; curl-impersonate if you can skip JS entirely.

Appendix: extraction and frameworks

3. URL → clean markdown / text

Name	Lang	License	Self-host	MCP	LLM?	Notes
Jina Reader	TS	Apache-2.0	Yes	Yes	No	Lightest. `r.jina.ai/<url>` managed. Weak on heavy SPAs.
Firecrawl	TS/Py	AGPL-3.0	Yes (Compose)	Yes	Optional	Full Playwright rendering. Recursive crawl, sitemap mapping.
Crawl4AI	Python	Apache-2.0	Yes	Yes	Optional	LLM-aware chunking. Best self-hosted Firecrawl peer.
Trafilatura	Python	Apache-2.0	library	No	No	Pure text extraction from HTML, no rendering.
Readability.js	JS	Apache-2.0	library	No	No	Mozilla's Reader View algorithm.
readabilipy	Python	MIT	library	No	No	Python wrapper around Readability.js.

Pick rule: Crawl4AI for self-hosted at any scale; Jina Reader for prototyping; Firecrawl for managed recursive crawl; Trafilatura when you already have the HTML.

4. URL → typed JSON (structured extraction)

Name	Lang	License	Self-host	MCP	LLM?	Notes
ScrapegraphAI	Python	MIT	library	Yes	Yes	Plain-English prompts → typed JSON. Graph pipeline.
Firecrawl Extract	n/a	AGPL	Yes	Yes	Yes	Schema or prompt → JSON. Managed LLM.
Diffbot	n/a	Commercial	No	Yes	No (ML)	Knowledge-graph extraction. 10K/mo free tier.
AgentQL	Py/JS	Commercial	No	Yes	No (ML)	Natural-language query language for selectors.
Kadoa	n/a	Commercial	No	Yes	Yes	Auto-adapts to layout changes.
Scrapling (adaptive)	Python	BSD-3	Yes	Yes	No	Element fingerprinting, not LLM-based. See §5.
AutoScraper	Python	MIT	library	No	No	Example-based: give it a page + sample data, it learns.

Pick rule: ScrapegraphAI for volatile layouts with LLM budget; Scrapling for stable layouts, free; Diffbot/AgentQL for commercial reliability. For an LLM agent: it already extracts, so you mostly need a good fetch.

5. Full scraping frameworks

Name	Lang	License	MCP	Anti-bot	Adaptive	Notes
Scrapling	Python	BSD-3	Built-in	Cloudflare/Turnstile	Yes	3 fetcher tiers (HTTP/Stealthy/Dynamic). Spiders with pause-resume. Docker image with all browsers.
Scrapy	Python	BSD-3	Community	No	No	The OG. Mature, large ecosystem, no built-in stealth.
Crawlee	TS/Py	Apache-2.0	Community	Some	No	Apify's OSS framework. Good Playwright + queue integration.
Botasaurus	Python	MIT	No	Yes	No	Stealth-first scraping framework.
Katana	Go	MIT	No	No	No	Crawling-focused, fast, from ProjectDiscovery (security tooling).

Pick rule: Scrapling for new projects; Scrapy for legacy/team familiarity; Katana for security recon, not data extraction.

Appendix: harnesses, cloud, and search

6. Agent harnesses (LLM drives the browser)

Name	Lang	License	Self-host	MCP	Notes
Browser Use	Python	MIT	Yes	Yes	89.1% WebVoyager. Leading OSS agent-browser framework.
Stagehand	TS	MIT	lib, cloud-coupled	Yes	Browserbase's framework; v3 dropped Playwright for modular drivers.
Skyvern	Python	AGPL-3.0	Yes	Yes	Form-filling and workflow focus.
Hermes Agent	Python	MIT	Yes	Yes (self)	Persistent memory, multi-surface agent. Browser is one tool.
Anthropic computer-use	n/a	Commercial	No	N/A	Drives a full desktop. Most general, most expensive.
Magnitude	TS	MIT	Yes	Yes	Newer LLM-driven browser automation.

Pick rule: Browser Use for OSS production; Stagehand if you're already on Browserbase; Hermes Agent as an architectural reference for Zoya rather than a competitor.

7. Cloud browser infrastructure (managed)

Headless Chrome as a service. Skipped wholesale on sovereignty.

Name	Anti-bot	MCP	Notes
Browserbase	Strong (residential proxies, CAPTCHA)	Yes	Most common pick for agent infra.
Browserless	Yes	Yes	Self-hostable Docker option also exists.
Steel	Yes	Yes	Newer entrant, agent-native API.
Kernel	Yes	Yes	Agent-focused, includes session replay.
Bright Data	Strongest	Yes	Enterprise-grade, expensive.
Apify	Yes	Yes	Marketplace of pre-built scrapers + infra.
ScrapingBee	Yes	Yes	Credit-based, JS-rendering optional.
ZenRows	Yes	No	Scraping API with anti-bot rotation.

8 to 10. Self-hosted agentic browsers, API-discovery, search

Category	Name	License / Self-host	Notes
Agentic browser	BrowserOS	AGPL-3.0	Open-source Comet/Dia alternative, runs agents locally.
Agentic browser	Steel Browser	Apache-2.0	Self-hostable Steel infrastructure.
Agentic browser	Open Operator	MIT	OSS clone of the Operator pattern.
API-discovery	Unbrowse	Managed (MCP)	Discovers a site's internal APIs and calls them directly. Sub-second where it works.
Search	SearXNG	Self-host, MCP	Meta-search aggregator, privacy-first. Zoya's primary search (sovereign default).
Search	Brave Search API	Managed	Independent index. Reliable fallback below SearXNG; only fires when SearXNG comes up empty.
Search	Tavily	Managed	Skipped: agent-search API is roughly Brave with cleaner snippets; redundant.
Search	Exa	Managed	Neural/semantic (find-by-meaning). Gated: add if semantic-research becomes a need.
Search	Perplexity API	Managed	Skipped: search + LLM synthesis bundled; Zoya already does the synthesis.

11. MCP servers (cross-category index)

Self-hostable (sovereign-friendly): Scrapling, Crawl4AI, Firecrawl (self-host backend), Jina Reader (self-host backend), ScrapegraphAI, Playwright (MS), Chrome DevTools, Puppeteer, SearXNG, Browser Use, Hermes Agent, Unbrowse (self-host mode).

Managed-only: Browserbase, Browserless (cloud), Bright Data, Apify, Tavily, Exa, Linkup, ScrapingBee, Firecrawl (managed), Jina (managed).

Appendix: pickers and combinations

Picker logic for common tasks

Task	Sovereign pick	Managed pick
Read a single URL for LLM context	Crawl4AI / Jina Reader (self-host)	Jina Reader (`r.jina.ai`)
Typed JSON from a known site	Scrapling	ScrapegraphAI cloud
Crawl a docs site for RAG	Crawl4AI	Firecrawl
Click through an auth flow	Playwright + Patchright	Browserbase + Stagehand
Bypass Cloudflare	Scrapling StealthyFetcher	Browserbase / Bright Data
LLM-driven multi-step task	Browser Use	Browserbase + Stagehand v3
Search the web	SearXNG	Brave (Tavily/Exa for agent-tuned/semantic)
100+ concurrent headless	Lightpanda / Obscura	Browserbase pool
Scrape without rendering JS	curl-impersonate	ScrapingBee (no-JS)
Call a site's hidden API	DIY with mitmproxy	Unbrowse

Notable combinations

Crawl4AI + ScrapegraphAI: cheap markdown for the 90%, LLM extraction on layout-volatile pages.
Lightpanda + Playwright: drop Lightpanda in as the CDP endpoint for an existing Playwright script (beta, expect gaps).
Patchright + Scrapling DynamicFetcher: stack stealth patches under Scrapling's stealth layer.
SearXNG + Crawl4AI: search, then fetch the top N. Fully sovereign pipeline.
Scrapling MCP + Zoya approval gates: the sovereign agent stack: MCP exposes the fetchers, the agent wraps risky calls in human-in-the-loop. This is the one we shipped.

Zoya is a sovereign, single-user agent that already wraps an LLM and plugs in tools over MCP. Those four facts decide most of the web-reading stack before anyone scores quality: no managed cloud, no scraping farm, nothing that needs wrapping, and no second LLM to drive a browser. What survives is small.

What runs: a three-tier fetch ladder plus a search chain. Each tier is tried only when the cheaper one fails.

web_fetch: an in-process HTTP GET, no JavaScript. Static pages, JSON, RSS. Tried first.
Lightpanda: a Zig headless browser, about 9x lighter than headless Chrome. Used when a page needs JS or a form.
Scrapling: bot-protected sites (Cloudflare, TLS fingerprinting) and deterministic selector extraction. The top tier.
Search: SearXNG (self-hosted), then Brave, then DuckDuckGo, falling through on empty.

Try cheap, escalate only when the cheaper tier fails. That one rule is the whole design.

Why these and not the other eighty

Sovereignty is a veto. Browserbase, Bright Data, ScrapingBee and every managed-cloud option are out on contact. Agent harnesses are redundant: Zoya is already the agent that drives tools, so bolting on a second LLM-driven browser is an architectural inversion and a second model bill. Browser drivers and dedicated extractors are dead weight too: Zoya talks to the fetchers over MCP and emits typed JSON itself, so it needs a good fetch, not a driver layer or a separate extractor.

The skips are logged, not closed

Crawl4AI, Exa and Obscura are deferred, each with a written "revisit when" trigger: bulk-ingesting whole sites, real semantic search, or Lightpanda's beta gaps starting to cost real pages. A deferral you cannot reopen is just a guess you got attached to. The appendix below is the full landscape the call was made against.

Sources

Lightpanda: github.com/lightpanda-io/browser
Scrapling: github.com/D4Vinci/Scrapling (v0.4.8, May 2026)
Crawl4AI: github.com/unclecode/crawl4ai
Firecrawl: firecrawl.dev · ScrapegraphAI: scrapegraphai.com
Browser Use: github.com/browser-use/browser-use · Playwright MCP: github.com/microsoft/playwright-mcp