Zoya runs three tools for the web — web_fetch, mcp_lightpanda, mcp_scrapling — plus a sovereign-first search chain. The other eighty-odd projects in this space were evaluated and rejected. This is the decision record: what runs, what was cut, and the dimension that decided each call. Four facts about Zoya did most of the filtering before quality was assessed: it is sovereign, single-user, MCP-native, and it is already the reasoning agent rather than a tool that needs one bolted on.
The constraints did most of the work
Four facts about what Zoya is remove most of the market before quality is assessed:
- Sovereign / self-hosted. No managed cloud services. That deletes the entire managed-browser and managed-extraction tier on contact — Browserbase, Bright Data, ScrapingBee, Diffbot, and friends.
- Single-user personal agent, not a scraping farm or a RAG-ingestion pipeline. "Run 100+ concurrent headless" and "crawl an entire docs site" aren't the workload; reading a URL inside a conversation is.
- MCP-native. Zoya is an MCP client; new tools plug in over stdio through a glibc sidecar. Anything that ships an MCP server drops in cleanly; anything that needs wrapping pays a tax.
- Zoya is already the agent. Its runtime wraps an LLM and runs the loop itself, so it already reasons. It does not need a second LLM to drive a browser, and it turns fetched content into typed JSON natively. This single fact rules out two whole categories (agent harnesses and LLM extractors) as redundant.
What we run
A three-tier escalation ladder for fetching, plus a search backend:
web_fetch— an in-process HTTP GET (curl-shaped), no JavaScript. The default for static pages, JSON APIs, RSS, downloads. Cheapest and fastest; tried first.mcp_lightpanda_*— Lightpanda, a Zig headless browser, for JS-rendered pages and interaction (goto, markdown, fill, click, detectForms, evaluate). Used whenweb_fetchcomes back with an empty shell or a "please enable JavaScript" page.mcp_scrapling_*— Scrapling, for bot-protected sites (TLS fingerprinting, Cloudflare/Turnstile) and adaptive extraction. The tier above a plain browser.- Search:
web_searchruns a sovereign-first chain — SearXNG (self-hosted meta-search) → Brave Search API → DuckDuckGo — falling through on empty results. The agent can also force a specific engine per call (backend), so it can cross-check two engines on a hard lookup.
The agent-facing rule is one line: try cheap, escalate only when the cheaper tier fails.
Why Lightpanda
Zig, Apache-2.0, self-hostable, CDP-compatible, ~9× less RAM than headless Chrome. Headless Chrome costs ~200 MB per instance — wrong tool for a single-host agent that needs a DOM occasionally. Lightpanda is beta: coverage gaps, some sites crash it, a few block it. Acceptable, because the tier below it (web_fetch) and above it (scrapling) cover the misses.
Why Scrapling
BSD-3, built-in MCP server, three fetcher tiers in one package: HTTP (curl-impersonate-style TLS fingerprinting, no browser), Stealthy (Cloudflare bypass), Dynamic (full browser). Its structured extraction is deterministic — element fingerprinting, not an LLM — so it costs nothing per call and survives layout drift. It closes the two gaps above Lightpanda — anti-bot fetch and selector extraction — in one sovereign tool that installs into the sidecar like the document-to-markdown server.
The dimensions we scored on
Every candidate was weighed on the same axes. For Zoya, the first four behaved as near-vetoes; the rest were tie-breakers.
- Sovereignty — self-hostable vs managed-only. (veto: managed-only is out.)
- License — permissive (BSD/MIT/Apache) vs copyleft (AGPL) vs commercial. (AGPL tolerated for a true service, avoided for a dependency.)
- LLM dependency — deterministic vs needs an LLM at runtime. (veto-ish: a second runtime LLM is redundant — Zoya already drives one.)
- MCP-native — ready to plug in vs needs wrapping.
- Anti-bot capability — none / TLS-level / full stealth / managed-bypass.
- Adaptivity — breaks on layout change vs adapts.
- Scale shape — single-page vs spider vs concurrent pool. (deprioritized: single-user.)
- Language / container fit — does it slot into the glibc sidecar; does it ship a usable image.
- Maintenance — release cadence, star velocity, sponsor health.
What we skipped, and why
- All managed / cloud (Browserbase, Bright Data, Firecrawl-cloud, Tavily/Exa-managed, ScrapingBee, Apify, ZenRows, Diffbot, AgentQL, Kadoa) — breaks sovereignty. This is the whole point of the project, so it's a hard no, not a trade-off.
- Agent harnesses (Browser Use, Stagehand, Skyvern, Magnitude, Anthropic computer-use) — Zoya is the agent that drives tools. Bolting on another LLM-driven browser is an architectural inversion (and a second model bill). Hermes Agent stays a reference for how a memory-first multi-surface agent is built, not a dependency.
- Browser drivers (Playwright/Patchright/Puppeteer/Selenium/chromedp/Rod) — only needed if you script a browser. Zoya drives Lightpanda and Scrapling through MCP instead, so a driver layer would be dead weight.
- Crawl4AI / Firecrawl — genuinely good, but single-URL→markdown is already covered three ways (web_fetch / lightpanda / scrapling). Their unique value is recursive multi-page crawl and RAG ingestion, which isn't a current use case (our knowledge base is populated out-of-band). Firecrawl is also AGPL and heavy. Deferred, with a written revisit trigger.
- Dedicated structured extractors (ScrapegraphAI, Firecrawl-Extract) — redundant. Fetch a page through the ladder, and Zoya's own model emits the typed JSON; Scrapling covers the deterministic/selector case. A separate extractor is a second LLM doing what the agent already does.
- Other engines — Obscura (Rust, stealthier CDP) is the one to watch as a future swap for Lightpanda if the beta gaps bite; Servo and Ladybird aren't production-ready headless.
- Unbrowse / API-discovery — clever (call a site's hidden API directly), but niche; revisit per-site if it ever matters.
Adjacent: document parsers
This stack is web-reading; the sibling problem is converting documents — PDF, DocX, PPTX — into clean Markdown for the agent to read. Zoya runs markitdown as an MCP sidecar for that today. Docling (IBM Research, MIT, MCP-capable) is the more capable peer: stronger on table structure, layout reconstruction, and embedded figures — the obvious swap if markitdown's output ever costs us fidelity on tables or scientific PDFs. Both are parsing tools, not retrieval ones. A clean, structurally-perfect Markdown'd document is still untrusted content until it passes the memory store's trust gate — see Agent memory is an attack surface.
What would change our mind
Skips aren't permanent — they're recorded with the condition that would reopen them:
- Exa — the moment semantic / research-discovery search (find by meaning, find-similar) becomes a real workload. It's the one search API that adds something the keyword stack can't; until then its SaaS key + cost aren't worth it. (Tavily was evaluated and skipped — it's ≈ Brave with cleaner snippets, i.e. redundant; Firecrawl is a crawler, not search — see below.)
- Crawl4AI — the moment Zoya needs to bulk-ingest whole sites/docs into memory (crawl → chunk → store). That's the one job the read-one-URL loop doesn't scale to.
- Obscura — if Lightpanda's beta coverage gaps start costing us real pages.
These live in a decision log where every entry carries an explicit "revisit when" trigger. A deferral you can't reopen is just a guess you got attached to.
Appendix: the landscape we evaluated
The full reference the decision was made against. Sorted by layer; each row carries the dimensions above.
Decision tree
Need to extract data from web pages?
│
├─ Just need clean text/markdown for an LLM to read?
│ → Jina Reader (cheapest), Firecrawl, Crawl4AI (self-hosted)
│
├─ Need typed JSON matching a schema?
│ ├─ Schema known, layout stable → Scrapling (selectors + adaptive)
│ ├─ Schema known, layout volatile → ScrapegraphAI (LLM extraction)
│ └─ Schema unknown, exploratory → Diffbot, AgentQL
│
├─ Need to click, fill forms, log in?
│ ├─ Deterministic script → Playwright + Patchright (anti-detect)
│ ├─ LLM-driven actions → Browser Use, Stagehand
│ └─ Heavy anti-bot → Browserbase, or Scrapling StealthyFetcher
│
├─ Need to crawl an entire site?
│ ├─ Self-hosted → Scrapling Spiders, Scrapy, Crawl4AI
│ └─ Managed → Firecrawl, Apify
│
├─ Avoid the browser (use the site's hidden API)?
│ → Unbrowse, or DIY with curl-impersonate
│
└─ Just need search results, not page scraping?
→ Tavily, Exa, SearXNG (self-hosted), Brave Search API
1. Browser engines
The thing that parses HTML and runs JavaScript. Drop-in CDP-compatible engines.
| Name | Lang | License | Self-host | MCP | Status | Notes |
|---|---|---|---|---|---|---|
| Lightpanda | Zig | Apache-2.0 | Yes | No (CDP) | Beta | 9× less RAM, 11× faster than Chrome headless. No CSS layout/image decode/GPU; web-API coverage incomplete. |
| Obscura | Rust | MIT | Yes | No (CDP) | Beta | Stealth mode, 3,520-domain tracker block, CDP-compatible. |
| Chromium / Chrome | C++ | BSD | Yes | No (CDP) | Stable | Default everywhere. Heavy: ~200 MB+ RAM per instance. |
| Firefox (headless) | C++/Rust | MPL-2.0 | Yes | via mediator | Stable | Different fingerprint than Chrome — useful for diversity. |
| WebKit (headless) | C++ | LGPL/BSD | Yes | via mediator | Stable | Smaller footprint than Chromium; fewer sites optimize for it. |
| Servo | Rust | MPL-2.0 | Yes | No | Research | Not production-ready for headless automation. |
| Ladybird | C++ | BSD | Partial | No | Pre-alpha | Independent engine, not yet usable headless. |
Pick rule: Chromium unless you have a strong reason. Lightpanda/Obscura when you'll spawn >50 concurrent and memory matters. Firefox/WebKit for fingerprint diversity.
2. Browser drivers
Libraries that control engines via CDP or WebDriver.
| Name | Lang | License | Anti-detect | MCP | Notes |
|---|---|---|---|---|---|
| Playwright | TS/Py/Java/.NET | Apache-2.0 | No | Yes (MS) | The standard. Multi-engine (Chromium/Firefox/WebKit). |
| Puppeteer | JS/TS | Apache-2.0 | No | Yes | Chrome-only, simpler API than Playwright. |
| Selenium | Multi | Apache-2.0 | No | Community | Legacy WebDriver. Worse perf, broader site support. |
| chromedp | Go | MIT | No | Community | Go-native CDP driver, no Node dependency. |
| Rod | Go | MIT | No | Community | More ergonomic Go CDP driver than chromedp. |
| Patchright | Py/Node | Apache-2.0 | Yes | No | Playwright fork with anti-detection patches. |
| rebrowser-playwright | Node | Apache-2.0 | Yes | No | Different anti-detection patches than Patchright. |
| curl-impersonate | C | MIT | TLS-level | No | HTTP only, no JS. TLS fingerprint matching for Chrome/Firefox/Safari. |
Pick rule: Playwright + Patchright if anti-detection matters; plain Playwright otherwise; curl-impersonate if you can skip JS entirely.
3. URL → clean markdown / text
| Name | Lang | License | Self-host | MCP | LLM? | Notes |
|---|---|---|---|---|---|---|
| Jina Reader | TS | Apache-2.0 | Yes | Yes | No | Lightest. r.jina.ai/<url> managed. Weak on heavy SPAs. |
| Firecrawl | TS/Py | AGPL-3.0 | Yes (Compose) | Yes | Optional | Full Playwright rendering. Recursive crawl, sitemap mapping. |
| Crawl4AI | Python | Apache-2.0 | Yes | Yes | Optional | LLM-aware chunking. Best self-hosted Firecrawl peer. |
| Trafilatura | Python | Apache-2.0 | library | No | No | Pure text extraction from HTML, no rendering. |
| Readability.js | JS | Apache-2.0 | library | No | No | Mozilla's Reader View algorithm. |
| readabilipy | Python | MIT | library | No | No | Python wrapper around Readability.js. |
Pick rule: Crawl4AI for self-hosted at any scale; Jina Reader for prototyping; Firecrawl for managed recursive crawl; Trafilatura when you already have the HTML.
4. URL → typed JSON (structured extraction)
| Name | Lang | License | Self-host | MCP | LLM? | Notes |
|---|---|---|---|---|---|---|
| ScrapegraphAI | Python | MIT | library | Yes | Yes | Plain-English prompts → typed JSON. Graph pipeline. |
| Firecrawl Extract | — | AGPL | Yes | Yes | Yes | Schema or prompt → JSON. Managed LLM. |
| Diffbot | — | Commercial | No | Yes | No (ML) | Knowledge-graph extraction. 10K/mo free tier. |
| AgentQL | Py/JS | Commercial | No | Yes | No (ML) | Natural-language query language for selectors. |
| Kadoa | — | Commercial | No | Yes | Yes | Auto-adapts to layout changes. |
| Scrapling (adaptive) | Python | BSD-3 | Yes | Yes | No | Element fingerprinting, not LLM-based. See §5. |
| AutoScraper | Python | MIT | library | No | No | Example-based: give it a page + sample data, it learns. |
Pick rule: ScrapegraphAI for volatile layouts with LLM budget; Scrapling for stable layouts, free; Diffbot/AgentQL for commercial reliability. For an LLM agent: it already extracts — you mostly need a good fetch.
5. Full scraping frameworks
| Name | Lang | License | MCP | Anti-bot | Adaptive | Notes |
|---|---|---|---|---|---|---|
| Scrapling | Python | BSD-3 | Built-in | Cloudflare/Turnstile | Yes | 3 fetcher tiers (HTTP/Stealthy/Dynamic). Spiders with pause-resume. Docker image with all browsers. |
| Scrapy | Python | BSD-3 | Community | No | No | The OG. Mature, large ecosystem, no built-in stealth. |
| Crawlee | TS/Py | Apache-2.0 | Community | Some | No | Apify's OSS framework. Good Playwright + queue integration. |
| Botasaurus | Python | MIT | No | Yes | No | Stealth-first scraping framework. |
| Katana | Go | MIT | No | No | No | Crawling-focused, fast, from ProjectDiscovery (security tooling). |
Pick rule: Scrapling for new projects; Scrapy for legacy/team familiarity; Katana for security recon, not data extraction.
6. Agent harnesses (LLM drives the browser)
| Name | Lang | License | Self-host | MCP | Notes |
|---|---|---|---|---|---|
| Browser Use | Python | MIT | Yes | Yes | 89.1% WebVoyager. Leading OSS agent-browser framework. |
| Stagehand | TS | MIT | lib, cloud-coupled | Yes | Browserbase's framework; v3 dropped Playwright for modular drivers. |
| Skyvern | Python | AGPL-3.0 | Yes | Yes | Form-filling and workflow focus. |
| Hermes Agent | Python | MIT | Yes | Yes (self) | Persistent memory, multi-surface agent. Browser is one tool. |
| Anthropic computer-use | — | Commercial | No | N/A | Drives a full desktop. Most general, most expensive. |
| Magnitude | TS | MIT | Yes | Yes | Newer LLM-driven browser automation. |
Pick rule: Browser Use for OSS production; Stagehand if you're already on Browserbase; Hermes Agent as an architectural reference for Zoya rather than a competitor.
7. Cloud browser infrastructure (managed)
Headless Chrome as a service. Skipped wholesale — sovereignty.
| Name | Anti-bot | MCP | Notes |
|---|---|---|---|
| Browserbase | Strong (residential proxies, CAPTCHA) | Yes | Most common pick for agent infra. |
| Browserless | Yes | Yes | Self-hostable Docker option also exists. |
| Steel | Yes | Yes | Newer entrant, agent-native API. |
| Kernel | Yes | Yes | Agent-focused, includes session replay. |
| Bright Data | Strongest | Yes | Enterprise-grade, expensive. |
| Apify | Yes | Yes | Marketplace of pre-built scrapers + infra. |
| ScrapingBee | Yes | Yes | Credit-based, JS-rendering optional. |
| ZenRows | Yes | No | Scraping API with anti-bot rotation. |
8 – 10. Self-hosted agentic browsers, API-discovery, search
| Category | Name | License / Self-host | Notes |
|---|---|---|---|
| Agentic browser | BrowserOS | AGPL-3.0 | Open-source Comet/Dia alternative, runs agents locally. |
| Agentic browser | Steel Browser | Apache-2.0 | Self-hostable Steel infrastructure. |
| Agentic browser | Open Operator | MIT | OSS clone of the Operator pattern. |
| API-discovery | Unbrowse | Managed (MCP) | Discovers a site's internal APIs and calls them directly. Sub-second where it works. |
| Search | SearXNG | Self-host, MCP | Meta-search aggregator, privacy-first. Zoya's primary search (sovereign default). |
| Search | Brave Search API | Managed | Independent index. Reliable fallback below SearXNG; only fires when SearXNG comes up empty. |
| Search | Tavily | Managed | Skipped — agent-search API ≈ Brave with cleaner snippets; redundant. |
| Search | Exa | Managed | Neural/semantic (find-by-meaning). Gated: add if semantic-research becomes a need. |
| Search | Perplexity API | Managed | Skipped — search + LLM synthesis bundled; Zoya already does the synthesis. |
11. MCP servers (cross-category index)
Self-hostable (sovereign-friendly): Scrapling, Crawl4AI, Firecrawl (self-host backend), Jina Reader (self-host backend), ScrapegraphAI, Playwright (MS), Chrome DevTools, Puppeteer, SearXNG, Browser Use, Hermes Agent, Unbrowse (self-host mode).
Managed-only: Browserbase, Browserless (cloud), Bright Data, Apify, Tavily, Exa, Linkup, ScrapingBee, Firecrawl (managed), Jina (managed).
Picker logic for common tasks
| Task | Sovereign pick | Managed pick |
|---|---|---|
| Read a single URL for LLM context | Crawl4AI / Jina Reader (self-host) | Jina Reader (r.jina.ai) |
| Typed JSON from a known site | Scrapling | ScrapegraphAI cloud |
| Crawl a docs site for RAG | Crawl4AI | Firecrawl |
| Click through an auth flow | Playwright + Patchright | Browserbase + Stagehand |
| Bypass Cloudflare | Scrapling StealthyFetcher | Browserbase / Bright Data |
| LLM-driven multi-step task | Browser Use | Browserbase + Stagehand v3 |
| Search the web | SearXNG | Brave (Tavily/Exa for agent-tuned/semantic) |
| 100+ concurrent headless | Lightpanda / Obscura | Browserbase pool |
| Scrape without rendering JS | curl-impersonate | ScrapingBee (no-JS) |
| Call a site's hidden API | DIY with mitmproxy | Unbrowse |
Notable combinations
- Crawl4AI + ScrapegraphAI — cheap markdown for the 90%, LLM extraction on layout-volatile pages.
- Lightpanda + Playwright — drop Lightpanda in as the CDP endpoint for an existing Playwright script (beta — expect gaps).
- Patchright + Scrapling DynamicFetcher — stack stealth patches under Scrapling's stealth layer.
- SearXNG + Crawl4AI — search, then fetch the top N. Fully sovereign pipeline.
- Scrapling MCP + Zoya approval gates — the sovereign agent stack: MCP exposes the fetchers, the agent wraps risky calls in human-in-the-loop. This is the one we shipped.
Sources
- Lightpanda — github.com/lightpanda-io/browser
- Scrapling — github.com/D4Vinci/Scrapling (v0.4.8, May 2026)
- Crawl4AI — github.com/unclecode/crawl4ai
- Firecrawl — firecrawl.dev · ScrapegraphAI — scrapegraphai.com
- Browser Use — github.com/browser-use/browser-use · Playwright MCP — github.com/microsoft/playwright-mcp