Zoya runs three tools for the web — web_fetch, mcp_lightpanda, mcp_scrapling — plus a sovereign-first search chain. The other eighty-odd projects in this space were evaluated and rejected. This is the decision record: what runs, what was cut, and the dimension that decided each call. Four facts about Zoya did most of the filtering before quality was assessed: it is sovereign, single-user, MCP-native, and it is already the reasoning agent rather than a tool that needs one bolted on.

web_fetch — curl, no JS (default) mcp_lightpanda — JS render + forms mcp_scrapling — anti-bot + extraction page needs JS bot-protected
Zoya's web ladder: start cheap, escalate only when the cheaper tier fails.

The constraints did most of the work

Four facts about what Zoya is remove most of the market before quality is assessed:

What we run

A three-tier escalation ladder for fetching, plus a search backend:

The agent-facing rule is one line: try cheap, escalate only when the cheaper tier fails.

Why Lightpanda

Zig, Apache-2.0, self-hostable, CDP-compatible, ~9× less RAM than headless Chrome. Headless Chrome costs ~200 MB per instance — wrong tool for a single-host agent that needs a DOM occasionally. Lightpanda is beta: coverage gaps, some sites crash it, a few block it. Acceptable, because the tier below it (web_fetch) and above it (scrapling) cover the misses.

Why Scrapling

BSD-3, built-in MCP server, three fetcher tiers in one package: HTTP (curl-impersonate-style TLS fingerprinting, no browser), Stealthy (Cloudflare bypass), Dynamic (full browser). Its structured extraction is deterministic — element fingerprinting, not an LLM — so it costs nothing per call and survives layout drift. It closes the two gaps above Lightpanda — anti-bot fetch and selector extraction — in one sovereign tool that installs into the sidecar like the document-to-markdown server.

The dimensions we scored on

Every candidate was weighed on the same axes. For Zoya, the first four behaved as near-vetoes; the rest were tie-breakers.

  1. Sovereignty — self-hostable vs managed-only. (veto: managed-only is out.)
  2. License — permissive (BSD/MIT/Apache) vs copyleft (AGPL) vs commercial. (AGPL tolerated for a true service, avoided for a dependency.)
  3. LLM dependency — deterministic vs needs an LLM at runtime. (veto-ish: a second runtime LLM is redundant — Zoya already drives one.)
  4. MCP-native — ready to plug in vs needs wrapping.
  5. Anti-bot capability — none / TLS-level / full stealth / managed-bypass.
  6. Adaptivity — breaks on layout change vs adapts.
  7. Scale shape — single-page vs spider vs concurrent pool. (deprioritized: single-user.)
  8. Language / container fit — does it slot into the glibc sidecar; does it ship a usable image.
  9. Maintenance — release cadence, star velocity, sponsor health.

What we skipped, and why

Adjacent: document parsers

This stack is web-reading; the sibling problem is converting documents — PDF, DocX, PPTX — into clean Markdown for the agent to read. Zoya runs markitdown as an MCP sidecar for that today. Docling (IBM Research, MIT, MCP-capable) is the more capable peer: stronger on table structure, layout reconstruction, and embedded figures — the obvious swap if markitdown's output ever costs us fidelity on tables or scientific PDFs. Both are parsing tools, not retrieval ones. A clean, structurally-perfect Markdown'd document is still untrusted content until it passes the memory store's trust gate — see Agent memory is an attack surface.

What would change our mind

Skips aren't permanent — they're recorded with the condition that would reopen them:

These live in a decision log where every entry carries an explicit "revisit when" trigger. A deferral you can't reopen is just a guess you got attached to.


Appendix: the landscape we evaluated

The full reference the decision was made against. Sorted by layer; each row carries the dimensions above.

Decision tree

Need to extract data from web pages?
│
├─ Just need clean text/markdown for an LLM to read?
│    → Jina Reader (cheapest), Firecrawl, Crawl4AI (self-hosted)
│
├─ Need typed JSON matching a schema?
│    ├─ Schema known, layout stable  → Scrapling (selectors + adaptive)
│    ├─ Schema known, layout volatile → ScrapegraphAI (LLM extraction)
│    └─ Schema unknown, exploratory  → Diffbot, AgentQL
│
├─ Need to click, fill forms, log in?
│    ├─ Deterministic script → Playwright + Patchright (anti-detect)
│    ├─ LLM-driven actions   → Browser Use, Stagehand
│    └─ Heavy anti-bot       → Browserbase, or Scrapling StealthyFetcher
│
├─ Need to crawl an entire site?
│    ├─ Self-hosted → Scrapling Spiders, Scrapy, Crawl4AI
│    └─ Managed     → Firecrawl, Apify
│
├─ Avoid the browser (use the site's hidden API)?
│    → Unbrowse, or DIY with curl-impersonate
│
└─ Just need search results, not page scraping?
     → Tavily, Exa, SearXNG (self-hosted), Brave Search API

1. Browser engines

The thing that parses HTML and runs JavaScript. Drop-in CDP-compatible engines.

NameLangLicenseSelf-hostMCPStatusNotes
LightpandaZigApache-2.0YesNo (CDP)Beta9× less RAM, 11× faster than Chrome headless. No CSS layout/image decode/GPU; web-API coverage incomplete.
ObscuraRustMITYesNo (CDP)BetaStealth mode, 3,520-domain tracker block, CDP-compatible.
Chromium / ChromeC++BSDYesNo (CDP)StableDefault everywhere. Heavy: ~200 MB+ RAM per instance.
Firefox (headless)C++/RustMPL-2.0Yesvia mediatorStableDifferent fingerprint than Chrome — useful for diversity.
WebKit (headless)C++LGPL/BSDYesvia mediatorStableSmaller footprint than Chromium; fewer sites optimize for it.
ServoRustMPL-2.0YesNoResearchNot production-ready for headless automation.
LadybirdC++BSDPartialNoPre-alphaIndependent engine, not yet usable headless.

Pick rule: Chromium unless you have a strong reason. Lightpanda/Obscura when you'll spawn >50 concurrent and memory matters. Firefox/WebKit for fingerprint diversity.

2. Browser drivers

Libraries that control engines via CDP or WebDriver.

NameLangLicenseAnti-detectMCPNotes
PlaywrightTS/Py/Java/.NETApache-2.0NoYes (MS)The standard. Multi-engine (Chromium/Firefox/WebKit).
PuppeteerJS/TSApache-2.0NoYesChrome-only, simpler API than Playwright.
SeleniumMultiApache-2.0NoCommunityLegacy WebDriver. Worse perf, broader site support.
chromedpGoMITNoCommunityGo-native CDP driver, no Node dependency.
RodGoMITNoCommunityMore ergonomic Go CDP driver than chromedp.
PatchrightPy/NodeApache-2.0YesNoPlaywright fork with anti-detection patches.
rebrowser-playwrightNodeApache-2.0YesNoDifferent anti-detection patches than Patchright.
curl-impersonateCMITTLS-levelNoHTTP only, no JS. TLS fingerprint matching for Chrome/Firefox/Safari.

Pick rule: Playwright + Patchright if anti-detection matters; plain Playwright otherwise; curl-impersonate if you can skip JS entirely.

3. URL → clean markdown / text

NameLangLicenseSelf-hostMCPLLM?Notes
Jina ReaderTSApache-2.0YesYesNoLightest. r.jina.ai/<url> managed. Weak on heavy SPAs.
FirecrawlTS/PyAGPL-3.0Yes (Compose)YesOptionalFull Playwright rendering. Recursive crawl, sitemap mapping.
Crawl4AIPythonApache-2.0YesYesOptionalLLM-aware chunking. Best self-hosted Firecrawl peer.
TrafilaturaPythonApache-2.0libraryNoNoPure text extraction from HTML, no rendering.
Readability.jsJSApache-2.0libraryNoNoMozilla's Reader View algorithm.
readabilipyPythonMITlibraryNoNoPython wrapper around Readability.js.

Pick rule: Crawl4AI for self-hosted at any scale; Jina Reader for prototyping; Firecrawl for managed recursive crawl; Trafilatura when you already have the HTML.

4. URL → typed JSON (structured extraction)

NameLangLicenseSelf-hostMCPLLM?Notes
ScrapegraphAIPythonMITlibraryYesYesPlain-English prompts → typed JSON. Graph pipeline.
Firecrawl ExtractAGPLYesYesYesSchema or prompt → JSON. Managed LLM.
DiffbotCommercialNoYesNo (ML)Knowledge-graph extraction. 10K/mo free tier.
AgentQLPy/JSCommercialNoYesNo (ML)Natural-language query language for selectors.
KadoaCommercialNoYesYesAuto-adapts to layout changes.
Scrapling (adaptive)PythonBSD-3YesYesNoElement fingerprinting, not LLM-based. See §5.
AutoScraperPythonMITlibraryNoNoExample-based: give it a page + sample data, it learns.

Pick rule: ScrapegraphAI for volatile layouts with LLM budget; Scrapling for stable layouts, free; Diffbot/AgentQL for commercial reliability. For an LLM agent: it already extracts — you mostly need a good fetch.

5. Full scraping frameworks

NameLangLicenseMCPAnti-botAdaptiveNotes
ScraplingPythonBSD-3Built-inCloudflare/TurnstileYes3 fetcher tiers (HTTP/Stealthy/Dynamic). Spiders with pause-resume. Docker image with all browsers.
ScrapyPythonBSD-3CommunityNoNoThe OG. Mature, large ecosystem, no built-in stealth.
CrawleeTS/PyApache-2.0CommunitySomeNoApify's OSS framework. Good Playwright + queue integration.
BotasaurusPythonMITNoYesNoStealth-first scraping framework.
KatanaGoMITNoNoNoCrawling-focused, fast, from ProjectDiscovery (security tooling).

Pick rule: Scrapling for new projects; Scrapy for legacy/team familiarity; Katana for security recon, not data extraction.

6. Agent harnesses (LLM drives the browser)

NameLangLicenseSelf-hostMCPNotes
Browser UsePythonMITYesYes89.1% WebVoyager. Leading OSS agent-browser framework.
StagehandTSMITlib, cloud-coupledYesBrowserbase's framework; v3 dropped Playwright for modular drivers.
SkyvernPythonAGPL-3.0YesYesForm-filling and workflow focus.
Hermes AgentPythonMITYesYes (self)Persistent memory, multi-surface agent. Browser is one tool.
Anthropic computer-useCommercialNoN/ADrives a full desktop. Most general, most expensive.
MagnitudeTSMITYesYesNewer LLM-driven browser automation.

Pick rule: Browser Use for OSS production; Stagehand if you're already on Browserbase; Hermes Agent as an architectural reference for Zoya rather than a competitor.

7. Cloud browser infrastructure (managed)

Headless Chrome as a service. Skipped wholesale — sovereignty.

NameAnti-botMCPNotes
BrowserbaseStrong (residential proxies, CAPTCHA)YesMost common pick for agent infra.
BrowserlessYesYesSelf-hostable Docker option also exists.
SteelYesYesNewer entrant, agent-native API.
KernelYesYesAgent-focused, includes session replay.
Bright DataStrongestYesEnterprise-grade, expensive.
ApifyYesYesMarketplace of pre-built scrapers + infra.
ScrapingBeeYesYesCredit-based, JS-rendering optional.
ZenRowsYesNoScraping API with anti-bot rotation.

8 – 10. Self-hosted agentic browsers, API-discovery, search

CategoryNameLicense / Self-hostNotes
Agentic browserBrowserOSAGPL-3.0Open-source Comet/Dia alternative, runs agents locally.
Agentic browserSteel BrowserApache-2.0Self-hostable Steel infrastructure.
Agentic browserOpen OperatorMITOSS clone of the Operator pattern.
API-discoveryUnbrowseManaged (MCP)Discovers a site's internal APIs and calls them directly. Sub-second where it works.
SearchSearXNGSelf-host, MCPMeta-search aggregator, privacy-first. Zoya's primary search (sovereign default).
SearchBrave Search APIManagedIndependent index. Reliable fallback below SearXNG; only fires when SearXNG comes up empty.
SearchTavilyManagedSkipped — agent-search API ≈ Brave with cleaner snippets; redundant.
SearchExaManagedNeural/semantic (find-by-meaning). Gated: add if semantic-research becomes a need.
SearchPerplexity APIManagedSkipped — search + LLM synthesis bundled; Zoya already does the synthesis.

11. MCP servers (cross-category index)

Self-hostable (sovereign-friendly): Scrapling, Crawl4AI, Firecrawl (self-host backend), Jina Reader (self-host backend), ScrapegraphAI, Playwright (MS), Chrome DevTools, Puppeteer, SearXNG, Browser Use, Hermes Agent, Unbrowse (self-host mode).

Managed-only: Browserbase, Browserless (cloud), Bright Data, Apify, Tavily, Exa, Linkup, ScrapingBee, Firecrawl (managed), Jina (managed).

Picker logic for common tasks

TaskSovereign pickManaged pick
Read a single URL for LLM contextCrawl4AI / Jina Reader (self-host)Jina Reader (r.jina.ai)
Typed JSON from a known siteScraplingScrapegraphAI cloud
Crawl a docs site for RAGCrawl4AIFirecrawl
Click through an auth flowPlaywright + PatchrightBrowserbase + Stagehand
Bypass CloudflareScrapling StealthyFetcherBrowserbase / Bright Data
LLM-driven multi-step taskBrowser UseBrowserbase + Stagehand v3
Search the webSearXNGBrave (Tavily/Exa for agent-tuned/semantic)
100+ concurrent headlessLightpanda / ObscuraBrowserbase pool
Scrape without rendering JScurl-impersonateScrapingBee (no-JS)
Call a site's hidden APIDIY with mitmproxyUnbrowse

Notable combinations

Sources