The SOAR anti-pattern tax

SOAR is awesome if you sell it. Not so much if you're paying for it, or using it. Not the way it works today, anyway.

The category sells three things that aren't true. Cortex XSOAR, Splunk SOAR, Swimlane Turbine, Sentinel Playbooks, ServiceNow SecOps, Chronicle SOAR, Tines, Torq. Different pitch decks, same architecture, same pricing, same downstream mess. Here's what each piece of the marketing actually means after you've lived inside it.

There is an exit, and it works beautifully. It runs at under 10% of the vendor bill, doesn't need a steering committee or a multi-quarter program, and the vendors don't have to be told until the renewal goes a different direction than they expected. Diagnosis first, then the exit.

Myth #1: "Low-code, no engineers needed"

The pitch: analysts click playbooks together in a visual editor. No engineering team required.

What happens: the first ten playbooks click together fine. Playbook eleven needs a regex. Twelve needs date math. The hundredth needs a retry loop and a state machine. The analysts file tickets to the engineers. The engineers, who would happily write Python, are now writing programs in JSON through a clicky web interface. The conditional that started this paragraph looks like this on disk:

{
  "transformations": {
    "severity": {
      "$:if": {
        "$:eq": ["$inputs.alert.source_ip_prefix", "10.239.3."],
        "then": "critical",
        "else": "$inputs.alert.severity"
      }
    }
  }
}

That's an if statement. The fact that it's nested JSON instead of if alert.source_ip.startswith("10.239.3."): severity = "critical" doesn't make it less of a program. It makes it a worse one. A mature SOAR deployment contains tens of thousands of these. None grep. None test. None diff. Code review involves mentally translating JSON back into English so you can see what changed.

The labor math is upside down. Python engineers are easy to hire. Engineers fluent in a specific vendor's expression dialect are not. The product whose stated purpose is reducing your engineering bill instead pins your engineers to a worse-than-VS-Code editor where they work three to five times slower than they would in a normal language. Low-code as sold is click-ops as delivered.

Myth #2: "The visual editor is the source of truth"

The source of truth is the JSON. The visual editor renders it. The platform canonicalises the JSON on every save, so commits arrive as diffs of reordered keys when nothing changed. The expression dialect is the vendor's own, so what you learn doesn't transfer to anything else. The graph view is auto-laid-out, which means it mutates whenever you edit, which means it's not documentation.

The vendor ships a custom IDE because the JSON is unreadable. The custom IDE is then worse than VS Code at every task an engineer cares about: search, refactor, format, blame, jump-to-definition. There is no language server because the code is a string inside JSON. There is no formatter because the JSON gets canonicalised after you save. Nothing is defined in a place you can navigate to.

Myth #3: "Per-run pricing aligns with usage"

Per-run pricing aligns with consolidation, not usage. This is the myth that bends the architecture.

The shape you'd pick on a whiteboard is one playbook per customer per application. Twenty customers, four SIEM integrations, you get eighty named playbooks, the way you'd get eighty Python modules. Per-run pricing makes that bill impossible. So buyers merge.

The pattern is the same in every multi-tenant deployment:

Hundreds of per-customer scripts collapse into a few dozen shared playbooks.
Customer identity stops being a property of the playbook. It becomes a field in the alert.
"Process Alert" grows to fifteen actions, each branching on customer via embedded ternaries.
Per-customer behaviour moves into nested JSON expressions that no one can read back.
Typos in playbook names survive forever, because they became canonical identifiers for downstream tooling.

The bill went down. The cognitive load went up. The vendor doesn't care which way that trade lands.

The hidden tax: priority inversion

The merged-dispatcher architecture has a worse downstream effect than the cost cascade. It breaks priority.

Under per-customer playbooks, a low-priority storm from one client lands on that client's queue. The other nineteen customers are unaffected. Critical alerts on their queues sail through.

Under one shared dispatcher, every customer's traffic feeds the same playbook. One client's SIEM misfires and pushes ten thousand low-priority alerts in five minutes. The vendor's queue is FIFO with maybe one coarse priority lane. A critical alert from a different customer arriving during the storm now sits behind several thousand routine ones. The architecture optimised for billing, not routing.

You discover this in production, mid-incident, when a page about a real attack arrives forty minutes late because someone else's noisy detector flooded the pipe. The vendor's response is to add another lane. A lane is a knob. Architecture is what the per-run pricing forbade in the first place: per-customer pipelines, separate queues, isolated blast radius. Free in the right pricing model, structurally unavailable in the wrong one.

The hidden tax: lost attribution

Once everyone's traffic flows through the same shared playbook, the platform loses customer identity from its own logs. The playbook_id on a failed run points at the dispatcher, not the customer. To find out whose alert failed, you walk the parent chain looking for an ancestor whose title still mentions the customer.

Every mature SOAR deployment writes this code. Two hundred lines of Python that reconstruct information the platform discarded. It is the dark matter of SOC engineering. It exists because the billing model forces an architecture that destroys the platform's own observability, and the platform refuses to fix what it broke.

One more wrinkle: the per-run bill doesn't go to zero after the merge. It goes down by a factor proportional to how aggressively you consolidated. You're still paying per run. The runs are just fatter now. The merge bought you a discount, not an escape.

What gets pushed out of the platform

The other thing every mature SOC does is bleed work out of the platform to dodge the per-run rate.

Enrichment moves to standalone Python services the SIEM calls directly.
Asset-monitoring loops move to cron on whatever VM the SOC operates.
Sidecar scripts proliferate, doing the work that would have lived in a per-customer playbook if per-customer playbooks were affordable.
Observability moves to Prometheus and Grafana, because the platform's dashboards don't see anything that ran outside the platform.

Each move is a sensible local optimisation. The aggregate result is that the platform is the database for orchestration but no longer where orchestration lives. The vendor still gets paid for the playbooks that remain. The system still works, mostly, with attribution tooling layered on top. The architecture is an awkward equilibrium between billing and reality.

What the tax actually costs

Public per-run rates run $0.05 to $0.50 per dispatch, depending on vendor and contract. A SOC running 5,000 alerts per day through a four-to-six-playbook fan-out pays $300 to $3,000 per day in orchestration billing. Call it six figures at the low end, seven at the high.

The same workload on a durable-execution engine running on a single VM is dominated by the database bill plus one Python process. Low hundreds of dollars per month. The ratio sits between 10× and 50×.

The vendor's cost basis doesn't justify the spread. The spread is the price of the lock-in: the integration library, the analyst UX, the case-management bridge, and most of all the assumption that you can't build your own.

Where the volume actually lives

The volume sits in the SIEM, not the SOAR. Millions of raw events per day, sometimes billions, across logs, network traces, EDR telemetry, identity events. The SIEM's job is to compress that into the things worth looking at. By the time signal exits the SIEM as an alert, the volume has collapsed by three or four orders of magnitude.

SOAR sits below the SIEM in the funnel. Its inputs are the SIEM's outputs. A SOC processing a billion SIEM events per day might see five thousand SOAR alerts per day, generate two hundred ServiceNow tickets, and page a human on twenty of them. Each stage compresses the volume further and demands more attention per surviving item.

The vendor's pricing model is calibrated as if SOAR sat near the top of this funnel. Per-run pricing makes some sense for a system processing the raw event stream. It makes much less sense for a system processing the already-condensed alerts. You're paying SaaS rates for a workload that comfortably fits on a small VM, on the bottom step of a funnel where signal is rare and attention is the expensive resource.

The cost gap, the performance gap, the architectural-quality gap, all follow from this miscalibration. The system was priced for a position in the pipeline it doesn't occupy.

The performance gap nobody talks about

The cost gap is the headline. The performance gap is the part that surprised me.

A workflow running on the vendor's distributed Kubernetes fleet, with all the queueing and per-action serialization and cross-pod hops and JSON parse-and-rebuild at every step, runs roughly 50× faster on a single modest desktop in straight Python. The desktop isn't fast. The vendor's runtime is doing tremendous amounts of work that has nothing to do with the user's logic.

This isn't an argument against distributed systems. SOC traffic is small. Five thousand alerts a day is one every seventeen seconds. The vendor's runtime is sized for a global multi-tenant SaaS. Each tenant's actual load fits on a laptop with capacity left over. The Kubernetes overhead is paying for properties the buyer isn't using.

And the thing engineers forget exists, after enough years inside a vendor IDE: complete visibility. Breakpoint anywhere. Read any state. Walk any call stack. Replay a run against mocked integrations. Profile a function. Diff two versions of any logic. None of these are exotic. They're the daily working conditions of writing in a real language. The vendor IDE removes them and convinces you the loss is normal.

Myth #4: "Building your own would cost years"

This is the myth that protects all the others.

Every "should we just build it" conversation in security automation has died at the same estimate. Hand-porting hundreds of vendor-flavored expressions. Reimplementing dozens of integrations. Building case management, observability, a UI. Multi-quarter death march. The vendor bill, by comparison, looks like a bargain.

The estimate assumed hand-porting. It also assumed the comparable benchmark was fast. It wasn't.

Moving from Swimlane 10x to Swimlane Turbine, the same vendor's own next-generation product, takes six months minimum and an endless rotation of conference calls. Same shape for Demisto to XSOAR's newer tier. Same for Phantom upgrades, Sentinel rewires, every vendor's internal version transitions. Vendor-to-vendor migrations run on the vendor's professional-services calendar, not yours, and consistently take longer than building from scratch on your own calendar. The benchmark that made "build your own" look slow was itself running at a pace nobody should have accepted.

SOAR playbooks are a constrained input format. The DAG is regular. The action types enumerate at 50-100 per vendor. The expressions are short. Each node has known inputs and outputs. This is the cleanest possible target for AI-assisted transpilation, and current coding agents handle it in days, not quarters. You parse the source, dispatch a model per node, review the output against a known input/output pair. The reviewer has something to diff against. The model isn't writing new logic. It's restating existing logic in a better language.

The other half of the work, the platform layer (durable engine, integration libraries, observability, case-management bridge), is mostly already built at any SOC that's been operating long enough. It's the cron scripts, Prometheus exporters, parity-checks, and sidecar services accumulated over the years to make the vendor situation bearable. Pointing it at an in-house engine instead of the vendor's is wiring.

The migration tax that protected the category collapsed sometime in the last eighteen months. The vendors haven't noticed.

What the right shape looks like

Workflow as code, engine as runtime. Two flavors of this work today:

Code-as-workflow: Temporal, Prefect, Dagster, Restate, Inngest. Workflows are functions. The engine snapshots state. No DSL.
Thin orchestrator + real code: Step Functions calling Lambdas, Argo calling containers. The YAML/JSON is metadata only. The logic is in a codebase.

The rule that keeps either pattern healthy: no conditionals in workflow definitions beyond one on-success/on-failure split per action. Everything else moves to code. A code-review policy enforces it. The death spiral starts when "just one more if" creeps into the playbook because it's faster than touching the external service.

The ingestion side

Alerts still have to get from the SIEM to the orchestration engine, often across firewalls and across clouds. Every major cloud ships the primitive: Azure Service Bus, AWS SQS, GCP Pub/Sub. The SIEM pushes from inside whatever network segment it lives in. The orchestrator pulls. At SOC volumes, the queue bill is double-digit dollars per month. You're paying for capacity that vastly outstrips what any single tenant ever uses.

The things that fall out of this setup, none of which the SOAR vendor's architecture gave you for free:

High availability by default. The queue survives a node failure on either side.
Horizontal scale by adding consumers. Headroom into the millions of messages.
Priority queues that actually work. Critical to one queue, routine to another, drained independently. The priority-inversion pathology from earlier stops being a pathology.
Serverless processing if you want zero ops. Lambda or Functions reads the queue, runs the workflow, exits. At SOC volume the per-invocation bill is rounding error.
Audit logs from the queue by default. Trace ID per message, retries recorded, dead-letters with a paper trail.
Tests against recorded message batches. Replay a day of production traffic through the workflow on a laptop. Diff against the vendor's outputs.

Nothing is ever quite this simple in practice. Cross-cloud networking gets weird. Schema drift on the SIEM side surprises you. A queue is not magic. But compared to click-ops programming inside a buggy web IDE, the failure modes are normal engineering problems with normal engineering answers.

Audit works. Tests work. Automation works. AI analysis on the message archive works. It really is insane how well it works.

The unbundling

SOAR sells four things. Three are commodity:

Integrations: typed Python libraries on PyPI, version-pinned, with mocks.
Analyst UX and case management: ServiceNow Security Operations, Jira Service Management, a half-dozen specialists. Already in production at most SOCs.
Observability: Prometheus, Grafana, Thanos. Already deployed.

The fourth, orchestration, is a Python function and a queue.

The bundle is the price. The unbundle is three weeks of work plus two days of validation per SIEM path. A quarter, with one engineer.

Who is actually buying this

SOAR is a procurement category, not a technology category. That distinction explains most of what's wrong with it.

The buyer is the CISO, the SOC director, the head of security operations. Title-level role, budget authority, no recent time spent inside a code editor. The technical evaluator, the engineer who would actually use the product day to day, is not in the conversation. They get told what was bought, after.

The simplest test for any tool is: would a senior engineer who tested it choose it for their own workload? For SOAR, the answer is uniformly no. The same engineer would write the equivalent logic in Python in an afternoon and have it under test by the next morning. They know this. They aren't asked.

The vendors know it too. The sales motion is built around it. Demos lead with the drag-and-drop editor and not the JSON underneath. RFPs are filled out by sales engineers who know which compliance boxes to check. The marketing speaks to the buyer's anxieties (MTTR, analyst burnout, audit coverage) in language the buyer can repeat upward. None of the claims can be verified without actually using the product, and the buyer never does.

What gets bought, then, is a tool optimised to be procurable. Long feature lists. Integration counts in the high three digits. Visual editors that look productive in a five-minute demo. The criteria that would matter if an engineer were involved (diffability, testability, latency, cost per workflow at scale, observability) appear in no RFP and on no scoresheet. The engineer arrives later to find what was procured, suffers, and starts moving work out of the platform to make their day tolerable. The cycle continues because the buyer never sees the engineer's experience and the engineer has no path back to the buyer.

This isn't unique to SOAR. It's the standard pathology of enterprise procurement applied to a category where the cost of the mismatch is unusually visible. SIEM has the same shape. So does most identity tooling. SOAR is just the one where the per-run pricing keeps the wound open longest.

Why no vendor will ship the right thing

Three structural reasons the code-first SOAR doesn't exist as a product:

Buyer model. Code-first requires the buyer to also be the engineering judge. The buyer is not. See above.
Acquirer model. Demisto to Palo Alto. Phantom to Splunk to Cisco. Siemplify to Google. Architectures freeze on acquisition day. The acquirer optimises existing revenue. Nobody redesigns the category after the cheque clears.
Margin model. Per-run pricing is too lucrative to give up first. Whoever moves first sacrifices margin for differentiation. Classic prisoner's dilemma.

The replacement comes from a startup, from a durable-execution vendor (Temporal, Inngest, Restate) extending into security as a wedge, or from a high-volume engineering-led SOC that builds it for itself and quietly open-sources the platform layer. None of these candidates has a per-run model to protect.

The build itself

The mechanics are mechanical. Export the workflows from the vendor as JSON. Most vendors will let you, even if reluctantly. Hand the export to a coding agent with the target architecture spec, the integration libraries, and the team's style guide. Get back a transpile plan, review it, then execute it. Add tests against the original system's outputs. Add tests for the failure modes the vendor's runtime quietly swallowed. Add monitoring at every step the original platform was opaque about. Add more tests.

The timing is what surprises people who haven't tried it. Realistic per-action use-case tests in days. Full workflow ports in weeks. A couple of weeks of running both systems in parallel against the live alert stream, watching for drift. It works surprisingly fast.

The thing the build gives you, beyond the cost win, is visibility. Every step has a log line you wrote. Every transformation has a test. Every integration call has a mock. Every failure has a stack trace pointing at a file on disk. The vendor's runtime made all of this invisible. The replacement makes all of it boring, which is the right state for production code.

The parallel-track exit

The migration that nobody wants to undertake isn't a migration. It's a parallel build with a quiet handover. Four steps:

Build alongside. Stand up the in-house pipeline next to the vendor's. Same inputs, same outputs, no production dependency. The vendor keeps running. Nothing in the SOC's day changes.
Validate against the live system. Run both pipelines on the same alert stream. Diff the tickets, severities, enrichments. Fix the in-house side until parity holds. Most teams already have the diff tooling, because they built it during the last migration.
Operate at under 10% of vendor cost. The validation phase pays for itself before it finishes.
Flip the switch. When parity holds across weeks, not days, promote in-house to primary. The vendor's playbooks become reference. After another clean window, pause them.

The path doesn't need a steering committee. It doesn't need executive sponsorship for a multi-quarter project. One engineer, building in parallel, validated against the vendor's outputs as ground truth. The vendor never has to be told. They find out when the renewal goes a different direction than expected.

The category math

The vendors aren't telling you any of this. The transpilation pipelines are too new to be in the trade press. But once the math is clear, it propagates.

A category that survived a decade on the unaffordability of building your own does not survive a year in which building your own becomes affordable.

SOAR platforms (Cortex XSOAR, Splunk SOAR, Swimlane, Sentinel, Tines, Torq) sell security teams a low-code orchestration layer billed per playbook run. Three of its promises do not survive contact with the product, and a fourth used to protect them. That fourth has now collapsed: building your own is affordable, and the exit runs at under 10% of the vendor bill.

The three broken promises:

Low-code is click-ops. The first ten playbooks click together; the hundredth needs a retry loop. Engineers write programs in JSON that do not grep, test, or diff, three to five times slower than plain Python.
The JSON is the source of truth, not the editor. The platform canonicalises on every save, so commits arrive as reordered-key noise, and the custom IDE is worse than VS Code at every task that matters.
Per-run pricing aligns with consolidation, not usage. Teams merge per-customer scripts into shared dispatchers, burying customer identity in nested ternaries.

The hidden taxes

Merging into one dispatcher creates two costs the saving hides. Priority inversion: one client's misfiring SIEM floods a shared FIFO queue, and a critical alert from another customer pages forty minutes late, mid-incident. Lost attribution: a failed run points at the dispatcher, not the customer, so every SOC writes two hundred lines of Python to reconstruct what billing discarded.

The numbers

Public per-run rates run $0.05 to $0.50. A SOC pushing 5,000 alerts a day pays six to seven figures a year; the same workload on a durable-execution engine on one VM costs low hundreds a month, a 10x to 50x ratio.

It also runs 50x faster in plain Python, because SOAR sits at the bottom of the funnel, on a load that fits on a laptop.

Why the exit is now open

SOAR playbooks are a clean transpilation target: a regular DAG, 50 to 100 action types, known inputs and outputs. Coding agents restate that logic in days, and the platform layer is mostly built from the sidecar scripts every SOC already accumulated. The right shape is workflow as code over a durable engine, fed by a cloud queue.

No vendor will ship it: the CISO buyer is not the engineering judge, acquirers freeze architectures, and per-run margin is too lucrative to drop first.

The exit

A parallel build, not a migration. Stand up the in-house pipeline next to the vendor's on the same alert stream, diff outputs until parity holds across weeks, then promote it to primary. One engineer, no steering committee, under 10% of vendor cost. The vendor finds out at renewal.

A category that survived a decade on the unaffordability of building your own does not survive a year in which building your own becomes affordable.