Monitoring is not understanding

My father had a valve replacement a few years ago and was put on the standard antiplatelet regimen. Over the next year he had a small string of strokes. Each time the medication was adjusted; aspirin came off at one point on the explanation that "it does the same thing as the Plavix." More strokes. After the most recent one, and ten days in hospital, he was discharged with a single new test recommended: a platelet function test for the drug he had been taking the whole time.

It came back clear, in the lab's own words: the therapeutic effect was not obtained. He was on the drug. The drug was not doing anything. The test had existed for over a decade.

I'm writing about that, but mostly about the shape of it, because it repeats everywhere with different names on the labels.

The same playbook, twice

In both medicine and IT, the work runs on the same surface playbook:

Continuous metrics (vitals and lab panels; p99 latency and error rates).
Threshold alerts (a blood-pressure spike, a troponin elevation; pages, on-call SMS escalation).
Specialists who own slices (cardio, neuro, endo; the database team, the network team, the application team).
Treatment applied by symptom (a stronger antihypertensive, a restart of the worker).
Recurrence (another stroke, another outage two weeks later).

And in both, the place where it most often fails is the same: between the third bullet and the fourth. The treatment goes in, the alert quiets, the team moves on. The mechanism that produced the alert in the first place is never named, so when the same condition reasserts itself there is no record that it is the same condition. It is logged as a fresh event.

A dashboard going green is not a problem understood

Instrumentation tells you what is happening, not why. The "why" lives one layer deeper, in tests and analyses that are not on the dashboard, that are only run when someone decides to run them.

The medical version of the gap:

A patient on the protocol-correct medication has a stroke.
He is on the medication. He has another stroke.
The protocol-correct reading is "the medication is right." The reading the system rarely makes is "the medication may not be working in him." The first reading is much cheaper.
Months pass. A second stroke is closed as a fresh acute event. The first one is part of his history; it is not flagged as a recurrence to investigate.
Until somebody asks the mechanistic question: is the medication doing what it is supposed to be doing, in this specific patient? The platelet function test that answers it has existed for years.

The IT version of the gap:

A service has a p99 latency spike at 03:14. Someone restarts the worker. The graph recovers.
The retro is "we restarted, recovery time was X minutes." Root cause logged as "investigating."
Three weeks later, another p99 spike at 04:02. Same recovery.
Two months later, another. The pattern is in the data; nobody is reading the data as pattern. Each occurrence is closed as a unique incident.
Until somebody asks the mechanistic question: is the worker hitting a specific code path, a specific query, a specific lock that the rolled-up metrics do not show? The flame graph, the trace, the packet capture exist. They take a morning. They have been sitting in the catalogue.

Both stories end the same way. The test that resolves the question is available, cheap, and routine. It is just not routine enough to be ordered automatically when a symptom recurs. The recurrence is the question, not the fact.

An alert fires, a treatment is applied, the alert fires again two months later. In both fields, the test that would distinguish "the treatment is right" from "the treatment is right but not effective here" is sitting in the catalogue, usually not ordered until the third occurrence.

What IT has begun to learn that medicine could borrow

This is the part that should travel. IT, fitfully and unevenly, has built a culture around the gap, and the elements of that culture do the work that "more metrics" does not:

Blameless postmortems that are actually written down and read, and that name a mechanism, not just a recovery time.
Recurrence as its own category. Tooling and team practices that mark "this is the second time we have seen this," because the second time is a fundamentally different conversation from the first.
Observability as standard, not heroic. Distributed tracing, flame graphs, profiling, packet captures available routinely, not reserved for catastrophe.
Five whys used as a discipline rather than a quoted cliché.
A single picture the operator owns that crosses subsystems. The DB team and the app team look at the same dashboard, with the same correlation IDs.

None of this is novel; most of it was borrowed from aviation and surgery, which had to build the habits in the second half of the twentieth century to stop killing people. Medicine has the language for it (the M&M conference, the differential diagnosis), but in the outpatient world the practice has thinned. The patient moves between specialists, each treating their slice, and nobody is named owner of the integration. There is no "incident #3 is a recurrence of incident #1, order the test." The protocol is run again because the protocol is what the system has.

The end state of that, when nothing changes, is a dead patient with healthy pieces. Each specialist's slice looked fine on its own chart.

The patient has no app

I've written about this from a different angle in Follow the rent. The patient's records are fragmented across providers' EHRs. The patient himself doesn't have a unified view. The only person in the loop with the whole story is the one without the medical training, without the time, and without the tools.

Each specialist sees their slice: cardiology the heart, neurology the brain, pharmacy the script. Each is doing their job. Nobody is doing the integration job.

That's why nobody noticed that the second stroke came right after aspirin was withdrawn, on the assumption that Plavix was covering. The test that would have checked that assumption sat in the catalogue. Asking for it wasn't anyone's job.

There's a deeper version of this that bothers me more. We're trained to optimise at the provider end. Every screen is local:

The cardiologist's dashboard is a cardiology dashboard.
The surgical team's KPI is the operation.
The pharmacist's screen shows drugs and doses, not outcomes a month out.

At the provider end every case can close as a success: stroke treated acutely, valve replaced cleanly, script filled correctly. Every slice was a success.

Nothing in the workflow models the whole. The whole was supposed to be modelled by the person at the centre of it, who is also the one least equipped to do it. He's sick. That's why he's in the system in the first place.

He's not a cardiologist, not a neurologist, not a pharmacist. So when one of them tells him a drug does the same thing as another, he has no way of knowing whether that's true. Even on the treatments he does understand, he may miss a dose, mix up two pills, or stop a course because the side effect feels worse than the disease. The integration job has landed on the person least able to do it.

He was supposed to be the centre of the story. He's instead the seam between several stories that all close in success.

Same shape as the user with a hundred provider accounts. The patient is the only thread, and the patient has no app.

And the patient isn't somebody else. It's us, sooner or later. Our families. The doctor who wrote this morning's prescription is somebody's patient on a different morning. The integration job is everyone's, because it's everyone's turn.

The two jobs the system gets backwards

Two roles decide what happens next whenever the picture is unclear: triage and integration. Both are quietly understaffed almost everywhere. They get handed to whoever happens to be on-call or on-shift, on the assumption that the hard parts will be escalated up the chain. Those two roles are the hard parts.

In IT it looks like triage handed to whoever paged in, with integration treated as everyone's responsibility, which always means no-one's. In medicine it looks like a long case drifting between specialists, with the patient as the only carrier. Both jobs want the most capable person in the room, not the least. Treating them as if they were easy is wrong, and it's where most of the damage comes from.

One rule, for both

The thing I want to take from all of this is one operational rule.

When a problem recurs, do not ask whether the alert fired. Ask whether the test that would distinguish mechanisms has been run.

In IT, on the second occurrence of an incident, stop the war room and go to the profiler. By the third, the question isn't "do we have monitoring?" but "do we have a flame graph, a trace, a packet capture from the moment it was happening?"

In medicine, on the second recurrence of the same condition under the same treatment, stop running the protocol-by-symptom and ask which test would tell "the treatment is correct" apart from "the treatment is correct but not effective in this patient." Most of the time the test exists. Usually it's cheaper than the next ambulance ride.

A dashboard going green isn't the goal. It's the prompt to ask the next question, the one that costs more to answer and tells you why.

Coda

I don't know yet what the next regimen for my father will look like. The conversation with the cardiologist is still open. I'm writing this now, before it's finished, because the part that matters for the rest of us is already settled. The test that resolved the question was in the catalogue the whole time. The recurrence was the question. It just took a long time to be heard as one.

Monitoring tells you what is happening, not why. The test that shows whether a treatment is actually working, not merely correct, usually sits in the catalogue: cheap, routine, unordered. When a problem recurs, the recurrence is the question, not the fact. True in medicine and IT alike.

My father had a valve replacement and went on a standard antiplatelet drug. Over a year, a string of strokes. Aspirin was dropped because "it does the same thing as the Plavix." After the last stroke he was finally tested: it came back clear. The drug was doing nothing. The test had existed for a decade.

The same playbook, twice

Medicine and IT run the same surface playbook:

continuous metrics (labs; p99 latency);
threshold alerts (a BP spike; pages);
specialists owning slices (cardio, neuro; DB and network teams);
treatment by symptom (a stronger drug, a worker restart);
recurrence (another stroke, another outage).

It fails between owning the slice and treating the symptom. The mechanism is never named, so when the condition returns there is no record that it is the same condition. It is logged as a fresh event.

A green dashboard isn't understanding

Instrumentation tells you what, not why. The why lives one layer deeper: the flame graph, the trace, the platelet function test, all in the catalogue, unordered until a symptom recurs.

The recurrence is the question, not the fact. A dashboard going green is the prompt to ask it, not the answer.

What IT learned that medicine could borrow

IT has built a culture around the gap: blameless postmortems that name a mechanism, recurrence as its own category, observability as standard not heroic. Medicine has the language (M&M, the differential) but outpatient practice has thinned: the patient drifts between specialists, nobody owning the integration.

The patient has no app

The records are fragmented across providers. The only person with the whole story is the one without the training, time, or tools. Each specialist closes their slice as a success; the patient is the seam between stories that all close well.

The two jobs that matter most when the picture is unclear, triage and integration, go to whoever is on-call, as if they were easy. They are the hard parts. And the patient isn't somebody else: it is us, sooner or later.

One rule, for both

When a problem recurs, don't ask whether the alert fired. Ask whether the test that would distinguish mechanisms has been run. In IT, on the second occurrence, leave the war room for the profiler. In medicine, ask which test separates a correct treatment from one that isn't working in this patient. Usually it exists, and costs less than the next ambulance ride.