The Data Canary

Catalog metadata is the connective tissue of a media platform: titles, credits, artwork, runtimes, language tracks, rights windows. When any of it goes wrong, even subtly, real users notice. Schema validation will not catch most of it, because the field that breaks the experience is usually well-formed and still wrong. The Data Canary is the system we built to catch those defects before they ship.

Canary, not contract

Contract validation is necessary and insufficient. Schema enforcement and required-field checks catch the structural failures: a missing field, the wrong type, a malformed date. They pass everything that is the right shape, which is most of what hurts us.

The defects that reach users are semantic. A runtime stored in the wrong units. A typo in a country code that quietly drops a title from a region. A poster image that validates cleanly and is wildly off-brand. Each one is a perfectly legal value in the wrong place, so the contract waves it through.

The field that breaks the experience is almost always well-formed. The contract checks the shape; the canary checks the shape of the data.

A canary works the way the bird in the mine did: it watches for a change in the air, not a specific poison. Instead of asking "is this value legal," it asks "does today's catalog look like yesterday's." It compares the distribution of each column against a known-good baseline and pages when the shape shifts in a way that chance does not explain.

Defect	Schema verdict	Canary verdict
Missing required field	fails (right shape absent)	fails (null rate spikes)
Runtime in wrong units	passes (valid integer)	fails (quantiles slide)
Typo in country code	passes (valid string)	fails (top-k reshuffles)
Off-brand poster image	passes (valid URL)	fails (distribution shifts)

How it works

Each catalog snapshot is summarised into a set of distributional fingerprints, one per column. A fingerprint is a compact statistical summary, cheap to compute and cheap to store:

def fingerprint(column):
    return {
        "null_rate": column.nulls / len(column),
        "distinct":  column.approx_distinct(),
        "quantiles": column.quantiles([0.01, 0.5, 0.99]),
        "top_k":     column.top_k(20),
    }

The canary compares today's fingerprint against a rolling baseline of recent snapshots and flags any column whose shape has shifted more than the baseline's own variance would predict. The pipeline has three moving parts:

Summarise. Reduce each snapshot to per-column fingerprints, so the comparison is over kilobytes, not the full catalog.
Compare. Score each column against the rolling baseline; a null rate that jumps, a top-k that reshuffles, a quantile that slides all read as drift.
Route. Send each flagged column to the team whose pipeline most likely caused it, using lineage, so the page lands on the owner, not the on-call generalist.

A column's null rate held steady for weeks, then broke on a deploy, and the canary paged.

Tuning the alarm

A canary that cries wolf gets muted, so the threshold is the whole game. Set it tight and a normal weekend dip in fresh titles pages someone at 3am; set it loose and the real regression slips under the bar. We tie the threshold to each column's own historical variance rather than a global constant, so a naturally noisy column needs a bigger move to alarm than a stable one does.

Seasonality is the other trap. Catalog shape breathes on a weekly and release cadence, so the baseline is a rolling window that already contains those rhythms. The canary asks whether today is unusual against recent comparable days, not against a flat line.

What it catches

The defects it catches share a signature: every value is legal, and the distribution moved anyway. A vendor double-encodes a batch of subtitle files and the encoding distribution skews before any of them render as mojibake. A rights rollout flips the wrong region bit on a subset of titles and the country-code distribution shifts while the change is still in staging. The schema sees nothing wrong; the shape gives it away.

None of these are schema failures. Every value is legal. The catalog just stops looking like itself.

The lesson generalises past catalogs. Wherever data flows through pipelines you do not fully control, the contract guards the structure and a canary guards the meaning. Validate the shape with the schema, watch the shape of the data with a baseline, and the defects that are technically valid and practically broken stop reaching the people who would have noticed.

Catalog metadata (titles, credits, artwork, runtimes, rights windows) is the connective tissue of a media platform. When it breaks, users notice. Schema validation misses most of it, because the field that breaks the experience is usually well-formed and still wrong. The Data Canary catches those defects before they ship.

Contracts check structure; the failures are semantic. Schema and required-field checks catch a missing field or the wrong type. They pass a runtime in the wrong units, a typo in a country code, an off-brand poster that validates cleanly. Each is a legal value in the wrong place, so the contract waves it through.

A canary watches the shape of the data, not the shape of the schema. It does not ask whether a value is legal. It asks whether today's catalog looks like yesterday's, comparing each column's distribution against a known-good baseline and paging when the shape shifts more than chance explains.

The contract checks the shape. The canary checks the shape of the data.

How it works

Each snapshot becomes a per-column fingerprint: null rate, distinct count, a few quantiles, the top values. The pipeline runs in three steps:

Summarise: reduce each snapshot to fingerprints, so comparison is over kilobytes, not the full catalog;
Compare: score each column against a rolling baseline; a jumped null rate or a reshuffled top-k reads as drift;
Route: send each flag to the owning team by lineage, so the page lands on the right person.

Two tuning rules keep it usable. The alarm threshold tracks each column's own historical variance, so a noisy column needs a bigger move to fire than a stable one. The baseline is a rolling window, so weekly and release cadences are built in and a normal weekend dip does not page anyone.

What it catches

The defects share a signature: a vendor double-encodes subtitle files, a rights rollout flips the wrong region bit on some titles. None are schema failures. Every value is legal; the catalog just stops looking like itself. Validate the shape with the schema, watch the shape of the data with a baseline.

Sources

Internal: the Data Canary pipeline in production, the source of the defect patterns above.
Schema-on-read and contract validation practice: the structural baseline a canary complements, not replaces.

Canary, not contract

How it works

Tuning the alarm

What it catches

How it works

What it catches

Sources