Catalog metadata is the connective tissue of a media platform: titles, credits, artwork, runtimes, language tracks, rights windows. When any of it goes wrong, even subtly, real users notice. The Data Canary is our system for catching catalog defects before they ship.

Canary, not contract

Contract validation, schema enforcement, required-field checks, is necessary but insufficient. Most catalog bugs are semantic, not structural: a runtime in the wrong units, a typo in a country code, a poster image that's technically valid but wildly off-brand. A canary watches for "shape" changes against a known-good baseline.

How it works

A fingerprint is just a compact summary per column:

def fingerprint(column):
    return {
        "null_rate": column.nulls / len(column),
        "distinct":  column.approx_distinct(),
        "quantiles": column.quantiles([0.01, 0.5, 0.99]),
        "top_k":     column.top_k(20),
    }
drift detected null_rate over 60 days
A column's null rate held steady for weeks, then broke on a deploy, the canary paged

What it caught

In the first six months the canary surfaced a dozen-plus issues that would have reached production, ranging from a vendor accidentally double-encoding subtitle files to a rights rollout pushing the wrong region bit on a subset of titles.