Catalog metadata is the connective tissue of a media platform: titles, credits, artwork, runtimes, language tracks, rights windows. When any of it goes wrong, even subtly, real users notice. The Data Canary is our system for catching catalog defects before they ship.
Canary, not contract
Contract validation, schema enforcement, required-field checks, is necessary but insufficient. Most catalog bugs are semantic, not structural: a runtime in the wrong units, a typo in a country code, a poster image that's technically valid but wildly off-brand. A canary watches for "shape" changes against a known-good baseline.
How it works
- Each catalog snapshot is summarized into a set of distributional fingerprints.
- The canary compares today's fingerprint against a rolling baseline and flags statistically unusual shifts.
- Drift is routed to the team whose pipeline most likely caused it, based on lineage.
A fingerprint is just a compact summary per column:
def fingerprint(column):
return {
"null_rate": column.nulls / len(column),
"distinct": column.approx_distinct(),
"quantiles": column.quantiles([0.01, 0.5, 0.99]),
"top_k": column.top_k(20),
}
What it caught
In the first six months the canary surfaced a dozen-plus issues that would have reached production, ranging from a vendor accidentally double-encoding subtitle files to a rights rollout pushing the wrong region bit on a subset of titles.