The modern data stack has settled into a comfortable shape: ingestion, warehouse, transformation, BI. What's still missing is a shared semantic layer, a place where a metric like "monthly active users" is defined once and consumed consistently by every downstream tool. DataJunction is our take on that layer, building on the open-source project of the same name that originated at Airbnb.
Problems we kept hitting
- The same metric defined differently across dashboards, causing endless reconciliation meetings.
- Expensive recomputation because each tool built its own aggregates from raw tables.
- Lineage breaking at the BI boundary, so governance and privacy reviews stopped at the warehouse.
What DataJunction does
It hosts a DAG of nodes: sources, transforms, dimensions, and metrics. Metrics are expressed declaratively. Consumers, dashboards, notebooks, ML features, request metrics by name. DataJunction picks the cheapest valid query plan across available materializations.
A metric definition is a small piece of declarative SQL:
metric monthly_active_users {
description: "Distinct users with >=1 stream in the last 30 days"
source: events.stream_started
expression: approx_count_distinct(user_id)
window: 30d
dimensions: [country, plan, device_class]
}
Why a service, not a library
A library forces every consumer into one language and one execution engine. As a service, DataJunction speaks SQL, GraphQL, and Python; the engine selection becomes an implementation detail.