The modern data stack has settled into a comfortable shape: ingestion, warehouse, transformation, BI. What's still missing is a shared semantic layer, a place where a metric like "monthly active users" is defined once and consumed consistently by every downstream tool. DataJunction is our take on that layer, building on the open-source project of the same name that originated at Airbnb.

Problems we kept hitting

What DataJunction does

It hosts a DAG of nodes: sources, transforms, dimensions, and metrics. Metrics are expressed declaratively. Consumers, dashboards, notebooks, ML features, request metrics by name. DataJunction picks the cheapest valid query plan across available materializations.

A metric definition is a small piece of declarative SQL:

metric monthly_active_users {
  description: "Distinct users with >=1 stream in the last 30 days"
  source: events.stream_started
  expression: approx_count_distinct(user_id)
  window: 30d
  dimensions: [country, plan, device_class]
}
Dashboards Notebooks ML features DataJunction Warehouse Iceberg OLAP
One semantic layer between every consumer and every backing store

Why a service, not a library

A library forces every consumer into one language and one execution engine. As a service, DataJunction speaks SQL, GraphQL, and Python; the engine selection becomes an implementation detail.