DataJunction · H2 Labs

The modern data stack has settled into a comfortable shape: ingestion, warehouse, transformation, BI. What's still missing is a shared semantic layer, a place where a metric like "monthly active users" is defined once and consumed consistently by every downstream tool. DataJunction is our take on that layer, building on the open-source project of the same name that originated at Airbnb.

Problems we kept hitting

The same metric defined differently across dashboards, causing endless reconciliation meetings.
Expensive recomputation because each tool built its own aggregates from raw tables.
Lineage breaking at the BI boundary, so governance and privacy reviews stopped at the warehouse.

What DataJunction does

It hosts a DAG of nodes: sources, transforms, dimensions, and metrics. Metrics are expressed declaratively. Consumers, dashboards, notebooks, ML features, request metrics by name. DataJunction picks the cheapest valid query plan across available materialisations.

A metric definition is a small piece of declarative SQL:

metric monthly_active_users {
  description: "Distinct users with >=1 stream in the last 30 days"
  source: events.stream_started
  expression: approx_count_distinct(user_id)
  window: 30d
  dimensions: [country, plan, device_class]
}

One semantic layer between every consumer and every backing store

Why a service, not a library

A library forces every consumer into one language and one execution engine. As a service, DataJunction speaks SQL, GraphQL, and Python; the engine selection becomes an implementation detail.

DataJunction: the missing piece of the modern data stack

Problems we kept hitting

What DataJunction does

Why a service, not a library