Data Governance with DataHub and dbt: A Practical End-to-End Blueprint

Community manager and producer of specialized marketing content

If your teams are building more dashboards than they can explain, changing data models without warning, or struggling to trust metrics, you don’t have a tooling problem—you have a governance problem. The good news: pairing DataHub (as your active metadata platform) with dbt (as your transformation and testing workhorse) gives you a simple, scalable way to make data discoverable, explainable, and trustworthy—without slowing teams down.

This guide shows how to design a modern data governance program around DataHub and dbt, including architecture patterns, implementation steps, quality controls, access policies, and the KPIs that prove it’s working.

What DataHub and dbt each do—and why they’re powerful together

DataHub at a glance
Centralizes metadata across warehouses, lakes, pipelines, and BI tools
Delivers end-to-end lineage (table- and column-level), ownership, tags, and business glossary
Enables “active metadata” automations—trigger actions when quality fails or changes occur
Gives everyone a single place to search, understand, and govern data

dbt at a glance
Transforms data in your lakehouse/warehouse with SQL-first, version-controlled models
Bakes in documentation, testing, and model contracts
Produces artifacts (manifest.json, run_results.json) that describe models, lineage, tests, and exposures

Why they’re better together
dbt “documents while it builds.” DataHub ingests that context to produce rich, searchable documentation, lineage, and data quality views—automatically.
dbt tests and contracts surface in DataHub as assertions, creating a unified quality and governance hub.
Exposures in dbt (downstream dashboards/reports) connect to lineage in DataHub, helping you instantly see blast radius for any upstream change.

For a quick refresher on the difference between a “data catalog” and broader metadata needs, see this overview of data catalogs vs. metadata management.

The problems this combo actually solves

“What does this dashboard metric really mean?” Map metrics to upstream models, tests, and owners.
“Can I change this column safely?” Visualize lineage across dbt models, jobs, and BI assets.
“Who owns this table?” Set and enforce ownership, domains, and stewardship right in DataHub.
“Are we allowed to use this data?” Classify PII and sensitive attributes, apply policies, and track usage.
“Why do tests keep failing?” Centralize dbt test results as assertions; trigger notifications and tickets automatically.
“Where is the documentation?” Publish dbt docs to DataHub—searchable, versioned, and linked to lineage.

Reference architecture: Governance in the modern stack

Ingestion: Airbyte/Fivetran/custom streams into your warehouse/lakehouse (Snowflake, BigQuery, Databricks).
Transformations: dbt Core or dbt Cloud runs build/test/docs on a schedule or via CI.
Orchestration: Airflow, Dagster, or dbt Cloud triggers dbt jobs and governance workflows.
Catalog and governance: DataHub ingests warehouse schemas, dbt artifacts, BI assets, owners, tags, glossary, and assertions.
Observability: Optional integration with tools like Great Expectations or Soda; results appear in DataHub via assertions.
The active metadata loop: DataHub emits events on quality failures or schema changes—your orchestrator/slack/Jira automations act on them.

Want to go deeper on the “automate, not document-only” mindset? Explore active metadata management.

Step-by-step implementation plan

1) Set the foundation: scope, ownership, and naming

Pick one domain for your pilot (e.g., Sales or Finance).
Define data owners and stewards; use a simple RACI so it’s clear who approves schema changes.
Establish naming conventions for layers (raw/staging/marts), models, and columns.

If you’re still solidifying your platform standards, this guide to developing a solid data architecture can help.

2) Stand up DataHub and connect core sources

Deploy via Docker quickstart, Kubernetes, or a managed offering.
Ingest your warehouse metadata first (Snowflake/BigQuery/Databricks connectors).
Add BI sources (e.g., Looker, Power BI) so downstream assets appear in lineage.

3) Level up dbt for governance

Add model tests (unique, not_null, relationships) and create a tests/ directory for custom checks.
Use descriptions, tags, and meta blocks in models and sources.
Define exposures to represent critical dashboards and reports.
Use model contracts to enforce schemas on your gold models.

4) Ingest dbt metadata into DataHub

Configure a recurring DataHub ingestion job that reads dbt artifacts and pushes metadata.

Key recipe fields to set:

manifest_path: Path to dbt’s manifest.json
run_results_path: Path to dbt’s run_results.json
target_platform: bigquery | snowflake | databricks | redshift (matches your warehouse)
include_column_lineage: true (when available)
profiling: enabled (optional, based on permissions)
incremental lineage: recommended for large estates

Result: DataHub will show dbt models, rich lineage, documentation, owners, and the results of dbt tests as assertions on each asset.

5) Configure quality and contracts as guardrails

Treat dbt tests as the first line of data quality.
Map test results to DataHub assertions and define severity (warn vs. fail).
Add alerts: send Slack/Jira notifications on failed assertions for tier-1 assets.
Enforce dbt model contracts for gold-layer stability and safer downstream changes.

6) Build your business glossary and classification

Create glossary terms (e.g., “ARR,” “Active Customer”) and link them to datasets and columns.
Tag PII/sensitive fields; apply policies for who can view, query, or export.
Group assets into domains (e.g., Marketing, Finance) for clearer ownership and access.

7) Shift-left with CI checks

Run dbt build in PRs; fail when tests or contracts would break production.
Optionally, use DataHub Actions to comment lineage impact or owner notifications on PRs.
Require description/tags for any new gold models (policy-as-code mindset).

8) Roll out to data consumers

Promote DataHub as the default starting point: search, glossary, lineage, owners, and “how to use this data” all in one place.
Add exposures for key dashboards; verify ownership, tags, and glossary links.
Track adoption (searches, popular assets, coverage).

A concrete example: From raw to gold (with lineage and quality)

Raw: src_orders (ingested nightly)
Staging: stg_orders (cleaned types, standardized statuses)
Mart: fct_revenue (join orders with customer and currency)
Exposures: “Finance Executive Dashboard”

What you see in DataHub:

End-to-end lineage: src_orders → stg_orders → fct_revenue → Finance Executive Dashboard
Owners: Data product owner for the Finance domain
Glossary: “Revenue” linked to business definition
Assertions: dbt not_null on order_id, unique on invoice_id, relationship to dim_customer
Policies: PII tag on customer_email enforces restricted access

Data access and policy patterns that work

Classify first, then enforce: label PII/sensitive fields at the column level; apply tag-based policies in DataHub.
Separate duties: Owners approve schema changes; stewards manage definitions and tags; platform admins manage connections and ingestion.
Least privilege: Grant default read to non-sensitive marts; require approvals for sensitive domains.
Auditability: Use DataHub’s access logs and lineage to demonstrate control and trace decisions.

Operationalizing governance with workflows

Schedule DataHub ingestion for your warehouse, BI tools, and dbt artifacts (e.g., after each dbt Cloud job).
Use Airflow/Dagster to orchestrate “build → test → ingest → notify” pipelines.
Create runbooks: what happens when a test fails, who’s paged, how SLAs are measured.

KPIs to prove your governance is working

Coverage KPIs
% assets with owners and stewards
% assets with descriptions and glossary terms
% gold models with contracts enabled
Quality KPIs
Test coverage (tests per model/column)
Assertion pass rate by tier
MTTR for broken pipelines
Adoption KPIs
DataHub monthly active users
Top search terms and click-through
Access request SLAs
Risk KPIs
% PII columns with active policies
Number of unowned/undocumented assets (trending down)

Common pitfalls—and how to avoid them

Boiling the ocean: Start with one domain and 10–20 high-value assets; expand gradually.
Treating governance as “extra documentation”: Automate with dbt artifacts and DataHub ingestion; make governance part of the build.
No ownership: Governance without named owners stalls. Assign them early.
Inconsistent naming: Enforce conventions in CI/PR; make it a checklist, not a suggestion.
Catalog drift: Schedule ingestion and add alerts for new assets without owners or docs.

Security, privacy, and compliance

Use tags to flag PII and sensitive fields; apply access policies accordingly.
Keep an audit trail of changes and approvals via DataHub, git history in dbt, and your ticketing system.
Link glossary terms and policies to data retention and minimization standards.

A simple 30/60/90-day roadmap

30 days: Deploy DataHub, ingest warehouse metadata, pilot dbt ingestion, add owners/descriptions for 10–20 assets.
60 days: Enable tests and contracts on gold models, classify PII, wire Slack/Jira alerts for failed assertions, publish glossary.
90 days: Enforce PR checks, expand to 2–3 more domains, implement policy-based access, and report governance KPIs to leadership.

Frequently Asked Questions

1) Do I need dbt Cloud, or can I use dbt Core?

Either works. dbt Cloud simplifies scheduling and artifact storage, while dbt Core gives you full control in your own CI/CD (e.g., GitHub Actions, Airflow). DataHub ingests artifacts from both.

2) How does DataHub ingest dbt metadata?

You schedule a DataHub ingestion recipe that points to dbt’s manifest.json and run_results.json. DataHub then maps models, exposures, tests, owners, tags, and lineage into a unified, searchable graph.

3) Can DataHub show column-level lineage from dbt?

Yes, when available. Enable column-level parsing in the dbt ingestion recipe and ensure your dbt project maintains clear source-to-target mappings. You’ll see field-to-field impact through your pipelines and BI.

4) What’s the difference between a data catalog and active metadata?

A traditional catalog centralizes documentation and search. An active metadata platform (like DataHub) also triggers automations—alerts on quality failures, approvals for schema changes, or policy enforcement. Learn more about active metadata management.

5) We already have a catalog. Why switch or add DataHub?

If your current catalog is static, hard to keep up-to-date, or doesn’t unify dbt tests, lineage, and policies in one place, you’ll struggle with trust and adoption. DataHub can coexist initially—start by ingesting dbt and warehouse metadata and compare coverage, lineage depth, and automation.

6) How do we handle PII and sensitive data?

Tag sensitive columns in DataHub (e.g., “PII.Email” or “Sensitive.Health”). Apply tag-based policies that restrict who can see, query, or export those fields. Keep a record of policy changes and approvals for compliance.

7) Can we enforce quality in CI/CD before merges?

Yes. Run dbt build in pull requests; fail the PR if tests or contracts would break. Optionally, add a step that checks for required descriptions/tags on new gold models. This “shift-left” approach prevents issues from reaching production.

8) Does this replace data observability tools?

Not necessarily. dbt tests are great for contract-level checks. Specialized observability tools add anomaly detection, freshness SLOs, and profiling. The key is to centralize all results as assertions in DataHub so teams have one place to see health and lineage.

9) How do we measure success?

Track coverage (owners, docs, glossary), quality (assertion pass rate, MTTR), adoption (searches, MAUs), and risk (PII policy coverage). Report trends monthly and tie them to real outcomes (fewer dashboard breakages, faster incident resolution).

10) Where should we start if our data architecture is still evolving?

Start by defining domains, ownership, and naming conventions. In parallel, get basics right: warehouse metadata ingestion, dbt artifacts ingestion, and one business glossary. For the platform itself, here’s a practical guide to building a solid data architecture. If you’re weighing tooling boundaries, this explainer on data catalogs vs. metadata management will help.

Final thoughts

Governance that slows down developers won’t last. Governance that lives in code and automates itself will. DataHub and dbt give you exactly that: a living, active view of your data—its lineage, quality, ownership, and meaning—kept fresh by the very pipelines that produce it.

Start small. Automate early. Publish definitions where people actually search. Then let your governance mature from a documentation project into a true operating system for trusted, explainable data.

Data Analytics