Data Modeling and Transformation with dbt: A Practical, End-to-End Guide

November 17, 2025 at 01:32 PM | Est. read time: 13 min
Valentina Vianna

By Valentina Vianna

Community manager and producer of specialized marketing content

If you’ve ever wished your data transformations were as simple to version, test, and review as application code, dbt (data build tool) is probably already on your radar. dbt lets analytics engineers and data teams transform raw data into clean, reliable models inside your data warehouse or lakehouse using SQL and software engineering best practices.

In this guide, you’ll learn what dbt is, how it works, and how to design a maintainable dbt project—from naming conventions and layers to testing, documentation, incremental models, and orchestration. You’ll also see how dbt fits into modern architectures like the lakehouse and the Medallion approach, with practical tips you can apply right away.

For a focused overview, see this deep dive into dbt (Data Build Tool).

What is dbt and why it matters

dbt is a transformation framework for the ELT world. Instead of extracting, transforming, and loading outside your warehouse (traditional ETL), dbt embraces ELT: load first, then transform inside the warehouse using SQL. dbt provides:

  • A modular modeling framework using SQL + Jinja
  • Dependency management and DAGs via the ref() function
  • Built-in testing (schema and data tests)
  • Documentation and lineage generated from your project
  • Environment-aware builds and CI/CD integration
  • Support for modern platforms like Snowflake, BigQuery, Redshift, Databricks, and Spark

dbt helps bridge data engineering and analytics—a core goal of modern analytics engineering. If you’re establishing the backbone of your analytics stack, understanding the broader discipline is useful. Explore the role of data engineering in modern analytics in this guide: What is Data Engineering—The Backbone of Modern Business Intelligence.

How dbt works in plain English

  • You write models as select statements. dbt compiles these into the correct SQL for your warehouse.
  • Model dependencies form a DAG. dbt figures out build order automatically.
  • Materializations define how models persist (view, table, incremental, ephemeral).
  • Tests verify constraints and data quality before models reach stakeholders.
  • Docs and lineage get generated from your project to help people trust and understand your data.

When to use dbt

dbt is a great fit when:

  • Your data lives in a warehouse or lakehouse and SQL is your main transformation language.
  • You want repeatable, testable, version-controlled pipelines.
  • You’re adopting analytics engineering practices and need modular, reusable transformations.
  • You need governance: lineage, ownership, documentation, and data contracts.

A maintainable dbt project structure

A clear structure is the difference between a project that scales and one that slows you down. A common pattern:

  • staging/ (stg_): One-to-one models with sources; light cleaning, type casting, and de-duplication.
  • intermediate/ (int_): Business logic joins, aggregations, and shaping for marts.
  • marts/ (dim_/fct_): Star-schema outputs for analytics—dimensions and fact tables.
  • snapshots/: Slowly Changing Dimensions (SCD) history tracking.
  • seeds/: Static reference data in CSVs (e.g., country codes).
  • tests/: Custom and reusable tests.
  • macros/: Helper functions to DRY up logic.
  • analyses/ and exposures/: Ad hoc SQL and downstream dependencies (dashboards, ML).

Use consistent, descriptive naming:

  • stg__
  • int__
  • dim_
  • fct_

This makes intent and lineage obvious.

Core dbt concepts you’ll use daily

  • ref(): Declares dependencies between models and builds the DAG.
  • source(): References raw tables and enables freshness checks.
  • Materializations:
  • view: Lightweight, always up-to-date, higher runtime cost.
  • table: Physicalized, faster reads, refreshed on rebuild.
  • incremental: Only processes new or changed data—great for large tables.
  • ephemeral: Inlined at compile time, no physical object created.
  • Tests:
  • Generic schema tests (unique, not_null, accepted_values, relationships)
  • Custom SQL-based tests for business rules
  • Seeds: Small CSVs turned into warehouse tables.
  • Snapshots: Track historical changes for SCD Type 2.
  • Macros and packages: Reuse logic across models and projects (e.g., dbt-utils).
  • Documentation and lineage: Auto-generated site with node-by-node lineage.

Designing the modeling layers

1) Staging (stg_)

  • One model per source table
  • Type cast, rename columns, standardize timestamps and IDs
  • De-duplicate and apply lightweight data quality checks
  • Keep transformations minimal and predictable

2) Intermediate (int_)

  • Join across staged sources to create business-ready entities
  • Apply business rules and surrogate keys
  • Implement pre-aggregation and standard calculations reused by marts

3) Marts (dim_, fct_)

  • Dimensional models optimized for analytics and BI
  • Dimensions store attributes (customer, product, calendar)
  • Facts store events or transactions (orders, pageviews)
  • Keep marts tidy and well-tested; they are your “contract” to analysts

Explore how this maps into the modern lakehouse using the Bronze/Silver/Gold approach in this primer: Medallion Architecture Explained.

Testing and data contracts

Reliable analytics requires confidence in your data.

  • Schema tests:
  • not_null on primary keys
  • unique on natural or surrogate keys
  • relationships to validate foreign keys
  • accepted_values for enums and statuses
  • Custom tests:
  • Business rules (e.g., revenue >= 0, dates within expected ranges)
  • Source freshness:
  • Detect stale pipelines before stakeholders feel it
  • Data contracts (via tests and constraints):
  • Treat marts like APIs—breaking changes must be deliberate and reviewed

Incremental models, performance, and cost control

Incremental models are essential for large tables:

  • Use unique_key to identify rows to update/merge.
  • Partition and cluster where supported:
  • BigQuery: partition by date, cluster by high-cardinality fields
  • Snowflake: consider clustering keys for very large tables
  • Databricks/Spark: leverage partitioned storage for pruning
  • Prune columns to what you need; avoid SELECT *
  • Push computation to the warehouse; minimize Python UDFs unless necessary
  • Use dbt’s state comparison for faster CI:
  • Build only modified models and their children (state:modified+)
  • Defer to production artifacts for dependency resolution
  • Prefer dbt build in CI to run and test in one command

Snapshots for history (SCD2)

When you need to track how a record changes over time (e.g., customer tier):

  • Define a snapshot with a unique key and updated_at or check columns
  • dbt will create valid_from/valid_to windows
  • Join snapshots with facts to analyze historical state at event time

Seeds and reference data

  • Store small, stable lookup tables in seeds
  • Version-control CSVs so changes are reviewed
  • Add tests to enforce completeness and consistency

Macros and packages

  • Macros let you encapsulate repetitive SQL/Jinja logic
  • Packages like dbt-utils offer proven macros (surrogate_key, date_spine, etc.)
  • Keep macros generalized; keep model-specific logic inside models

Documentation and lineage

  • Add descriptions to models, columns, and sources (YAML)
  • Generate docs to visualize lineage and searchable metadata
  • Use exposures to document downstream reports, dashboards, and ML jobs
  • Tag owners and SLAs for operational clarity

Orchestration and environments

  • dbt Core vs dbt Cloud:
  • Core: open-source CLI, integrate with your own orchestrator
  • Cloud: hosted IDE, jobs, scheduling, and easy documentation hosting
  • Orchestrate with Airflow, Dagster, Prefect, or warehouse-native schedulers
  • Promote changes through dev → staging → prod
  • Use environment variables for secrets—never commit credentials

Governance, code review, and CI/CD

  • Pull requests with linters and automated tests
  • Enforce naming conventions and folder structure
  • Keep models small and composable; avoid “god queries”
  • Create an onboarding playbook for new contributors
  • Monitor run artifacts for timing, failures, and data tests

Common pitfalls and how to avoid them

  • Overloading staging with business logic: keep it light and predictable
  • Giant, monolithic marts: split into smaller, testable units
  • Lack of tests: treat marts like APIs, test them
  • SELECT * in production models: control columns explicitly
  • No incremental strategy on large tables: costs and runtimes will hurt
  • Missing ownership metadata: add owner tags and documentation
  • Ignoring source freshness: catch upstream issues early

Example end-to-end flow

  • Load raw data into your warehouse
  • Define sources with freshness checks
  • Build stg_ models to standardize schemas and IDs
  • Create int_ models to join and shape business entities
  • Publish dim_ and fct_ marts with comprehensive tests
  • Snapshot entities that need history
  • Document everything and generate lineage
  • Schedule runs and monitor performance and test results

Where dbt shines in modern architectures

dbt amplifies the value of your warehouse or lakehouse by standardizing how teams transform and govern data. It complements ingestion frameworks and orchestration tools while keeping the transformation logic versioned, testable, and discoverable.

To connect the dots with your broader platform choices, read this lakehouse-aligned overview: Medallion Architecture Explained. And if you’re formalizing the data engineering function, this primer outlines the foundations and responsibilities: What is Data Engineering—The Backbone of Modern Business Intelligence.

Implementation checklist

  • Define modeling layers and naming conventions
  • Choose materializations for each model
  • Add schema tests and critical custom tests
  • Establish source freshness checks
  • Implement incremental strategies for large tables
  • Add snapshots where history matters
  • Document models, columns, owners, and SLAs
  • Wire CI/CD to run dbt build on PRs
  • Schedule production jobs and monitor artifacts
  • Review costs and optimize partitions/clustering periodically

FAQ: dbt Data Modeling and Transformation

1) What is dbt, in one sentence?

dbt is a framework that lets you model, test, and document SQL transformations in your warehouse using software engineering best practices.

2) How is dbt different from ETL tools?

Traditional ETL transforms data outside the warehouse; dbt embraces ELT—load first, then transform inside the warehouse. dbt focuses on the T in ELT, with version control, tests, and documentation built in.

3) When should I use views vs tables vs incremental models?

  • Views: small or frequently changing logic; pay compute at query time.
  • Tables: stable datasets you want fast reads on; rebuild on schedule.
  • Incremental: large datasets where you only process new/changed rows for speed and cost control.

4) What tests should every model have?

At minimum, not_null and unique on primary keys; relationships for foreign keys; accepted_values for enumerations. Then add custom business-rule tests (e.g., non-negative amounts, realistic dates).

5) How do I track historical changes with dbt?

Use snapshots (SCD Type 2). They create valid_from/valid_to windows so you can analyze facts based on the entity state at the time of the event.

6) What are seeds and when should I use them?

Seeds are small CSVs turned into warehouse tables—perfect for static reference data like country codes or mappings. Version them in Git and test them like any other model.

7) Do I need an orchestrator if I use dbt?

dbt Cloud includes scheduling. With dbt Core, you’ll typically use Airflow, Dagster, Prefect, or warehouse-native schedulers to run jobs and manage dependencies.

8) How do I keep costs under control with dbt?

Use incremental models with partitions/clustering, avoid SELECT *, process only changed models in CI (state:modified+), and periodically review heavy models for pruning and refactoring.

9) How does dbt help with documentation and lineage?

dbt generates rich documentation sites with model and column descriptions, tests, and a clickable lineage graph. Exposures document downstream dashboards and ML jobs.

10) Where does dbt fit in a lakehouse or Medallion architecture?

dbt aligns naturally with Bronze/Silver/Gold layers: stage raw data (Bronze), refine and conform (Silver), and publish analytics-ready marts (Gold). Its testing and documentation reinforce quality and trust at each step.

For another perspective and further reading on dbt’s role in modern teams, check out this overview: dbt (Data Build Tool).

Don't miss any of our content

Sign up for our BIX News

Our Social Media

Most Popular

Start your tech project risk-free

AI, Data & Dev teams aligned with your time zone – get a free consultation and pay $0 if you're not satisfied with the first sprint.