Study Notes 4.3.1 - Build the First dbt Models
1. Introduction to dbt Models dbt (Data Build Tool) sits on top of data platforms like BigQuery or Postgres. Objective: Transform raw data (e.g., green and yellow taxi trip data) into structured, usable formats for BI tools. Approach: Modular data modeling using fact tables and dimensional tables. Process: Start with raw data (sources). Build SQL scripts (models) to clean, deduplicate, recast, and rename data. Apply business logic to create fact and dimensional tables. Aggregate data for stakeholders. 2. Anatomy of a dbt Model Models are SQL files that contain SELECT statements. Materialization: Determines how the model is stored in the database. Types of Materialization: Ephemeral: Not physically stored; exists only within other models (like a CTE). View: Materialized as a view in the database. Table: Materialized as a physical table. Incremental: Updates only new data in the table (avoids recreating the entire table). File Structure: Each model is a .sql file. Example: model_name.sql contains a SELECT statement. dbt compiles the SQL into DDL/DML (e.g., CREATE TABLE AS SELECT). 3. Defining Sources Sources: Raw data tables loaded into the database (e.g., green and yellow taxi trip data). YAML Configuration: Define sources in a schema.yml file. Specify database, schema, and table names. Example: version: 2 sources: - name: taxi_trips database: my_database schema: trip_data tables: - name: green_trip_data - name: yellow_trip_data Benefits: Abstracts the complexity of source locations. Allows for freshness testing to ensure data is up-to-date. 4. Using Seeds Seeds: CSV files stored in the dbt repository (e.g., taxi_zone_lookup.csv). Purpose: Used for small, static datasets (e.g., master data tables). Benefits: Version-controlled. Can be documented and tested like other dbt models. Example: Load taxi_zone_lookup.csv as a seed. Use it in transformations to enrich trip data with zone information. 5. Modular Data Modeling Layers: Staging: Clean and prepare raw data. Example: stg_green_trip_data.sql and stg_yellow_trip_data.sql. Core: Create fact and dimensional tables. Example: fact_trips.sql and dim_zones.sql. Ref Function: Use ref('model_name') to reference other dbt models. Automatically builds dependencies between models. Example: SELECT * FROM {{ ref('stg_green_trip_data') }} 6. Macros in dbt Macros: Reusable SQL code snippets (similar to functions in Python). Purpose: Simplify repetitive tasks and dynamically generate SQL. Example: Macro to generate payment type descriptions: {% macro get_payment_type_description(payment_type) %} CASE CAST({{ payment_type }} AS INTEGER) WHEN 1 THEN 'Credit Card' WHEN 2 THEN 'Cash' ELSE 'Unknown' END {% endmacro %} Usage: Call the macro in a model: SELECT {{ get_payment_type_description('payment_type') }} AS payment_description 7. Packages in dbt Packages: Libraries of reusable macros and models. Example: dbt_utils package provides common utilities (e.g., surrogate_key). Installation: Add to packages.yml: packages: - package: dbt-labs/dbt_utils version: 0.8.0 Run dbt deps to install. Usage: Example: Generate a surrogate key: SELECT {{ dbt_utils.surrogate_key(['vendor_id', 'pickup_datetime']) }} AS trip_key 8. Variables in dbt Variables: Dynamic values that can be defined at the project level or passed during execution. Usage: Define in dbt_project.yml: vars: payment_type_values: [1, 2, 3, 4, 5, 6] Use in a model: {% if var('is_test_run', true) %} LIMIT 100 {% endif %} Execution: Pass variables via CLI: dbt run --vars '{"is_test_run": false}' 9. Building Fact and Dimensional Tables Fact Table: Aggregates data from multiple sources. Example: fact_trips.sql combines green and yellow trip data. Union both datasets and join with dim_zones for zone information. Dimensional Table: Provides context for fact tables. Example: dim_zones.sql contains zone details (e.g., borough, service zone). 10. Testing and Deployment Testing: Use dbt's built-in tests (e.g., uniqueness, not null). Example: tests: - unique - not_null Deployment: Run dbt run to compile and execute models. Use dbt build to run models and tests together. Example: dbt build --vars '{"is_test_run": false}' Best Practice to implement dev limit which can be toggle off when in production by simply setting default=false: -- dbt build --select --vars '{'is_test_run': 'false'}' {% if var('is_test_run', default=true) %} limit 100 {% endif %} → faster, cheaper and can be made as macro to follow DRY ( Don’t Repeat Y

1. Introduction to dbt Models
- dbt (Data Build Tool) sits on top of data platforms like BigQuery or Postgres.
- Objective: Transform raw data (e.g., green and yellow taxi trip data) into structured, usable formats for BI tools.
- Approach: Modular data modeling using fact tables and dimensional tables.
-
Process:
- Start with raw data (sources).
- Build SQL scripts (models) to clean, deduplicate, recast, and rename data.
- Apply business logic to create fact and dimensional tables.
- Aggregate data for stakeholders.
2. Anatomy of a dbt Model
-
Models are SQL files that contain
SELECT
statements. -
Materialization: Determines how the model is stored in the database.
-
Types of Materialization:
- Ephemeral: Not physically stored; exists only within other models (like a CTE).
- View: Materialized as a view in the database.
- Table: Materialized as a physical table.
- Incremental: Updates only new data in the table (avoids recreating the entire table).
-
Types of Materialization:
-
File Structure:
- Each model is a
.sql
file. - Example:
model_name.sql
contains aSELECT
statement. - dbt compiles the SQL into DDL/DML (e.g.,
CREATE TABLE AS SELECT
).
- Each model is a
3. Defining Sources
- Sources: Raw data tables loaded into the database (e.g., green and yellow taxi trip data).
-
YAML Configuration:
- Define sources in a
schema.yml
file. - Specify database, schema, and table names.
-
Example:
version: 2 sources: - name: taxi_trips database: my_database schema: trip_data tables: - name: green_trip_data - name: yellow_trip_data
- Define sources in a
-
Benefits:
- Abstracts the complexity of source locations.
- Allows for freshness testing to ensure data is up-to-date.
4. Using Seeds
-
Seeds: CSV files stored in the dbt repository (e.g.,
taxi_zone_lookup.csv
). - Purpose: Used for small, static datasets (e.g., master data tables).
-
Benefits:
- Version-controlled.
- Can be documented and tested like other dbt models.
-
Example:
- Load
taxi_zone_lookup.csv
as a seed. - Use it in transformations to enrich trip data with zone information.
- Load
5. Modular Data Modeling
-
Layers:
-
Staging: Clean and prepare raw data.
- Example:
stg_green_trip_data.sql
andstg_yellow_trip_data.sql
.
- Example:
-
Core: Create fact and dimensional tables.
- Example:
fact_trips.sql
anddim_zones.sql
.
- Example:
-
Staging: Clean and prepare raw data.
-
Ref Function:
- Use
ref('model_name')
to reference other dbt models. - Automatically builds dependencies between models.
-
Example:
SELECT * FROM {{ ref('stg_green_trip_data') }}
- Use
6. Macros in dbt
- Macros: Reusable SQL code snippets (similar to functions in Python).
- Purpose: Simplify repetitive tasks and dynamically generate SQL.
-
Example:
-
Macro to generate payment type descriptions:
{% macro get_payment_type_description(payment_type) %} CASE CAST({{ payment_type }} AS INTEGER) WHEN 1 THEN 'Credit Card' WHEN 2 THEN 'Cash' ELSE 'Unknown' END {% endmacro %}
-
-
Usage:
-
Call the macro in a model:
SELECT {{ get_payment_type_description('payment_type') }} AS payment_description
-
7. Packages in dbt
- Packages: Libraries of reusable macros and models.
-
Example:
dbt_utils
package provides common utilities (e.g.,surrogate_key
). -
Installation:
-
Add to
packages.yml
:
packages: - package: dbt-labs/dbt_utils version: 0.8.0
Run
dbt deps
to install.
-
-
Usage:
-
Example: Generate a surrogate key:
SELECT {{ dbt_utils.surrogate_key(['vendor_id', 'pickup_datetime']) }} AS trip_key
-
8. Variables in dbt
- Variables: Dynamic values that can be defined at the project level or passed during execution.
-
Usage:
-
Define in
dbt_project.yml
:
vars: payment_type_values: [1, 2, 3, 4, 5, 6]
-
Use in a model:
{% if var('is_test_run', true) %} LIMIT 100 {% endif %}
-
-
Execution:
-
Pass variables via CLI:
dbt run --vars '{"is_test_run": false}'
-
9. Building Fact and Dimensional Tables
-
Fact Table: Aggregates data from multiple sources.
- Example:
fact_trips.sql
combines green and yellow trip data. - Union both datasets and join with
dim_zones
for zone information.
- Example:
-
Dimensional Table: Provides context for fact tables.
- Example:
dim_zones.sql
contains zone details (e.g., borough, service zone).
- Example:
10. Testing and Deployment
-
Testing:
- Use dbt's built-in tests (e.g., uniqueness, not null).
-
Example:
tests: - unique - not_null
-
Deployment:
- Run
dbt run
to compile and execute models. - Use
dbt build
to run models and tests together. -
Example:
dbt build --vars '{"is_test_run": false}'
-
Best Practice to implement
dev limit
which can be toggle off when in production by simply settingdefault=false
:
-- dbt build --select <model_name> --vars '{'is_test_run': 'false'}' {% if var('is_test_run', default=true) %} limit 100 {% endif %}
→ faster, cheaper and can be made as macro to follow DRY ( Don’t Repeat Yourself) methodology.
- Run
11. Lineage and Dependency Management
- Lineage: Visual representation of dependencies between models.
-
Benefits:
- Understand how data flows through the pipeline.
- Identify upstream and downstream impacts of changes.
-
Example:
-
fact_trips
depends onstg_green_trip_data
,stg_yellow_trip_data
, anddim_zones
.
-
12. Best Practices
- Modularity: Break down transformations into reusable models.
- Documentation: Use YAML files to document sources, models, and tests.
- Version Control: Store dbt projects in Git for collaboration and history.
- Performance: Materialize large tables as tables (not views) for better performance.
13. Key Commands
-
dbt run
: Compiles and runs models. -
dbt test
: Runs tests on models. -
dbt build
: Runs models and tests together. -
dbt deps
: Installs packages. -
dbt compile
: Compiles SQL without executing it.
14. Example Project Structure
dbt_project/
├── models/
│ ├── staging/
│ │ ├── stg_green_trip_data.sql
│ │ ├── stg_yellow_trip_data.sql
│ ├── core/
│ │ ├── fact_trips.sql
│ │ ├── dim_zones.sql
├── seeds/
│ ├── taxi_zone_lookup.csv
├── macros/
│ ├── get_payment_type_description.sql
├── schema.yml
├── dbt_project.yml
├── packages.yml