dev.to

Study Notes 4.3.1 - Build the First dbt Models

1. Introduction to dbt Models dbt (Data Build Tool) sits on top of data platforms like BigQuery or Postgres. Objective: Transform raw data (e.g., green and yellow taxi trip data) into structured, usable formats for BI tools. Approach: Modular data modeling using fact tables and dimensional tables. Process: Start with raw data (sources). Build SQL scripts (models) to clean, deduplicate, recast, and rename data. Apply business logic to create fact and dimensional tables. Aggregate data for stakeholders. 2. Anatomy of a dbt Model Models are SQL files that contain SELECT statements. Materialization: Determines how the model is stored in the database. Types of Materialization: Ephemeral: Not physically stored; exists only within other models (like a CTE). View: Materialized as a view in the database. Table: Materialized as a physical table. Incremental: Updates only new data in the table (avoids recreating the entire table). File Structure: Each model is a .sql file. Example: model_name.sql contains a SELECT statement. dbt compiles the SQL into DDL/DML (e.g., CREATE TABLE AS SELECT). 3. Defining Sources Sources: Raw data tables loaded into the database (e.g., green and yellow taxi trip data). YAML Configuration: Define sources in a schema.yml file. Specify database, schema, and table names. Example: version: 2 sources: - name: taxi_trips database: my_database schema: trip_data tables: - name: green_trip_data - name: yellow_trip_data Benefits: Abstracts the complexity of source locations. Allows for freshness testing to ensure data is up-to-date. 4. Using Seeds Seeds: CSV files stored in the dbt repository (e.g., taxi_zone_lookup.csv). Purpose: Used for small, static datasets (e.g., master data tables). Benefits: Version-controlled. Can be documented and tested like other dbt models. Example: Load taxi_zone_lookup.csv as a seed. Use it in transformations to enrich trip data with zone information. 5. Modular Data Modeling Layers: Staging: Clean and prepare raw data. Example: stg_green_trip_data.sql and stg_yellow_trip_data.sql. Core: Create fact and dimensional tables. Example: fact_trips.sql and dim_zones.sql. Ref Function: Use ref('model_name') to reference other dbt models. Automatically builds dependencies between models. Example: SELECT * FROM {{ ref('stg_green_trip_data') }} 6. Macros in dbt Macros: Reusable SQL code snippets (similar to functions in Python). Purpose: Simplify repetitive tasks and dynamically generate SQL. Example: Macro to generate payment type descriptions: {% macro get_payment_type_description(payment_type) %} CASE CAST({{ payment_type }} AS INTEGER) WHEN 1 THEN 'Credit Card' WHEN 2 THEN 'Cash' ELSE 'Unknown' END {% endmacro %} Usage: Call the macro in a model: SELECT {{ get_payment_type_description('payment_type') }} AS payment_description 7. Packages in dbt Packages: Libraries of reusable macros and models. Example: dbt_utils package provides common utilities (e.g., surrogate_key). Installation: Add to packages.yml: packages: - package: dbt-labs/dbt_utils version: 0.8.0 Run dbt deps to install. Usage: Example: Generate a surrogate key: SELECT {{ dbt_utils.surrogate_key(['vendor_id', 'pickup_datetime']) }} AS trip_key 8. Variables in dbt Variables: Dynamic values that can be defined at the project level or passed during execution. Usage: Define in dbt_project.yml: vars: payment_type_values: [1, 2, 3, 4, 5, 6] Use in a model: {% if var('is_test_run', true) %} LIMIT 100 {% endif %} Execution: Pass variables via CLI: dbt run --vars '{"is_test_run": false}' 9. Building Fact and Dimensional Tables Fact Table: Aggregates data from multiple sources. Example: fact_trips.sql combines green and yellow trip data. Union both datasets and join with dim_zones for zone information. Dimensional Table: Provides context for fact tables. Example: dim_zones.sql contains zone details (e.g., borough, service zone). 10. Testing and Deployment Testing: Use dbt's built-in tests (e.g., uniqueness, not null). Example: tests: - unique - not_null Deployment: Run dbt run to compile and execute models. Use dbt build to run models and tests together. Example: dbt build --vars '{"is_test_run": false}' Best Practice to implement dev limit which can be toggle off when in production by simply setting default=false: -- dbt build --select --vars '{'is_test_run': 'false'}' {% if var('is_test_run', default=true) %} limit 100 {% endif %} → faster, cheaper and can be made as macro to follow DRY ( Don’t Repeat Y

Feb 25, 2025 - 22:06

Study Notes 4.3.1 - Build the First dbt Models

1. Introduction to dbt Models

dbt (Data Build Tool) sits on top of data platforms like BigQuery or Postgres.
Objective: Transform raw data (e.g., green and yellow taxi trip data) into structured, usable formats for BI tools.
Approach: Modular data modeling using fact tables and dimensional tables.

Process:
- Start with raw data (sources).
- Build SQL scripts (models) to clean, deduplicate, recast, and rename data.
- Apply business logic to create fact and dimensional tables.
- Aggregate data for stakeholders.

2. Anatomy of a dbt Model

Models are SQL files that contain SELECT statements.
Materialization: Determines how the model is stored in the database.
- Types of Materialization:
  1. Ephemeral: Not physically stored; exists only within other models (like a CTE).
  2. View: Materialized as a view in the database.
  3. Table: Materialized as a physical table.
  4. Incremental: Updates only new data in the table (avoids recreating the entire table).

File Structure:
- Each model is a .sql file.
- Example: model_name.sql contains a SELECT statement.
- dbt compiles the SQL into DDL/DML (e.g., CREATE TABLE AS SELECT).

3. Defining Sources

Sources: Raw data tables loaded into the database (e.g., green and yellow taxi trip data).

YAML Configuration:

Define sources in a schema.yml file.
Specify database, schema, and table names.

Example:

version: 2
sources:
  - name: taxi_trips
    database: my_database
    schema: trip_data
    tables:
      - name: green_trip_data
      - name: yellow_trip_data

Benefits:
- Abstracts the complexity of source locations.
- Allows for freshness testing to ensure data is up-to-date.

4. Using Seeds

Seeds: CSV files stored in the dbt repository (e.g., taxi_zone_lookup.csv).
Purpose: Used for small, static datasets (e.g., master data tables).
Benefits:
- Version-controlled.
- Can be documented and tested like other dbt models.
Example:
- Load taxi_zone_lookup.csv as a seed.
- Use it in transformations to enrich trip data with zone information.

5. Modular Data Modeling

Layers:
1. Staging: Clean and prepare raw data.
  - Example: stg_green_trip_data.sql and stg_yellow_trip_data.sql.
2. Core: Create fact and dimensional tables.
  - Example: fact_trips.sql and dim_zones.sql.

Ref Function:
- Use ref('model_name') to reference other dbt models.
- Automatically builds dependencies between models.
- Example:
```
SELECT * FROM {{ ref('stg_green_trip_data') }}
```

6. Macros in dbt

Macros: Reusable SQL code snippets (similar to functions in Python).
Purpose: Simplify repetitive tasks and dynamically generate SQL.

Example:

Macro to generate payment type descriptions:

{% macro get_payment_type_description(payment_type) %}
    CASE CAST({{ payment_type }} AS INTEGER)
        WHEN 1 THEN 'Credit Card'
        WHEN 2 THEN 'Cash'
        ELSE 'Unknown'
    END
{% endmacro %}

Usage:

Call the macro in a model:

SELECT {{ get_payment_type_description('payment_type') }} AS payment_description

7. Packages in dbt

Packages: Libraries of reusable macros and models.
Example: dbt_utils package provides common utilities (e.g., surrogate_key).

Installation:

Add to packages.yml:

packages:
  - package: dbt-labs/dbt_utils
    version: 0.8.0

Run dbt deps to install.

Usage:

Example: Generate a surrogate key:

SELECT {{ dbt_utils.surrogate_key(['vendor_id', 'pickup_datetime']) }} AS trip_key

8. Variables in dbt

Variables: Dynamic values that can be defined at the project level or passed during execution.

Usage:

Define in dbt_project.yml:

vars:
  payment_type_values: [1, 2, 3, 4, 5, 6]

Use in a model:

{% if var('is_test_run', true) %}
    LIMIT 100
{% endif %}

Execution:

Pass variables via CLI:

dbt run --vars '{"is_test_run": false}'

9. Building Fact and Dimensional Tables

Fact Table: Aggregates data from multiple sources.
- Example: fact_trips.sql combines green and yellow trip data.
- Union both datasets and join with dim_zones for zone information.
Dimensional Table: Provides context for fact tables.
- Example: dim_zones.sql contains zone details (e.g., borough, service zone).

10. Testing and Deployment

Testing:
- Use dbt's built-in tests (e.g., uniqueness, not null).
- Example:
```
tests:
  - unique
  - not_null
```
Deployment:
- Run dbt run to compile and execute models.
- Use dbt build to run models and tests together.
- Example:
```
dbt build --vars '{"is_test_run": false}'
```
- Best Practice to implement dev limit which can be toggle off when in production by simply setting default=false:
```
-- dbt build --select <model_name> --vars '{'is_test_run': 'false'}'
{% if var('is_test_run', default=true) %}

  limit 100

{% endif %}
```
  → faster, cheaper and can be made as macro to follow DRY ( Don’t Repeat Yourself) methodology.

11. Lineage and Dependency Management

Lineage: Visual representation of dependencies between models.
Benefits:
- Understand how data flows through the pipeline.
- Identify upstream and downstream impacts of changes.
Example:
- fact_trips depends on stg_green_trip_data, stg_yellow_trip_data, and dim_zones.

12. Best Practices

Modularity: Break down transformations into reusable models.
Documentation: Use YAML files to document sources, models, and tests.
Version Control: Store dbt projects in Git for collaboration and history.
Performance: Materialize large tables as tables (not views) for better performance.

13. Key Commands

dbt run: Compiles and runs models.
dbt test: Runs tests on models.
dbt build: Runs models and tests together.
dbt deps: Installs packages.
dbt compile: Compiles SQL without executing it.

14. Example Project Structure

dbt_project/
├── models/
│   ├── staging/
│   │   ├── stg_green_trip_data.sql
│   │   ├── stg_yellow_trip_data.sql
│   ├── core/
│   │   ├── fact_trips.sql
│   │   ├── dim_zones.sql
├── seeds/
│   ├── taxi_zone_lookup.csv
├── macros/
│   ├── get_payment_type_description.sql
├── schema.yml
├── dbt_project.yml
├── packages.yml