Study Notes 2.2.7: Managing Schedules and Backfills with BigQuery in Kestra

Overview This study note covers the details from the video "DE Zoomcamp 2.2.7 - Manage Schedules and Backfills with BigQuery in Kestra". The video focuses on adding schedules to a BigQuery workflow and performing backfills for data from 2019 and 2020 using the workflow orchestration tool, Kestra. Key Concepts Workflow Orchestration: Automating tasks and data pipelines to handle data workflows efficiently. Kestra: An open-source orchestration tool used to manage complex workflows. BigQuery: A fully-managed, serverless data warehouse that enables scalable analysis over large datasets. Backfills: The process of filling in missing data retrospectively. Workflow Overview The workflow allows the automatic scheduling of data extraction and uploading to BigQuery. It replaces manual input date selection with automatic triggers for running tasks at specified times. Supports both "green" and "yellow" datasets. Steps to Implement Schedules and Backfills Setup Triggers: Triggers are set to run at different times for different datasets. Example: "Green" dataset at 9:00 AM and "Yellow" dataset at 10:00 AM on the 1st of each month. Workflow Modifications: Replace manual inputs with automatic triggers. Modify the workflow to handle the new triggers. Handling Data Storage: Extract data from GitHub. Upload the data to Google Cloud Storage (GCS) for BigQuery to use. Executing Backfills: Use the trigger to execute backfills for previous months. Example: Backfill for the entire year of 2019. Managing BigQuery Tables: Delete existing tables in BigQuery to start with a fresh dataset. Set backfill date range and execute backfill in Kestra. Monitor execution and check for completion. Practical Steps Trigger Configuration: Set up triggers in Kestra to run at the specified times. Ensure triggers are set to run after the data is available (e.g., after 9:00 AM on the 1st of each month). Workflow Execution: Add labels to track execution and differentiate between normal runs and backfills. Monitor the progress in the execution logs. Data Management: Ensure the BigQuery dataset is empty before starting backfills. Check the Google Cloud Storage bucket for uploaded CSV files. Verify the data in BigQuery tables. Summary of Execution Labels are used to indicate backfill execution. Data for all months is processed and stored in BigQuery. Example: 12 CSV files, each around 600-700 MB, uploaded to GCS. BigQuery table contains 18.9 GB of data with unique row IDs to prevent duplicates. Conclusion The video demonstrates how to manage schedules and backfills effectively using Kestra and BigQuery. Automating workflows helps in efficient data processing and handling large datasets. Future videos will cover running workflows with DBT and deploying to the cloud. Additional Resources Kestra Documentation: Kestra.io BigQuery Documentation: Google Cloud BigQuery DE Zoomcamp Resources: Data Engineering Zoomcamp Notes Ensure your system has sufficient storage and processing power when running large datasets locally. Always verify the scheduled times and data availability to avoid execution failures. Utilize labels and logs to monitor and debug workflows efficiently.

Feb 4, 2025 - 22:19

This study note covers the details from the video "DE Zoomcamp 2.2.7 - Manage Schedules and Backfills with BigQuery in Kestra". The video focuses on adding schedules to a BigQuery workflow and performing backfills for data from 2019 and 2020 using the workflow orchestration tool, Kestra.

Workflow Orchestration:
- Automating tasks and data pipelines to handle data workflows efficiently.
Kestra:
- An open-source orchestration tool used to manage complex workflows.
BigQuery:
- A fully-managed, serverless data warehouse that enables scalable analysis over large datasets.
Backfills:
- The process of filling in missing data retrospectively.

The workflow allows the automatic scheduling of data extraction and uploading to BigQuery.
It replaces manual input date selection with automatic triggers for running tasks at specified times.
Supports both "green" and "yellow" datasets.

Setup Triggers:
- Triggers are set to run at different times for different datasets.
- Example: "Green" dataset at 9:00 AM and "Yellow" dataset at 10:00 AM on the 1st of each month.
Workflow Modifications:
- Replace manual inputs with automatic triggers.
- Modify the workflow to handle the new triggers.
Handling Data Storage:
- Extract data from GitHub.
- Upload the data to Google Cloud Storage (GCS) for BigQuery to use.
Executing Backfills:
- Use the trigger to execute backfills for previous months.
- Example: Backfill for the entire year of 2019.
Managing BigQuery Tables:
- Delete existing tables in BigQuery to start with a fresh dataset.
- Set backfill date range and execute backfill in Kestra.
- Monitor execution and check for completion.

Trigger Configuration:
- Set up triggers in Kestra to run at the specified times.
- Ensure triggers are set to run after the data is available (e.g., after 9:00 AM on the 1st of each month).
Workflow Execution:
- Add labels to track execution and differentiate between normal runs and backfills.
- Monitor the progress in the execution logs.
Data Management:
- Ensure the BigQuery dataset is empty before starting backfills.
- Check the Google Cloud Storage bucket for uploaded CSV files.
- Verify the data in BigQuery tables.

Labels are used to indicate backfill execution.
Data for all months is processed and stored in BigQuery.
Example: 12 CSV files, each around 600-700 MB, uploaded to GCS.
BigQuery table contains 18.9 GB of data with unique row IDs to prevent duplicates.

The video demonstrates how to manage schedules and backfills effectively using Kestra and BigQuery.
Automating workflows helps in efficient data processing and handling large datasets.
Future videos will cover running workflows with DBT and deploying to the cloud.

Ensure your system has sufficient storage and processing power when running large datasets locally.
Always verify the scheduled times and data availability to avoid execution failures.
Utilize labels and logs to monitor and debug workflows efficiently.