How does it work?
airflow-dbt-python’s main goal is to elevate dbt to first-class citizen status in Airflow. By this we mean that users of dbt can leverage as many Airflow features as possible, without breaking the assumptions that Airflow expects from any workflows it orchestrates. Perhaps more importantly, Airflow should enhance a dbt user’s experience and not simply emulate the way they would run dbt in the command line. This is what separates airflow-dbt-python from other alternatives like airflow-dbt which simply wrap dbt cli commands in BashOperator
.
To achieve this goal airflow-dbt-python provides Airflow operators, hooks, and other utilities. Hooks in particular come in two flavors:
A
DbtHook
that abstracts all interaction with dbt internals.Subclasses of
DbtRemoteHook
that expose an interface to interact with dbt remote storages where project files are located (like AWS S3 buckets or git repositories).
dbt as a library
A lot of the code in airflow-dbt-python is required to provide a wrapper for dbt, as dbt only provides a CLI interface. There are ongoing efforts to provide a dbt library, which would significantly simplify our codebase. As of the time of development, these efforts are not in a state where they can be used by us, but we can keep an eye out for the future.
Most of the code used to adapting dbt can be found in the utilities module, as some of our features require that we break some assumptions dbt makes when initializing. For example, we need setup dbt to access project files stored remotely, or intiailize all profile settings from an Airflow Connection.
dbt operators
airflow-dbt-python provides one operator per dbt task: for example, DbtRunOperator
can be used to execute a dbt run command, as if running dbt run ...
in the CLI.
dbt hooks
airflow-dbt-python provides a DbtHook
to abstract all interactions with dbt. The main method of DbtHook
operators should be calling is DbtHook.run_dbt_task
which takes any dbt command as it’s first argument, and any configuration parameters keyword arguments. This hook abstracts interactions with dbt, including:
Setting up a temporary directory for dbt execution. * Potentially downloading files from a dbt remote into this directory.
Using Airflow connections to configure dbt connections (“targets” as these connections are found in profiles.yml).
Initializing a configuration for dbt with parameters provided. * Includes configuring dbt logging.
Temporary directories ensure task independence
Airflow executes tasks independent of one another: even though downstream and upstream dependencies between tasks exist, the execution of an individual task happens entirely independently of any other task execution (see: Tasks Relationships).
In order to respect this constraint, airflow-dbt-python hooks run each dbt command in a temporary and isolated directory:
Before execution, all the relevant dbt files are downloaded from supported remotes.
After execution, any resulting artifacts are uploaded back to supported remotes (if configured).
This ensures dbt can work with any Airflow deployment, including most production deployments running Remote Executors that do not guarantee any files will be shared between tasks, since each task may run in a completely different worker.
dbt remote hooks
dbt remote hooks implement a simple interface to communicate with dbt remotes. A dbt remote can be any external storage that contains a dbt project and potentially also a profiles.yml file for example: an AWS S3 bucket or a GitHub repository. See the reference for a list of which remotes are currently supported.
Implementing the DbtRemoteHook
interface
Supporting a new remote to store dbt files requires implementing the DbtRemoteHook
interface. There are only two methods in the interface: DbtRemoteHook.download
and DbtRemoteHook.upload
.