How does it work?

airflow-dbt-python’s main goal is to elevate dbt to first-class citizen status in Airflow. By this we mean that users of dbt can leverage as many Airflow features as possible, without breaking the assumptions that Airflow expects from any workflows it orchestrates. Perhaps more importantly, Airflow should enhance a dbt user’s experience and not simply emulate the way they would run dbt in the command line. This is what separates airflow-dbt-python from other alternatives like airflow-dbt which simply wrap dbt cli commands in BashOperator.

To achieve this goal airflow-dbt-python provides Airflow operators, hooks, and other utilities. Hooks in particular come in two flavors:

  • A DbtHook that abstracts all interaction with dbt internals.

  • Subclasses of DbtRemoteHook that expose an interface to interact with dbt remote storages where project files are located (like AWS S3 buckets or git repositories).

digraph HowDoesItWork { graph [fontname="Hack", splines=ortho]; node [fontname="Hack", shape=box]; edge [fontname="Hack", labelfontsize=12.0, fontsize=12.0]; rankdir = "TB"; newrank = true; nodesep = 0.8; "Airflow DAG" [style=filled, fillcolor="#CBCBCB", color="#00C7D4"]; XCom [style=filled, fillcolor="#CBCBCB", color="#00C7D4"]; "Other tasks" [style=filled, fillcolor="#CBCBCB", color="#00C7D4"]; "Airflow DAG" -> DbtOperator [label="orchestrates"]; subgraph cluster_0 { color = "#00AD46"; label = "airflow-dbt-python"; labelloc = "b"; DbtHook; DbtOperator -> DbtHook [label="run_dbt_task"]; DbtRemoteHooks -> DbtHook [label="download"]; DbtHook -> DbtRemoteHooks [label="upload", labelfloat=true]; } "dbt-core" [style=filled, fillcolor="#CBCBCB", color="#FF7557"]; DbtHook -> "dbt-core" [headlabel="executes", labeldistance=4.5]; {rank=same; "dbt-core"; DbtHook; } split [shape=point, label=""]; DbtOperator -> split [arrowhead=none]; split -> "Other tasks" [label="return"]; split -> XCom [label="push", labelfloat=true]; XCom -> DbtHook [style=invis, arrowhead=none]; {rank=same; split; DbtOperator; } "Remote storage" [style=filled, fillcolor="#CBCBCB", color="#FF7557"]; DbtRemoteHooks -> "Remote storage" [headlabel="interacts", labeldistance=4.0]; {rank=same; "Remote storage"; DbtRemoteHooks; } }

dbt as a library

A lot of the code in airflow-dbt-python is required to provide a wrapper for dbt, as dbt only provides a CLI interface. There are ongoing efforts to provide a dbt library, which would significantly simplify our codebase. As of the time of development, these efforts are not in a state where they can be used by us, but we can keep an eye out for the future.

Most of the code used to adapting dbt can be found in the utilities module, as some of our features require that we break some assumptions dbt makes when initializing. For example, we need setup dbt to access project files stored remotely, or intiailize all profile settings from an Airflow Connection.

dbt operators

airflow-dbt-python provides one operator per dbt task: for example, DbtRunOperator can be used to execute a dbt run command, as if running dbt run ... in the CLI.

dbt hooks

airflow-dbt-python provides a DbtHook to abstract all interactions with dbt. The main method of DbtHook operators should be calling is DbtHook.run_dbt_task which takes any dbt command as it’s first argument, and any configuration parameters keyword arguments. This hook abstracts interactions with dbt, including:

  • Setting up a temporary directory for dbt execution. * Potentially downloading files from a dbt remote into this directory.

  • Using Airflow connections to configure dbt connections (“targets” as these connections are found in profiles.yml).

  • Initializing a configuration for dbt with parameters provided. * Includes configuring dbt logging.

Temporary directories ensure task independence

Airflow executes tasks independent of one another: even though downstream and upstream dependencies between tasks exist, the execution of an individual task happens entirely independently of any other task execution (see: Tasks Relationships).

In order to respect this constraint, airflow-dbt-python hooks run each dbt command in a temporary and isolated directory:

  1. Before execution, all the relevant dbt files are downloaded from supported remotes.

  2. After execution, any resulting artifacts are uploaded back to supported remotes (if configured).

This ensures dbt can work with any Airflow deployment, including most production deployments running Remote Executors that do not guarantee any files will be shared between tasks, since each task may run in a completely different worker.

dbt remote hooks

dbt remote hooks implement a simple interface to communicate with dbt remotes. A dbt remote can be any external storage that contains a dbt project and potentially also a profiles.yml file for example: an AWS S3 bucket or a GitHub repository. See the reference for a list of which remotes are currently supported.

Implementing the DbtRemoteHook interface

Supporting a new remote to store dbt files requires implementing the DbtRemoteHook interface. There are only two methods in the interface: DbtRemoteHook.download and DbtRemoteHook.upload.