Data processing within information systems generally follows three stages: extraction, transformation, and loading (ETL). Big Data solutions often require the transformation of raw data into information suitable for business analysis using ETL processes. Apache Airflow is a tool designed to manage these ETL processes, especially in the context of Cloud-Native applications. To effectively utilize this tool, one should understand what Airflow is and its underlying architectural principles.
Key Concepts in Apache Airflow
Data processes or pipelines in Airflow are expressed using Directed Acyclic Graphs (DAG). A DAG, as its name implies, is a graph that doesn't have cyclic dependencies.
Tasks act as DAG nodes and represent operations on data, such as data loading, aggregation, indexing, deduplication, and other ETL processes. In code, these tasks can be Python functions or Bash scripts. Operators typically handle Task execution. While Tasks describe what actions to perform on the data, Operators determine how to carry out those actions. There's a special set of Operators known as Sensors, which trigger based on specific events such as time, new file or another DAG.
Airflow provides numerous built-in operators. Additionally, numerous specialized operators are available through community-supported provider packages. Custom operators can also be added by extending the BaseOperator class. When repetitive code based on these operators emerges in a project, it's often beneficial to transform it into a custom operator.
In latest Airflow versions, tasks can be structured as TaskFlows, allowing tasks to be chained together to pass output data to subsequent tasks and operators.
To better understand the distinctions between DAG, Task, and Operator, consider this example: We have a database where we're monitoring for specific data entries in a table. When such data is identified, it's first aggregated and stored in a repository, followed by the sending of an email alert.
According to the example, DAG architecture includes 3 nodes:
Each component corresponds to a Task, with an Operator overseeing its execution.
Another key principle within Airflow is storing information about each DAG launch according to a schedule. The timestamps are termed "execution_date," with the associated DAG instances labeled as "DAG Run" and the specific "Task Instances" tied to a DAG Run.
The concept of "execution_date" is essential for ensuring idempotence: executing or re-executing a task for a certain past date remains unaffected by the actual time of execution. This allows for consistent replication of prior results. Moreover, it's possible to concurrently run tasks of the same DAG for various timestamps (multiple Dag Runs).
Airflow architecture includes the following:
- Web Server. It is responsible for the user interface.
- Metadata database is a dedicated metadata repository built on the SqlAlchemy library, used for storing global variables, Task Instance execution and DAG Run statuses, and so on.
- Scheduler is a service responsible for task planning. By monitoring all created Tasks and DAGs, the Scheduler initiates Task Instances.
- Worker is a separate process to complete tasks. Its placement is determined by the selected Executor type and can either be local or remote.
- Executor is the mechanism responsible for launching task instances. It operates in tandem with the Scheduler within a single process.
The interaction of Airflow components can be described by the following simplified scheme. Additional components may be used depending on the type of Executor selected.
Advantages And Disadvantages of Airflow
These are advantages:
- Open Source platform with detailed documentation.
- Python-based. Python is considered a relatively simple language. It also eliminates the need to use JSON or XML configuration files to describe pipelines.
- Extensive tools and a user-friendly UI, supporting CLI, REST API, and a web interface built on the Flask Python framework.
- Integration with multiple data sources and services, including well-known databases and S3 object storage
- Highly customizable..
- Allows an unlimited number of DAGs.
- Features monitoring, alerting, role-based access, and testing.
There are some disadvantages too and you have to know them before starting to use Airflow.
- Ensuring idempotence when designing tasks is imperative.
- Deep understanding of the ‘execution_date’ processing mechanism is required.
- It is impossible to design a DAG in graphical form.
Who Should Consider Using Airflow?
Certainly, AirFlow isn't the lone ETL tool in the IT realm. There are many other paid and open-source options. For basic needs, the standard Cron scheduler suffices. However, AirFlow stands out when:
- Beyond Cron's scheduling capabilities.
- The team is well-versed in Python.
- Projects lean towards batch rather than streaming data.
- Task dependencies can be visualized as a DAG.
The product is gaining popularity and continues to evolve. You can count on consistent support when required, bolstered by a large technical community.