One of the trending open-source workflow management systems among developers, Apache Airflow is a platform to programmatically author, schedule and monitor workflows. Recently, the team at Airflow unveiled the new version of this platform, which is Apache Airflow 2.0. Last year, the Apache Software Foundation (ASF) announced Apache Airflow as the Top-Level Project (TLP).
With substantial changes than the former version, the 2.0 release of the Airflow came with significant upgrade. In order to start using Airflow 2.0, one must need to follow some prerequisites, such as if users are using Python 2.7, they need to migrate to Python 3.6+. Also, when the user is on the latest Airflow 1.10 release, they can use the airflow upgrade-check command to see if they can migrate to the new Airflow version.
Before diving into the significant upgrades, let us take you through the basics of AirFlow first.
Behind the Basics
Created by Airbnb, Apache Airflow is an open-source platform to programmatically author, schedule, and monitor workflows. The platform is a flexible, scalable workflow automation and scheduling system for authoring and managing Big Data processing pipelines of hundreds of petabytes.
It is a workflow engine that performs several tasks, such as managing scheduling and running jobs and data pipelines, managing the allocation of scarce resources, provides mechanisms for tracking the state of jobs and recovering from failure and more.
There are four key components of Airflow, which are:
- Web server: It is the GUI, which remains under the hood of a Flask app where you can track the status of your jobs and read logs from a remote file store
- Scheduler: Scheduler is a multithreaded Python process and is responsible for scheduling jobs. It uses the DAGb object to decide what tasks need to be run, when and where.
- Executor: Executor is the mechanism that gets the tasks done.
- Metadata database: Metadata database powers how the other components interact, stores the Airflow states and all the processes read and write from here.
Some of the intuitive features of AIrflow are mentioned below:
- One of the main advantages of using a workflow system like Airflow is that all is code, which makes the workflows maintainable, versionable, testable, and collaborative.
- Airflow is versatile in nature and can be used across various domains, including growth analytics, data warehousing, engagement analytics, anomaly detection, email targeting, among others.
- Airflow has built-in support using schedulers.
- Through smart scheduling, database and dependency management, error handling and logging, Airflow automates resource management, from single servers to large-scale clusters.
- Written in Python, the project is highly extensible and able to run tasks written in other languages, allowing integration with commonly used architectures and projects such as AWS S3, Docker, Apache Hadoop HDFS, Apache Hive, Kubernetes, MySQL, Postgres, among others.
Coming to the major release of version 2.0, the developers announced that Airflow 2.0 is in the alpha testing stage and is scheduled to be generally available in December of 2020. According to its developers, Airflow 2.0 includes hundreds of features and bug fixes, both large and small, where most of the significant updates were influenced by the feedback from Airflow’s 2019 Community Survey.
Some of the significant updates are mentioned below:
A New Scheduler: Low-Latency + High-Availability
According to the developers, Scheduler Performance was the most asked for improvement in the Community Survey. With version 2.0, the team introduced a new, refactored Scheduler. The most impactful Airflow 2.0 change in this area is support for running multiple schedulers concurrently in an active/active model. The new functionality includes horizontal scalability, lowered task latency, zero recovery time and easier maintenance.
Full REST API
Airflow 2.0 introduces a new, comprehensive REST API that sets a strong foundation for a new Airflow UI and CLI in the future. This new API includes authorisation capabilities, makes easy access by third-parties and more.
Know more here.
Sensors are a special type of Airflow Operator whose purpose is to wait on a particular trigger. Version 2.0 introduced Smart Sensor and is capable of checking the status of a batch of Sensor tasks, storing sensor status information in Airflow’s Metadata DB and other such.
Airflow 2.0 introduced the TaskFlow API and Task Decorator to mitigate the issue of scheduling and running the idempotent tasks. The TaskFlow API implemented in 2.0 makes DAGs significantly easier to write by abstracting the task and dependency management layer from users. The functionalities of TaskFlow API include support for Custom XCom backends, automatically creates PythonOperator tasks from Python functions and handles variable passing and more.
Subscribe to our NewsletterGet the latest updates and relevant offers by sharing your email.
A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.