Started by Maxime Beauchemin at Airbnb in 2014, Apache Airflow is an open-source workflow management platform. Apache Airflow, or simply Airflow, is used to author, schedule and monitor workflows. Airflow was officially announced and brought under Airbnb GitHub in 2015.
Defining workflows in code makes them more maintainable, testable and collaborative. For example, airflow pipelines are defined in Python to enable dynamic pipeline generation. Thus, also allowing developers to use standard Python features for scheduling and loops and maintain flexibility.
Luigi is a Python package used to build Hadoop jobs, dump data to or from databases, and run ML algorithms. It addresses all plumbing associated with long-running processes and handles dependency resolutions, workflow management, visualisation, and command-line integrations, among other things.
Luigi is used to stitch tasks – Hadoop job in Java, Spark job in Scala or Python or a Hive query. Additionally, it comes with a toolbox of task templates. Luigi is internally used at Spotify and Deloitte. Learn more about Luigi’s features here.
Open-source Python framework Kedro is used for creating easy-to-maintain and reproducible modular data science codes. According to its website:
‘Kedro borrows concepts from software engineering best-practice and applies them to machine-learning code.’
Kedro offers the following features:
- Easy-to-use Cookiecutter Data Science project templates
- Data connectors to save and load data across file formats and systems
- Pipeline abstraction
- Offers deployment using pytest, produce code using Sphinx, create code with support for black, flake8 and isort
- Support for deployment on Kubeflow, AWS Batch, Databricks, Prefect and Argo.
Know how to get started with Kedro here.
Open-source, scalable workflow manager Pinball was built by Pinterest, although the project is not actively managed by Pinterest anymore. Its design is easy-to-grasp and component-based and can be upgraded without aborting workflows. The four critical components of Pinball include:
- Master: the frontend to a state repository to support atomic job token updates
- UI: a service reading from the storage layer that the Master essentially uses
- Scheduler: Responsible for running workflows on schedule
- Worker: It is the client of the Master
Pinball runs on Python 2.
Robotic Process Automation or RPA helps businesses automate processes for monotonous tasks, thereby reducing human efforts. RPA works in Windows and Linux environments and uses BPMN (Business Process Model Notation)-based diagrams. Usually, BPMN-based diagrams are run with a Workflow Engine.
AWS Step Functions
Amazon Web Services’ Step Function is a fully managed, serverless and low-code visual workflow service. AWS Step Functions is used to prepare data for machine learning, build serverless applications, automate ETL processes and orchestrate microservices.
AWS Step Functions allows one to compose AWS resources including Lambda, Fargate, SNS, SQS, SageMaker or EMR into business workflows, data pipelines and applications. Additionally, it offers two types of workflows– Standard (for long-running workloads) and Express (for high-volume event processing workloads), that users and businesses can opt for, depending on their use case.
Read about the pricing details for AWS Step Functions here.
StackStorm is targeted towards developer teams who want to automate DevOps processes. Cisco, Netflix and Pearson use it. Its features include:
- Sensors: Python plugins for inbound and outbound integration
- Triggers for external events
- Actions for outbound integrations
- Rules to map triggers to actions or workflows
- Packs for content deployment
- Audits for executions, manual and automated
Learn more about the features of StackStorm here.
All data-driven companies depend on workflow management systems and selecting the one that fits your business needs can be challenging and often overwhelming. Businesses should opt for a system that fits their business size, the use case and is affordable.