Is Apache Airflow Due for Replacement? The First Impression Of mage-ai

Foto von Enis Yavuz auf Unsplash
Foto von Enis Yavuz auf Unsplash

Apache Airflow became the top project in Apache Foundation in January 2019. It is one of the most popular projects for data engineers, and many companies list Airflow as one of the requirements for hiring. Data projects iterate quickly, and some projects have been replaced rapidly. Airflow has been widespread for years. Is Apache Airflow due for a replacement? mage-ai is the new ETL tool for data engineers to check out as a substitution. I have taken a first impression of mage-ai and will share my thoughts.

The Data Orchestration War

Data orchestration is highly competitive, primarily because old technology has yet to be fully retired and the increasing arrival of new open-source players. Spotify’s Luigi vs. Apache Airflow is forever debatable in the data engineering world. Apache Hop and Apache Nifi have a web-based user interface to crack this problem for users who prefer to rely more on UI as part of the data pipeline development. 

Prefect open-sourced project with an Apache 2.0 license after two years of development in 2019. Prefect is new to the data engineering stack, but it learned from years of experience working on Airflow.

mage-ai is a unique project here. It doesn’t follow its predecessor to try to emphasize data orchestration as a more scalable or reliable solution. It is much harder to perform scale testing and verify them as features for a data orchestration project. 

To get more users to move to mage-ai, it focuses more on usability and astonishes you with the first impression — UI. mage-ai integrates closely with data professionals’ daily development experience and tries to improve that experience. Compared with other data orchestration projects, it has a good balance between coding and drags & drops as a hybrid approach. 

What is mage-ai

mage-ai is built by Mage, a company founded in 2021 and based in Santa Clara, CA. According to CrunchBase, the company has already received two seed funding rounds with $6.3M. Its mage-ai project has had many attractions recently, with 2.5k stars on GitHub as of 1/10/2023. mage-ai markets itself as the alternative to Apache Airflow. We’d need to review what Airflow lacks and what new mage-ai brings as an improvement. 

What is challenging in developing Airflow DAGs?

Long feedback loop

Airflow is python code oriented. While writing the code, engineers need instant feedback on how the DAGs look. To see the graph view, which is mainly for visualizing dependencies in DAGs, your code needs to be in the folder of an Airflow scheduler that can be picked up. The airflow scheduler also takes time to render and parse your DAG until it shows up. 

This might not bother you if the DAG dependencies can be conceptualized in your head. However, only some engineers can do so. Instant feedback becomes crucial when you want a sanity check before your code goes too far than expected. 

The long feedback loop in the Airflow DAG development cycle makes debugging difficult. What does the engineer prefer to do if the DAG takes multiple steps to verify? Write more lines of code and test them all together. If the lines of code become unmanageable on one screen, you might vaguely remember what to validate and what dependencies to check.

Difficult with local development 

Compiling and debugging the code of Airflow DAG is simple. All you need to do is the following where Airflow is installed. 

				
					python my_fancy_dag.py
				
			

You can install a local Airflow environment or build fresh from docker nowadays. However, your local environment might be different from the production one, primarily when you use Airflow variables, pool, connection, and production-dependent features. 

Some teams would set up dev, staging, and production environments for running Airflow. However, those environments are not isolated to your local one, and it still has a chance for one step on others’ feet. 

I used to build a docker image to inject as much production-related information as possible. But it’s still not 100% copy, and it takes tremendous effort to develop and maintain that docker image. 

What new features does mage-ai bring to the table?

Development in Jupyter Notebook like with DAG visualization

mage-ai edit mode | Image by author
mage-ai edit mode | Image by author

Although the web-based IDE lacks the fancy features you used in VS code or PyCharm, it serves the purpose of getting things going with development in mage-ai. Since it’s a web-based IDE, its mobility allows working from different devices, and sharing becomes more straightforward. 

The initial impression is that the UI layout feels like using RStudio. It has many sections divided into different areas. One of the areas is the DAG visualization which provides instant feedback to the user on the task relationship. 

The pipeline or DAG is constructed with modular blocks—a block maps to a single file. For each block, you can perform the following:

  • Execution with upstream blocks: this triggers all upstream to get the data ready for the current block to run 
  • Execute and run tests defined in the current block: this focuses on the current block and performs testing.
  • Set block as dynamic: this changes the block type into the dynamic block, and it fits better to create multiple downstream blocks at runtime.
Block Execution Options | Image by author
Block Execution Options | Image by author

DAG Dependencies defined in UI instead of Code

The DAG graph view is shown directly on the UI. It doesn’t offer an arrow for direction, so you might need to rely on color coding if the DAG gets too complex. For the default setting, blue is the data loader, purple is the transformer, and yellow is the exporter. 

mage-ai DAG view | Image by author
mage-ai DAG view | Image by author

The mage-ai UI is more interactive than the Airflow UI. In Airflow, you cannot change the dependencies directly on the Airflow webserver. To modify the dependencies, you’d need to write and define them in the code.

				
					# Airflow DAG dependencies
Task1 >> Task2 >> [Task3, Task4]
				
			

In mage-ai, this part can be drag and drop. For example, when I click on the transformation, it provides two circles indicating the input and output dependencies. I can further click the output circle at the bottom and drag and drop it to the exporter. 

mage-ai drag & drop | Image by author
mage-ai drag & drop | Image by author

Where does mage-ai manage the dependencies? It is under ./pipelines/{your_awesome_pipeline_name}/metadata.yaml . In the YAML file, you will see something like the following:

				
					- all_upstream_blocks_executed: true
  configuration:
    dynamic: false
  downstream_blocks:
  - export_titanic_clean
  executor_config: null
  executor_type: local_python
  language: python
  name: cool rain
  status: not_executed
  type: transformer
  upstream_blocks:
  - fill_in_missing_values
  uuid: cool_rain
				
			

mage-ai keeps track of the UI changes the user made and automatically builds the dependencies DAG into the YAML file. 

Data Visualization is integrated into the UI

Another cool feature mage-ai integrates into the UI is the ability to visualize the dataset at each block. This is especially helpful for inspecting your input data and further validating the transformation. 

For example, in the following, you can choose the group by dimension, the type of aggregation you’d like to perform, and which chart you’d like to visualize. 

Data Visualization with transformation | Image by author
Data Visualization with transformation | Image by author

Once the chart has been created, it will also be attached to the current block as the downstream_blocks

				
					downstream_blocks:
  - cool_rain
  - bar_chart_for_fill_in_missing_values_1673500169993
				
			

Final Thoughts

mage-ai has some unique features that have the potential to get more data engineers and scientists to adopt it. The project is still early, and many key features and documentation are evolving. 

Unlike Airflow, which requires some depth knowledge of its core scheduler to set up properly. mage-ai is lightweight for users with rarer data infra experience who want a simple data orchestration pipeline running. Developing in mage-ai is more interactive than working with Airflow. More engagement with the data pipeline from the user perspective can also improve the data product that users build. 

I hope my first impression of mage-ai is helpful to you. I plan to use mage-ai for some time and write more in-depth reviews and tutorials about it. Please let me know if there is anything you’d like to see next. 

About Me

I hope my stories are helpful to you. 

For data engineering post, you can also subscribe to my new articles or becomes a referred Medium member that also gets full access to stories on Medium.

In case of questions/comments, do not hesitate to write in the comments of this story or reach me directly through Linkedin or Twitter.

More Articles

Photo by Huyen Bui on Unsplash

Get Fluent in Python Decorators by Visualizing It

Python decorator is syntactic sugar. You can achieve everything without explicitly using the decorator. However, Using the decorator can help your code be more concise and readable. Ultimately, you write fewer lines of code by leveraging Python decorators.

Read More »

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
Share via
Copy link