
Apache Airflow became the top project in Apache Foundation in January 2019. It is one of the most popular projects for data engineers, and many companies list Airflow as one of the requirements for hiring. Data projects iterate quickly, and some projects have been replaced rapidly. Airflow has been widespread for years. Is Apache Airflow due for a replacement? mage-ai is the new ETL tool for data engineers to check out as a substitution. I have taken a first impression of mage-ai and will share my thoughts.
The Data Orchestration War
Data orchestration is highly competitive, primarily because old technology has yet to be fully retired and the increasing arrival of new open-source players. Spotify’s Luigi vs. Apache Airflow is forever debatable in the data engineering world. Apache Hop and Apache Nifi have a web-based user interface to crack this problem for users who prefer to rely more on UI as part of the data pipeline development.
Prefect open-sourced project with an Apache 2.0 license after two years of development in 2019. Prefect is new to the data engineering stack, but it learned from years of experience working on Airflow.
mage-ai is a unique project here. It doesn’t follow its predecessor to try to emphasize data orchestration as a more scalable or reliable solution. It is much harder to perform scale testing and verify them as features for a data orchestration project.
To get more users to move to mage-ai, it focuses more on usability and astonishes you with the first impression — UI. mage-ai integrates closely with data professionals’ daily development experience and tries to improve that experience. Compared with other data orchestration projects, it has a good balance between coding and drags & drops as a hybrid approach.
What is mage-ai
mage-ai is built by Mage, a company founded in 2021 and based in Santa Clara, CA. According to CrunchBase, the company has already received two seed funding rounds with $6.3M. Its mage-ai project has had many attractions recently, with 2.5k stars on GitHub as of 1/10/2023. mage-ai markets itself as the alternative to Apache Airflow. We’d need to review what Airflow lacks and what new mage-ai brings as an improvement.
What is challenging in developing Airflow DAGs?
Long feedback loop
Airflow is python code oriented. While writing the code, engineers need instant feedback on how the DAGs look. To see the graph view, which is mainly for visualizing dependencies in DAGs, your code needs to be in the folder of an Airflow scheduler that can be picked up. The airflow scheduler also takes time to render and parse your DAG until it shows up.
This might not bother you if the DAG dependencies can be conceptualized in your head. However, only some engineers can do so. Instant feedback becomes crucial when you want a sanity check before your code goes too far than expected.
The long feedback loop in the Airflow DAG development cycle makes debugging difficult. What does the engineer prefer to do if the DAG takes multiple steps to verify? Write more lines of code and test them all together. If the lines of code become unmanageable on one screen, you might vaguely remember what to validate and what dependencies to check.
Difficult with local development
Compiling and debugging the code of Airflow DAG is simple. All you need to do is the following where Airflow is installed.
python my_fancy_dag.py
You can install a local Airflow environment or build fresh from docker nowadays. However, your local environment might be different from the production one, primarily when you use Airflow variables, pool, connection, and production-dependent features.
Some teams would set up dev, staging, and production environments for running Airflow. However, those environments are not isolated to your local one, and it still has a chance for one step on others’ feet.
I used to build a docker image to inject as much production-related information as possible. But it’s still not 100% copy, and it takes tremendous effort to develop and maintain that docker image.
What new features does mage-ai bring to the table?
Development in Jupyter Notebook like with DAG visualization

Although the web-based IDE lacks the fancy features you used in VS code or PyCharm, it serves the purpose of getting things going with development in mage-ai. Since it’s a web-based IDE, its mobility allows working from different devices, and sharing becomes more straightforward.
The initial impression is that the UI layout feels like using RStudio. It has many sections divided into different areas. One of the areas is the DAG visualization which provides instant feedback to the user on the task relationship.
The pipeline or DAG is constructed with modular blocks—a block maps to a single file. For each block, you can perform the following:
- Execution with upstream blocks: this triggers all upstream to get the data ready for the current block to run
- Execute and run tests defined in the current block: this focuses on the current block and performs testing.
- Set block as dynamic: this changes the block type into the dynamic block, and it fits better to create multiple downstream blocks at runtime.

DAG Dependencies defined in UI instead of Code
The DAG graph view is shown directly on the UI. It doesn’t offer an arrow for direction, so you might need to rely on color coding if the DAG gets too complex. For the default setting, blue is the data loader, purple is the transformer, and yellow is the exporter.

The mage-ai UI is more interactive than the Airflow UI. In Airflow, you cannot change the dependencies directly on the Airflow webserver. To modify the dependencies, you’d need to write and define them in the code.
# Airflow DAG dependencies
Task1 >> Task2 >> [Task3, Task4]
In mage-ai, this part can be drag and drop. For example, when I click on the transformation, it provides two circles indicating the input and output dependencies. I can further click the output circle at the bottom and drag and drop it to the exporter.

Where does mage-ai manage the dependencies? It is under ./pipelines/{your_awesome_pipeline_name}/metadata.yaml
. In the YAML file, you will see something like the following:
- all_upstream_blocks_executed: true
configuration:
dynamic: false
downstream_blocks:
- export_titanic_clean
executor_config: null
executor_type: local_python
language: python
name: cool rain
status: not_executed
type: transformer
upstream_blocks:
- fill_in_missing_values
uuid: cool_rain
mage-ai keeps track of the UI changes the user made and automatically builds the dependencies DAG into the YAML file.
Data Visualization is integrated into the UI
Another cool feature mage-ai integrates into the UI is the ability to visualize the dataset at each block. This is especially helpful for inspecting your input data and further validating the transformation.
For example, in the following, you can choose the group by dimension, the type of aggregation you’d like to perform, and which chart you’d like to visualize.

Once the chart has been created, it will also be attached to the current block as the downstream_blocks
.
downstream_blocks:
- cool_rain
- bar_chart_for_fill_in_missing_values_1673500169993
Final Thoughts
mage-ai has some unique features that have the potential to get more data engineers and scientists to adopt it. The project is still early, and many key features and documentation are evolving.
Unlike Airflow, which requires some depth knowledge of its core scheduler to set up properly. mage-ai is lightweight for users with rarer data infra experience who want a simple data orchestration pipeline running. Developing in mage-ai is more interactive than working with Airflow. More engagement with the data pipeline from the user perspective can also improve the data product that users build.
I hope my first impression of mage-ai is helpful to you. I plan to use mage-ai for some time and write more in-depth reviews and tutorials about it. Please let me know if there is anything you’d like to see next.
About Me
I hope my stories are helpful to you.
For data engineering post, you can also subscribe to my new articles or becomes a referred Medium member that also gets full access to stories on Medium.
More Articles

Airflow Schedule Interval 101
The airflow schedule interval could be a challenging concept to comprehend, even for developers work on Airflow for a while find difficult to grasp. A confusing question arises every once a while on StackOverflow is “Why my DAG is not running as expected?”. This problem usually indicates a misunderstanding among the Airflow schedule interval.

Bidding War on Housing Market? Let’s Use R For Exploratory Data Analysis
Exploratory Data Analysis (EDA) is a methodology in data science as the initial approach to gain insights by visualizing and summarizing data. We will use some exploratory data analysis technics to find the reason behind the bidding war on the housing market.

Visualizing Data with ggridges: Techniques to Eliminate Density Plot Overlaps in ggplot2
When it comes to visualizing data with a histogram and dealing with multiple groups, it can be quite challenging. I have recently come across a useful ggplot2 extension called ggridges that has been helpful for my data exploratory tasks.