The post has over 120k+ view, and 1.2k claps on medium, here is the original story Medium: Airflow Schedule Interval 101

The airflow schedule interval could be a challenging concept to comprehend, even for developers work on Airflow for a while find difficult to grasp. A confusing question arises every once a while on StackOverflow is “Why my DAG is not running as expected?”. This problem usually indicates a misunderstanding among the Airflow schedule interval. In this article, we will talk about how to set up the Airflow schedule interval, what result you should expect for scheduling your Airflow DAGs, and how to debug the Airflow schedule interval issues with examples.
How Airflow Schedule DAGs?
First of all, Airflow is not a streaming solution. People usually use it as an ETL tool or replacement of cron. As Airflow has its scheduler and it adopts the schedule interval syntax from cron, the smallest data and time interval in the Airflow scheduler world is minute. Inside of the scheduler, the only thing that is continuously running is the scheduler itself.
However, as a non-streaming solution to avoid hammering your system resources, Airflow won’t watch and trigger your DAGs all the time. It arranges the monitoring with some intervals, which is a configurable setting called scheduler_heartbeat_sec
, it is suggested you provide a number more substantial than 60 seconds to avoid some unexpected results in production. The reason is Airflow still needs a backend database to keep track of all the progress in case of a crash. Setting up fewer heartbeat seconds means the Airflow scheduler has to check more frequently to see if it needs to trigger any new tasks, you place more pressure on the Airflow scheduler as well as its backend database.
Finally, the Airflow scheduler follows the heartbeat interval and iterate through all DAGs and calculates their next schedule time and compare with wall clock time to examine whether a given DAG should be triggered or not.
Why you need a start_date?
Every DAG has its schedule, start_date
is simply the date a DAG should be included in the eyes of the Airflow scheduler. It also helps the developers to release a DAG before its production date. You could set up start_date
more dynamically before Airflow 1.8. However, it is recommended you set a fixed date, and more detail can be referred to as “Less forgiving scheduler on dynamic start_date”.
Which timezone should we use?
Airflow infrastructure initially starts only with UTC. Although you can configure Airflow to run on your local time now, most deployment is still under UTC. Setting up Airflow under UTC makes it easy for business across multiple time zones and make your life easier on occasional events such as daylight saving days. The schedule interval that you set up would be the same as your Airflow infrastructure setup.
How to set the Airflow schedule interval?
You probably familiar with the syntax of defining a DAG, and usually implement both start_date
and scheduler_interval
under the args
in the DAG class.
from airflow import DAG
from datetime import datetime, timedelta
default_args = {
'owner': 'XYZ',
'start_date': datetime(2020, 4, 1),
'schedule_interval': '@daily',
}
dag = DAG('tutorial', catchup=False, default_args=default_args)
What values to provided for schedule_interval?
The Airflow Scheduler section provides more detail on what value you can provide. Necessarily, you’d need a crontab forscheduler_interval
. If you found yourself lost in crontab’s definition, try to use crontab guru, and it will explain what you put there. Airflow also gives you some user-friendly names like @daily
or @weekly
. I found those names are less clean and expressible than crontab. It is also limited to a few intervals, and the underlying implementation is still a crontab, so you might even want to learn crontab and live with it. Moreover, if you just want to trigger your DAG, use manually schedule_interval:None
.
What is the difference between execution_date and start_date?
As a scheduler, date and time are very imperative components. In Airflow, there are two dates you’d need to put extra effort to digest: execution_date
and start_date
. Note thestart_date
is not the same as the date you defined in the previous DAG.
- execution_date is the start date and time when you expect a DAG to be triggered.
- start_date is the data and time when a DAG has been triggered, and this time usually refers to the wall clock.
A frequently asked question is, “why execution_date is not the same as start_date?” To get an answer for this, let’s take a look at one DAG execution and use 0 2 * * *
, and this helps us understand the Airflow schedule interval better. Please refer to the following code as an example.
from airflow.models import DAG
from datetime import datetime, timedelta
from airflow.operators.bash_operator import BashOperator
args = {
'owner': 'Airflow',
'start_date': datetime(2020, 4, 1),
'depends_on_past': True,
}
dag = DAG(
dag_id='scheduler_interval_101',
schedule_interval='0 2 * * *',
default_args=args,
tags=['example']
)
hello_my_task = BashOperator(
task_id='hello_task',
bash_command='echo "hello_world"',
dag=dag,
)
0 2 * * *
means Airflow will start a new job at 2:00 a.m. every day. We can keep a DAG with this interval to run for multiple days. If you click Browse
→ Tasks Instances
, you’d see both execution_date and start_date.
I started this new DAG at 04–10 00:05:21 (UTC), the first thing usually happens to any new Airflow DAG is backfill, which is enabled by default. As you can see in the snapshot below, execution_date is perfectly incremented as expected by day, and the time is anticipated as well. On the other hand, start_date is when the Airflow scheduler started a task.

After backfilling all the previous executions, you probably notice that 04–09 is not here, but it is 04–10 wall clock already. What went wrong here?
The answer is: NOTHING IS WRONG.

First, Airflow is built with an ETL mindset, which is usually a batch processing that runs 24 hours. Think about an ETL job, within that 24 hours window, and you’d trigger the job only after the 24 hours finished. The same rule applies here, and we don’t see the execution_date on 04–09 is because 24 hours window has not been closed yet. From execution_date, we know the last successful run was on 04–08T02:00:00 (remember the execution_date here is the start time of 24-hour window), and it ends at 04–09T02:00:00 (exclusive). So what would be our 24-hour window for 04–09 run? It is from 04–09T02:00:00 to 04–10T02:00:00, which has not been reached yet.

When does the Airflow scheduler run the 04–09 execution? It waits until 04–10 02:00:00 (wall clock). Once the 04–09 execution has been triggered, you’d see execution_date as 04–09T02:00:00 and start_date would be something like 04–10T02:01:15 (this varies as Airflow decides when to trigger the task, and we’ll cover more in next section).
Given the context above, you can easily see why execution_date is not the same as start_date. Understanding the difference between execution_date and start_date would be very helpful when you try to apply your code based on execution_date and use a macro like {{ds}}
Another way to think this would be: the execution_date would be close to the previous start_date. Let’s use a more complex example: 0 2 * * 4,5,6
, and this crontab means run At 02:00 on Thursday, Friday, and Saturday.
Below is the calendar for wall clock or start_date, and the red texts are the execution_date expected. If you have the schedule interval like this, you shouldn’t be shocked that Airflow would trigger 04–04 DAG execution on 04–09.

Why there is a short delay in triggering the DAGs?
From the example above, although we figured out the date is different but time is slightly different. For example, with daily interval, execution_date is 04–09T02:00:00 ,and start_date is on 04–10T02:01:15. What does the Airflow do with that 1.25-minute delay?
An analogy for this would be a meeting scenario. You probably won’t start the meeting at the same time as it states on your calendar. For example, you have a virtual meeting invitation every Monday at 10:00:00 a.m (scheduler_interval). On this Monday at 10:00:00 a.m. (execution_date), you receive a notification from joining the meeting from your calendar reminder, then you click that meeting link and start your virtual meeting. By the time you entered, and the meeting starts, it is 10:01:15 a.m. (start_date).

You probably already noticed the small delay between execution_date and start_date. Ideally, they should be the same, but the reality is not. The question is why Airflow won’t trigger the DAG on time and delay its actual run? As we discussed before, the Airflow scheduler won’t monitor the DAGs all the time. The scheduler waits for its next heartbeat to trigger new DAGs, and this process causes delays. Also, even when the scheduler is ready to trigger at the exact same time, you need to consider the code execution and DB update time too. All the above reasons cause a short delay in scheduling.
Final Thought
I hope this article can demystify how the Airflow schedule interval works. Airflow is a complicated system internally but straightforward to work with for users. With its ETL mindset initially, it could take some time to understand how the Airflow scheduler handles time interval. Once you understand the Airflow schedule interval better, creating a DAG with the desired interval should be an unobstructed process.
About Me
I hope my stories are helpful to you.
For data engineering post, you can also subscribe to my new articles or becomes a referred Medium member that also gets full access to stories on Medium.
More Articles
Visualizing Data with ggridges: Techniques to Eliminate Density Plot Overlaps in ggplot2
When it comes to visualizing data with a histogram and dealing with multiple groups, it can be quite challenging. I have recently come across a useful ggplot2 extension called ggridges that has been helpful for my data exploratory tasks.
Airflow Schedule Interval 101
The airflow schedule interval could be a challenging concept to comprehend, even for developers work on Airflow for a while find difficult to grasp. A confusing question arises every once a while on StackOverflow is “Why my DAG is not running as expected?”. This problem usually indicates a misunderstanding among the Airflow schedule interval.
Think in SQL — Avoid Writing SQL in a Top to Bottom Approach
SQL logical query processing order can help you understand why to change writing SQL in the top to bottom approach. It can also help you think in SQL clearly and develop your query more effectively