Airflow Schedule Interval 101

Source: Aron Visuals from Unsplash
Source: Aron Visuals from Unsplash

The airflow schedule interval could be a challenging concept to comprehend, even for developers work on Airflow for a while find difficult to grasp. A confusing question arises every once a while on StackOverflow is “Why my DAG is not running as expected?”. This problem usually indicates a misunderstanding among the Airflow schedule interval. In this article, we will talk about how to set up the Airflow schedule interval, what result you should expect for scheduling your Airflow DAGs, and how to debug the Airflow schedule interval issues with examples.

How Airflow Schedule DAGs?

Why you need a start_date?

Every DAG has its schedule, start_date is simply the date a DAG should be included in the eyes of the Airflow scheduler. It also helps the developers to release a DAG before its production date. You could set up start_date more dynamically before Airflow 1.8. However, it is recommended you set a fixed date, and more detail can be referred to as “Less forgiving scheduler on dynamic start_date”.

Which timezone should we use?

Airflow infrastructure initially starts only with UTC. Although you can configure Airflow to run on your local time now, most deployment is still under UTCSetting up Airflow under UTC makes it easy for business across multiple time zones and make your life easier on occasional events such as daylight saving days. The schedule interval that you set up would be the same as your Airflow infrastructure setup.

How to set the Airflow schedule interval?

You probably familiar with the syntax of defining a DAG, and usually implement both start_date and scheduler_interval under the args in the DAG class.

				
					from airflow import DAG
from datetime import datetime, timedelta
default_args = {
    'owner': 'XYZ',
    'start_date': datetime(2020, 4, 1),
    'schedule_interval': '@daily',
}
dag = DAG('tutorial', catchup=False, default_args=default_args)
				
			

What values to provided for schedule_interval?

The Airflow Scheduler section provides more detail on what value you can provide. Necessarily, you’d need a crontab forscheduler_interval . If you found yourself lost in crontab’s definition, try to use crontab guru, and it will explain what you put there. Airflow also gives you some user-friendly names like @daily or @weekly . I found those names are less clean and expressible than crontab. It is also limited to a few intervals, and the underlying implementation is still a crontab, so you might even want to learn crontab and live with it. Moreover, if you just want to trigger your DAG, use manually schedule_interval:None .

What is the difference between execution_date and start_date?

  • execution_date is the start date and time when you expect a DAG to be triggered.
  • start_date is the data and time when a DAG has been triggered, and this time usually refers to the wall clock.
				
					from airflow.models import DAG
from datetime import datetime, timedelta
from airflow.operators.bash_operator import BashOperator
args = {
    'owner': 'Airflow',
    'start_date': datetime(2020, 4, 1),
    'depends_on_past': True,
}
dag = DAG(
    dag_id='scheduler_interval_101',
    schedule_interval='0 2 * * *',
    default_args=args,
    tags=['example']
)
hello_my_task = BashOperator(
    task_id='hello_task',
    bash_command='echo "hello_world"',
    dag=dag,
)
				
			
Example of Task Instance (Image by Author)
Example of Task Instance (Image by Author)
Source: Tony Babel from https://gph.is/2jqoiRI
Source: Tony Babel from https://gph.is/2jqoiRI

First, Airflow is built with an ETL mindset, which is usually a batch processing that runs 24 hours. Think about an ETL job, within that 24 hours window, and you’d trigger the job only after the 24 hours finished. The same rule applies here, and we don’t see the execution_date on 04–09 is because 24 hours window has not been closed yetFrom execution_date, we know the last successful run was on 04–08T02:00:00 (remember the execution_date here is the start time of 24-hour window), and it ends at 04–09T02:00:00 (exclusive). So what would be our 24-hour window for 04–09 run? It is from 04–09T02:00:00 to 04–10T02:00:00, which has not been reached yet.

Execution Data and Start Date Relationship (Image by Author)
Execution Data and Start Date Relationship (Image by Author)
Example of Schedule Interval (Image by Author)
Example of Schedule Interval (Image by Author)

Why there is a short delay in triggering the DAGs?

Source: You X Ventures from Unsplash
Source: You X Ventures from Unsplash

You probably already noticed the small delay between execution_date and start_date. Ideally, they should be the same, but the reality is not. The question is why Airflow won’t trigger the DAG on time and delay its actual run? As we discussed before, the Airflow scheduler won’t monitor the DAGs all the time. The scheduler waits for its next heartbeat to trigger new DAGs, and this process causes delays. Also, even when the scheduler is ready to trigger at the exact same time, you need to consider the code execution and DB update time too. All the above reasons cause a short delay in scheduling.

Final Thought

I hope this article can demystify how the Airflow schedule interval works. Airflow is a complicated system internally but straightforward to work with for users. With its ETL mindset initially, it could take some time to understand how the Airflow scheduler handles time interval. Once you understand the Airflow schedule interval better, creating a DAG with the desired interval should be an unobstructed process.

About Me

I hope my stories are helpful to you. 

For data engineering post, you can also subscribe to my new articles or becomes a referred Medium member that also gets full access to stories on Medium.

In case of questions/comments, do not hesitate to write in the comments of this story or reach me directly through Linkedin or Twitter.

More Articles

Photo by LoboStudio Hamburg on Unsplash

Visualizing Data with ggridges: Techniques to Eliminate Density Plot Overlaps in ggplot2

When it comes to visualizing data with a histogram and dealing with multiple groups, it can be quite challenging. I have recently come across a useful ggplot2 extension called ggridges that has been helpful for my data exploratory tasks.

Read More →
Source: Aron Visuals from Unsplash

Airflow Schedule Interval 101

The airflow schedule interval could be a challenging concept to comprehend, even for developers work on Airflow for a while find difficult to grasp. A confusing question arises every once a while on StackOverflow is “Why my DAG is not running as expected?”. This problem usually indicates a misunderstanding among the Airflow schedule interval.

Read More →
Photo by Jeffrey Brandjes on Unsplash

Think in SQL — Avoid Writing SQL in a Top to Bottom Approach

SQL logical query processing order can help you understand why to change writing SQL in the top to bottom approach. It can also help you think in SQL clearly and develop your query more effectively

Read More →

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
Share via
Copy link