Introduction

This post offers a cost-effective blueprint for deploying Apache Airflow for small to medium workloads, an alternative to pricier solutions like Astronomer. It's tailored for those familiar with AWS, CloudFormation, Airflow, and related DevOps concepts. If that's not you, share this guide with your tech team or reach out to Three Ventures for professional help.

Ensure our solution's assumptions align with your needs before deployment:

An environment can be made available in AWS for these resources.
You’re using GitHub to manage source code.

This article details an automated method for deploying Airflow on AWS, allowing for consistent multi-environment setups (dev, stage, production). The focus is on Airflow's core architecture components (Redis, PostgreSQL, Scheduler, Worker, Web) and its service features, such as connections and DAGs. We've shared a public repository with code for a standard Airflow setup, influenced by the Airflow Autoscaling ECS repository, but with enhanced security and added utilities.

The Production Problem

Airflow's local setup is straightforward, but production deployment, especially outside Kubernetes, is more complex. Most online solutions are incomplete or focus on Airflow v1, despite v2 being the latest version.

Airflow can get costly based on DAG configurations and runtimes. We've developed a method for cost-effective, consistent production deployments using AWS Fargate.

The Production Solution

We utilized CloudFormation for a turnkey solution to deploy consistent Airflow environments in AWS, ensuring identical infrastructure across production, staging, and development. Rather than adding external tools like Terraform or Pulumi, we chose CloudFormation, AWS's native IaC solution. The AirFlow Stack handles everything from initial setup to DAG updates through an integrated CI pipeline. Refer to the architecture diagrams below for a snapshot of the Airflow application layers.

Apache AirFlow AWS Fargate Diagram - Three Ventures AWS Consulting

Using CloudFormation, we've integrated a CodePipeline for automatic DAG updates. Committing code to a specific branch triggers an automatic release to your Airflow environment. The CloudFormation YAML files detail:

Source: Triggers the pipeline upon branch changes matching the environment name.
Build: Executes configurations in the buildspec file, which is where any CI/CD customizations would occur. You can learn more about buildspec files for CodePipeline using the AWS documentation.

A separate diagram is provided to depict the CI/CD configuration, as it's managed alongside the main infrastructure.

Deployment Note

CodePipeline auto-updates and restarts the application with new images. Based on our tests, this can take about five minutes, so changes may not reflect instantly!

Preparing Your Machine

To deploy Airflow, you'll first need Docker and Python up-to-date on your machine. Installation varies based on your OS, so we won't detail it here. Once set up, proceed to the next steps.

Python Libraries

After installing the mentioned programs, begin configuring the libraries and settings for our repository. The subsequent commands will set up the AWS CLI and essential libraries for the utility scripts.

pip install awscli
virtualenv venv
source venv/bin/activate
pip install -r requirements.txt

Preparing For Deployments

Several cloud services must be set up or updated prior to starting any deployment.

IAM Authentication (Required)

First, create or use an existing IAM user with CLI permissions for deployment. An example IAM Policy is provided in the repository with the file /airflow.cli-user.policy.json . This IAM Policy is very permissive, so use caution when providing access. Next, setup your AWS CLI to use the credentials associated with the IAM user. You can create a custom profile in ~/.aws/config (example below) or run aws configure in a terminal. Multiple authentication methods exist with the CLI so please pick what aligns with your company's access policy.

[profile airflow]
aws_access_key_id = <id>
aws_secret_access_key = <key>
region = us-east-1

Continuous Integration (Required)

For continuous integration via GitHub, first create a dedicated branch in your repository, ensuring only updates to this branch initiate environment changes. Then, establish a connection in CodeStar, granting CodePipeline the necessary access to the repository.

SSL Certificate (Optional)

For HTTPS, generate a certificate with ACM and add its ARN to your service.yml file. After deploying the stack, create a DNS record targeting the new load balancer. Ensure the certificate aligns with the DNS record you intend on creating.

Configuration Files

The CloudFormation stack relies on a configuration file to build environments. We've provided a service.yml file as a starting point. Review it thoroughly and decide on potential changes. Each environment will have it's own configuration within the service.yml file. To get started, you must modify certain settings in the configuration file; otherwise, the CloudFormation stack won't deploy. The following keys must be updated:

serviceName
owner
www
github

For CI/CD with CodePipeline, each environment requires a buildspec file. We offer a default buildspec file that can be used as a starting point. Each buildspec file must match the naming convention buildspec.environment.yaml. An example for a branch stage would be buildspec.stage.yaml.

Managing Environments

After setting up your local machine and configuring AWS assets, you're ready to deploy using the CloudFormation stack.

CloudFormation doesn't store a stack's status in your files. Stack configuration relies on environment variables being set in your current shell. Before executing any commands, always set the proper shell variables since our CloudFormation commands depend on them. The ENVIRONMENT variable is what differentiates between environments.

# set environment variables when applicable
export AWS_REGION=us-east-1
export AWS_PROFILE=default
export ENVIRONMENT=default

Initial Deployment Per Environment

To deploy your environment for the first time, use the following command:

# Initialize the Environment
make airflow-init

The initial deployment requires numerous resources, typically taking about 30 minutes to set up. Even after the terminal indicates completion, the web interface may take a few extra minutes to become accessible via a browser.

Accessing Airflow

To access the interface, find the AirflowWebServerEndpoint value in your terminal.

"cfn-airflow-webserver": [
    {
        "OutputKey": "AirflowWebServerEndpoint",
        "OutputValue": "airflow-dev-webserver-alb-1234567890.us-east-1.elb.amazonaws.com"
    }
]

Reminder!

Use the AirflowWebServerEndpoint to create your DNS records in Route53.

The password and username for the Airflow service can be set in your configuration file under the wwwAdmin and wwwPassword keys.

Multiple Deployments

It’s possible to have multiple configurations for different environments. If you look at the service.yml file, you will find out that the tree starts with a default key. That is the setting that the app will always default to if it does not find a configuration with the same name as the environment shell variable (ENVIRONMENT). If you wish to set your own config, copy and paste the default config to the bottom of the file and change default key to match your environment name. Modify the configurations under your environment name to be specific to your needs.

Adding DAGs

Eventually you’ll need to add your own DAGs, and you can do so by adding your DAG files to the /dags directory. We recognize that some DAGs might require other changes, such as configuring a connection or adding an environment variable. We’ve given you a way to do this using the extra_commands.sh file. Just add your bash commands there and update your environment. Any shell commands in the file will be run against your environment with the following command.

# Run Commands From extra_commands.sh Script
make airflow-run-commands

Other Commands

We've provided additional commands for managing environments. Just like deployment commands, they require the correct shell variables. Ensure your terminal is set up properly before executing them. See the makefile for all commands.

# Update the Environment
make airflow-deploy 

# Update Environment Variables Only
make airflow-update-stack

# Push a New Image and Restart the Container
make airflow-push-image

Release Automation

Once Airflow is deployed, the continuous integration infrastructure should function. Pushing changes to the designated GitHub branch initiates a code update, rebuilds the Docker image, and deploys your code. After the pipeline completes and the containers are in Fargate, your updates will be live in the environment.

Destroying Your Environment

We've added a command to dismantle your environment. This takes time, so consider it a coffee break. The beauty of infrastructure as code is it runs in the background so you can focus on other priorities!

# Destroy an Environment
make airflow-destroy

Updating the environment

We've updated the Python package versions for Airflow and its dependencies. If you're creating a new environment using the blueprint from our repository, there's no action required — the updated versions will be applied automatically. However, if you want to update an existing environment, simply modify the requirements.txt file (you can refer to the updated versions in our repository) and run the command:

make airflow-push-image

Remember to set the correct shell variables in your terminal before executing this command, as is required for all make commands in the Makefile.

Important: Updating an existing environment may require changes to your DAGs. Be sure to test your setup locally before deploying, as failure to do so could result in broken DAGs in production.

Closing Thoughts

We aimed to provide a foundation that can be used as a baseline for production installs of Apache AirFlow to those looking to be cost conscious. There are a couple of options provided by AWS, like Managed Workflow for Apache AirFlow, or services like Astronomer, but those tend to come with higher price tags. So we’ve released this CloudFormation stack as a way for those with technical skills and limited budgets to get something as turnkey as possible.

If you’re looking for AWS experts to help you or your teams with a project, feel free to reach out to us. We’re an AWS Consulting Partner after all!

How to Deploy Apache Airflow Using AWS Fargate