Skip to content

1๏ธโƒฃ What is Apache Airflow?

Apache Airflow is a workflow orchestration platform.

Think of it as:

โ€œA scheduler + executor that runs your jobs in the right order, at the right time, with retries and visibility.โ€

๐Ÿง  What Youโ€™ll Learn (Roadmap)

  1. What Apache Airflow is (mental model)
  2. Core Airflow components (with flow)
  3. Executors overview (why Celery)
  4. Architecture: Airflow + Celery + Redis
  5. Local setup (Docker-based, production-like)
  6. Writing DAGs (basic โ†’ real-world)
  7. Task distribution with Celery workers
  8. Monitoring, retries, backfills
  9. Common commands & debugging
  10. Best practices

What Airflow is GOOD at

โœ… ETL pipelines โœ… Data engineering workflows โœ… Batch jobs โœ… ML pipelines โœ… Cron replacement with observability

What Airflow is NOT

โŒ Real-time streaming โŒ Event-driven APIs โŒ Long-running services


2๏ธโƒฃ Airflow Core Concepts (Very Important)

๐Ÿ”น DAG

  • Directed Acyclic Graph
  • Python file that defines workflow
  • Nodes = tasks, edges = dependencies

๐Ÿ”น Task

  • A single unit of work
  • Example: run a bash command, execute Python, call API

๐Ÿ”น Operator

  • Template for a task
  • Examples:

  • PythonOperator

  • BashOperator
  • DockerOperator

๐Ÿ”น Scheduler

  • Decides when tasks should run

๐Ÿ”น Executor

  • Decides how tasks are executed

3๏ธโƒฃ Executors Explained (Why Celery?)

Executor Use Case
Sequential Learning only
LocalExecutor Single machine
CeleryExecutor Distributed, scalable
KubernetesExecutor Cloud-native

Why CeleryExecutor?

  • Multiple workers
  • Horizontal scaling
  • Reliable for production
  • Industry standard

4๏ธโƒฃ Architecture: Airflow + Celery + Redis

Image

Image

Image

Flow (Understand this clearly ๐Ÿ‘‡)

DAG โ†’ Scheduler โ†’ Redis (Queue) โ†’ Celery Workers โ†’ Result โ†’ Metadata DB

Components

Component Role
Webserver UI
Scheduler Schedules tasks
Redis Message broker
Celery Workers Execute tasks
Postgres Metadata DB

5๏ธโƒฃ Install Airflow with Celery + Redis (Docker)

๐Ÿ“ Project Structure

airflow/
โ”œโ”€โ”€ dags/
โ”‚   โ””โ”€โ”€ example_dag.py
โ”œโ”€โ”€ docker-compose.yml
โ””โ”€โ”€ .env

๐Ÿ“„ docker-compose.yml

version: "3.8"

x-airflow-common: &airflow-common
  image: apache/airflow:2.8.1
  environment:
    AIRFLOW__CORE__EXECUTOR: CeleryExecutor
    AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow
    AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres/airflow
    AIRFLOW__CELERY__BROKER_URL: redis://redis:6379/0
    AIRFLOW__CORE__LOAD_EXAMPLES: "false"
  volumes:
    - ./dags:/opt/airflow/dags
  depends_on:
    - redis
    - postgres

services:

  postgres:
    image: postgres:15
    environment:
      POSTGRES_USER: airflow
      POSTGRES_PASSWORD: airflow
      POSTGRES_DB: airflow

  redis:
    image: redis:7

  airflow-webserver:
    <<: *airflow-common
    command: webserver
    ports:
      - "8080:8080"

  airflow-scheduler:
    <<: *airflow-common
    command: scheduler

  airflow-worker:
    <<: *airflow-common
    command: celery worker

  airflow-init:
    <<: *airflow-common
    command: bash -c "airflow db init && airflow users create \
      --username admin \
      --password admin \
      --firstname Admin \
      --lastname User \
      --role Admin \
      --email admin@example.com"

โ–ถ๏ธ Start Airflow

docker compose up -d

Access UI:

http://localhost:8080
username: admin
password: admin

6๏ธโƒฃ Your First DAG (Hands-On)

๐Ÿ“„ dags/example_dag.py

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

def hello():
    print("Hello from Airflow Celery Worker!")

with DAG(
    dag_id="hello_celery",
    start_date=datetime(2024, 1, 1),
    schedule_interval="@daily",
    catchup=False,
) as dag:

    task_1 = PythonOperator(
        task_id="hello_task",
        python_callable=hello
    )

    task_1

What happens?

  • Scheduler queues task
  • Redis stores message
  • Celery worker picks it up
  • Worker executes task

7๏ธโƒฃ Verify Celery Workers Are Working

Check worker logs

docker compose logs airflow-worker -f

You should see:

Task hello_celery.hello_task succeeded

๐ŸŽ‰ You are now running distributed Airflow tasks


8๏ธโƒฃ Parallel Tasks Example (Real Practice)

from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime

with DAG(
    "parallel_tasks",
    start_date=datetime(2024, 1, 1),
    schedule_interval=None,
    catchup=False,
) as dag:

    t1 = BashOperator(
        task_id="task_1",
        bash_command="sleep 10"
    )

    t2 = BashOperator(
        task_id="task_2",
        bash_command="sleep 10"
    )

    t3 = BashOperator(
        task_id="task_3",
        bash_command="sleep 10"
    )

    [t1, t2] >> t3

๐Ÿ‘‰ Watch multiple workers executing in parallel.


9๏ธโƒฃ Important Airflow CLI Commands

# List DAGs
airflow dags list

# Trigger DAG
airflow dags trigger hello_celery

# Test task locally
airflow tasks test hello_celery hello_task 2024-01-01

# Check Celery worker
airflow celery status

๐Ÿ”Ÿ Common Problems & Fixes

โŒ Tasks stuck in "Queued"

โœ” Redis not reachable โœ” Worker not running

Check:

docker compose ps

โŒ ImportError in DAG

โœ” Missing Python library โœ” Add via custom image or requirements.txt


๐Ÿ” Best Practices (Production Mindset)

โœ… One DAG = one business workflow โœ… Use retries & SLA โœ… Avoid heavy logic inside DAG file โœ… Keep tasks idempotent โœ… Version control DAGs โœ… Monitor with Prometheus / StatsD


๐Ÿงช Practice Ideas (Do These Next)

  1. ETL pipeline (API โ†’ Transform โ†’ DB)
  2. ML training DAG
  3. File processing pipeline
  4. Trigger DAG from REST API
  5. Add Flower for Celery monitoring
  6. Add Airflow variables & connections
  7. Use Redis CLI to inspect queues

๐ŸŽฏ Final Mental Model

Airflow = Brain
Scheduler = Planner
Redis = Queue
Celery Workers = Muscle
Postgres = Memory