1๏ธโฃ What is Apache Airflow?¶
Apache Airflow is a workflow orchestration platform.
Think of it as:
โA scheduler + executor that runs your jobs in the right order, at the right time, with retries and visibility.โ
๐ง What Youโll Learn (Roadmap)¶
- What Apache Airflow is (mental model)
- Core Airflow components (with flow)
- Executors overview (why Celery)
- Architecture: Airflow + Celery + Redis
- Local setup (Docker-based, production-like)
- Writing DAGs (basic โ real-world)
- Task distribution with Celery workers
- Monitoring, retries, backfills
- Common commands & debugging
- Best practices
What Airflow is GOOD at¶
โ ETL pipelines โ Data engineering workflows โ Batch jobs โ ML pipelines โ Cron replacement with observability
What Airflow is NOT¶
โ Real-time streaming โ Event-driven APIs โ Long-running services
2๏ธโฃ Airflow Core Concepts (Very Important)¶
๐น DAG¶
- Directed Acyclic Graph
- Python file that defines workflow
- Nodes = tasks, edges = dependencies
๐น Task¶
- A single unit of work
- Example: run a bash command, execute Python, call API
๐น Operator¶
- Template for a task
-
Examples:
-
PythonOperator BashOperatorDockerOperator
๐น Scheduler¶
- Decides when tasks should run
๐น Executor¶
- Decides how tasks are executed
3๏ธโฃ Executors Explained (Why Celery?)¶
| Executor | Use Case |
|---|---|
| Sequential | Learning only |
| LocalExecutor | Single machine |
| CeleryExecutor | Distributed, scalable |
| KubernetesExecutor | Cloud-native |
Why CeleryExecutor?¶
- Multiple workers
- Horizontal scaling
- Reliable for production
- Industry standard
4๏ธโฃ Architecture: Airflow + Celery + Redis¶



Flow (Understand this clearly ๐)¶
Components¶
| Component | Role |
|---|---|
| Webserver | UI |
| Scheduler | Schedules tasks |
| Redis | Message broker |
| Celery Workers | Execute tasks |
| Postgres | Metadata DB |
5๏ธโฃ Install Airflow with Celery + Redis (Docker)¶
๐ Project Structure¶
๐ docker-compose.yml¶
version: "3.8"
x-airflow-common: &airflow-common
image: apache/airflow:2.8.1
environment:
AIRFLOW__CORE__EXECUTOR: CeleryExecutor
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow
AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres/airflow
AIRFLOW__CELERY__BROKER_URL: redis://redis:6379/0
AIRFLOW__CORE__LOAD_EXAMPLES: "false"
volumes:
- ./dags:/opt/airflow/dags
depends_on:
- redis
- postgres
services:
postgres:
image: postgres:15
environment:
POSTGRES_USER: airflow
POSTGRES_PASSWORD: airflow
POSTGRES_DB: airflow
redis:
image: redis:7
airflow-webserver:
<<: *airflow-common
command: webserver
ports:
- "8080:8080"
airflow-scheduler:
<<: *airflow-common
command: scheduler
airflow-worker:
<<: *airflow-common
command: celery worker
airflow-init:
<<: *airflow-common
command: bash -c "airflow db init && airflow users create \
--username admin \
--password admin \
--firstname Admin \
--lastname User \
--role Admin \
--email admin@example.com"
โถ๏ธ Start Airflow¶
Access UI:
6๏ธโฃ Your First DAG (Hands-On)¶
๐ dags/example_dag.py¶
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
def hello():
print("Hello from Airflow Celery Worker!")
with DAG(
dag_id="hello_celery",
start_date=datetime(2024, 1, 1),
schedule_interval="@daily",
catchup=False,
) as dag:
task_1 = PythonOperator(
task_id="hello_task",
python_callable=hello
)
task_1
What happens?¶
- Scheduler queues task
- Redis stores message
- Celery worker picks it up
- Worker executes task
7๏ธโฃ Verify Celery Workers Are Working¶
Check worker logs¶
You should see:
๐ You are now running distributed Airflow tasks
8๏ธโฃ Parallel Tasks Example (Real Practice)¶
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime
with DAG(
"parallel_tasks",
start_date=datetime(2024, 1, 1),
schedule_interval=None,
catchup=False,
) as dag:
t1 = BashOperator(
task_id="task_1",
bash_command="sleep 10"
)
t2 = BashOperator(
task_id="task_2",
bash_command="sleep 10"
)
t3 = BashOperator(
task_id="task_3",
bash_command="sleep 10"
)
[t1, t2] >> t3
๐ Watch multiple workers executing in parallel.
9๏ธโฃ Important Airflow CLI Commands¶
# List DAGs
airflow dags list
# Trigger DAG
airflow dags trigger hello_celery
# Test task locally
airflow tasks test hello_celery hello_task 2024-01-01
# Check Celery worker
airflow celery status
๐ Common Problems & Fixes¶
โ Tasks stuck in "Queued"¶
โ Redis not reachable โ Worker not running
Check:
โ ImportError in DAG¶
โ Missing Python library โ Add via custom image or requirements.txt
๐ Best Practices (Production Mindset)¶
โ One DAG = one business workflow โ Use retries & SLA โ Avoid heavy logic inside DAG file โ Keep tasks idempotent โ Version control DAGs โ Monitor with Prometheus / StatsD
๐งช Practice Ideas (Do These Next)¶
- ETL pipeline (API โ Transform โ DB)
- ML training DAG
- File processing pipeline
- Trigger DAG from REST API
- Add Flower for Celery monitoring
- Add Airflow variables & connections
- Use Redis CLI to inspect queues