Glue
๐ง What is AWS Glue?
AWS Glue is a fully managed serverless ETL (Extract, Transform, Load) service that helps you prepare, transform, and move data between data sources (e.g., S3, RDS, Redshift, DynamoDB).
โ
It includes a data catalog, ETL engine, data preparation UI, event-driven workflows, and Python/Scala-based jobs.
๐ฆ Core Components
| Component |
Description |
| Glue Data Catalog |
Central metadata repository (like Hive Metastore) for databases/tables |
| Glue Crawler |
Auto-detects schema of structured/semi-structured data |
| Glue Job |
Code that performs ETL in PySpark or Scala |
| Glue Workflow |
Manages job dependencies and triggers |
| Glue Trigger |
Schedules or event-based invocations of jobs/workflows |
| Glue Studio |
Visual ETL job authoring (drag & drop GUI) |
| Glue DataBrew |
No-code UI for data cleaning and preparation |
๐ How Glue Works
โโโโโโโโโโโโโโโโ
โ Crawler โ โโโโโ Scans S3/RDS/Redshift
โโโโโโโฌโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโ
โ Glue Data Catalog โ โโโ Queried by Athena, Redshift Spectrum
โโโโโโโฌโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโ
โ Jobs โ โโโโโ Transforms (PySpark / Scala)
โโโโโโโฌโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโ
โ Target โ โโโโโ S3 / RDS / Redshift / DynamoDB
โโโโโโโโโโโโโโโโ
๐งฐ Use Cases
| Use Case |
Why Glue? |
| ๐งช Data lake ETL pipelines |
Process and move data between S3, Redshift, RDS, etc. |
| ๐๏ธ Centralized metadata catalog |
Federated schema registry for Athena, Redshift Spectrum |
| ๐ง Data cleansing and wrangling |
Glue Studio and DataBrew for visual/ML-powered data cleaning |
| โฑ๏ธ Scheduled batch ETL |
Run Spark jobs on schedule or event triggers |
| ๐ง ML feature engineering |
Preprocess data for SageMaker models |
๐๏ธ Glue Data Catalog vs Crawlers
| Feature |
Glue Data Catalog |
Glue Crawlers |
| Purpose |
Stores metadata (tables, databases) |
Auto-generates metadata |
| Queryable by |
Athena, Redshift Spectrum, EMR |
No direct querying |
| Language |
SQL, DDL |
Point to S3, RDS, DynamoDB |
| Format Support |
CSV, JSON, Parquet, ORC, Avro, XML |
Same |
๐ Security
| Layer |
Description |
| IAM Roles |
Used for job execution and access to S3, Redshift |
| Encryption |
Supports KMS for job scripts, logs, and data |
| VPC Access |
Jobs can run in private subnets (Glue version 2+) |
| Column-level access |
With Lake Formation integration |
๐ฐ Pricing (as of 2024)
| Item |
Cost (approx.) |
| Job (per DPU-hour) |
~$0.44 per DPU-hour (1 DPU = 4vCPU + 16GB RAM) |
| Glue Studio Notebooks |
~$0.44 per DPU-hour |
| Glue Crawlers |
~$0.44 per DPU-hour |
| Data Catalog storage |
Free up to 1M objects/month, then ~$1 per 100k |
๐ง Tip: Use Glue 2.0 or 3.0, which has faster startup time and per-second billing.
1. Glue Catalog Database
resource "aws_glue_catalog_database" "example" {
name = "analytics_db"
}
2. Glue Catalog Table
resource "aws_glue_catalog_table" "example" {
name = "sales_data"
database_name = aws_glue_catalog_database.example.name
table_type = "EXTERNAL_TABLE"
storage_descriptor {
location = "s3://my-bucket/sales/"
input_format = "org.apache.hadoop.mapred.TextInputFormat"
output_format = "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
columns {
name = "order_id"
type = "string"
}
columns {
name = "amount"
type = "double"
}
serde_info {
serialization_library = "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"
parameters = {
"field.delim" = ","
}
}
}
}
3. Glue Job
resource "aws_glue_job" "example" {
name = "etl-job"
role_arn = aws_iam_role.glue_role.arn
command {
name = "glueetl"
script_location = "s3://my-bucket/scripts/transform.py"
python_version = "3"
}
default_arguments = {
"--TempDir" = "s3://my-bucket/temp/"
"--job-language" = "python"
}
max_capacity = 2 # DPUs
glue_version = "3.0"
number_of_workers = 2
worker_type = "G.1X"
}
๐ Glue vs Other Services
| Goal |
Use Glue? |
Alternatives |
| Query S3 with SQL |
No (use Athena) |
Amazon Athena |
| Realtime stream ETL |
No (use Kinesis) |
Amazon Kinesis + Lambda |
| Batch ETL on big data |
โ
Yes |
EMR, Lambda (lightweight) |
| No-code cleaning |
โ
(via DataBrew) |
Third-party tools (Trifacta) |
โ
TL;DR Summary
| Feature |
AWS Glue |
| Serverless? |
โ
Yes |
| Languages |
Python (PySpark), Scala |
| Catalog Support |
โ
Glue Data Catalog |
| Visual Authoring |
โ
Glue Studio, Glue DataBrew |
| Triggering |
On schedule, on event, or manually |
| Pricing |
Per second, per DPU |
| Integrations |
S3, RDS, Redshift, Athena, DynamoDB |