Skip to content

Glue

๐Ÿง  What is AWS Glue?

AWS Glue is a fully managed serverless ETL (Extract, Transform, Load) service that helps you prepare, transform, and move data between data sources (e.g., S3, RDS, Redshift, DynamoDB).

โœ… It includes a data catalog, ETL engine, data preparation UI, event-driven workflows, and Python/Scala-based jobs.


๐Ÿ“ฆ Core Components

Component Description
Glue Data Catalog Central metadata repository (like Hive Metastore) for databases/tables
Glue Crawler Auto-detects schema of structured/semi-structured data
Glue Job Code that performs ETL in PySpark or Scala
Glue Workflow Manages job dependencies and triggers
Glue Trigger Schedules or event-based invocations of jobs/workflows
Glue Studio Visual ETL job authoring (drag & drop GUI)
Glue DataBrew No-code UI for data cleaning and preparation

๐Ÿ” How Glue Works

             โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
             โ”‚   Crawler    โ”‚ โ”€โ”€โ”€โ”€โ”€ Scans S3/RDS/Redshift
             โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                   โ†“
         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
         โ”‚   Glue Data Catalog โ”‚ โ†โ”€โ”€ Queried by Athena, Redshift Spectrum
         โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
               โ†“
         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
         โ”‚     Jobs     โ”‚ โ†โ”€โ”€โ”€โ”€ Transforms (PySpark / Scala)
         โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
               โ†“
         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
         โ”‚   Target     โ”‚ โ†โ”€โ”€โ”€โ”€ S3 / RDS / Redshift / DynamoDB
         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿงฐ Use Cases

Use Case Why Glue?
๐Ÿงช Data lake ETL pipelines Process and move data between S3, Redshift, RDS, etc.
๐Ÿ—‚๏ธ Centralized metadata catalog Federated schema registry for Athena, Redshift Spectrum
๐Ÿ’ง Data cleansing and wrangling Glue Studio and DataBrew for visual/ML-powered data cleaning
โฑ๏ธ Scheduled batch ETL Run Spark jobs on schedule or event triggers
๐Ÿง  ML feature engineering Preprocess data for SageMaker models

๐Ÿ—๏ธ Glue Data Catalog vs Crawlers

Feature Glue Data Catalog Glue Crawlers
Purpose Stores metadata (tables, databases) Auto-generates metadata
Queryable by Athena, Redshift Spectrum, EMR No direct querying
Language SQL, DDL Point to S3, RDS, DynamoDB
Format Support CSV, JSON, Parquet, ORC, Avro, XML Same

๐Ÿ” Security

Layer Description
IAM Roles Used for job execution and access to S3, Redshift
Encryption Supports KMS for job scripts, logs, and data
VPC Access Jobs can run in private subnets (Glue version 2+)
Column-level access With Lake Formation integration

๐Ÿ’ฐ Pricing (as of 2024)

Item Cost (approx.)
Job (per DPU-hour) ~$0.44 per DPU-hour (1 DPU = 4vCPU + 16GB RAM)
Glue Studio Notebooks ~$0.44 per DPU-hour
Glue Crawlers ~$0.44 per DPU-hour
Data Catalog storage Free up to 1M objects/month, then ~$1 per 100k

๐Ÿง  Tip: Use Glue 2.0 or 3.0, which has faster startup time and per-second billing.


๐Ÿ› ๏ธ Terraform Examples

1. Glue Catalog Database

resource "aws_glue_catalog_database" "example" {
  name = "analytics_db"
}

2. Glue Catalog Table

resource "aws_glue_catalog_table" "example" {
  name          = "sales_data"
  database_name = aws_glue_catalog_database.example.name
  table_type    = "EXTERNAL_TABLE"

  storage_descriptor {
    location      = "s3://my-bucket/sales/"
    input_format  = "org.apache.hadoop.mapred.TextInputFormat"
    output_format = "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"

    columns {
      name = "order_id"
      type = "string"
    }

    columns {
      name = "amount"
      type = "double"
    }

    serde_info {
      serialization_library = "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"
      parameters = {
        "field.delim" = ","
      }
    }
  }
}

3. Glue Job

resource "aws_glue_job" "example" {
  name     = "etl-job"
  role_arn = aws_iam_role.glue_role.arn

  command {
    name            = "glueetl"
    script_location = "s3://my-bucket/scripts/transform.py"
    python_version  = "3"
  }

  default_arguments = {
    "--TempDir" = "s3://my-bucket/temp/"
    "--job-language" = "python"
  }

  max_capacity   = 2  # DPUs
  glue_version   = "3.0"
  number_of_workers = 2
  worker_type    = "G.1X"
}

๐Ÿ”„ Glue vs Other Services

Goal Use Glue? Alternatives
Query S3 with SQL No (use Athena) Amazon Athena
Realtime stream ETL No (use Kinesis) Amazon Kinesis + Lambda
Batch ETL on big data โœ… Yes EMR, Lambda (lightweight)
No-code cleaning โœ… (via DataBrew) Third-party tools (Trifacta)

โœ… TL;DR Summary

Feature AWS Glue
Serverless? โœ… Yes
Languages Python (PySpark), Scala
Catalog Support โœ… Glue Data Catalog
Visual Authoring โœ… Glue Studio, Glue DataBrew
Triggering On schedule, on event, or manually
Pricing Per second, per DPU
Integrations S3, RDS, Redshift, Athena, DynamoDB