Skip to content

EMR

๐Ÿง  What is Amazon EMR?

Amazon EMR (Elastic MapReduce) is a managed big data platform that simplifies running Apache Hadoop, Spark, Hive, HBase, Flink, Presto, and other frameworks on AWS.

โœ… It lets you process and analyze massive datasets cost-effectively using open-source tools in a scalable, cluster-based environment.


๐Ÿ“ฆ Core Use Cases

Use Case Why Use EMR?
๐Ÿงช Big Data Processing (ETL) Ingest, clean, transform TBโ€“PB of data
๐Ÿ” Data Lake Analytics Analyze S3-stored data using Spark, Hive, Presto
๐Ÿง  Machine Learning Pipelines Distributed ML training using PySpark, TensorFlow
๐Ÿ“Š Data Warehousing SQL-like analytics using Hive/Presto
๐Ÿ“‚ Log & Clickstream Analysis Real-time and batch log processing

๐Ÿ—๏ธ Architecture Overview

       +--------------------+
       |   S3 (raw data)    |
       +--------------------+
                โ†“
        +---------------+
        |   EMR Cluster  |
        |  - Master Node |
        |  - Core Nodes  |
        |  - Task Nodes  |
        +---------------+
                โ†“
     Processed data โ†’ S3 / RDS / Redshift / DynamoDB

Node Types:

  • Master Node: Manages cluster, job scheduling

  • Core Nodes: Store data using HDFS and process tasks

  • Task Nodes: Only perform processing (no HDFS storage)


๐Ÿงฐ Supported Frameworks

Framework Purpose
Apache Spark Fast, in-memory big data engine
Apache Hive SQL-on-Hadoop for batch processing
Apache HBase NoSQL database over HDFS
Presto/Trino Distributed SQL query engine
Flink Real-time stream processing
Hue Web UI for Hadoop ecosystem

๐Ÿงช Cluster Types

Type Description
Transient Cluster Runs one job and shuts down automatically
Long-running Cluster Always-on for streaming, multi-job workloads
EMR on EKS Run Spark on Kubernetes
EMR Serverless No cluster management, pay per job

๐Ÿ” Security

Feature Description
IAM Roles Control access to EMR and underlying services
Kerberos Authentication Secure access to cluster services
TLS In-Transit Encryption Secures communication between nodes
At-Rest Encryption Uses S3 SSE, EBS encryption, and HDFS encryption
Private Subnets / VPC Full network isolation with security groups

๐Ÿ“Š Monitoring & Debugging

Tool What It Does
Amazon CloudWatch Logs, metrics for cluster and apps
Ganglia Real-time performance monitoring
Spark UI Visual interface for Spark job DAGs
YARN Resource Manager UI View Hadoop resource usage

๐Ÿ’ฐ Pricing

Component Billed By
EC2 Instances Per instance-hour
EMR Cost Per-node per-hour (~$0.015โ€“$0.27/hour)
S3 Storage Billed separately
Spot Support โœ… Supported (save up to 90%)

๐Ÿง  Tip: Use EMR Instance Fleets or Spot + On-Demand mix for cost optimization.


๐Ÿ› ๏ธ Terraform Example โ€” EMR Cluster with Spark

resource "aws_emr_cluster" "example" {
  name          = "example-emr-cluster"
  release_label = "emr-6.13.0"
  applications  = ["Spark", "Hive"]

  ec2_attributes {
    key_name                          = "my-key"
    subnet_id                         = "subnet-0abc123456"
    emr_managed_master_security_group = aws_security_group.master.id
    emr_managed_slave_security_group  = aws_security_group.core.id
  }

  master_instance_group {
    instance_type = "m5.xlarge"
    instance_count = 1
  }

  core_instance_group {
    instance_type  = "m5.xlarge"
    instance_count = 2
  }

  bootstrap_action {
    path = "s3://my-bucket/scripts/install-custom-libs.sh"
  }

  log_uri = "s3://my-emr-logs/"

  configurations_json = <<EOF
[
  {
    "Classification": "spark-defaults",
    "Properties": {
      "spark.executor.memory": "4g"
    }
  }
]
EOF

  service_role = "EMR_DefaultRole"
  job_flow_role = "EMR_EC2_DefaultRole"
}

โœ… TL;DR Summary

Feature Amazon EMR
Managed Frameworks Spark, Hive, HBase, Flink, Presto, etc.
Storage HDFS, Amazon S3
Compute EC2, EKS, or EMR Serverless
Cost EC2 + EMR fee (by hour)
Security IAM, VPC, TLS, Kerberos
Monitoring CloudWatch, Spark UI, YARN, Ganglia
Terraform Support โœ… Yes

๐Ÿ”„ Variants

Variant Description
EMR on EC2 Traditional cluster-based EMR
EMR on EKS Spark jobs on Kubernetes (more control)
EMR Serverless No need to manage clusters (pay per job)