EMR
๐ง What is Amazon EMR?
Amazon EMR (Elastic MapReduce) is a managed big data platform that simplifies running Apache Hadoop, Spark, Hive, HBase, Flink, Presto, and other frameworks on AWS.
โ
It lets you process and analyze massive datasets cost-effectively using open-source tools in a scalable, cluster-based environment.
๐ฆ Core Use Cases
| Use Case |
Why Use EMR? |
| ๐งช Big Data Processing (ETL) |
Ingest, clean, transform TBโPB of data |
| ๐ Data Lake Analytics |
Analyze S3-stored data using Spark, Hive, Presto |
| ๐ง Machine Learning Pipelines |
Distributed ML training using PySpark, TensorFlow |
| ๐ Data Warehousing |
SQL-like analytics using Hive/Presto |
| ๐ Log & Clickstream Analysis |
Real-time and batch log processing |
๐๏ธ Architecture Overview
+--------------------+
| S3 (raw data) |
+--------------------+
โ
+---------------+
| EMR Cluster |
| - Master Node |
| - Core Nodes |
| - Task Nodes |
+---------------+
โ
Processed data โ S3 / RDS / Redshift / DynamoDB
Node Types:
-
Master Node: Manages cluster, job scheduling
-
Core Nodes: Store data using HDFS and process tasks
-
Task Nodes: Only perform processing (no HDFS storage)
๐งฐ Supported Frameworks
| Framework |
Purpose |
| Apache Spark |
Fast, in-memory big data engine |
| Apache Hive |
SQL-on-Hadoop for batch processing |
| Apache HBase |
NoSQL database over HDFS |
| Presto/Trino |
Distributed SQL query engine |
| Flink |
Real-time stream processing |
| Hue |
Web UI for Hadoop ecosystem |
๐งช Cluster Types
| Type |
Description |
| Transient Cluster |
Runs one job and shuts down automatically |
| Long-running Cluster |
Always-on for streaming, multi-job workloads |
| EMR on EKS |
Run Spark on Kubernetes |
| EMR Serverless |
No cluster management, pay per job |
๐ Security
| Feature |
Description |
| IAM Roles |
Control access to EMR and underlying services |
| Kerberos Authentication |
Secure access to cluster services |
| TLS In-Transit Encryption |
Secures communication between nodes |
| At-Rest Encryption |
Uses S3 SSE, EBS encryption, and HDFS encryption |
| Private Subnets / VPC |
Full network isolation with security groups |
๐ Monitoring & Debugging
| Tool |
What It Does |
| Amazon CloudWatch |
Logs, metrics for cluster and apps |
| Ganglia |
Real-time performance monitoring |
| Spark UI |
Visual interface for Spark job DAGs |
| YARN Resource Manager UI |
View Hadoop resource usage |
๐ฐ Pricing
| Component |
Billed By |
| EC2 Instances |
Per instance-hour |
| EMR Cost |
Per-node per-hour (~$0.015โ$0.27/hour) |
| S3 Storage |
Billed separately |
| Spot Support |
โ
Supported (save up to 90%) |
๐ง Tip: Use EMR Instance Fleets or Spot + On-Demand mix for cost optimization.
resource "aws_emr_cluster" "example" {
name = "example-emr-cluster"
release_label = "emr-6.13.0"
applications = ["Spark", "Hive"]
ec2_attributes {
key_name = "my-key"
subnet_id = "subnet-0abc123456"
emr_managed_master_security_group = aws_security_group.master.id
emr_managed_slave_security_group = aws_security_group.core.id
}
master_instance_group {
instance_type = "m5.xlarge"
instance_count = 1
}
core_instance_group {
instance_type = "m5.xlarge"
instance_count = 2
}
bootstrap_action {
path = "s3://my-bucket/scripts/install-custom-libs.sh"
}
log_uri = "s3://my-emr-logs/"
configurations_json = <<EOF
[
{
"Classification": "spark-defaults",
"Properties": {
"spark.executor.memory": "4g"
}
}
]
EOF
service_role = "EMR_DefaultRole"
job_flow_role = "EMR_EC2_DefaultRole"
}
โ
TL;DR Summary
| Feature |
Amazon EMR |
| Managed Frameworks |
Spark, Hive, HBase, Flink, Presto, etc. |
| Storage |
HDFS, Amazon S3 |
| Compute |
EC2, EKS, or EMR Serverless |
| Cost |
EC2 + EMR fee (by hour) |
| Security |
IAM, VPC, TLS, Kerberos |
| Monitoring |
CloudWatch, Spark UI, YARN, Ganglia |
| Terraform Support |
โ
Yes |
๐ Variants
| Variant |
Description |
| EMR on EC2 |
Traditional cluster-based EMR |
| EMR on EKS |
Spark jobs on Kubernetes (more control) |
| EMR Serverless |
No need to manage clusters (pay per job) |