Redshift

🧠 What is Amazon Redshift?¶

Amazon Redshift is a fully managed, petabyte-scale data warehouse service designed for OLAP (analytical) workloads. It enables you to run complex queries across large datasets stored in Redshift, S3 (via Redshift Spectrum), and other sources using standard SQL.

✅ It's optimized for high-performance querying, data warehousing, and business intelligence (BI) reporting.

🧰 Common Use Cases¶

Use Case	Why Redshift?
📊 Business Intelligence	Fast queries on structured datasets
🛢️ Centralized Data Lake	Combine S3 + Redshift via Redshift Spectrum
🔁 ETL Processing	Load/transform data from operational DBs
🧪 Analytics over semi-structured data	Supports JSON, Parquet, Avro, etc.
🔗 Integration with BI tools	QuickSight, Tableau, Power BI

🧱 Core Components¶

Component	Description
Cluster	A collection of nodes managed as a single data warehouse
Leader Node	Receives queries, plans execution, and aggregates results
Compute Nodes	Store data and perform query processing
Node Types	Dense Storage (DS2) and Dense Compute (DC2), RA3 (scalable managed storage)
Data Distribution	EVEN, KEY, or ALL to manage data placement

🏗️ Architecture¶

           +--------------------------+
           |     BI Tool / SQL       |
           +------------+------------+
                        ↓
                 +------+------+
                 | Leader Node |
                 +------+------+
                        ↓
           +--------------------------+
           |    Compute Nodes (RA3)   |
           +--------------------------+
                     ↓   ↓   ↓
               Data Blocks in Managed Storage

Leader Node handles SQL parsing and query planning.
Compute Nodes execute the query.
RA3 nodes separate storage and compute, allowing you to scale independently.

📦 Redshift Spectrum¶

Redshift Spectrum lets you query data in Amazon S3 without loading it into Redshift.

🔥 Great for querying raw log files or cold storage without duplicating data.

Supports Parquet, ORC, CSV, JSON
Uses Glue Data Catalog for schema
Ideal for data lake + warehouse hybrid

🔐 Security¶

Feature	Description
IAM Policies	Control Redshift cluster access
VPC Security Groups	Define inbound/outbound traffic
KMS Encryption	Encrypt data at rest (built-in or custom key)
SSL in Transit	Protect client connections to Redshift
Audit Logging	Enable logging to S3 for access/query events
Row-level Security (RLS)	Fine-grained access control for rows (preview)

🧪 Performance Features¶

Feature	Description
Columnar Storage	Fast scanning, great compression
Data Compression	Built-in encoding for performance & cost savings
Query Optimization	Advanced planner, result caching, materialized views
Concurrency Scaling	Auto-spawns temporary clusters for burst workloads
Workload Management (WLM)	Assign queue priorities to workloads (ETL vs dashboard queries)
Sort Keys + Distribution	Improves join/filter performance

🛠️ Terraform Example – Redshift Cluster (RA3)¶

resource "aws_redshift_subnet_group" "example" {
  name       = "example-subnet-group"
  subnet_ids = ["subnet-12345", "subnet-67890"]
}

resource "aws_redshift_cluster" "example" {
  cluster_identifier = "my-redshift-cluster"
  node_type          = "ra3.4xlarge"
  number_of_nodes    = 2

  database_name      = "analytics"
  master_username    = "admin"
  master_password    = "StrongPassword123"

  cluster_subnet_group_name = aws_redshift_subnet_group.example.name
  publicly_accessible       = false
  encrypted                 = true

  tags = {
    Environment = "prod"
  }
}

💰 Pricing (2024 Overview)¶

Type	Cost (approx)
RA3.4xlarge (on-demand)	~$3.26/hour/node
Managed Storage (RA3)	~$0.024/GB-month (billed separately)
Concurrency Scaling	Free for 1 hour/day (then billed)
Spectrum (S3 query)	$5 per TB scanned
Reserved Instances	Save up to 75% on 1–3 year term

🧠 Tip: Use compression + Spectrum to keep storage low and only query active data.

🎯 Comparison with Other Services¶

Use Case	Use This
Petabyte-scale analytics	✅ Amazon Redshift
Real-time streaming analytics	Amazon Kinesis + Redshift
Operational DB (OLTP)	Amazon RDS / Aurora
S3 SQL-like ad hoc queries	Amazon Athena
Massive joins with data lake	Redshift Spectrum

✅ TL;DR Summary¶

Feature	Amazon Redshift
Query Engine	PostgreSQL-compatible SQL
Storage Format	Columnar, compressed
Serverless Mode	✅ Available (since 2022)
Best For	Large-scale analytics, BI workloads
Integrations	QuickSight, Glue, S3, Kinesis, Lambda
Terraform Support	✅ Fully supported

🔄 Optional: Redshift Serverless (2024)¶

Benefit	Description
Serverless	No cluster to manage
Pay-per-query	You pay for Redshift Processing Units (RPUs)
Easy to scale	Great for unpredictable workloads
Terraform Support	✅ Supported via `aws_redshiftserverless_*` resources