Lake Formation
🧠 What is AWS Lake Formation?¶
AWS Lake Formation is a fully managed service that makes it easier to set up, secure, and manage a data lake. It builds on top of AWS Glue and enhances it with centralized fine-grained access control, data cataloging, and governance capabilities.
✅ Lake Formation enables you to ingest, catalog, secure, and share structured and unstructured data at scale using a central metadata catalog and security model.
🏗️ Lake Formation Architecture Overview¶
+----------------------------+
| Centralized Catalog |
| (Glue + Lake Formation) |
+-------------+--------------+
|
+----------------------+----------------------+
| |
+--------------------+ +--------------------+
| S3 Data Lake | | Registered Locations|
+--------------------+ +--------------------+
| |
+----------------------+----------------------+
|
Fine-Grained Access Control
↓
+------------+------------+-------------+-------------+
| Athena | Redshift | EMR/Spark | LakeHouse Apps |
+------------+------------+-------------+-------------+
🧰 Key Use Cases¶
| Use Case | Why Use Lake Formation? |
|---|---|
| 🛡️ Secure S3 data lake | Fine-grained column/row-level access |
| 🗂️ Unified data catalog | Glue-backed catalog with permissions |
| 👥 Cross-account data sharing | Share cataloged tables securely |
| 📊 SQL analytics | Query governed data with Athena or Redshift Spectrum |
| 💾 Data ingestion and transformation | Works with Glue ETL, Glue Studio, and other tools |
🔐 Key Security Features¶
| Feature | Description |
|---|---|
| Table-level access | Control access to specific tables |
| Column-level access | Restrict access to specific columns (e.g., PII masking) |
| Row-level filtering | Provide filtered views of datasets (e.g., region='India') |
| Tag-based access control (LF-TBAC) | Apply access based on data classification tags |
| Cross-account access | Share Glue tables securely across accounts |
| Auditing | Logs access with AWS CloudTrail and Lake Formation logs |
📁 Data Lake Locations¶
Before Lake Formation can manage data in S3, you must register the S3 bucket (or a folder path) as a data lake location.
🧠 How Lake Formation Works (Simplified Workflow)¶
-
Register S3 location (data lake location)
-
Use Glue Crawlers to catalog the data
-
Apply Lake Formation permissions on cataloged databases/tables
-
Query with Athena/Redshift/EMR using governed access
-
Audit and monitor access
⚙️ Integration with AWS Services¶
| Service | Integration Description |
|---|---|
| AWS Glue | Shares Glue catalog and works with Glue ETL jobs |
| Amazon S3 | Stores the actual raw and transformed data |
| Amazon Athena | Supports row-level/column-level access via Lake Formation |
| Amazon Redshift Spectrum | Accesses governed S3 data via Lake Formation |
| Amazon EMR (Spark/Hive) | Can run governed queries with Lake Formation credentials |
🔁 Comparison: Lake Formation vs Glue Catalog Only¶
| Feature | AWS Glue Catalog Only | Lake Formation |
|---|---|---|
| Table/DB catalog | ✅ Yes | ✅ Yes |
| Column-level access control | ❌ No | ✅ Yes |
| Row-level security | ❌ No | ✅ Yes |
| Cross-account data sharing | ❌ Manual | ✅ Simplified |
| Auditing | ❌ No | ✅ Yes (CloudTrail + Lake Logs) |
🔐 Example Use Case: Column-Level Masking¶
Only allow analysts to query non-PII columns:
Table: customers
Columns: customer_id, name, email, phone_number, country
Admin:
- Full access to all columns
Analyst:
- Read access to customer_id, country
- No access to name, email, phone_number
This can be enforced using Lake Formation Permissions + Data Filters.
💰 Pricing (2024)¶
| Feature | Pricing |
|---|---|
| Permissions & governance | Free |
| Data Catalog | Free up to 1M objects, then ~$1 per 100,000 objects/month |
| Cross-account sharing | Free |
| Athena / Redshift Spectrum / EMR | Standard charges apply (based on services used) |
🛠️ Terraform Support?¶
As of now, Lake Formation is not fully supported by Terraform officially. However:
-
You can register data locations using custom AWS CLI scripts
-
Permissions can be applied via
aws_lakeformation_permissionsin community providers or custom providers -
The Glue catalog resources (database/table/crawler/job) are supported, and integrate with Lake Formation.
📌 Example to register a data location (via CLI):
aws lakeformation register-resource \
--resource-arn arn:aws:s3:::my-data-lake-bucket \
--use-service-linked-role
✅ TL;DR Summary¶
| Feature | AWS Lake Formation |
|---|---|
| Metadata Catalog | ✅ Built on top of Glue Catalog |
| Column-Level Security | ✅ Yes |
| Row-Level Filtering | ✅ Yes |
| Integration | Athena, Redshift Spectrum, EMR, Glue |
| Cross-Account Sharing | ✅ Simplified sharing of tables |
| Terraform Support | ❌ Limited (Glue resources supported) |
| Ideal For | Secure, governed, scalable data lake |