Skip to content

Lake Formation

🧠 What is AWS Lake Formation?

AWS Lake Formation is a fully managed service that makes it easier to set up, secure, and manage a data lake. It builds on top of AWS Glue and enhances it with centralized fine-grained access control, data cataloging, and governance capabilities.

✅ Lake Formation enables you to ingest, catalog, secure, and share structured and unstructured data at scale using a central metadata catalog and security model.


🏗️ Lake Formation Architecture Overview

                      +----------------------------+
                      |     Centralized Catalog    |
                      |  (Glue + Lake Formation)   |
                      +-------------+--------------+
                                    |
             +----------------------+----------------------+
             |                                             |
   +--------------------+                      +--------------------+
   |  S3 Data Lake      |                      |  Registered Locations|
   +--------------------+                      +--------------------+
             |                                             |
             +----------------------+----------------------+
                                    |
                          Fine-Grained Access Control
       +------------+------------+-------------+-------------+
       | Athena     | Redshift   | EMR/Spark   | LakeHouse Apps |
       +------------+------------+-------------+-------------+

🧰 Key Use Cases

Use Case Why Use Lake Formation?
🛡️ Secure S3 data lake Fine-grained column/row-level access
🗂️ Unified data catalog Glue-backed catalog with permissions
👥 Cross-account data sharing Share cataloged tables securely
📊 SQL analytics Query governed data with Athena or Redshift Spectrum
💾 Data ingestion and transformation Works with Glue ETL, Glue Studio, and other tools

🔐 Key Security Features

Feature Description
Table-level access Control access to specific tables
Column-level access Restrict access to specific columns (e.g., PII masking)
Row-level filtering Provide filtered views of datasets (e.g., region='India')
Tag-based access control (LF-TBAC) Apply access based on data classification tags
Cross-account access Share Glue tables securely across accounts
Auditing Logs access with AWS CloudTrail and Lake Formation logs

📁 Data Lake Locations

Before Lake Formation can manage data in S3, you must register the S3 bucket (or a folder path) as a data lake location.

S3 bucket/folder → Registered → Catalog table created → Access permissions applied

🧠 How Lake Formation Works (Simplified Workflow)

  1. Register S3 location (data lake location)

  2. Use Glue Crawlers to catalog the data

  3. Apply Lake Formation permissions on cataloged databases/tables

  4. Query with Athena/Redshift/EMR using governed access

  5. Audit and monitor access


⚙️ Integration with AWS Services

Service Integration Description
AWS Glue Shares Glue catalog and works with Glue ETL jobs
Amazon S3 Stores the actual raw and transformed data
Amazon Athena Supports row-level/column-level access via Lake Formation
Amazon Redshift Spectrum Accesses governed S3 data via Lake Formation
Amazon EMR (Spark/Hive) Can run governed queries with Lake Formation credentials

🔁 Comparison: Lake Formation vs Glue Catalog Only

Feature AWS Glue Catalog Only Lake Formation
Table/DB catalog ✅ Yes ✅ Yes
Column-level access control ❌ No ✅ Yes
Row-level security ❌ No ✅ Yes
Cross-account data sharing ❌ Manual ✅ Simplified
Auditing ❌ No ✅ Yes (CloudTrail + Lake Logs)

🔐 Example Use Case: Column-Level Masking

Only allow analysts to query non-PII columns:

Table: customers
Columns: customer_id, name, email, phone_number, country

Admin:
- Full access to all columns

Analyst:
- Read access to customer_id, country
- No access to name, email, phone_number

This can be enforced using Lake Formation Permissions + Data Filters.


💰 Pricing (2024)

Feature Pricing
Permissions & governance Free
Data Catalog Free up to 1M objects, then ~$1 per 100,000 objects/month
Cross-account sharing Free
Athena / Redshift Spectrum / EMR Standard charges apply (based on services used)

🛠️ Terraform Support?

As of now, Lake Formation is not fully supported by Terraform officially. However:

  • You can register data locations using custom AWS CLI scripts

  • Permissions can be applied via aws_lakeformation_permissions in community providers or custom providers

  • The Glue catalog resources (database/table/crawler/job) are supported, and integrate with Lake Formation.

📌 Example to register a data location (via CLI):

aws lakeformation register-resource \
  --resource-arn arn:aws:s3:::my-data-lake-bucket \
  --use-service-linked-role

✅ TL;DR Summary

Feature AWS Lake Formation
Metadata Catalog ✅ Built on top of Glue Catalog
Column-Level Security ✅ Yes
Row-Level Filtering ✅ Yes
Integration Athena, Redshift Spectrum, EMR, Glue
Cross-Account Sharing ✅ Simplified sharing of tables
Terraform Support ❌ Limited (Glue resources supported)
Ideal For Secure, governed, scalable data lake