Skip to content

Athena

🧠 What is AWS Athena?

Amazon Athena is a serverless, interactive query service that lets you analyze data directly in Amazon S3 using standard SQL.

βœ… You don’t need to manage infrastructure β€” just point Athena to your data in S3, define a schema, and start querying.


πŸ” Key Highlights

Feature Description
Query Engine Presto (open-source distributed SQL engine)
Query Language ANSI SQL (standard SQL)
Serverless? βœ… Yes
Data Source Amazon S3 (default)
Schema Management AWS Glue Data Catalog (or inline DDLs)
Output Location Query results stored in S3 (CSV/Parquet)

🧰 Common Use Cases

Scenario Why Use Athena?
πŸ“Š Ad-hoc log analysis Run SQL queries on S3-stored CloudTrail or ELB logs
πŸ§ͺ Data lake exploration Query large data lakes stored in Parquet, ORC, JSON, CSV
🧼 ETL pipeline staging Lightweight pre-processing for EMR, Redshift, or Glue
πŸ•΅οΈ Security investigation Query S3-stored VPC flow logs, GuardDuty alerts, etc.
πŸ’° Cost monitoring Analyze CUR (Cost and Usage Report) files in S3

🧱 Architecture

        +----------------+                 +------------------------+
        |  Data in S3    | <----------+    | AWS Glue Data Catalog  |
        +----------------+           |     +------------------------+
               ↑                     |            ↑
         Partitioned /              |        Table definitions
         Columnar formats           |
               ↑                    |
        +------------------+        |     +-------------------------+
        |   Amazon Athena   |  SQL  +---> | Results stored in S3    |
        +------------------+              +-------------------------+
  • Athena reads data from S3 directly

  • Uses Glue Data Catalog for schema definitions

  • Results are stored in S3 for download or further processing


πŸ“š Supported Data Formats

Format Compression Supported Columnar?
CSV GZIP ❌
JSON GZIP ❌
Parquet GZIP, Snappy βœ…
ORC ZLIB, Snappy βœ…
Avro Deflate, Snappy βœ…
TSV GZIP ❌

βœ… Use Parquet or ORC for best performance (columnar, compressed)


πŸ” Security Features

Layer Description
IAM Policies Restrict Athena API access and S3 read/write
S3 Bucket Policy Restrict access to query and result locations
Encryption at Rest S3-SSE / SSE-KMS for input and output
In-Transit HTTPS (secure transmission)
Workgroups Isolate workloads and enforce limits/quota
Access Control Fine-grained access to Glue catalog/tables

πŸ’° Pricing (as of 2024)

Metric Cost
Data scanned per query $5.00 per TB
Data catalog (Glue) Free for <1 million objects/month
Workgroup control/monitoring Free

🧠 Tip: Use compressed columnar formats + partitioning to reduce scan size!


πŸ› οΈ Terraform Example: Athena Table in Glue Catalog

resource "aws_glue_catalog_database" "example_db" {
  name = "example_database"
}

resource "aws_glue_catalog_table" "example_table" {
  name          = "access_logs"
  database_name = aws_glue_catalog_database.example_db.name
  table_type    = "EXTERNAL_TABLE"

  storage_descriptor {
    location      = "s3://my-bucket/logs/"
    input_format  = "org.apache.hadoop.mapred.TextInputFormat"
    output_format = "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
    serde_info {
      serialization_library = "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"
      parameters = {
        "field.delim" = "\t"
      }
    }
    columns {
      name = "ip"
      type = "string"
    }
    columns {
      name = "timestamp"
      type = "string"
    }
    columns {
      name = "method"
      type = "string"
    }
  }
}

Now you can query this in Athena:

SELECT * FROM example_database.access_logs WHERE method = 'POST';

πŸ§ͺ Best Practices

Best Practice Why it Matters
Partition your data (e.g., by date) Reduces data scanned and improves performance
Use columnar formats (Parquet/ORC) Faster queries and lower costs
Compress files Smaller scan size = lower cost
Use Workgroups Enforce usage limits and control access
Use Glue Catalog Central schema management for Athena & ETL

βœ… TL;DR Summary

Feature AWS Athena
Query Engine Presto
Serverless? βœ… Yes
Data Source Amazon S3 (also federated sources)
Schema Catalog AWS Glue Data Catalog
Pricing $5/TB scanned
Use Cases Ad hoc analytics, logs, billing
Performance Tip Partitioned + columnar format
Terraform Support βœ… Via Glue Catalog + IAM

If You Need... Use This
Query S3 with SQL βœ… Amazon Athena
Managed Data Warehouse Amazon Redshift
ETL Jobs AWS Glue
Run SQL over streaming data Amazon Kinesis Analytics
Scheduled Athena queries EventBridge + Lambda