Textract
Amazon Textract is a machine learning-based OCR service that automatically extracts printed and handwritten text, tables, forms, and key-value pairs from scanned documents, PDFs, and images โ without needing templates or custom code.
โ
It goes beyond basic OCR by understanding the structure and relationships in your documents.
๐ฏ Key Use Cases
| Industry |
Use Case |
| Banking |
Extract KYC data from ID documents |
| Healthcare |
Digitize handwritten prescriptions or records |
| Insurance |
Read claim forms and policy documents |
| HR |
Process resumes, forms, or offer letters |
| Legal |
Extract clauses from contracts |
| Logistics |
Read invoices, receipts, and bills of lading |
| Feature |
Description |
| Raw Text |
Plain text from PDFs, images, scans |
| Forms |
Key-value pairs (e.g., โName: Johnโ) |
| Tables |
Structured rows and columns |
| Handwriting |
Supports both printed and cursive |
| Checkboxes |
Detects selection states (ON/OFF) |
| Identity Documents |
Structured fields from ID cards, licenses |
| Expense Analysis |
Categorize receipts, totals, vendors, etc. |
| API Method |
Purpose |
DetectDocumentText |
Extract raw lines and words |
AnalyzeDocument |
Extract forms, tables, checkboxes |
AnalyzeExpense |
Specialized for receipts, invoices |
AnalyzeID |
Specialized for identity documents |
StartDocumentAnalysis |
Async operation for large docs |
GetDocumentAnalysis |
Retrieve results from async jobs |
๐งช Sample Python Code (Boto3)
import boto3
textract = boto3.client('textract')
with open("form.png", "rb") as document:
image_bytes = document.read()
response = textract.analyze_document(
Document={'Bytes': image_bytes},
FeatureTypes=["FORMS", "TABLES"]
)
for block in response['Blocks']:
if block['BlockType'] == 'KEY_VALUE_SET':
print(block.get('Text'))
response = textract.start_document_analysis(
DocumentLocation={'S3Object': {'Bucket': 'my-bucket', 'Name': 'invoice.pdf'}},
FeatureTypes=["TABLES", "FORMS"]
)
job_id = response['JobId']
Textract returns nested blocks with relationships:
{
"Blocks": [
{
"BlockType": "KEY_VALUE_SET",
"EntityTypes": ["KEY"],
"Text": "Name"
},
{
"BlockType": "KEY_VALUE_SET",
"EntityTypes": ["VALUE"],
"Text": "John Smith"
}
]
}
| Feature |
Textract |
Simple OCR (e.g., Tesseract) |
| Text |
โ
Yes |
โ
Yes |
| Tables |
โ
Yes |
โ No |
| Forms/Key-Value |
โ
Yes |
โ No |
| Handwriting |
โ
Yes |
Limited |
| Identity Docs |
โ
Prebuilt parser |
โ Manual |
| Expense Docs |
โ
Yes |
โ No |
๐งพ Pricing (2024)
| API |
Price per Page |
DetectDocumentText |
$0.0015/page |
AnalyzeDocument |
$0.015/page (tables/forms) |
AnalyzeExpense |
$0.05/page |
AnalyzeID |
$0.025/page |
๐ง Free Tier: 1,000 pages/month for first 3 months (analyze or detect text).
๐ Security
| Feature |
Support |
| IAM integration |
โ
Yes |
| Encryption |
โ
With KMS |
| S3 Integration |
โ
Yes |
| VPC endpoint support |
โ
Yes (via PrivateLink) |
| HIPAA eligible |
โ
Yes |
No direct Terraform resource for Textract, but you can:
resource "aws_lambda_function" "textract_handler" {
...
}
resource "aws_s3_bucket_notification" "notify" {
bucket = "my-textract-bucket"
lambda_function {
lambda_function_arn = aws_lambda_function.textract_handler.arn
events = ["s3:ObjectCreated:*"]
filter_suffix = ".pdf"
}
}
โ๏ธ Integration With Other Services
| Service |
Purpose |
| S3 |
Store source and output files |
| Lambda |
Trigger processing logic |
| Comprehend |
NLP on extracted text |
| Athena |
Query structured output |
| Step Functions |
Orchestrate large pipelines |
| OpenSearch |
Index documents for search |
โ
TL;DR Summary
| Feature |
Amazon Textract |
| OCR for scanned docs |
โ
Yes |
| Forms/key-value pairs |
โ
Yes |
| Tables |
โ
Yes |
| Handwriting |
โ
Yes |
| Identity/receipts |
โ
Specialized APIs |
| Real-time + batch |
โ
Both supported |
| Output format |
JSON with structured blocks |
| Free Tier |
โ
1000 pages for 3 months |