aws_demo
3 месяца назад
3 месяца назад
3 месяца назад
3 месяца назад
3 месяца назад
3 месяца назад
3 месяца назад
3 месяца назад
3 месяца назад
3 месяца назад
README.md
AWS ETL Stack with CDK
A complete AWS ETL (Extract, Transform, Load) pipeline built with AWS CDK for Python.
Architecture
This project creates an end-to-end ETL pipeline for processing e-commerce data:
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ EventBridge │───▶│ Step Function│───▶│ Glue Jobs │
│ (Schedule) │ │ (Orchestrate)│ │ (Transform) │
└──────────────┘ └──────┬───────┘ └──────────────┘
│
▼
┌──────────────┐ ┌──────────────┐
│ Lambda │───▶│ SNS │
│ (Notify) │ │ (Alerts) │
└──────────────┘ └──────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ S3 BUCKETS │
├─────────────────┬─────────────────┬─────────────────────────────┤
│ raw-data/ │ processed/ │ scripts/ │
│ (Landing) │ (Curated) │ (Glue ETL scripts) │
└─────────────────┴─────────────────┴─────────────────────────────┘
Components
1. S3 Buckets
- Raw Data Bucket: Landing zone for incoming data files
- Processed Data Bucket: Stores cleaned and transformed data
- Scripts Bucket: Holds Glue ETL scripts and temporary files
2. AWS Glue
- Glue Jobs: PySpark ETL jobs for data transformation
- Sales data cleaning job
- Inventory processing job
- Glue Database: Acts as a namespace for job metadata and bookmarks (no tables)
3. Step Functions
- Orchestrates the ETL workflow
- Runs Glue jobs in parallel
- Handles retries and error states
- Sends notifications via SNS on success or failure
4. Lambda Functions
- Validation Lambda: Validates input files before processing
- Notification Lambda: Sends success/failure alerts
5. EventBridge
- Scheduled trigger (daily at 2 AM UTC)
- Starts the Step Functions workflow
Project Structure
aws_demo/
├── app.py # CDK app entry point
├── requirements.txt # Python dependencies
├── cdk.json # CDK configuration
├── etl_stack/
│ ├── __init__.py
│ └── etl_stack.py # Main stack definition
├── glue_scripts/ # Glue ETL scripts
│ ├── clean_sales.py
│ ├── process_inventory.py
│ └── validate.py
└── lambda_functions/ # Lambda handlers
├── validation/
│ └── handler.py
└── notification/
└── handler.py
Prerequisites
- AWS Account with appropriate permissions
- AWS CLI configured
- Python 3.13
- Node.js 18+ (for CDK CLI)
Setup
- Clone the repository
- Install Node.js dependencies (CDK CLI)
- Create virtual environment
- Install Python dependencies
- Bootstrap CDK (first time only)
Deployment
- Synthesize CloudFormation template
- Deploy the stack
- View outputs After deployment, note the outputs:
- S3 bucket names
- Step Functions ARN
- Glue database name
Usage
Manual Trigger
Trigger the Step Functions workflow manually:
Upload Test Data
Verify Processed Data
Check the output bucket for generated Parquet files:
Monitoring
- Step Functions: View execution history in AWS Console
- Glue Jobs: Check job runs and logs in Glue Console
- CloudWatch Logs: View Lambda and Glue execution logs
- CloudWatch Metrics: Monitor job duration and success rates
Pure ETL Strategy
This pipeline focuses on data transformation only.
- Input: Raw CSV files in S3.
- Process: Glue jobs transform data using PySpark.
- Output: Cleaned Parquet files in S3.
Since we don't need to query this data immediately via Athena, we skip creating Glue Tables and Partitions. This simplifies the stack and reduces costs. The Glue Database is used solely for job bookmarks and metadata.
Cost Optimization
- Glue jobs use G.1X workers (cost-effective for small workloads)
- No Glue Crawler runs = saves ~$0.44 per DPU-hour
- Partition API calls are essentially free (vs crawler compute costs)
- S3 lifecycle policies archive old data to Glacier
- Lambda functions use minimal memory allocation
- Step Functions standard workflow (not Express)
Cleanup
To avoid ongoing charges:
Note: S3 buckets with data may need to be emptied manually first.
Security
- All S3 buckets have encryption enabled
- IAM roles follow least-privilege principle
- Secrets managed via AWS Secrets Manager
- VPC endpoints for private Glue connectivity (optional)
Customization
Edit to:etl_stack/etl_stack.py
- Add more Glue jobs
- Modify Step Functions workflow
- Change schedule frequency
- Add SNS notifications
- Configure VPC for database access
- Add Glue Tables if querying is needed
Troubleshooting
Glue Job Fails
- Check CloudWatch Logs for the job
- Verify S3 paths and permissions
- Ensure input data format matches expectations
Step Functions Timeout
- Increase timeout in etl_stack.py
- Check Glue job performance
Permission Errors
- Review IAM role policies
- Ensure CDK has necessary permissions
License
MIT