Welcome to the "AWS for Web Developers" masterclass. This is not just a tutorial on clicking buttons; it is a comprehensive restructuring of how you approach software engineering. Moving from on-premise or simple VPS (like DigitalOcean/Heroku) to AWS requires a fundamental shift in philosophy.
In this module, we will establish the "Cloud Mindset," configure your local development environment, secure your wallet against unexpected bills, and understand the physical layout of the AWS global network.
If you come from a background of managing a single Linux server (VPS), you likely treat that server like a Pet.
The Goal: By the end of this course, you will build systems where you can delete your web server, and your application will automatically recover within seconds without human intervention. This is called Disposable Infrastructure.
AWS is an advanced platform. To succeed, you must be comfortable with the following web fundamentals:
AWS requires you to build your own network (VPC). You must understand:
192.168.1.5)./16 is a larger network than /24.While the AWS Console (GUI) is great for learning, professionals use the Command Line Interface (CLI). You should know how to:
ssh user@host).chmod 400 key.pem).You cannot deploy code if you don't know where it lives. AWS is divided into three physical layers:
A Region is a physical location in the world where AWS clusters data centers. Example: us-east-1 (Northern Virginia), eu-west-2 (London).
Why it matters:
ap-northeast-1 (Tokyo), not Virginia.eu-central-1).Inside every Region, there are typically 3 to 6 Availability Zones. An AZ is one or more discrete data centers with redundant power, networking, and connectivity.
The Golden Rule of High Availability: If you only run one server in us-east-1a, and that data center floods, your site is down. To be "High Availability," you must run servers in at least two AZs (e.g., us-east-1a AND us-east-1b).
There are 400+ Edge Locations globally. These are not data centers for servers; they are caching endpoints for CloudFront (CDN). They sit very close to users to serve static content (images/video) quickly.
Who is responsible when a hack happens? It depends.
They protect the physical hardware, the concrete walls, the guards, the power, and the host OS virtualization software.
If you launch a server and leave the password as `admin/admin`, or leave Port 22 open to the world, and you get hacked—that is YOUR FAULT. AWS does not patch your OS or secure your code.
The Horror Story: A student leaves a large EC2 instance running, forgets about it, and wakes up to a $2,000 bill. Follow these steps immediately.
We will interact with AWS programmatically. Install the AWS Command Line Interface (CLI).
Mac:
brew install awscli
Windows: Download the MSI installer from AWS.
Linux:
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" unzip awscliv2.zip sudo ./aws/install
Once we create credentials in Module 1, you will run:
aws configure # AWS Access Key ID: [Paste Key] # AWS Secret Access Key: [Paste Secret] # Default region name: us-east-1 # Default output format: json
You are now ready to begin. In the next module, we will secure your account using IAM.
Identity and Access Management (IAM) is the backbone of AWS security. It is a global service (not bound to a specific region like EC2). It answers two fundamental questions:
Before we launch servers, we must secure the perimeter. If a hacker gains access to your IAM credentials, they don't need to hack your firewall—they own the firewall.
When you first sign up for AWS with your email and credit card, you are logged in as the Root User. This account has unlimited privileges. It can close your account, change your billing, and delete every server you own.
You must understand the difference between these four objects to architect securely.
Represents a person (e.g., `alice`, `bob`) or a specific legacy application. Users have permanent long-term credentials.
A collection of Users. You should never attach permissions directly to a User. Always attach permissions to a Group, and add Users to that Group.
Example: Create a "Developers" group. Give the group access to S3. Add `alice` to the group. If `alice` leaves the company, simply remove her from the group.
A Role is an identity that you can assume temporarily. It does not have a password or permanent keys.
Use Case: You have a Python script running on an EC2 server that needs to upload files to S3.
Policies are JSON documents that define permissions. They are attached to Users, Groups, or Roles.
You will be writing and reading a lot of JSON. Here is the structure of a policy implementing the Principle of Least Privilege.
{
"Version": "2012-10-17",
"Id": "S3-Restricted-Access",
"Statement": [
{
"Sid": "AllowListBucket",
"Effect": "Allow",
"Action": [
"s3:ListBucket"
],
"Resource": "arn:aws:s3:::my-company-data"
},
{
"Sid": "AllowUploadFiles",
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject"
],
"Resource": "arn:aws:s3:::my-company-data/*"
}
]
}
arn:partition:service:region:account-id:resource-idarn:aws:ec2:us-east-1:123456789012:instance/i-0a1b2c3d
This is the golden rule of IAM. A user/service should only have the minimum permissions necessary to do their job, and nothing more.
Scenario: A Junior Dev needs to restart a specific web server.
AdministratorAccess policy.AmazonEC2FullAccess policy.ec2:RebootInstances only on that specific server ARN.If you commit an `access_key_id` to a public GitHub repo, AWS bots (and hacker bots) scan GitHub continuously. You will often receive an email from AWS within 5 minutes telling you your account is compromised.
Solution: Use `.env` files and add them to `.gitignore`. Better yet, use IAM Roles.
In IAM settings, enforce a strong password policy: Minimum length 12, require special characters, expire passwords every 90 days.
Before moving to Module 2, perform these actions:
YourName-Admin.SysAdmins.AdministratorAccess to the Group.If IAM is the gatekeeper, the VPC (Virtual Private Cloud) is the castle walls. A VPC is a logically isolated section of the AWS Cloud where you can launch resources in a virtual network that you define.
Many developers skip this and use the "Default VPC" AWS provides. This is fine for testing, but in production, you must build a custom network to separate public-facing servers from sensitive databases.
When you create a VPC, you must assign it an IP address range using CIDR (Classless Inter-Domain Routing) notation.
An IPv4 address is 32 bits (e.g., 192.168.0.1). The CIDR suffix (the /16 part) tells AWS how many bits are fixed.
Recommendation: For your main VPC, always use a large range like 10.0.0.0/16. This gives you plenty of room to grow. You cannot easily change the size of a VPC after creation.
/24 subnet (256 IPs), you can only use 251.
You cannot launch servers directly into a VPC. You must launch them into Subnets. A subnet is a sub-section of your VPC range found within a specific Availability Zone.
To ensure High Availability, we create pairs of subnets across two Availability Zones (e.g., us-east-1a and us-east-1b).
A subnet is not inherently "Public" or "Private". The distinction is defined by its Route Table.
This is a virtual modem that connects your VPC to the real internet.
A subnet is Public if its Route Table has a route to the Internet Gateway.
Destination: 0.0.0.0/0 Target: igw-123456
Use Case: Load Balancers, Bastion Hosts, Web Servers (though Web Servers are safer in private subnets behind a Load Balancer).
A subnet is Private if it does NOT have a route to the IGW.
Use Case: App Servers, Databases, Back-end Logic. No one from the outside world can directly ping these servers.
Problem: Your database is in a Private Subnet (secure). But your database needs to download a security patch from `update.mysql.com`. How does it reach the internet if it has no IGW route?
Solution: The NAT Gateway (Network Address Translation).
0.0.0.0/0 -> nat-123456.AWS has two layers of firewalls. You must understand the difference.
Before moving to Compute, visualize a user loading your website:
Amazon EC2 is the service that started the cloud revolution. It allows you to rent virtual computers (instances) on demand. While modern development often favors "Serverless" (Lambda), EC2 remains the workhorse of the internet. If you are migrating a legacy app, running a long-lived worker process, or hosting a database, you need EC2.
When you see m5.large or c6g.xlarge, it isn't random. It follows a strict syntax:
[Family][Generation][Attribute].[Size]
m5 is newer than m4. Always use the latest generation for the best price/performance.g = AWS Graviton Processor (ARM-based, cheaper and faster than Intel).a = AMD Processor (slightly cheaper than Intel).nano, micro, small, medium, large, xlarge... Doubles in capacity (and price) at each step.An AMI is the "template" for the root volume (C: Drive) of your instance.
User Data is a script responsible for automating the setup of a server immediately after it launches. This is the first step toward "Infrastructure as Code." You should never SSH into a server to manually install software if you can avoid it.
#!/bin/bash # Update the OS yum update -y # Install Apache yum install -y httpd # Start the service systemctl start httpd systemctl enable httpd # Create a landing page echo "<h1>Hello from Server $(hostname -f)</h1>" > /var/www/html/index.html
The exact same server can cost $100 or $10 depending on how you buy it.
EC2 is the CPU/RAM; EBS is the Disk. EBS volumes are network drives attached to your instance.
How do you log into your Linux server?
Requires Port 22 to be open (0.0.0.0/0 is a security risk). Requires managing `.pem` private key files. If you lose the key, you lose the server.
Uses the AWS Systems Manager Agent (pre-installed on Amazon Linux/Ubuntu).
Benefits:
From inside the EC2 instance, the OS can learn about itself by querying a special local IP address.
curl http://169.254.169.254/latest/meta-data/
This returns the Instance ID, Public IP, Private IP, and IAM Role credentials. This is how scripts running on your server automatically get the permissions assigned to the IAM Role.
In the old days, if a user uploaded a profile picture, you saved it to the /var/www/uploads folder on your server. In the cloud, this is an anti-pattern. If your server crashes or if you autoscale to add a new server, that file is gone or missing from the new server.
The Solution: Amazon S3 (Simple Storage Service). It provides infinite storage, 99.999999999% (11 9's) durability, and serves as the backbone of the modern web.
test; you need my-company-app-assets-2025.Not all data needs to be instantly accessible. S3 offers "Lifecycle Policies" to move data between tiers automatically to save money.
| Class | Use Case | Retrieval Time |
|---|---|---|
| S3 Standard | Hot data. User profile pics, static website assets. | Milliseconds |
| S3 Intelligent-Tiering | Unknown access patterns. AWS moves data for you. | Milliseconds |
| S3 Standard-IA | Infrequent Access. Backups accessed once a month. Lower storage cost, higher retrieval fee. | Milliseconds |
| S3 Glacier Deep Archive | Regulatory archives (keep for 7 years). Cheapest storage (~$1/TB). | 12 to 48 Hours |
By default, all S3 buckets are Private. The infamous "S3 Data Leaks" in the news are always due to user error (turning off "Block Public Access").
A JSON document attached to the bucket itself to control access.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowPublicRead",
"Effect": "Allow",
"Principal": "*",
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::my-public-website/*"
}
]
}
Note: Only use the policy above for hosting public websites. For private data, use IAM Roles or Presigned URLs.
How do you let a user upload a file directly to a private bucket without giving them AWS credentials?
S3 can act as a web server for HTML/CSS/JS. It supports index documents and error documents. This is the standard way to host React, Vue, and Angular apps.
CloudFront is a Content Delivery Network. It caches your content at 400+ "Edge Locations" around the world.
When User A in Paris requests `logo.png`, CloudFront checks the Paris Edge Location. If it's there (Cache Hit), it returns it instantly. If not (Cache Miss), it fetches it from your S3 bucket in Virginia, returns it to the user, and saves a copy in Paris for the next user.
To secure your content:
Now, users cannot bypass the CDN to access S3 directly.
If you are a corporate developer, you might see Storage Gateway. This connects on-premise servers to S3.
A "Scalable" system is one that can handle increased load by adding resources. A "High Availability" (HA) system is one that remains operational even if some components fail. On AWS, we achieve both using Load Balancers and Auto Scaling Groups.
Example: Upgrade your EC2 instance from `t2.micro` (1 CPU) to `c5.4xlarge` (16 CPUs).
Example: Add three more `t2.micro` instances.
An ELB is a managed service that distributes incoming traffic across multiple targets (EC2 instances, Containers, IP addresses).
The ASG is the automation engine. It adds or removes EC2 instances from your Target Group based on demand.
Horizontal scaling breaks traditional web apps that store user sessions (login cookies) or uploaded files on the local disk.
Scenario: User John logs in. The ALB sends him to Server A. Server A saves his session to `/tmp/sessions`.
Next request: ALB sends John to Server B. Server B checks `/tmp/sessions`, finds nothing, and kicks John out to the login screen.
Steps to achieve High Availability:
In traditional hosting, you install MySQL on the same server as your PHP/Node code. In AWS, we decouple the database. This allows the database to scale independently and survive even if your application servers are wiped out.
RDS is a managed service for SQL databases. It supports: PostgreSQL, MySQL, MariaDB, Oracle, SQL Server, and Amazon Aurora.
"Managed" means AWS handles:
You must understand the difference between these two features:
| Feature | Multi-AZ (Availability Zones) | Read Replicas |
|---|---|---|
| Purpose | Disaster Recovery (High Availability) | Performance (Scalability) |
| Mechanism | Synchronous replication to a standby instance in a different AZ. | Asynchronous replication to up to 5 read-only copies. |
| Access | You cannot connect to the standby. It is passive. | You can connect to replicas for SELECT queries. |
| Failover | Automatic. DNS switches to standby. | Manual (must promote replica to standalone DB). |
Aurora is AWS's proprietary database engine. It is compatible with MySQL and Postgres but is 5x faster and cheaper at scale.
RDS is great for complex joins. DynamoDB is great for massive scale and simple Key-Value data.
In SQL, you normalize data. In DynamoDB, you facilitate access patterns. You often use Single Table Design, putting Users, Orders, and Products in one table to fetch everything in a single request.
The fastest database query is the one you never make. ElastiCache manages Redis or Memcached.
// Pseudocode for Lazy Loading
function getUser(userId) {
// 1. Check Cache
record = cache.get(userId);
if (record) {
return record; // Cache Hit
}
// 2. Query DB (Cache Miss)
record = db.query("SELECT * FROM users WHERE id = ?", userId);
// 3. Write to Cache with TTL (Time To Live)
cache.set(userId, record, 3600); // Expire in 1 hour
return record;
}
Databases should NEVER be accessible from the public internet. They belong in Private Subnets.
The Security Group Chain:
This ensures that even if someone has the database password, they cannot connect unless they have hacked your Application Server first.
"Serverless" is a misnomer. There are still servers, but they are abstracted away. You do not provision, patch, or scale them. You simply upload code, and AWS runs it in response to events.
Benefits: No idle costs (pay for value), automatic scaling, reduced operational overhead.
Drawbacks: Cold starts, execution time limits, vendor lock-in.
Lambda is the core of serverless compute. It supports Node.js, Python, Java, Go, Ruby, and .NET.
You are charged based on:
When a Lambda function hasn't been used for a while, AWS spins down the container. The next request must wait for AWS to initialize the environment (adding ~100ms to ~1s latency). Solutions include "Provisioned Concurrency" or using lighter runtimes (like Node.js or Python) over heavy ones (like Java).
Lambda is not a daemon process; it does not sit and wait. It must be triggered.
If you want to build a Serverless API, you need a front door. API Gateway handles the HTTP traffic and invokes Lambda.
To build resilient serverless apps, you must decouple components.
A buffer between producers and consumers.
Scenario: 1000 users order a ticket at the same exact second.
Pub/Sub messaging. One message -> Many subscribers (Fan-out pattern).
Example: "Order Placed" event published to SNS Topic.
-> Subscriber A (Lambda): Updates Inventory.
-> Subscriber B (SQS): Queues email to user.
-> Subscriber C (HTTPS): Sends webhook to Slack.
This is a classic AWS pattern:
// Example: Lambda Handler (Node.js)
const AWS = require('aws-sdk');
const s3 = new AWS.S3();
const sharp = require('sharp');
exports.handler = async (event) => {
const bucket = event.Records[0].s3.bucket.name;
const key = decodeURIComponent(event.Records[0].s3.object.key.replace(/\+/g, ' '));
// 1. Get Image
const original = await s3.getObject({ Bucket: bucket, Key: key }).promise();
// 2. Resize
const resized = await sharp(original.Body).resize(200).toBuffer();
// 3. Save to new bucket
await s3.putObject({
Bucket: bucket + '-resized',
Key: 'thumb-' + key,
Body: resized
}).promise();
};
You have learned about EC2 (Managing OS + App) and Lambda (Managing Code only). But what if you have a legacy application that takes 30 seconds to start? Lambda will time out. What if you need to install complex system dependencies? EC2 is too much maintenance.
The Solution: Docker Containers on AWS.
Containers package your code and dependencies together. AWS provides the tools to run them at scale.
Containerization on AWS consists of three main parts:
ECR is a secure, private registry. It is integrated with IAM.
To push an image from your laptop to ECR, you must authenticate the Docker CLI with AWS.
# 1. Login to ECR aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 123456789012.dkr.ecr.us-east-1.amazonaws.com # 2. Build your image docker build -t my-app . # 3. Tag it docker tag my-app:latest 123456789012.dkr.ecr.us-east-1.amazonaws.com/my-app:latest # 4. Push docker push 123456789012.dkr.ecr.us-east-1.amazonaws.com/my-app:latest
Understanding ECS requires learning its specific vocabulary.
This is the most critical architectural decision you will make.
| Feature | EC2 Launch Type | AWS Fargate (Serverless) |
|---|---|---|
| Management | High. You manage the EC2 instances (patching, scaling the cluster). | Zero. No servers to manage. AWS runs the container on their fleet. |
| Pricing | Pay for the EC2 instance uptime (even if empty). Cheaper at massive scale. | Pay for vCPU/RAM used by the running container. Slightly more expensive per hour. |
| Use Case | Machine Learning, Legacy apps requiring specific kernel flags. | Web APIs, Microservices, Batch Jobs. (Recommended for 90% of users). |
How do users reach your container?
You will hear about Kubernetes. EKS is AWS's managed Kubernetes service.
Should you use it?
For this course, stick to ECS. It is significantly easier to learn.
To deploy a container without servers:
Once deployed, grab the DNS name of the Load Balancer. You now have a scalable, containerized application.
Clicking around the AWS Console (ClickOps) is fine for learning, but it is forbidden in production. It is error-prone, unrepeatable, and impossible to audit. The industry standard is Infrastructure as Code (IaC).
IaC allows you to define your VPCs, EC2s, and Databases in text files. You commit these files to Git. You deploy your entire datacenter with one command.
Terraform works in three stages:
# main.tf
provider "aws" {
region = "us-east-1"
}
resource "aws_vpc" "main_vpc" {
cidr_block = "10.0.0.0/16"
tags = {
Name = "Production-VPC"
}
}
resource "aws_instance" "web_server" {
ami = "ami-0c55b159cbfafe1f0" # Amazon Linux 2
instance_type = "t2.micro"
# Reference the VPC created above
subnet_id = aws_subnet.public_subnet.id
tags = {
Name = "MyWebServer"
}
}
Terraform must remember what it created. It stores this in a JSON file called the State File.
For developers who hate writing YAML or HCL configuration, AWS created the CDK. It allows you to define infrastructure using real programming languages (TypeScript, Python, Java).
import * as ec2 from 'aws-cdk-lib/aws-ec2';
import * as cdk from 'aws-cdk-lib';
export class MyStack extends cdk.Stack {
constructor(scope: cdk.App, id: string, props?: cdk.StackProps) {
super(scope, id, props);
// Create a VPC with 2 lines of code
const vpc = new ec2.Vpc(this, 'TheVPC', {
maxAzs: 2
});
// Create a Web Server
const instance = new ec2.Instance(this, 'WebServer', {
vpc,
instanceType: ec2.InstanceType.of(ec2.InstanceClass.T3, ec2.InstanceSize.MICRO),
machineImage: new ec2.AmazonLinuxImage(),
});
}
}
IaC enables the concept of Immutability.
In a professional environment, developers never deploy from their local machines. Manual deployments are slow, unrepeatable, and prone to human error ("Oops, I deployed the wrong folder").
The Solution: A CI/CD Pipeline.
CI (Continuous Integration): "I push code, the server automatically tests and builds it."
CD (Continuous Deployment): "If the build passes, the server automatically pushes it to Production."
AWS provides a set of tools that mirror the functionality of Jenkins, GitHub, and CircleCI.
A private Git repository hosted by AWS.
Reality Check: Most companies still use GitHub or GitLab. CodeCommit is mostly used in highly regulated industries (banking/government) where code cannot leave the AWS ecosystem.
A fully managed build service. It spins up a temporary Docker container, downloads your code, runs your commands (e.g., `npm install`, `npm run build`, `docker build`), generates artifacts (zip files/images), and then destroys the container.
Pricing: You pay per minute of build time.
Automates the deployment to compute services.
The "Conductor". It visualizes the workflow: Source -> Build -> Staging -> Manual Approval -> Production.
CodeBuild looks for a file named buildspec.yml in the root of your repository to know what to do. This is equivalent to `.github/workflows/main.yml`.
version: 0.2
phases:
install:
runtime-versions:
nodejs: 18
commands:
- echo Installing dependencies...
- npm ci
pre_build:
commands:
- echo Running tests...
- npm test
build:
commands:
- echo Building the React app...
- npm run build
post_build:
commands:
- echo Build completed on `date`
- echo Syncing to S3...
- aws s3 sync build/ s3://my-production-bucket --delete
artifacts:
files:
- '**/*'
base-directory: build
How do you update a running application without crashing it?
| Strategy | Description | Risk |
|---|---|---|
| In-Place | Stop the app on Server A, update code, restart app. | High. Downtime occurs during restart. |
| Rolling | Update Server 1, then Server 2, then Server 3. | Medium. Reduced capacity during update. |
| Blue/Green | Spin up a completely new set of servers (Green) alongside the old ones (Blue). Switch the Load Balancer to Green instantly. | Low. Instant rollback possible. Expensive (double infrastructure for a short time). |
| Canary | Send 10% of traffic to the new version. If no errors after 5 mins, send 100%. | Lowest. Real users test the code. |
If you prefer GitHub Actions over AWS CodePipeline, you must secure the connection. Do not create an IAM User with long-term Access Keys and paste them into GitHub Secrets.
OIDC allows GitHub to request a temporary, short-lived IAM Role from AWS only for the duration of the job.
# .github/workflows/deploy.yml
name: Deploy to AWS
on: [push]
permissions:
id-token: write # Required for OIDC
contents: read
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@v2
with:
role-to-assume: arn:aws:iam::123456789012:role/GitHubActionRole
aws-region: us-east-1
- name: Deploy
run: |
aws s3 sync ./build s3://my-bucket
The ultimate goal is GitOps. Your pipeline should not just deploy app code; it should deploy the infrastructure.
The Workflow:
Congratulations on reaching the end. To graduate from "Tutorial Hell" to "Cloud Architect," you must build something real. We will design and architect an Instagram clone called CloudGram.
The Constraint: You cannot click buttons in the console. You must define this entire infrastructure using Terraform or AWS CDK. This ensures your project is reproducible and professional.
We will use a Hybrid Architecture. This demonstrates mastery of both server-based networking (VPC/EC2) and modern serverless event processing (Lambda).
Before writing application code, lay the foundation.
# network.tf
# 1. Create VPC
resource "aws_vpc" "main" { cidr_block = "10.0.0.0/16" }
# 2. Public Subnets (for ALB & NAT Gateway)
resource "aws_subnet" "public_1" { ... }
resource "aws_subnet" "public_2" { ... }
# 3. Private App Subnets (for EC2 API)
resource "aws_subnet" "private_app_1" { ... }
resource "aws_subnet" "private_app_2" { ... }
# 4. Private Data Subnets (for RDS - No Internet Access)
resource "aws_subnet" "private_db_1" { ... }
resource "aws_subnet" "private_db_2" { ... }
Challenge: If a user uploads a 10MB photo to your EC2 server, your server memory spikes, and you pay for bandwidth ingress and egress to S3. This kills scalability.
Solution: Upload directly from the Browser to S3.
/api/get-upload-url?filename=photo.jpg.PUT request directly to S3.Once the file hits S3, we need to create a thumbnail. Do not do this on the EC2 server.
s3:ObjectCreated:Put.We use the right database for the right job.
Strict schema. Consistency matters.
Table: Users - id (PK) - username - email - password_hash - created_at
High speed reads. Flexible schema.
Table: Feed - PK: UserID - SK: Timestamp - Attributes: ImageURL, ThumbnailURL, Caption, LikesCount
Why? To load a user's profile feed, we perform a single DynamoDB Query: PK = UserID AND SK > 0. This is O(1) time complexity, regardless of how many users exist.
To ensure the app is fast globally:
Before you call this project "Done", verify the following:
If you build this project and document it on GitHub with a clear ReadMe and architecture diagram, you are technically qualified for the AWS Certified Solutions Architect Associate exam and junior/mid-level Cloud Engineering roles.
There is a distinct difference between Monitoring and Observability.
In a microservices or serverless architecture, a single user request might touch 10 different services (CloudFront -> ALB -> Fargate -> Lambda -> DynamoDB). If it fails, how do you find the needle in the haystack? We use the Three Pillars of Observability: Logs, Metrics, and Traces.
CloudWatch Logs is the centralized aggregation service. Resources (Lambda, EC2, RDS, API Gateway) send their `stdout` and `stderr` output here.
Browsing logs line-by-line is impossible at scale. Logs Insights provides a SQL-like syntax to query terabytes of logs in seconds.
Scenario: You want to find the top 5 most frequent error messages in the last hour.
# CloudWatch Logs Insights Query Syntax fields @timestamp, @message, userId | filter level = "ERROR" | stats count(*) as errorCount by @message | sort errorCount desc | limit 5
Metrics are numerical data points sent over time. They are lightweight and fast.
By default, EC2 sends CPU, Disk, and Network metrics to CloudWatch. It does NOT send RAM (Memory) usage.
Why? The Hypervisor (AWS hardware) can see the CPU load, but it cannot see inside the OS to know how much RAM is free.
Solution: You must install the CloudWatch Unified Agent on your EC2 instances to push memory/disk swap metrics.
Dashboards are for looking; Alarms are for waking you up.
CloudWatch Alarms don't send emails directly. They publish to an SNS Topic.
This is the ultimate tool for microservices. X-Ray traces a request as it travels through your entire distributed application.
When a request hits your Load Balancer, AWS adds a unique header: `X-Amzn-Trace-Id`. As this request passes to EC2, then to DynamoDB, then to S3, that ID is preserved.
X-Ray draws a visual node-graph of your architecture.
Benefit: You can instantly see: "The API is slow because the SQL Query to RDS is taking 3 seconds."
To get detailed traces, you wrap the AWS SDK in your code.
const AWSXRay = require('aws-xray-sdk');
// Wrap the AWS SDK
const AWS = AWSXRay.captureAWS(require('aws-sdk'));
// Now, every call this S3 client makes is automatically traced
const s3 = new AWS.S3();
exports.handler = async (event) => {
// Custom sub-segment for your own logic
const segment = AWSXRay.getSegment();
const subsegment = segment.addNewSubsegment('ImageProcessing');
try {
await processImage();
subsegment.close();
} catch (e) {
subsegment.addError(e);
subsegment.close();
}
};
ServiceLens is the UI that combines CloudWatch Metrics, Logs, and X-Ray Traces into one view.
If you click on a spike in a specific metric graph (e.g., "Latency"), ServiceLens will show you the exact X-Ray traces that contributed to that spike, and the Logs associated with those specific requests.
While often used for scheduling (Cron), EventBridge is also an observability tool for Audit & Compliance.
Pattern: GuardRails
For your Capstone project, build a Dashboard with these 3 widgets:
You have now covered the entire spectrum of AWS development. From IAM security to VPC networking, from Serverless compute to Distributed Observability.
Next Steps: Build the Capstone project. Break it. Fix it using Logs Insights. That is how you become a Senior Engineer.