Right-Sizing EC2 Instances: The Hidden Savings in Your AWS Bill

If you've been running EC2 instances for more than six months without a deliberate right-sizing exercise, you're almost certainly leaving money on the table. In our experience auditing AWS environments for mid-market companies, the average account is paying for roughly 40–60% more compute capacity than it ever uses. The culprit isn't negligence — it's the default instinct to provision generously and never revisit.

This post walks through a systematic, data-driven approach to right-sizing: where to gather the metrics that matter, how to interpret them, which AWS tools can automate the analysis, and how to execute changes without risking downtime.

Why Over-Provisioning Happens

Teams typically choose instance sizes at launch time based on worst-case estimates or rough rules of thumb ("our old server had 16 cores, so let's get a c5.4xlarge"). That initial choice almost never gets revisited once the service is stable. Meanwhile, AWS keeps releasing new generations of instances — the m6i family delivers roughly 15% better price/performance than m5 at the same price point — and workloads evolve as features are added, traffic patterns shift, or services are split into microservices.

The result is a sprawling fleet where some instances are maxing out their CPUs while others idle at 3% utilization, and the bill reflects the worst of both worlds.

The Four Dimensions of Right-Sizing

Right-sizing isn't just about CPU. A thorough analysis covers four resource axes:

CPU utilization — the most commonly analyzed. Look at P95 and P99, not just averages. A web server that spikes to 95% CPU for 30 seconds every hour needs headroom; one that averages 8% and never exceeds 20% is over-provisioned.
Memory utilization — EC2 does not send memory metrics to CloudWatch by default. You need the CloudWatch agent installed to capture this. It's the most commonly overlooked dimension and often the one where the biggest waste hides.
Network throughput — instances have baseline and burst network bandwidth. A t3.xlarge has a 5 Gbps baseline; if your workload never exceeds 500 Mbps, network is not your constraint.
EBS I/O — disk-intensive workloads (databases, log aggregators) may be bottlenecked on IOPS rather than CPU. Upsizing compute while keeping a gp2 volume won't help. Conversely, you may be paying for a large instance when an IOPS-optimized volume on a smaller instance would do the job cheaper.

Getting CloudWatch Memory Metrics

The single highest-leverage action before any right-sizing analysis: install the CloudWatch agent on every instance and enable memory reporting. Without it, you're flying blind on one of the four critical dimensions.

Here's the minimal agent config to capture memory and disk on Amazon Linux 2 / AL2023:

{
  "metrics": {
    "namespace": "CWAgent",
    "metrics_collected": {
      "mem": {
        "measurement": ["mem_used_percent"],
        "metrics_collection_interval": 60
      },
      "disk": {
        "measurement": ["disk_used_percent"],
        "resources": ["/"],
        "metrics_collection_interval": 60
      }
    }
  }
}

Deploy this via SSM Run Command across your fleet:

aws ssm send-command \
  --document-name "AWS-ConfigureAWSPackage" \
  --parameters '{"action":["Install"],"name":["AmazonCloudWatchAgent"]}' \
  --targets "Key=tag:Environment,Values=production" \
  --region us-east-1

Give it a week or two to collect data before drawing conclusions. Sizing decisions based on less than two weeks of metrics will miss weekly peaks (e.g., Monday-morning batch jobs or Friday-night reports).

Using AWS Compute Optimizer

Once you have reasonable metric coverage, AWS Compute Optimizer is the fastest path to right-sizing recommendations. Enable it at the organization level (it's free for the base tier, with a paid Enhanced Infrastructure Metrics option that extends the lookback window from 14 days to 93 days).

# Enable Compute Optimizer for your account
aws compute-optimizer update-enrollment-status \
  --status Active \
  --include-member-accounts

# Pull EC2 recommendations to a CSV
aws compute-optimizer get-ec2-instance-recommendations \
  --output json \
  --query 'instanceRecommendations[*].{
    Instance:instanceArn,
    Current:currentInstanceType,
    Recommended:recommendationOptions[0].instanceType,
    PerfRisk:recommendationOptions[0].performanceRisk,
    MonthlySavings:recommendationOptions[0].estimatedMonthlySavings.value
  }' | jq -r '
    ["Instance","Current","Recommended","PerfRisk","MonthlySavings"],
    (.[] | [.Instance,.Current,.Recommended,.PerfRisk,.MonthlySavings])
    | @csv
  ' > compute-optimizer-recs.csv

The output gives you a prioritized list with estimated monthly savings and a performance risk score (0 = safe, 3 = high risk of degradation). Filter for risk scores of 0 or 1 and you have a set of changes you can confidently schedule.

Key insight: Compute Optimizer's "performance risk" score is based on historical utilization patterns. A score of 0 means the recommended instance type has consistently had headroom given your actual workload — not just that it's a smaller box. Trust the score, but always cross-reference with your own P99 metrics before touching production databases or anything without auto-scaling.

The Right-Sizing Decision Matrix

Not all instances should be treated the same way. Here's the framework we use when triaging a fleet:

Tier 1: Safe to change immediately

Stateless, auto-scaled workloads (web tier, API servers behind a load balancer) with P99 CPU below 40% and P99 memory below 60%. These can be resized during a normal deploy cycle with zero risk. Swap the launch template, roll the ASG, done.

Tier 2: Change with a maintenance window

Stateful single-instance workloads — standalone app servers, dev/staging instances, internal tools. Schedule a 15-minute maintenance window, stop the instance, change the instance type, start it back up. The whole operation takes under 5 minutes of actual downtime.

Tier 3: Requires a workload-aware plan

Databases (RDS and self-managed), Kafka brokers, Elasticsearch nodes. These need careful handling: verify replication is healthy before touching any node, resize one node at a time, and monitor closely for the 15 minutes after each change. For RDS, use the "apply during next maintenance window" option unless urgency demands otherwise.

Tier 4: Do not touch without a load test

Instances where Compute Optimizer shows a performance risk of 2 or 3, or where you don't have CloudWatch memory metrics. Gather more data first, or run a shadow load test against a resized instance before committing to production.

Generation Upgrades: Often the Biggest Win

Right-sizing is not only about going smaller — it's also about going newer. Moving from m5 to m7i (Intel) or m7g (Graviton 3) at the same size often delivers 20–30% better performance per dollar, meaning you can sometimes downsize and get better performance simultaneously.

The Graviton opportunity is particularly compelling. m7g instances are typically 15–20% cheaper than equivalent m7i instances and perform comparably for most general-purpose workloads. If your application stack compiles cleanly for ARM64 (most modern runtimes — Python, Node.js, Java, Go — do), the migration is usually a drop-in swap at the launch template level.

# Check if your AMI supports ARM64
aws ec2 describe-images \
  --image-ids ami-0abcdef1234567890 \
  --query 'Images[*].Architecture'

# List available m7g instance types in your region
aws ec2 describe-instance-types \
  --filters "Name=instance-type,Values=m7g.*" \
  --query 'InstanceTypes[*].{Type:InstanceType,vCPU:VCpuInfo.DefaultVCpus,MemGiB:MemoryInfo.SizeInMiB}' \
  --output table

Combining Right-Sizing with Reserved Instances and Savings Plans

One common mistake: buying Reserved Instances or Compute Savings Plans before right-sizing. If you commit to a 1-year reservation on a fleet of over-provisioned m5.2xlarge instances and later discover you could run on m7g.large, you've locked in the waste. The correct order is:

Right-size the fleet. Get to a stable baseline you trust.
Run for 30–60 days post-resize to confirm the new sizing holds under production load.
Apply Savings Plans or RIs to the stabilized usage. Even a 1-year Compute Savings Plan saves ~30% over On-Demand; a 3-year no-upfront saves ~50%.

The combination of right-sizing plus Savings Plans is where the 40–60% bill reductions we mentioned at the top of this article actually come from. Neither alone gets you there consistently.

Tracking Progress with a Right-Sizing Dashboard

Once you've made changes, you need a way to measure results. The simplest approach: tag every instance with its pre-resize type using a custom tag (previous-instance-type), then use Cost Explorer's tag-based grouping to measure before/after spend. You can also set up a monthly CloudWatch metric math expression that tracks fleet-wide average CPU utilization as a proxy for sizing efficiency over time.

# Tag instances before resize (run before changing instance type)
aws ec2 create-tags \
  --resources i-0123456789abcdef0 \
  --tags Key=previous-instance-type,Value=m5.2xlarge \
         Key=resized-date,Value=2025-03-21

Putting It All Together

Right-sizing is not a one-time project — it's a quarterly hygiene practice. Workloads change, new instance generations arrive, and usage patterns shift. The companies that keep their AWS bills under control are the ones that have made right-sizing a routine part of their engineering calendar, not a crisis response to a surprise invoice.

Start with Compute Optimizer, layer in the CloudWatch agent for memory visibility, and work through the tier framework above. A focused two-week effort on a typical 50-instance fleet routinely surfaces $8,000–$20,000 per year in savings — for work that doesn't require a single line of application code to change.

Need a second opinion on your stack?

Start with a free 20-minute assessment call to scope the problem and decide whether a paid diagnostic or implementation step is worthwhile. Written findings are not included in the free call.

Get a Free Assessment → More Articles