De-mystifying AWS Cloudwatch metrics for EBS IOPS

If you use AWS EBS extensively, I’m  sure you have often scratched your head trying to figure out its performance metrics.

We wrote a couple of blogs (and see this) on this a while back and tried to explain why EBS performance (or any storage in general) can be confusing. Customers typically use CloudWatch EBS metrics to analyze the performance of EBS volumes. AWS provides extensive monitoring capabilities and details of available EBS Cloudwatch metrics can be found here.

Is CloudWatch metrics on EBS IOPS measurement accurate?

If yes, how is it possible that at times you see EBS IOPS consumption reported in CloudWatch exceeding provisioned capacity?  For e.g. you provision a EBS io1 volume provisioned for 1000 IOPS and at times you see CloudWatch reporting 2000 IOPS! The answer depends on the details of how CloudWatch reports EBS metrics.  Read on to find out the details.

For any storage system, the most interesting metrics are IOPS (IO operations per second), latency and throughput (explained in the blog here), IOPS being the most important one. AWS provisions storage based on IOPS (io1) and capacity.

It should be noted that CloudWatch collects performance information every 60 seconds for the io1 (provisioned IOPS) volume, and every 300 seconds for gp2 (general purpose SSD), sc1 (cold HDD), st1 (throughput optimized HDD) and magnetic volumes. You need to keep the minimum measured granularity in mind while analyzing available statistics such as ‘Average’, ‘Maximum’ and ‘Minimum’. Within the minimum measured granularity (60 seconds or 300 seconds), all of these statistics will remain the same as there is only one data point for that period.

CloudWatch metrics relevant to this blog are the following:

VolumeReadOps and VolumeWriteOps (units: count), which are the total number of I/O operations in a specified period of time. To calculate the average I/O operations per second (IOPS) for the period, divide the total operations in the period by the number of seconds in that period. The Minimum and Maximum statistics on this metric are supported only by volumes attached to a C5 or M5 instance.

AWS defines an I/O operation as below:

“I/O operations that are smaller than 256K each count as 1 consumed IOPS. I/O operations that are larger than 256K are counted in 256K capacity units. For example, a 1024K I/O would count as 4 consumed IOPS.”

So what is the mystery?

The mystery lies in the following facts (let’s limit the discussion here to io1 type volumes. gp2 volumes offer performance credits up to 5M I/O at 3000 IOPS):

  1. I/O Consolidation:  When I/O requests are sequential, the EBS system consolidates I/O requests into 256K blocks, i.e. if you send 2 x 128K blocks, EBS will write it as a single 256K block and, from an IOPS consumption point of view, count it as 1 block.
  2. CloudWatch reports pre-merged I/O count: If you send 2 x 128K blocks, CloudWatch will report them as 2 I/O operations!

In other words CloudWatch reports IOPS on a pre-merged basis while the EBS backend uses post-merged I/O count to determine consumption.  Unfortunately, AWS currently doesn’t provide any mechanisms to retrieve consumed/post-merged IOPS. We have been told that AWS is looking into it. It does take some effort, but you can figure it out with some efforts if you know the blocksize used by your application/file system. File systems and block level drivers also consolidate data blocks to improve performance complicating the situation.

Clearly this is an issue that must be addressed by AWS, and we hope they will address it soon.