Machine learning CPU Utilization for AWS EC2 scheduling

Background

The worldwide public cloud services market is projected to reach $204 billion in 2016, 16.5% up from $175 billion in 2015, according to Gartner, Inc. One of the benefits of public cloud is its cost effectiveness. Customers only need to pay for resources they provision/use. Physical resources such as compute instances are charged based on when they are active (‘powered on’). For example, AWS charges m4.4xlarge compute instance with 16 vCPUs and 64GB RAM $0.958 per hour. If the machine was left alone 24 hours one would pay 24*$0.958 = $22.99/day = . $8391/Year. However, many instances are not utilized 247. If a compute instance is used for 8 hours per day, it can be powered down for the rest 16 hours to save 67% spending ($5621 savings!). Therefore, one needs to keep an eye on the instance utilization patterns and create power management schedules based on the patterns. Next, we provide an example where one can analyze the CPU utilization patterns.

CPU Utilization Analysis

The following figure shows the CPU max utilization (%) from an instance for 14 days. Just from visualization, we can tell the CPU max utilization is only high for a short period of time per day. This implies that the instance can be powered down for the rest of the time.

blog1

To run the analysis in a quantitative way, we first need to determine a threshold for idle times, say CPU max utilization < 1%. Then we can check when utilization falls below the threshold, and that is when the instance can be powered down. However, this is not an easy task. Although there seem to be some periodic patterns, there are also many non-periodic patterns and noise. Therefore, we need another threshold for the shortest length of the idle interval. Because of unexpected usage and noise, only idle intervals greater than say 5 hours are considered as real idle patterns. By manually checking the whole time series using the two thresholds, we have found that weekday idle times are 19:45-00:45 and 07:00-13:45, and weekend idle times are 06:45-06:00.

While this analysis works, it has some limitations. First, it requires a lot of efforts to constantly run this analysis. Furthermore, one may fail when there are thousands of instances to track, and that is when unnecessary cost can add up quickly.

FittedCloud Solution

We use machine learning techniques to learn the CPU utilization patterns over a period of time and translate the likelihood of idle times into different probability levels. Idle times with high confidence are converted to recommended power management schedules. Customers can choose different confidence levels of idle times to execute the schedules. We continuously monitor changes in CPU utilization patterns and update the recommendations in a timely basis.

We have learned the CPU utilization patterns for the above example. The predicted the likelihoods of power-on are shown below. We can see that while there exists usage around 06:30 for both weekday and weekend, weekday has usage at other times. Therefore, the recommended idle times for weekday are 19:45-00:45 and 07:00-13:45, and those for weekend are 06:45-06:00, which are the same as in the above analysis.

blog2

There are several advantages of our solution. First of all, it is completely automatic. This frees customers from the tedious tasks of monitoring and scheduling multiple instances, especially when there are hundreds or thousands of instances. Second, the continuous monitoring of utilization patterns keeps track of seasonal and unexpected changes, which may be difficult to notice and can even be ignored by customers. Third, the power management schedules learned from CPU utilization are statistically optimal, which will outperform manual schedules in the long run and eventually reduce more cost.