Cloud Cost Management – Explainable Anomaly Detection with Cross Resources in Public Clouds

In our previous blog of contextual anomaly detection, we have discussed machine learning algorithms that are used to detect abnormal Cloud activities in a real-time manner. Several types of abnormal Cloud activities include but not limited to:

  1. Abnormal User Accesses to the Cloud Environment based on the information of Login Location, Credential Usage, etc.
  2. Abnormal User Activities such as Role Changes, API Creations, New User Creation, etc.
  3. Abnormal Cloud Entity Activities such as Machine Communications, Application Communications, API Usage, etc. A Cloud Entity may include an application, a machine, or a process.
  4. Abnormal capacity provisioning – such as creation of resources, deletion of resources, changes in resource capacities
  5. Abnormal capacity provisioning related to location/time (for e.g. resource provisioning in a region typically not used)
  6. Abnormal capacity provisioning related to users/groups/BUs/AWS accounts

Usually, an anomaly detection algorithm can tell whether or not current Cloud activity or configuration is normal. For Cloud customers or managers, this function seems to be a “black box”. They may have no idea of that, for example, how the detection decision is made or what kind of information is used to support the decision making. For a reported anomaly, it is necessary to provide users alerts of the detected anomalies with meaningful information or interpretation to ensure the Cloud managers to trust this detection result and to further simplify its investigation. In this blog, we will discuss our development of explainable anomaly detection methods that utilize global resources in a Cloud environment.

To understand the root of an abnormal behavior, we first need to know how to represent the activity and monitor its changes over time. Here we introduce a graph-based anomaly detection method which offers a promising capability to support contextual analysis of anomalies. This method uses a graph structure to describe the Cloud activity, in which each Cloud element, such as Cloud entity, user, login IP, etc., is considered as a node, and the edge indicates the connection/communication/correlation between two nodes associated with the activity. A baseline graph is then built and maintained for normal activities in a Cloud environment. For example, one can expect that there would be a strong connection between the Cloud user node and the IP address node in the graph if the Cloud user always logs into the system with that IP address.

This graph-based anomaly detection method can incorporate all types of Cloud resources from different Cloud services for a given Cloud environment. When it is deployed, it will automatically monitor the Cloud activities and performance, build the initial baseline graph, cluster Cloud activities into groups, and update the baseline and groups with new activities over time. A new malicious activity that deviates from normal activity groups can be detected, and all abnormal connectivities in the graph can be visualized and summarized to the Cloud administrators or managers.

An example of this graph is shown in Fig. 1. At normal situations, user1 has associated with two IP addresses: ip1 and ip2, and has normal historical resource performance such as the efficiency metrics of the EC2 M4 instance, and normal activities including user creation, API creation, and role access, using one IAM. Abnormal activities will be detected if they are not consistent with this baseline activity graph. For example, user1 logs into Cloud with a new IP address, saying ip3, or there are significant changes of CPU efficiency and cost efficiency in EC2 M4 instance, or the user tries to delete existing users. All of these anomalies will be highlighted in the graph for visualization. Moreover, using the graph and the connectivities among different nodes, we can further infer the relationship among different anomalies, and then identify the root reason of one abnormal activity. For example, both anomalies of user deletion and efficiency metric change derive from the activity of user login with unknown IP address.

Fig.1 Examples of Cloud Activity Graph and abnormal behavior detection. The Cloud activity is modeled by the graph, and abnormal activities which do not conform to the baseline graph can be detected and highlighted.

Once an abnormal activity is detected, we will report it to the user or manager depending on the customized policy who will make confirmations. For example, user1 is allowed to access the Cloud environment with ip3 address, and this behavior will be considered as normal. The graph will be further updated with an edge of value 1 connecting the node ip4 and the node user1.

The explainable anomaly detection method built by FittedCloud will provide a powerful tool to identify resource provisioning anomalies caused by malicious attacks or human error and enhance the security of a Cloud environment with meaningful interpretation and visualization provided to the customers so that the decisions made by the AI algorithm are trustable and analyzable.