Cloud FinOps

mins read

Cloud Cost Anomalies: Detection, Identification, and Mitigation of Unexpected Costs

Haneesh Vengala

Published on

September 10, 2024

What are cloud cost anomalies?

Anomalies in the context of cloud costs unexpected and significant spikes in your cloud spending compared to historical patterns. Though many cost increases in the cloud are expected and planned (due to planned releases), some of the cost increases can be completely unexpected and we will be considering them as cloud cost anomalies. Often the usage that's related to these cost anomalies are unnecessary and unimportant.

Identifying such anomalies are very important as they rapidly escalate cloud costs, consuming substantial portions of cloud and overall IT budgets. For Managed Service Providers (MSPs) that manage a customer's cloud, the repercussions extend beyond monetary loss. Failure to address anomalies can erode customer confidence and potentially lead to customer attrition.

‍

How can anomalies be generated?

1. Overambitious Projects

Overly optimistic workload estimations can lead to excessive resource provisioning. This often results in idle resources, incurring unnecessary costs. For instance, provisioning servers for anticipated peak load that never materializes can lead to significant overspending.

2. Coding Issues

Defective code can inadvertently trigger resource-intensive operations. For example, a lambda function with an infinite loop can run for substantial time without yielding any intended output. This type of anomaly can rapidly escalate costs.

3. Lack of AWS pricing knowledge

Implementation of AWS cost-saving measures can increase expenses rather than reducing them. A prime example of this is the mismanagement of lifecycle policies involving AWS Glacier storage.

While Glacier offers significantly lower storage costs compared to standard S3 buckets, the lifecycle transition process to this tier incurs charges based on the number of objects transferred. If the unnecessary data is planned for moving to glacier tier and not optimally compressed into a smaller number of objects before transitioning, the lifecycle transition costs can easily outweigh the savings in storage costs. This underscores the importance of having a comprehensive understanding of AWS pricing structures and the potential consequences of hasty optimization decisions.

4. Misconfigurations

Misconfigurations like enabling unnecessary debug logging for ECS pods by default or enabling logging for irrelevant cloud trails by default can quickly escalate costs and impact your cloud spend. Some other examples include

Enabling additional copies of management events in CloudTrail, which are typically not required leads to copies getting stored in S3 and inflates S3 storage costs
Enabling AWS Macie for sensitive data discovery on the S3 buckets, that themselves are used for storing sensitive PII data. Given that we are already certain about the sensitive nature of the data, the upfront cost of Macie might appear unjustified and redundant. This can lead to high costs quickly as AWS Macie charges for every object present in the S3 bucket.

5. Malicious Activity

Cyberattacks can exploit cloud resources for malicious purposes, leading to exorbitant costs. Crypto mining is a prevalent example, where attackers hijack compute resources of an organization to generate cryptocurrency, draining system performance and inflating cloud bills.

‍

Type of Anomalies

There can be following types of anomalous events based on how long it occurs :

1.One-time or Point anomalies

These kinds of anomalies cause sudden sharp increases in cloud costs for a specific period like for few hours in a day. A common case occurs when unexpected traffic to an application load balancer triggers autoscaling group attached to it, to spin up multiple ec2 instances to accommodate increased demand. While this is necessary for legitimate traffic spikes, this mechanism can inadvertently incur excessive compute costs when confronted with anomalous traffic patterns, such as those generated by Distributed Denial of Service (DDoS) attacks. In these instances, resource provisioning in response to malicious traffic results in unnecessary expenditure.

2.Continuous or Burning anomalies

Continuous cost anomalies refer to anomalies that cause sustained cost spikes for longer period (Ex: for a few days). Unlike point anomalies, where there are sudden spikes in costs, these anomalies gradually shift the baseline daily costs upwards over a longer period. Because of such behavior, they rather go unnoticed when costs are being looked at a daily level. So, early identification of such anomalies is even more critical as accumulated costs over time can escalate significantly, necessitating quick intervention to mitigate cost impact from these anomalies.

Some of the scenarios where continuous anomalies occur include:

Deployment of additional resources like EC2 instances or databases in RDS for load testing and later forgetting to shut them down.
Provisioning of large EC2 instances in development environment (Ex: Instances with instance size of 16xlarge)which are not needed. These kinds of events don't increase daily costs but contribute significantly when looked over for a month or a quarter.
Managing copies of CloudTrail management events can increase costs over a period as management events happen on day-to-day basis having a single trail is typically sufficient rather than having multiple copies of it.

3.Slow burning or hidden anomalies

Within the domain of burning anomalies, identifying slow-burning anomalies is very important as these anomalies often manifest as slow increases in costs over extended periods, making them difficult to detect through daily or weekly analysis. Highly advanced analytics techniques, such as trend analysis, time series forecasting, and anomaly detection algorithms, are essential for uncovering these critical cost drivers. By establishing baseline costs and comparing them to actual spending over longer time horizons, organizations can increase their chances of identifying these hidden anomalies before they escalate into significant financial burdens. One of the scenarios where slow burning anomaly occurs is having indefinite retention of RDS storage backups. Over time, databases can accumulate unnecessary data due to excessive data retention. Without clear detection, these backups can gradually escalate into significant cost burdens when looked over for over a month.

‍

Anomaly Management

Anomaly management involves the systematic identification, analysis, and resolution of unexpected variations in cloud spending. By effectively managing anomalies, organizations can prevent financial losses, optimize resource utilization, and ensure the overall health of their cloud environment. It typically consists of the following stages:

‍Anomaly Detection: This involves identifying unusual spending patterns through data analysis and pattern recognition techniques.
‍Anomaly Verification: Once an anomaly is detected, it's essential to validate its existence and determine its potential impact.
‍Anomaly Investigation: A deeper dive into the root cause of the anomaly is conducted to understand the underlying factors contributing to the cost increase.
‍Anomaly Remediation: Implementing corrective actions to address the anomaly and prevent recurrence.
‍Anomaly Closure: Documenting the anomaly, its resolution, and lessons learned for future reference.‍
Continuous Monitoring: Continuously monitoring cloud costs to identify new anomalies and refine detection methods.

Following process chart highlights the best anomaly management process that could be followed :

‍

Common Problems associated with anomaly management

Effective management of cloud cost anomalies requires a proactive and systematic approach. Following problems typically occur while identifying anomalies in the cloud using ingenious ,3rd party, or AWS provided tools like AWS Anomaly detection

1. Quantity over Quality

A significant challenge in cloud cost anomaly detection stems from the abundance of "noise" generated by planned, predictable cost fluctuations. These expected variations, such as peak business hours or planned deployments, are typically of high volume and mask the genuine anomalies which are more critical and happen often when only relying on purely algorithmic based anomaly detection systems. Therefore, a hybrid approach incorporating human inputs through defining common criteria for irrelevant anomalies is essential to accurately remove relevant and irrelevant cost anomalies, ensuring effective resource allocation and cost optimization.

2. Quickly finding Accountability

Though the first step for mitigating an anomaly after detection is to find the owner of the resources that cause the anomaly for validation, determining ownership for cost anomalies is hindered by the often-unclear ownership of cloud resources. While AWS tagging offers a potential solution, widespread adoption is impeded by challenges such as inconsistent tagging conventions, lack of enforcement mechanisms, and resistance to cultural change within organizations (you can go about tagging in our blog here ).Furthermore, the priority of running stable workloads for business use cases often overshadows the importance of tagging, as it requires significant time and effort to implement and maintain tagging mechanisms. These issues often increase the time for mitigating anomalies and increase the cost anomaly impact.

Best Practices

To resolve the above problems as well as to best handle the cost anomalies in your cloud, the following key practices should be considered and implemented.

‍

1. Cutting the Chatter

One effective approach to reduce noise from expected is to incorporate human inputs through rule-based systems. By empowering users who manage cloud resources to define rules that identify irrelevant ones, organizations can filter out the same. This can be achieved by creating custom rules based on AWS tags, exempting certain services, or establishing specific cost thresholds beyond which an anomaly should be highlighted. Such a system empowers stakeholders to tailor anomaly detection to their unique organizational context, significantly reducing false positives and insignificant anomalies thereby improving the overall efficiency of the anomaly management process.

2. Muting anomalies basis engineering release frequency

Teams undergoing rapid development and deployment cycles often experience frequent cost fluctuations due to resource provisioning and scaling. In such cases, generating alerts for every anomaly can create unnecessary noise and hinder productivity. By implementing mechanisms to temporarily mute anomalies for these teams during high-release periods, organizations can focus on more critical cost variations while allowing engineering teams to manage their environment effectively.

3. Establishing ownership

Implementing a robust cost allocation framework is essential for effective anomaly management. By establishing a rule-based system where cost centre owners can define ownership of resources basis on tags, services, and accounts, can streamline the process of identifying responsible parties for cost anomalies. This approach facilitates accountability and enables targeted cost optimization efforts.

4. Tracking relevant KPIs like Cost avoidance

Key KPIs like Cost Avoidance should be tracked for assessing effectiveness of Anomaly management system. Cost Avoidance provides valuable insights into the financial impact of anomaly management efforts and helps justify investments in anomaly detection and remediation strategies.

Cost Avoidance at a high level, quantifies the financial savings realized by preventing an anomaly from persisting.

To calculate cost avoidance, first determine the average daily cost incurred during the anomalous period. Then, estimate the total potential cost if the anomaly had continued unchecked by multiplying the daily cost by the total duration it would have persisted. Finally, subtract the actual cost incurred during the anomaly's active period from the total potential cost to arrive at the cost avoidance figure. This metric provides valuable insights into the financial impact of anomaly management efforts and helps justify investments in anomaly detection and remediation strategies.

Following is the example to understand how cost avoidance should be calculated :

Let's assume an EC2 instance has run unnecessarily for 30 days due to a configuration error. The instance costs $100 per day to run.

Total potential cost without anomaly management system: $100/day * 30 days = $3000
The anomaly was detected by anomaly detection system and resolved after 20 days. ‍
Actual cost incurred: $100/day * 20 days = $2000
‍Cost avoidance: $3000 - $2000 = $1000

In this case, by identifying and resolving the anomaly through anomaly management system 10 days earlier, the organization saved $1000.

‍

Conclusion

Effectively managing cloud cost anomalies is imperative for organizations seeking to optimize their cloud spending and use cloud budgets effectively. By understanding the various types of anomalies, their root causes, and the potential impact on the bottom line, businesses can implement robust strategies to identify, analyze, and remediate these cost drivers. A combination of advanced analytics, human expertise, and automation is essential for building a resilient anomaly management system. By prioritizing relevant anomalies, establishing clear ownership, and tracking key performance indicators like cost avoidance, organizations can significantly reduce cloud costs and improve overall financial performance. Ultimately, a proactive and data-driven approach to anomaly management is crucial for achieving long-term cost optimization and maximizing the return on cloud investments

FAQs

No items found.

Table of content

Example H2