What is the OOMKilled Error?
OOMKilled is a common error in Linux operating systems. Linux systems have a program called Out of Memory Manager (OOM) that tracks the memory usage of each process. If your system is in danger of running out of available memory, the OOM Killer will appear and attempt to kill processes, to free up memory and prevent crashes. When a process is terminated by OOM, Linux will display the OOMKilled error.
OOM Killer’s goal is to free up as much memory as possible by killing the least number of processes. Internally, OOM Killer assigns a score to each running process. The higher the score, the more likely the process is to be terminated.
How Can You Prevent the OOMKilled Error in Linux?
Using Linux with default settings can result in memory overcommitment. This improves resource utilization, but on the other hand, it is risky because memory cannot be directly controlled by applications. Instead of simply fixing the error and restoring terminated processes, it is a good idea to think about the problem holistically and find a strategy for improving memory usage.
Linux can be configured to use different strategies for allocating memory. You can find the current policy by mounting the proc filesystem and reading from /proc/sys/vm/. The files overcommit_ratio and overcommit_memory govern how much memory the system can commit beyond the available physical memory.
The Out of Memory (OOM) Killer will only run if your system is configured to overcommit memory. If you don’t want the system to overcommit, set overcommit_memory to 2 overcommit_ratio to 0. This will prevent overcommitting and essentially, disable the OOM killer.
Identifying and Diagnosing OOMKilled Errors
You’ll need to rely on logs to find out about OOM errors. Linux has many different logging tools. One open source tool you can use is journald. It reads information about kernel events using the syslog interface. It provides a CLI called journalctl which lets you navigate through logs.
Run the following command to view OOM logs:
$ journalctl --utc -b -X -ke
X must be an integer, the minus sign next to it indicates it is a negative number. You can use negative integers to count down from the last boot, and positive integers to count forward from the first boot.
The -e option fast-forwards to the end of the log. Then you can use one of the available search commands. For example, ?killed will allow you to find log info related to OOMKilled. This will present a log with a virtual memory status summary at the time the OOM Killer ran, with statistics about processes and their memory usage.
Understanding the OOMKilled Error in Kubernetes [P1]
How is OOMKilled related to cluster resource allocation?
Applications deployed in production Kubernetes clusters require appropriate resource allocation settings to run smoothly. Optimal amounts of storage, CPU, and memory should always be available, especially when multiple instances of an application are running. It is a balancing act between providing sufficient hardware resources and minimizing cloud costs.
It is especially important to reserve sufficient resources for applications to scale, based on usage and load, without impacting service availability. When dealing with memory and CPU utilization, this additional buffer of resources is called headroom.
OOMKilled reflects lack of headroom for Kubernetes pods
For applications running as pods in a Kubernetes cluster, you can use the Metrics Server to track headroom and automatically scale these pods. If there is not enough headroom, Kubernetes will start shutting down pods, returning the OOMKilled error to pod containers.
If this error occurs, you will need to provide more resources to get your application working again. This can be time consuming, costly and cause production downtime. To avoid this, it’s important to accurately estimate how many resources your application will need before deploying it into production.
Measuring headroom to prevent OOMKilled errors
Headroom primarily reflects memory and CPU utilization. In Kubernetes, CPU utilization is measured in “millicores”, and memory usage in MiB (megabytes). Clusters running on-premises have limited available headroom. The computing power required for a pod to process a task is dynamic, because it is based on the current tasks performed. As pods face more complex tasks, they consume more resources, resulting in increased CPU and memory utilization.
Pods can be scaled horizontally when they reach specific thresholds, to handle increasing loads and ensure availability. This spins up more pods through a process called horizontal pod autoscaling (HPA). When the hardware limit is exhausted, the pod will start throwing OOMKilled errors—this can be caused by traffic load spikes, infinite loops or memory leaks.
An OOMKilled error means that the pod’s containers are terminated and unable to process requests. If multiple pods see this error during traffic peaks, it will affect multiple services, negatively impacting service availability and end-user experience. Additionally, if a pod is unable to service a request for data, you still pay for the infrastructure but pods are not actually usable.
Using load testing to simulate loads and anticipate OOMKilled errors
OOMKilled errors are common when a pod is experiencing unusually high loads, and it is not aware that sufficient resources have not been deployed. With load testing, you can simulate traffic several times higher than usual and monitor how your pods perform under the load. This is how headroom is measured in real-world scenarios. Many unexpected situations can be tested during load testing, such as logic errors, memory leaks, dependency-based errors, and concurrency mismatches.
Using autoscaling and monitoring to address the problem
Kubernetes provides automated scaling, but human intervention may be required due to the dynamic nature of cloud-native applications. To limit these interventions, it is important to analyze the key metrics of the application and make an informed estimate of the amount of headroom required.
Metrics collection can be complex, and is often managed with specialized tools such as Prometheus, Jaeger, and Grafana. These tools should be installed inside the running node and systematically export metrics to the desired backend.
Many companies are building data lakes to make it easier to analyze metrics and draw conclusions from the data. Key metrics to monitor on Kubernetes clusters include CPU and memory utilization, network traffic, pod health, API server latency, and CrashLoopBackoff errors. In addition to these metrics, you can use custom application metrics to measure errors and unexpected behavior.
In conclusion, the OOMKilled error (Out Of Memory Killed) is a common issue that can occur when a process in a Linux or Kubernetes system consumes too much memory and causes the system to run out of available memory. This can lead to performance issues and can even cause the system to crash. To solve the OOMKilled error, it is necessary to identify the process that is consuming the most memory and take steps to reduce its memory usage. This can involve optimizing the application or workload, or increasing the amount of available memory. By addressing the root cause of the OOMKilled error, organizations can improve the stability and performance of their Linux and Kubernetes systems and prevent future occurrences of this error.