IT operations need a better approach to warnings and alerts. Today, most IT monitoring software uses static performance thresholds. This project explores use of machine learning algorithms for dynamic thresholds, based on time series anomaly detection.
Machine Learning for Anomaly Detection on VM and Host Performance Metrics
One or more of the PDF files on this page fall under E202.2 Legacy Exceptions and may not be completely accessible. You may request an accessible version of a PDF using the form on the Contact Us page.
IT operations needs an improved approach to warnings and alerts. Currently, most IT monitoring software uses static performance thresholds i.e. <80% CPU usage. This project explores use of machine learning algorithms for dynamic thresholds, based on time series anomaly detection. We developed a procedure that: 1) Determines the periodicity using the autocorrelation function (ACF). 2) Uses Kalman filters for that periodicity, to learn the behavior of IT performance metrics and forecast values based on time of day, etc. Compare the actual to the forecast to check for anomalies or violation of dynamic thresholds. 3) Since Kalman filters continue to learn on new data, we needed another algorithm (DBSCAN) to check for week to week degradation or abnormal behavior and prevent the Kalman algorithm from learning from “bad performance” data and corrupting the calculation of the dynamic threshold. 4) Work included examining time series data for many virtual machines metrics and identified frequently occurring patterns. The algorithms (1-3) were successfully tested on examples of all the patterns. The research only calculates dynamic thresholds for single independent performance metric at a time.