Machine Learning for Anomaly Detection on VM and Host Performance MetricsJuly 2018
Topics: Machine Learning, Computing Methodologies, Software (General), Software Testing
IT operations needs an improved approach to warnings and alerts. Currently, most IT monitoring software uses static performance thresholds i.e. <80% CPU usage. This project explores use of machine learning algorithms for dynamic thresholds, based on time series anomaly detection. We developed a procedure that: 1) Determines the periodicity using the autocorrelation function (ACF). 2) Uses Kalman filters for that periodicity, to learn the behavior of IT performance metrics and forecast values based on time of day, etc. Compare the actual to the forecast to check for anomalies or violation of dynamic thresholds. 3) Since Kalman filters continue to learn on new data, we needed another algorithm (DBSCAN) to check for week to week degradation or abnormal behavior and prevent the Kalman algorithm from learning from “bad performance” data and corrupting the calculation of the dynamic threshold. 4) Work included examining time series data for many virtual machines metrics and identified frequently occurring patterns. The algorithms (1-3) were successfully tested on examples of all the patterns. The research only calculates dynamic thresholds for single independent performance metric at a time.