Instead of using the article's exact formulas, I found an article named "Incremental calculation of weighted mean and variance" by Tony Finch that, besides the interesting read for a math nerd like me, describes how to perform the online calculation of both average and variance of the time series value. By online, I mean that you need not have all the data set in memory in order to calculate the values of interest, but you can read them one by one from a stream, discard the old ones and still be able to get the correct values. This not only saves memory but may also improve performance. The price you pay for this is a small amount of state that must be somehow persisted.
Data science is about data, so we need some of it to show how this anomaly detection method works. For this purpose, I will use simulated cyclical data using a Python script. Each data point corresponds to a five-minute interval sample as above, and I am going to pretend that all days are equal and there is a very small upwards trend. The data is generated with a simple sine function with some random noise added. Later I will manually add some extreme values and check if the code detects them as abnormal. The Python script I used to generate the simulated data is the following:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# | |
# Generates the simulated server log | |
# | |
import csv | |
import numpy as np | |
import random | |
base = 750 | |
amplitude = 500 | |
growth = 1e-6 | |
noise = 20 | |
with open('server-log.csv', 'w', newline='') as csvfile: | |
fieldnames = ['period', 'value'] | |
writer = csv.DictWriter(csvfile, fieldnames) | |
writer.writeheader() | |
for i in range(0, 2016 * 52 * 4): | |
row = {'period': i, 'value': (base * (1 + growth)) + amplitude * np.sin( i * np.pi * 2 / 288 - np.pi / 2) + (random.random() * 2 * noise - noise) } | |
writer.writerow(row) |
I implemented the detection code using a small console C# project that reads the whole CSV file and tries to detect any possible outliers. The weekly model is represented by the WeeklyLogModel class that contains an array of 2016 ExponentialMovingModel objects.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
using System; | |
namespace Sagaceco.TimeSeries.Patterns.Models | |
{ | |
public class ExponentialMovingModel | |
{ | |
private double average = 0.0; | |
private double variance = 0.0; | |
public ExponentialMovingModel() | |
{ | |
Weight = 0.1; | |
} | |
public ExponentialMovingModel(double weight) | |
{ | |
Weight = weight; | |
} | |
public double Weight { get; set; } | |
public double Average | |
{ | |
get { return average; } | |
} | |
public double Variance | |
{ | |
get { return variance; } | |
} | |
public void Update(double x) | |
{ | |
if( average == 0.0 && variance == 0.0 ) | |
{ | |
average = x; | |
} | |
else | |
{ | |
double diff = x - average; | |
double incr = Weight * diff; | |
average = average + incr; | |
variance = (1 - Weight) * (variance + diff * incr); | |
} | |
} | |
public bool IsOutlier( double radius, double x ) | |
{ | |
return Math.Abs( x - average ) > radius * Math.Sqrt( variance ); | |
} | |
} | |
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
namespace Sagaceco.ServerLog | |
{ | |
public class WeeklyLogModel | |
{ | |
private ExponentialMovingModel[] models = new ExponentialMovingModel[2016]; | |
private double radius = 3.0; | |
public WeeklyLogModel() | |
{ | |
for(int i = 0; i < models.Length; i++) | |
models[i] = new ExponentialMovingModel(); | |
} | |
public WeeklyLogModel(double weight, double radius) | |
{ | |
this.radius = radius; | |
for(int i = 0; i < models.Length; i++) | |
models[i] = new ExponentialMovingModel(weight); | |
} | |
public void Update(LogRecord record) | |
{ | |
int index = record.Period % models.Length; | |
models[index].Update(record.Value); | |
} | |
public bool IsOutlier(LogRecord record) | |
{ | |
int index = record.Period % models.Length; | |
return models[index].IsOutlier(radius, record.Value); | |
} | |
public double GetAverage(LogRecord record) | |
{ | |
int index = record.Period % models.Length; | |
return models[index].Average; | |
} | |
public double GetVariance(LogRecord record) | |
{ | |
int index = record.Period % models.Length; | |
return models[index].Variance; | |
} | |
} | |
} |
The first 32 weeks are used to train the model and the remaining ones are evaluated for outliers. I got to this figure by trial-and-error: too small and you will get too many false positives. Also, I tuned the other parameters as 0.1 for the "memory" parameter and 3.5 for the detection "radius". This parameter is a distance from the model mean, measured in standard deviations, within which acceptable or reasonable values must lie. Outside of this radius, values are considered to be potential outliers. Smaller values will produce more false positives while larger ones may lead to true positives going unreported. This is definitely a parameter to tweak according to your own series.
Running the code as it is shows that the original simulated data only has one false positive at period number 65797. The value you would expect there would be roughly 1226.52 + 3.2 * 7.77 = 1251.38, quite close to the observed 1255.28. All other 419327 log records were considered "normal" or "acceptable".
Now let's try a real outlier. I planted one on a copy of the original file named "server-log-bad.csv" (also on the git repo) at observation number 419000. Just change the file name on the C# project and run it again.
Another approach to this idea can be found here: A simple approach to anomaly detection in periodic big data streams
You can find all the source code for this post on GitHub.
No comments:
Post a Comment