First, let's start with this post's inspiration, an excellent idea that was brought to my attention by the weekly O'Reilly Data Newsletter: Data Science tricks: Simple anomaly detection for metrics with a weekly pattern, by Ilias Flaounas. Instead of using one model for the whole time series, the author proposes the creation of multiple models, one for each seasonal measurement. For instance, if every 5 minutes you are sampling a data source that has both a daily and a weekly pattern, you would create 12 * 24 * 7 = 2016 models (there are 12 five-minute intervals in one hour). Other sources, like sales data, might require larger sampling periods and a smaller number of models, e.g. an hourly sampled source would require 168 models for a daily and weekly seasonality.
Instead of using the article's exact formulas, I found an article named "Incremental calculation of weighted mean and variance" by Tony Finch that, besides the interesting read for a math nerd like me, describes how to perform the online calculation of both average and variance of the time series value. By online, I mean that you need not have all the data set in memory in order to calculate the values of interest, but you can read them one by one from a stream, discard the old ones and still be able to get the correct values. This not only saves memory but may also improve performance. The price you pay for this is a small amount of state that must be somehow persisted.
Data science is about data, so we need some of it to show how this anomaly detection method works. For this purpose, I will use simulated cyclical data using a Python script. Each data point corresponds to a five-minute interval sample as above, and I am going to pretend that all days are equal and there is a very small upwards trend. The data is generated with a simple sine function with some random noise added. Later I will manually add some extreme values and check if the code detects them as abnormal. The Python script I used to generate the simulated data is the following:
This script generates data for approximately four years (four 52 week periods). To simulate random variations, the script adds a uniform random value to each generated point. You can either run the code or use the "server-log.csv" file on the git repo.
I implemented the detection code using a small console C# project that reads the whole CSV file and tries to detect any possible outliers. The weekly model is represented by the WeeklyLogModel class that contains an array of 2016 ExponentialMovingModel objects.
The first 32 weeks are used to train the model and the remaining ones are evaluated for outliers. I got to this figure by trial-and-error: too small and you will get too many false positives. Also, I tuned the other parameters as 0.1 for the "memory" parameter and 3.5 for the detection "radius". This parameter is a distance from the model mean, measured in standard deviations, within which acceptable or reasonable values must lie. Outside of this radius, values are considered to be potential outliers. Smaller values will produce more false positives while larger ones may lead to true positives going unreported. This is definitely a parameter to tweak according to your own series.
Running the code as it is shows that the original simulated data only has one false positive at period number 65797. The value you would expect there would be roughly 1226.52 + 3.2 * 7.77 = 1251.38, quite close to the observed 1255.28. All other 419327 log records were considered "normal" or "acceptable".
Now let's try a real outlier. I planted one on a copy of the original file named "server-log-bad.csv" (also on the git repo) at observation number 419000. Just change the file name on the C# project and run it again.
Another approach to this idea can be found here: A simple approach to anomaly detection in periodic big data streams
You can find all the source code for this post on GitHub.