My last post explored an idea to automatically detect outliers on seasonal time series. These time series would have both weekly and daily cycles along with a small trend, such as the ones you will find on some data sources such as server performance meters or even in some business cycles. Data for these time series is typically sampled with a given frequency, like every five minutes or every hour. To detect the existence of unexpected data on such time series, the post followed an idea that instantiates a predictive model for every sampling period of the weekly period. The original idea suggested the use of an exponential moving average model that proved to work very well. I actually tested it with a four-year data set from one of our servers only to find that outlier detection was spot-on. Now, what else can we try with this idea?
What if we used a different model, say a linear regression? In this case we would have a linear regression model for each one of the sampling periods. As you may recall, the exponential moving average model was implemented as an on-line process, being able to read and discard one sampled measure at a time. Can we do this with a linear regression model? Let's see how to estimate the parameters of a simple linear regression model:
Looking at the formula to calculate the slope estimate, it might seem that the whole data set would be needed. After all, the above formula seems to imply that we must keep track of all the x and y values, so the whole data set would have to be kept in memory like a spreadsheet.
Fortunately this is not so. After some algebraic manipulations (see here), we can safely transform the equation to this
where the variables above are very simple averages:
This means that we only need a few variables to store all the necessary information to calculate the regression coefficients. Note that instead of the averages themselves, only the sums will be stored that, along with the sample count (n), will allow to perform the average calculation when needed. The stored variables are then:
For completeness, we will also store Syy (see below why). To determine if a given value is unexpected, we will use the same approach as before: first the expected value is calculated around which an acceptability range is calculated using the expected value's standard error or other similar measure. If the observed value falls outside this range, it is signaled as a potential outlier.
In the case of linear regression there is a standard approach to solve the outlier detection issue. For each estimated value , calculate its residual - the difference between the expected and the observed value - and divide it by the residuals standard error (RSE): the studentized residual in statistics lingo. The residual is pretty easy to calculate:
So how do we calculate RSE? Easy:
Where RSS, the residual sum of squares is defined as:
We can also rearrange this formula using
so that we get
Expanding the square, we get
which is already expressed in terms of the variables we store. Now, it is very easy to implement the new LinearRegressionModel class. The update method is simply:
Outlier detection is a bit more convoluted:
As usual, the code can be found on GitHub.
Equations in this post were created using a very nice online equation editor. Besides the Wikipedia article, I also used some knowledge from this wonderful book: An Introduction to Statistical Learning: with Applications in R (Springer Texts in Statistics).