What if we used a different model, say a linear regression? In this case we would have a linear regression model for each one of the sampling periods. As you may recall, the exponential moving average model was implemented as an

*on-line process*, being able to read and discard one sampled measure at a time. Can we do this with a linear regression model? Let's see how to estimate the parameters of a simple linear regression model:

and:

Looking at the formula to calculate the slope estimate, it might seem that the whole data set would be needed. After all, the above formula seems to imply that we must keep track of all the

*x*and

*y*values, so the whole data set would have to be kept in memory like a spreadsheet.

Fortunately this is not so. After some algebraic manipulations (see here), we can safely transform the equation to this

where the variables above are very simple averages:

This means that we only need a few variables to store all the necessary information to calculate the regression coefficients. Note that instead of the averages themselves, only the sums will be stored that, along with the sample count (

*n*), will allow to perform the average calculation when needed. The stored variables are then:

For completeness, we will also store

*Syy*(see below why). To determine if a given value is unexpected, we will use the same approach as before: first the expected value is calculated around which an acceptability range is calculated using the expected value's standard error or other similar measure. If the observed value falls outside this range, it is signaled as a potential outlier.

In the case of linear regression there is a standard approach to solve the outlier detection issue. For each estimated value , calculate its residual - the difference between the expected and the observed value - and divide it by the residuals standard error (RSE): the

*studentized*residual in statistics lingo. The residual is pretty easy to calculate:

So how do we calculate RSE? Easy:

Where RSS, the residual sum of squares is defined as:

We can also rearrange this formula using

so that we get

Expanding the square, we get

or

which is already expressed in terms of the variables we store. Now, it is very easy to implement the new LinearRegressionModel class. The update method is simply:

Outlier detection is a bit more convoluted:

As usual, the code can be found on GitHub.

Equations in this post were created using a very nice online equation editor. Besides the Wikipedia article, I also used some knowledge from this wonderful book: An Introduction to Statistical Learning: with Applications in R (Springer Texts in Statistics).