Sunday, February 19, 2012

Welfords One Pass Agorithm

I've continued to be quiet recently as I am back at college so a lot of my time is taken trying to understand the material.

This semester I am taking Simulation and Data Mining. The maths is a lot heavier that last semester. Most of it is covered in the form of proofs in the Simulation course. In addition I am re-learning Java programming, pretty much from scratch so when as I understand the maths bits I then try to code it up in Java or implement it in an assignment solution.

We learned about Welford's One Pass Algorithm during the week and it turns out that it should be implemented in an assignment.The algorithm is an efficient way of calculating mean, variance and standard deviation for a sample on the fly.Normally if we were working on paper or in excel we would store the data first and then run back through it, calculating the individual parts of the equation first and then adding those up to get the final answers. This algorithm allows us to compute mean and variance as we iterate through a loop.
The real value from a computer resource point of view is that we don't have the memory overhead of storing the individual bits of the formula first for calculation after.

 I wasted a lot of time searching the net yesterday in a hungover state trying to find a way to code it up, as I thought that one of the variables was dependent on the the other two and so would not compute properly. Hopefully this blog entry will help anyone searching for it in plain English.

Standard equations
The fomula for the mean is:
The sum of the individual observations divided by n, the number of observations.

The formula for the variance is:
Standard deviation is simply the square root of s above.

As you can imagine in the above you would have to store values first, then send them through a loop to store the individual differences, and then finally do aggregated calculation. Very costly in terms of memory and code.

Welfords One Pass Algorithm
Welfords algorithm gets around this by calculating the minor changes on the fly. This removes the need to store data and simply looks at the delta's.

The algorithm is below; I have used a general notation which you can apply to whatever language you need.
int n = 0;
double xbar = 0.0; //mean
double var = 0.0; //Variance
double x = 0.0; //Next Observation
double d = 0.0; // Difference

For every observation:
x = getNextData();
d = x - xbar;
v = v + (d^2);
xbar = xbar + d/n;

v = v/n ***

*** I made this correction after trying to run this algorithm a second time on a different machine.

One problem I had was that if you are working through an array, the first array index is 0, which caused the entire equation to return NaN; but when I added 1 to it to change the indexing it seemed to work.

You can also calculate the covariance quite easily as well
Covariance is simply (the sum of (x*y) - (the sum of x times the sum of y) / n)/n-1
You can collect x*y as you iterate through the loop
sum of X times sum y can be calculated after the fact!