# Calculating variance and mean with MapReduce (Python)

I’m half way through Peter Seibel’s *Coders At Work* and one of the recurring topics seems to be the difficulty of parallel computing. Nearly every developer interviewed in the book so far claims that, in their experience, the bugs in multithreaded and multiprocess applications have been the most difficult to track down and fix. This is such an old topic that you’d think we would have gotten it right by now. But we haven’t, and the recent growth of cloud computing and multi-core processors has revived the topic of parallelism. One of the more intuitive approaches in recent years has been MapReduce, a framework introduced by a Google researcher. Google uses it in their search engine for tasks such as counting the frequency of particular words in a large number of documents.

### MapReduce

If you’re not familiar with MapReduce, quite simply it is a methodology for standardising the method of implementing massively parallel data processing in grid computing. The data input data is broken down into processing units or blocks. Each block is processed with a `map()`

function, and then the results of all of the blocks are combined with a `reduce()`

function. The name comes from the Python functions `map()`

and `reduce()`

, but the general concept is implementable in any language with similar functions, such as C#, Java and C++.

In the case of calculating the statistics on a large dataset, such as the frequency, mean and variance of a set of numbers, a MapReduce-based implementation would slice up the data and process the slices in parallel. The bulk of the actual parallel processing occurs in the map step. The reduce step is usually a minimal step to combine these independently calculated results, which can be as simple as adding the results from each of the map outputs.

### A simple example

Let’s say that we wanted to count the total number of characters in an array of strings. Here, the map function will find the length of each string, and then the reduce function will add these together. In Python, one possible implementation would be:

import operator strings = ['string 1', 'string xyz', 'foobar'] print reduce(operator.add, map(len, strings))

This code does not really achieve anything special — counting characters is a trivial task in nearly any language, and is achievable in a variety of different ways. The advantage of this particular solution is that, because the map functions can be executed easily in parallel, it’s extremely scalable.The processing units are associative and commutative, meaning that they can be calculated independently of each other; each of the strings could be sent to a separate processor or compute node for calculation of the `len()`

function. In Python 2.6, you can use the `multiprocessing.Pool.map()`

function as a drop-in replacement for the map function (and there are numerous other parallel implementations of the map function available). The following is a modification to our program which will distribute this task across two threads:

import operator from multiprocessing import Pool strings = ['string 1', 'string xyz', 'foobar'] pool = Pool(processes=2) print reduce(operator.add, pool.map(len, strings))

### Parallel statistics

As a real-world example, say that we have a large array of random floats loaded into memory and wish to calculate the total count, mean and variance of them. Using the MapReduce paradigm, our map function could make use of Numpy’s `size()`

, `mean()`

and `var()`

functions. The reduce function needs to implement a way to combine these statistics in parallel. It needs to accept, for example, the means of two samples, and find their combined mean.

Determining the combined number of samples is, of course, extremely simple:

Combining two mean values is also fairly trivial. We find the mean of the two means, weighted by the number of samples:

The variance is a little trickier. If both samples have a similar mean, a sample size-weighted mean would provide a reasonable estimate for the combined variance. However, we can calculate the precise variance as:

which is based on a pairwise variance algorithm.

An example implementation in Python is shown below:

import numpy def mapFunc(row): "Calculate the statistics for a single row of data." return (numpy.size(row), numpy.mean(row), numpy.var(row)) def reduceFunc(row1, row2): "Calculate the combined statistics from two rows of data." n_a, mean_a, var_a = row1 n_b, mean_b, var_b = row2 n_ab = n_a + n_b mean_ab = ((mean_a * n_a) + (mean_b * n_b)) / n_ab var_ab = (((n_a * var_a) + (n_b * var_b)) / n_ab) + ((n_a * n_b) * ((mean_b - mean_a) / n_ab)**2) return (n_ab, mean_ab, var_ab) numRows = 100 numSamplesPerRow = 500 x = numpy.random.rand(numRows, numSamplesPerRow) y = reduce(reduceFunc, map(mapFunc, x)) print "n=%d, mean=%f, var=%f" % y

The output after running this program five times was:

n=50000, mean=0.497709, var=0.082983 n=50000, mean=0.498162, var=0.082474 n=50000, mean=0.498098, var=0.083814 n=50000, mean=0.498482, var=0.083203 n=50000, mean=0.499027, var=0.083813

You can verify these results by running the following from the interactive shell, which should produce the same output:

print "n=%d, mean=%f, var=%f" % (numpy.size(x), numpy.mean(x), numpy.var(x))

MapReduce is interesting because a large class of programs can be solved by restating them as MapReduce programs. It’s no silver bullet approach to writing bug-free parallel software, but you may find that a large number of existing solutions in your code base can be easily parallelised using the MapReduce approach.

I was wondering if there are any good resources for parallel statistics in MapReduce (beyond just mean and variance). I do econometric research and am interested in whatever statistical packages and resources that are available on MapReduce.

Thanks!

Not sure if it’s appropriate for econometrics, but the Apache Mahout project has quite a nice collection of data mining and machine learning algorithms implemented using Hadoop MapReduce that you may find useful.

Hi,

the figure are not visible, could you please update the post?