Skip to content

Calculating variance and mean with MapReduce (Python)

16 June 2010

I’m half way through Peter Seibel’s Coders At Work and one of the recurring topics seems to be the difficulty of parallel computing. Nearly every developer interviewed in the book so far claims that, in their experience, the bugs in multithreaded and multiprocess applications have been the most difficult to track down and fix. This is such an old topic that you’d think we would have gotten it right by now. But we haven’t, and the recent growth of cloud computing and multi-core processors has revived the topic of parallelism. One of the more intuitive approaches in recent years has been MapReduce, a framework introduced by a Google researcher. Google uses it in their search engine for tasks such as counting the frequency of particular words in a large number of documents.

MapReduce

If you’re not familiar with MapReduce, quite simply it is a methodology for standardising the method of implementing massively parallel data processing in grid computing. The data input data is broken down into processing units or blocks. Each block is processed with a map() function, and then the results of all of the blocks are combined with a reduce() function. The name comes from the Python functions map() and reduce(), but the general concept is implementable in any language with similar functions, such as C#, Java and C++.

In the case of calculating the statistics on a large dataset, such as the frequency, mean and variance of a set of numbers, a MapReduce-based implementation would slice up the data and process the slices in parallel. The bulk of the actual parallel processing occurs in the map step. The reduce step is usually a minimal step to combine these independently calculated results, which can be as simple as adding the results from each of the map outputs.

A simple example

Let’s say that we wanted to count the total number of characters in an array of strings. Here, the map function will find the length of each string, and then the reduce function will add these together. In Python, one possible implementation would be:

import operator

strings = ['string 1', 'string xyz', 'foobar']
print reduce(operator.add, map(len, strings))

This code does not really achieve anything special — counting characters is a trivial task in nearly any language, and is achievable in a variety of different ways. The advantage of this particular solution is that, because the map functions can be executed easily in parallel, it’s extremely scalable.The processing units are associative and commutative, meaning that they can be calculated independently of each other; each of the strings could be sent to a separate processor or compute node for calculation of the len() function. In Python 2.6, you can use the multiprocessing.Pool.map() function as a drop-in replacement for the map function (and there are numerous other parallel implementations of the map function available). The following is a modification to our program which will distribute this task across two threads:

import operator
from multiprocessing import Pool

strings = ['string 1', 'string xyz', 'foobar']
pool = Pool(processes=2)
print reduce(operator.add, pool.map(len, strings))

Parallel statistics

As a real-world example, say that we have a large array of random floats loaded into memory and wish to calculate the total count, mean and variance of them. Using the MapReduce paradigm, our map function could make use of Numpy’s size()mean() and var() functions. The reduce function needs to implement a way to combine these statistics in parallel. It needs to accept, for example, the means of two samples, and find their combined mean.

Determining the combined number of samples is, of course, extremely simple:

Combining two mean values is also fairly trivial. We find the mean of the two means, weighted by the number of samples:

The variance is a little trickier. If both samples have a similar mean, a sample size-weighted mean would provide a reasonable estimate for the combined variance. However, we can calculate the precise variance as:

which is based on a pairwise variance algorithm.

An example implementation in Python is shown below:

import numpy

def mapFunc(row):
    "Calculate the statistics for a single row of data."
    return (numpy.size(row), numpy.mean(row), numpy.var(row))

def reduceFunc(row1, row2):
    "Calculate the combined statistics from two rows of data."
    n_a, mean_a, var_a = row1
    n_b, mean_b, var_b = row2
    n_ab = n_a + n_b
    mean_ab = ((mean_a * n_a) + (mean_b * n_b)) / n_ab
    var_ab = (((n_a * var_a) + (n_b * var_b)) / n_ab) + ((n_a * n_b) * ((mean_b - mean_a) / n_ab)**2)
    return (n_ab, mean_ab, var_ab)

numRows = 100
numSamplesPerRow = 500
x = numpy.random.rand(numRows, numSamplesPerRow)
y = reduce(reduceFunc, map(mapFunc, x))
print "n=%d, mean=%f, var=%f" % y

The output after running this program five times was:

n=50000, mean=0.497709, var=0.082983
n=50000, mean=0.498162, var=0.082474
n=50000, mean=0.498098, var=0.083814
n=50000, mean=0.498482, var=0.083203
n=50000, mean=0.499027, var=0.083813

You can verify these results by running the following from the interactive shell, which should produce the same output:

print "n=%d, mean=%f, var=%f" % (numpy.size(x), numpy.mean(x), numpy.var(x))

MapReduce is interesting because a large class of programs can be solved by restating them as MapReduce programs. It’s no silver bullet approach to writing bug-free parallel software, but you may find that a large number of existing solutions in your code base can be easily parallelised using the MapReduce approach.

Advertisements

From → Data mining, Python

3 Comments
  1. I was wondering if there are any good resources for parallel statistics in MapReduce (beyond just mean and variance). I do econometric research and am interested in whatever statistical packages and resources that are available on MapReduce.

    Thanks!

    • Not sure if it’s appropriate for econometrics, but the Apache Mahout project has quite a nice collection of data mining and machine learning algorithms implemented using Hadoop MapReduce that you may find useful.

  2. Hi,
    the figure are not visible, could you please update the post?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: