Pleasingly Parallel MCMC: cracked wide open for MapReduce and Hadoop

MCMC methods guarantee an accurate enough result (say parameter estimation for a phylogenetic tree). But they give it to you usually in the long-run and many burn-in steps might be necessary before performing ok. And if the data size grows larger, the number of operations to draw a sample grows larger too (N -> O(N) for most MCMC methods.

Although there’s been attempts earlier to express an MCMC algorithm in a distributed manner it was a big question whether it can be turned into an embarrassingly parallel algorithm (let me not discuss here the difference between parallel and distributed). An embarrassingly or, more positively, a pleasingly parallel algorithm can be executed on many separate nodes on different chunks of the input data without requiring those tasks communicating with each other and without the need to maintain a global shared state throughout the process.

Exactly these are the problems MapReduce was designed for and provides a nearly ideal fit.

Today I discovered 2 papers from 2013 that have finally came up with efficient looking pleasingly parallel MCMC designs and prototypes and the whole reason of this post is to share my joy felt over this little insight. These 2 papers finally present the opportunity to write a stable and efficient MapReduce Hadoop library allowing data intensive bioinformatics applications and opening up this important subspace. So the race is on, dear Hadoop developers to give another important toolkit into the hands of the bioinformatics crowd.

The 2 papers:

Asymptotically Exact, Embarrassingly Parallel MCMC

We present a parallel MCMC method with the following advantages: (1) the data can be partitioned arbitrarily, (2) each sampler/machine can make steps (i.e. draw samples and mix) quickly because each only uses a subset of the data, (3) subposterior sampling is performed in an embarrassingly parallel manner without communication between machines (the only communication needed is the nal combination of samples), (4) one can use any type of posterior sampling method (including preexisting MCMC packages) on each machine and (5) our method guarantees asymptotic correctness of the true posterior.

The other paper:

Bayes and Big Data: The Consensus Monte Carlo Algorithm

The idea behind consensus Monte Carlo is to break the data into groups (called “shards”), give each shard to a worker machine which does a full Monte Carlo simulation from a posterior distribution given its own data, and then combine the posterior simulations from each worker to produce a set of global draws representing the consensus belief among all the workers.