MCMC methods guarantee an accurate enough result (say parameter estimation for a phylogenetic tree). But they give it to you usually in the long-run and many burn-in steps might be necessary before performing ok. And if the data size grows larger, the number of operations to draw a sample grows larger too (N -> O(N) for most MCMC methods.
Although there’s been attempts earlier to express an MCMC algorithm in a distributed manner it was a big question whether it can be turned into an embarrassingly parallel algorithm (let me not discuss here the difference between parallel and distributed). An embarrassingly or, more positively, a pleasingly parallel algorithm can be executed on many separate nodes on different chunks of the input data without requiring those tasks communicating with each other and without the need to maintain a global shared state throughout the process.
Exactly these are the problems MapReduce was designed for and provides a nearly ideal fit.
Today I discovered 2 papers from 2013 that have finally came up with efficient looking pleasingly parallel MCMC designs and prototypes and the whole reason of this post is to share my joy felt over this little insight. These 2 papers finally present the opportunity to write a stable and efficient MapReduce Hadoop library allowing data intensive bioinformatics applications and opening up this important subspace of biological methods. While I’m certainly not the type of bioinformatician who is going to implement these designs in Hadoop, I’m certainly the type of applied scientist who is going to use them. So the race is on, dear Hadoop developers to give another important toolkit in the hands of the bioinformatics crowd.
The 2 papers: