1 day Apache Spark training: randomish insights

Last week I’ve participated in a one day Apache Spark workshop in London developed by Databricks and organised by Big Data Partnership.

Databricks Training Resources is the most important link you need to know in order to get started, contains the whole training material.

Let me share some short comments:

Spark is the next, logical generalised step leveraging the map/shuffle/reduce processing engine paradigm. It is evolution but not revolution.

Main motivation behind Spark is to serve as the ultimate glue of the big data stack.

Do not take the in-memory distributed datasets message for granted, Continue reading

Big list of Markov chain Monte Carlo (MCMC) applications

I became quite obsessed with Markov chain Monte Carlo Methods lately. It is said that MCMC methods form the most frequently used class of algorithms in computer science. However when I was searching for a comprehensive list of MCMC applications across different domains to my surprise I have found none. So I’d like to ask you to help fill in the public Google Sheet called Big list of Markov chain Monte Carlo (MCMC) applications the columns of the list being:

1. Particular quantity estimated, system simulated, sampled. (mandatory)

2. General class of quantity, system sampled. (mandatory)

3. Reference to paper or link. (strongly recommended or optional for commercial examples)

4. MCMC variant used. (optional)

5. Software implementation. (optional)

I provided a couple of examples so I think it is quite straightforward once you have an example to add. Also added the request on Quora.

Hadoop 101 for bioinformaticians: 1 hour crash course, code and slides

Earlier this year (February-April) I ran 9 short 1 hour hands-on sessions (5 persons/session) called Hadoop 101 for bioinformaticians at the Genome Campus for European Bioinformatics Institute and Sanger Institute people. The participants were bioinformaticians, developers and sysadmins. My idea was to start with a ~20 minutes long theoretical introduction so it provides some handles on whether the participants’ particular computational problems might fit the MapReduce/Hadoop distributed computing paradigm. This was followed by a ~40 min long practical session where I aimed to provide enough code with examples to get people started with Hadoop development. I set up a github repo for this called Hadoop 101 for bioinformaticians and here are the slides I used throughout:

Pleasingly Parallel MCMC: cracked wide open for MapReduce and Hadoop

MCMC methods guarantee an accurate enough result (say parameter estimation for a phylogenetic tree). But they give it to you usually in the long-run and many burn-in steps might be necessary before performing ok. And if the data size grows larger, the number of operations to draw a sample grows larger too (N -> O(N) for most MCMC methods.

Although there’s been attempts earlier to express an MCMC algorithm in a distributed manner it was a big question whether it can be turned into an embarrassingly parallel algorithm (let me not discuss here the difference between parallel and distributed). An embarrassingly or, more positively, a pleasingly parallel algorithm can be executed on many separate nodes on different chunks of the input data without requiring those tasks communicating with each other and without the need to maintain a global shared state throughout the process.

Exactly these are the problems MapReduce was designed for and provides a nearly ideal fit.

Today I discovered 2 papers from 2013 that have finally came up with efficient looking pleasingly parallel MCMC designs and prototypes and the whole reason of this post is to share my joy felt over this little insight. These 2 papers finally present the opportunity to write a stable and efficient MapReduce Hadoop library allowing data intensive bioinformatics applications and opening up this important subspace. So the race is on, dear Hadoop developers to give another important toolkit into the hands of the bioinformatics crowd.

The 2 papers:

Asymptotically Exact, Embarrassingly Parallel MCMC Continue reading

2 recent Global Alliance for Genomics and Health standard candidates: ADAM and Google Genomics

Global Alliance for Genomics and Health includes > 150 health and research organizations to progress/accelerate secure and responsible sharing of genomic and clinical data. GA4GH (for short) is something you will here about more and more in the short term future.

In the context of genomics standards think of mainly data formats and code to access and process these formats (APIs if you like this term, well I don’t).

2 emerging projects in the genomics standards field, one of them is bleeding, the other one is cutting edge:


“Global Alliance is looking at ADAM as a potential standard” check slide 12 of this slideshow.

what is it: “ADAM is a genomics processing engine and specialized file format built using Apache Avro, Apache Spark and Parquet. Apache 2 licensed.”

Currently it includes a complete variant calling pipeline amongst others.

main codebase: http://bdgenomics.org/

some folks behind it: @matt_massie @fnothaft and several other people from places like AMPLab at Berkeley, GenomeBridge, The Broad Institute, Mt Sinai.

2. Google Genomics

Continue reading