Hadoop 101 for bioinformaticians: 1 hour crash course, code and slides

Earlier this year (February-April) I ran 9 short 1 hour hands-on sessions (5 persons/session) called Hadoop 101 for bioinformaticians at the Genome Campus for European Bioinformatics Institute and Sanger Institute people. The participants were bioinformaticians, developers and sysadmins. My idea was to start with a ~20 minutes long theoretical introduction so it provides some handles on whether the participants’ particular computational problems might fit the MapReduce/Hadoop distributed computing paradigm. This was followed by a ~40 min long practical session where I aimed to provide enough code with examples to get people started with Hadoop development. I set up a github repo for this called Hadoop 101 for bioinformaticians and here are the slides I used throughout:

Changing the game: absolute protein quantification by relating histone mass spec signals to DNA amounts and cell numbers

One thing system biologists want is to have by and large absolute protein concentrations or copy numbers per cells available cheaply for their models leveraging all sorts of omics data. Looks like such results can now be easily delivered based on a study published on the 15th of September by the Mann lab in Molecular & Cellular Proteomics entitled A ‘proteomic ruler’ for protein copy number and concentration estimation without spike-in standards.

This is the best, and certainly the simplest idea in proteomics I’ve seen in a while Continue reading

Pleasingly Parallel MCMC: cracked wide open for MapReduce and Hadoop

MCMC methods guarantee an accurate enough result (say parameter estimation for a phylogenetic tree). But they give it to you usually in the long-run and many burn-in steps might be necessary before performing ok. And if the data size grows larger, the number of operations to draw a sample grows larger too (N -> O(N) for most MCMC methods.

Although there’s been attempts earlier to express an MCMC algorithm in a distributed manner it was a big question whether it can be turned into an embarrassingly parallel algorithm (let me not discuss here the difference between parallel and distributed). An embarrassingly or, more positively, a pleasingly parallel algorithm can be executed on many separate nodes on different chunks of the input data without requiring those tasks communicating with each other and without the need to maintain a global shared state throughout the process.

Exactly these are the problems MapReduce was designed for and provides a nearly ideal fit.

Today I discovered 2 papers from 2013 that have finally came up with efficient looking pleasingly parallel MCMC designs and prototypes and the whole reason of this post is to share my joy felt over this little insight. These 2 papers finally present the opportunity to write a stable and efficient MapReduce Hadoop library allowing data intensive bioinformatics applications and opening up this important subspace of biological methods. While I’m certainly not the type of bioinformatician who is going to implement these designs in Hadoop, I’m certainly the type of applied scientist who is going to use them. So the race is on, dear Hadoop developers to give another important toolkit in the hands of the bioinformatics crowd.

The 2 papers:

Asymptotically Exact, Embarrassingly Parallel MCMC Continue reading

2 recent Global Alliance for Genomics and Health standard candidates: ADAM and Google Genomics

Global Alliance for Genomics and Health includes > 150 health and research organizations to progress/accelerate secure and responsible sharing of genomic and clinical data. GA4GH (for short) is something you will here about more and more in the short term future.

In the context of genomics standards think of mainly data formats and code to access and process these formats (APIs if you like this term, well I don’t).

2 emerging projects in the genomics standards field, one of them is bleeding, the other one is cutting edge:


“Global Alliance is looking at ADAM as a potential standard” check slide 12 of this slideshow.

what is it: “ADAM is a genomics processing engine and specialized file format built using Apache Avro, Apache Spark and Parquet. Apache 2 licensed.”

Currently it includes a complete variant calling pipeline amongst others.

main codebase: http://bdgenomics.org/

some folks behind it: @matt_massie @fnothaft and several other people from places like AMPLab at Berkeley, GenomeBridge, The Broad Institute, Mt Sinai.

2. Google Genomics

Continue reading

3 recent Hadoop/MapReduce applications in the life sciences: RNA structure prediction, neuroimaging genetics, EEG signal analysis

3 open access papers, 3 prototypes, source code available only for 1, healthy diversification of topics.

1. Enhancement of accuracy and efficiency for RNA secondary structure prediction by sequence segmentation and MapReduce

code available: haven’t found it referenced in the paper

Our previous research shows that cutting long sequences into shorter chunks, predicting secondary structures of the chunks independently using thermodynamic methods, and reconstructing the entire secondary structure from the predicted chunk structures can yield better accuracy than predicting the secondary structure using the RNA sequence as a whole.

This is quite unexpected but obviously favourable for a pleasingly parallel implementation MapReduce can offer. Lots of benchmarking and comparison, quite methodology focused.

2. Random forests on Hadoop for genome-wide association studies of multivariate neuroimaging phenotypes

code and documentation available: http://www2.imperial.ac.uk/~gmontana/parfr.htm but only java classes no project management and automated software build tools used

We have developed a parallel version of the RF algorithm for regression and genetic similarity learning tasks in large-scale population genetic association studies involving multivariate traits, called PaRFR (Parallel Random Forest Regression)… Notable speed-ups are obtained by introducing a distance-based criterion for node splitting in the tree estimation process. PaRFR has been applied to a genome-wide association study on Alzheimer’s disease (AD) in which the quantitative trait consists of a high-dimensional neuroimaging phenotype describing longitudinal changes in the human brain structure. PaRFR provides a ranking of SNPs associated to this trait, and produces pair-wise measures of genetic proximity that can be directly compared to pair-wise measures of phenotypic proximity. Several known AD-related variants have been identified, including APOE4 and TOMM40. We also present experimental evidence supporting the hypothesis of a linear relationship between the number of top-ranked mutated states, or frequent mutation patterns, and an indicator of disease severity.

First impression is of a well-thought, serious study with different types of results.

3. Cloudwave: distributed processing of “big data” from electrophysiological recordings for epilepsy clinical research using hadoop.

code available: I don’t think so Continue reading

Google invests into DNAnexus: aging-driven big data bioinformatics without the Hadoop Ecosystem?

First time DNAnexus made me think a little about what they can achieve was when they came up with an alternative search and browse interface for the complete Sequence Read Archive (SRA) database. They came to the ‘rescue’ as NCBI discontinued SRA in 2011 although later they’ve changed their mind, so SRA is still up and running there.

But DNAnexus is here to stay and they also made me think personally since the same time NCBI discontinued SRA, they discontinued Peptidome as well that was NCBI’s shotgun mass spectrometry proteomics data repository. Much of my work in 2011 and 2012 went into reprocessing the valuable Peptidome datasets and making them available at EBI’s PRIDE database (where I work), a process which was documented in the open access paper From Peptidome to PRIDE: Public proteomics data migration at a large scale. As opposed to SRA and next-gen sequencing DNA data, Peptidome was shut down permanently.

But back to DNAnexus and the fresh news that Google Ventures “has joined a group of investors seeding DNAnexus‘ cloud-computing services for biomedical researchers with a $15 million C round”. What makes this news interesting is that recently, DNAnexus announced large-scale projects with Stanford and Baylor College of Medicine in processing tens of thousands of genomes and making the resulting data sets securely available. From the press release:

Through these two projects alone, DNAnexus processed over 17,000 genomes and generated more than 500 terabytes of genomic data. The resulting data sets can be accessed on DNAnexus and are available for follow-on analysis and data sharing to researchers participating in the 1000 Genomes Project and the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium.

Put this news into the context of Google’s investment into Calico to address aging and healthy lifespan extension and into the fact that the first scientist hires of Calico, like Cynthia Kenyon and David Botstein are coming from genetics/genomics background. Medium boom, it’s still opening game on the board!

This all sounds awesome and I hope that the DNAnexus folks are going to deliver & have fun while doing so. But let me offer an angle here that makes this big data bioinformatics news just a little less awesome: DNAnexus already tried MapReduce and Hadoop for their genomics data processing pipelines and at the end they ditched it for proprietary homegrown software. Continue reading