Pleasingly Parallel MCMC: cracked wide open for MapReduce and Hadoop

MCMC methods guarantee an accurate enough result (say parameter estimation for a phylogenetic tree). But they give it to you usually in the long-run and many burn-in steps might be necessary before performing ok. And if the data size grows larger, the number of operations to draw a sample grows larger too (N -> O(N) for most MCMC methods.

Although there’s been attempts earlier to express an MCMC algorithm in a distributed manner it was a big question whether it can be turned into an embarrassingly parallel algorithm (let me not discuss here the difference between parallel and distributed). An embarrassingly or, more positively, a pleasingly parallel algorithm can be executed on many separate nodes on different chunks of the input data without requiring those tasks communicating with each other and without the need to maintain a global shared state throughout the process.

Exactly these are the problems MapReduce was designed for and provides a nearly ideal fit.

Today I discovered 2 papers from 2013 that have finally came up with efficient looking pleasingly parallel MCMC designs and prototypes and the whole reason of this post is to share my joy felt over this little insight. These 2 papers finally present the opportunity to write a stable and efficient MapReduce Hadoop library allowing data intensive bioinformatics applications and opening up this important subspace of biological methods. While I’m certainly not the type of bioinformatician who is going to implement these designs in Hadoop, I’m certainly the type of applied scientist who is going to use them. So the race is on, dear Hadoop developers to give another important toolkit in the hands of the bioinformatics crowd.

The 2 papers:

Asymptotically Exact, Embarrassingly Parallel MCMC Continue reading

2 recent Global Alliance for Genomics and Health standard candidates: ADAM and Google Genomics

Global Alliance for Genomics and Health includes > 150 health and research organizations to progress/accelerate secure and responsible sharing of genomic and clinical data. GA4GH (for short) is something you will here about more and more in the short term future.

In the context of genomics standards think of mainly data formats and code to access and process these formats (APIs if you like this term, well I don’t).

2 emerging projects in the genomics standards field, one of them is bleeding, the other one is cutting edge:


“Global Alliance is looking at ADAM as a potential standard” check slide 12 of this slideshow.

what is it: “ADAM is a genomics processing engine and specialized file format built using Apache Avro, Apache Spark and Parquet. Apache 2 licensed.”

Currently it includes a complete variant calling pipeline amongst others.

main codebase:

some folks behind it: @matt_massie @fnothaft and several other people from places like AMPLab at Berkeley, GenomeBridge, The Broad Institute, Mt Sinai.

2. Google Genomics

Continue reading

3 recent Hadoop/MapReduce applications in the life sciences: RNA structure prediction, neuroimaging genetics, EEG signal analysis

3 open access papers, 3 prototypes, source code available only for 1, healthy diversification of topics.

1. Enhancement of accuracy and efficiency for RNA secondary structure prediction by sequence segmentation and MapReduce

code available: haven’t found it referenced in the paper

Our previous research shows that cutting long sequences into shorter chunks, predicting secondary structures of the chunks independently using thermodynamic methods, and reconstructing the entire secondary structure from the predicted chunk structures can yield better accuracy than predicting the secondary structure using the RNA sequence as a whole.

This is quite unexpected but obviously favourable for a pleasingly parallel implementation MapReduce can offer. Lots of benchmarking and comparison, quite methodology focused.

2. Random forests on Hadoop for genome-wide association studies of multivariate neuroimaging phenotypes

code and documentation available: but only java classes no project management and automated software build tools used

We have developed a parallel version of the RF algorithm for regression and genetic similarity learning tasks in large-scale population genetic association studies involving multivariate traits, called PaRFR (Parallel Random Forest Regression)… Notable speed-ups are obtained by introducing a distance-based criterion for node splitting in the tree estimation process. PaRFR has been applied to a genome-wide association study on Alzheimer’s disease (AD) in which the quantitative trait consists of a high-dimensional neuroimaging phenotype describing longitudinal changes in the human brain structure. PaRFR provides a ranking of SNPs associated to this trait, and produces pair-wise measures of genetic proximity that can be directly compared to pair-wise measures of phenotypic proximity. Several known AD-related variants have been identified, including APOE4 and TOMM40. We also present experimental evidence supporting the hypothesis of a linear relationship between the number of top-ranked mutated states, or frequent mutation patterns, and an indicator of disease severity.

First impression is of a well-thought, serious study with different types of results.

3. Cloudwave: distributed processing of “big data” from electrophysiological recordings for epilepsy clinical research using hadoop.

code available: I don’t think so Continue reading

Google invests into DNAnexus: aging-driven big data bioinformatics without the Hadoop Ecosystem?

First time DNAnexus made me think a little about what they can achieve was when they came up with an alternative search and browse interface for the complete Sequence Read Archive (SRA) database. They came to the ‘rescue’ as NCBI discontinued SRA in 2011 although later they’ve changed their mind, so SRA is still up and running there.

But DNAnexus is here to stay and they also made me think personally since the same time NCBI discontinued SRA, they discontinued Peptidome as well that was NCBI’s shotgun mass spectrometry proteomics data repository. Much of my work in 2011 and 2012 went into reprocessing the valuable Peptidome datasets and making them available at EBI’s PRIDE database (where I work), a process which was documented in the open access paper From Peptidome to PRIDE: Public proteomics data migration at a large scale. As opposed to SRA and next-gen sequencing DNA data, Peptidome was shut down permanently.

But back to DNAnexus and the fresh news that Google Ventures “has joined a group of investors seeding DNAnexus‘ cloud-computing services for biomedical researchers with a $15 million C round”. What makes this news interesting is that recently, DNAnexus announced large-scale projects with Stanford and Baylor College of Medicine in processing tens of thousands of genomes and making the resulting data sets securely available. From the press release:

Through these two projects alone, DNAnexus processed over 17,000 genomes and generated more than 500 terabytes of genomic data. The resulting data sets can be accessed on DNAnexus and are available for follow-on analysis and data sharing to researchers participating in the 1000 Genomes Project and the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium.

Put this news into the context of Google’s investment into Calico to address aging and healthy lifespan extension and into the fact that the first scientist hires of Calico, like Cynthia Kenyon and David Botstein are coming from genetics/genomics background. Medium boom, it’s still opening game on the board!

This all sounds awesome and I hope that the DNAnexus folks are going to deliver & have fun while doing so. But let me offer an angle here that makes this big data bioinformatics news just a little less awesome: DNAnexus already tried MapReduce and Hadoop for their genomics data processing pipelines and at the end they ditched it for proprietary homegrown software. Continue reading

Coming of age for proteogenomics: 10% less human protein coding genes based on mass spec proteomics data?

Guessing the number of real protein-coding genes is an ‘ancient’ bioinformatics game and now a new argument & newish research field has been applied to this problem. Proteogenomics can refer to different type of studies but the basic idea is that mass spectrometry peptide/protein evidences are used to improve genome annotations. Now a joint Spanish – British – US team makes an interesting argument based on the LACK of mass spec evidence (negative proteogenomics mind you) in a pre-print deposited in arxiv entitled The shrinking human protein coding complement: are there fewer than 20,000 genes? The argument and claim in a nutshell: by pulling together and reanalysing 7 large-scale proteomics studies & mapping them to the GENCODE v12 annotation they identified 60% of protein coding genes. Then applying multiple non-coding features (the real meat of the study) they further restricted the non-coding set yielding at the end a set of 2001 genes out of which they think 1500 do not actual code for proteins at all.

Whether the bioinformatics argument is flawless or not the following findings are exciting: Continue reading

Three links in Aging, Regenerative Medicine & Healthy Lifespan Extension: 17 December 2013

1. DNA methylation age of human tissues and cell types by Steve Horvath: This is the type of relevant data mining study most bioinformaticians are dreaming of: you pull together a large body of publicly available datasets (CpG methylation) that are not too heterogeneous (Infinium type II assay on Illumina 27K or Illumina 450K array platform), derive robust statistical results (develop a multi-tissue predictor of age) and apply it on a medically relevant field (20 cancer types exhibit significant age acceleration, with an average of 36 years). Continue reading

Three links in Aging, Regenerative Medicine & Healthy Lifespan Extension: 8 December 2013

1. Is aging linear or does it follow a step function? A good & simple question on Quora that surprised even Aubrey de Grey. If you are a bioinformatician out there – looking for a new pet project – go pull together some data & try to plot it! Let me know if you have something. An interesting answer:

It’s exponential. Starting in your 20s, your probability of death doubles every 8 years, as does your probability of getting cancer. Of course, since we’re talking about high-impact, low-frequency events, they’re governed by a Poisson distribution (i.e. fairly random noise, manifest in “jumpy” changes). But there’s no planned step-function behavior.

If interested to know more, start with thorough fact-checking on things, e.g. on that probability mentioned above.

2. Developmental senescence: yeah, as in normal, physiological, embryonic development. In mammals. Reported by study_1 and study_2. Apoptosis has long been accepted as part of the healthy embryo’s toolkit, think limb growth and tissue remodelling. Now senescence follows.

Perhaps the most important ramification of the new work relates to its implications for the evolutionary origin of the senescence program. Most research to date has focused on senescence as a tumor-suppressive process, and it has been debated as to how evolution selects for programs that prevent a disorder that typically occurs after reproductive age (Campisi, 2003). The new work raises the possibility that senescence in the adult evolved from a primordial tissue-remodeling program that takes place in the embryo. In both settings, cells arrest in the cell cycle, partially share a common set of functional markers, have an active role in modifying the tissue microenvironment, and are ultimately recognized and cleared by the immune system (Figure 1). These features may have been adapted as part of an emergent adult stress response program that incorporated additional tumor suppressor mechanisms, such as those reliant on p53 and p16, to eliminate damaged cells and that may, in turn, contribute to organismal aging.

3. This anti-aging brain trust is the most interesting startup in Silicon Valley: I don’t care about hype or no hype but I do care about the fact that the sporadic news on Calico re-energetised the whole aging/lifespan extension field and community.

The timing may be just right for a project like Calico. And unlike the vast majority of Silicon Valley’s startups, the technology is addressing a need that is keenly felt by many of us. Most people are in a constant battle against aging and will pay exorbitant sums of money to slow down the rate that our bodies deteriorate.

“Historically, the whole field of aging research has been very underfunded Continue reading