Earlier this year (February-April) I ran 9 short 1 hour hands-on sessions (5 persons/session) called Hadoop 101 for bioinformaticians at the Genome Campus for European Bioinformatics Institute and Sanger Institute people. The participants were bioinformaticians, developers and sysadmins. My idea was to start with a ~20 minutes long theoretical introduction so it provides some handles on whether the participants’ particular computational problems might fit the MapReduce/Hadoop distributed computing paradigm. This was followed by a ~40 min long practical session where I aimed to provide enough code with examples to get people started with Hadoop development. I set up a github repo for this called Hadoop 101 for bioinformaticians and here are the slides I used throughout:
MCMC methods guarantee an accurate enough result (say parameter estimation for a phylogenetic tree). But they give it to you usually in the long-run and many burn-in steps might be necessary before performing ok. And if the data size grows larger, the number of operations to draw a sample grows larger too (N -> O(N) for most MCMC methods.
Although there’s been attempts earlier to express an MCMC algorithm in a distributed manner it was a big question whether it can be turned into an embarrassingly parallel algorithm (let me not discuss here the difference between parallel and distributed). An embarrassingly or, more positively, a pleasingly parallel algorithm can be executed on many separate nodes on different chunks of the input data without requiring those tasks communicating with each other and without the need to maintain a global shared state throughout the process.
Exactly these are the problems MapReduce was designed for and provides a nearly ideal fit.
Today I discovered 2 papers from 2013 that have finally came up with efficient looking pleasingly parallel MCMC designs and prototypes and the whole reason of this post is to share my joy felt over this little insight. These 2 papers finally present the opportunity to write a stable and efficient MapReduce Hadoop library allowing data intensive bioinformatics applications and opening up this important subspace of biological methods. While I’m certainly not the type of bioinformatician who is going to implement these designs in Hadoop, I’m certainly the type of applied scientist who is going to use them. So the race is on, dear Hadoop developers to give another important toolkit in the hands of the bioinformatics crowd.
The 2 papers:
Global Alliance for Genomics and Health includes > 150 health and research organizations to progress/accelerate secure and responsible sharing of genomic and clinical data. GA4GH (for short) is something you will here about more and more in the short term future.
In the context of genomics standards think of mainly data formats and code to access and process these formats (APIs if you like this term, well I don’t).
2 emerging projects in the genomics standards field, one of them is bleeding, the other one is cutting edge:
what is it: “ADAM is a genomics processing engine and specialized file format built using Apache Avro, Apache Spark and Parquet. Apache 2 licensed.”
Currently it includes a complete variant calling pipeline amongst others.
main codebase: http://bdgenomics.org/
2. Google Genomics
3 open access papers, 3 prototypes, source code available only for 1, healthy diversification of topics.
code available: haven’t found it referenced in the paper
Our previous research shows that cutting long sequences into shorter chunks, predicting secondary structures of the chunks independently using thermodynamic methods, and reconstructing the entire secondary structure from the predicted chunk structures can yield better accuracy than predicting the secondary structure using the RNA sequence as a whole.
This is quite unexpected but obviously favourable for a pleasingly parallel implementation MapReduce can offer. Lots of benchmarking and comparison, quite methodology focused.
code and documentation available: http://www2.imperial.ac.uk/~gmontana/parfr.htm but only java classes no project management and automated software build tools used
We have developed a parallel version of the RF algorithm for regression and genetic similarity learning tasks in large-scale population genetic association studies involving multivariate traits, called PaRFR (Parallel Random Forest Regression)… Notable speed-ups are obtained by introducing a distance-based criterion for node splitting in the tree estimation process. PaRFR has been applied to a genome-wide association study on Alzheimer’s disease (AD) in which the quantitative trait consists of a high-dimensional neuroimaging phenotype describing longitudinal changes in the human brain structure. PaRFR provides a ranking of SNPs associated to this trait, and produces pair-wise measures of genetic proximity that can be directly compared to pair-wise measures of phenotypic proximity. Several known AD-related variants have been identified, including APOE4 and TOMM40. We also present experimental evidence supporting the hypothesis of a linear relationship between the number of top-ranked mutated states, or frequent mutation patterns, and an indicator of disease severity.
First impression is of a well-thought, serious study with different types of results.
code available: I don’t think so Continue reading
First time DNAnexus made me think a little about what they can achieve was when they came up with an alternative search and browse interface for the complete Sequence Read Archive (SRA) database. They came to the ‘rescue’ as NCBI discontinued SRA in 2011 although later they’ve changed their mind, so SRA is still up and running there.
But DNAnexus is here to stay and they also made me think personally since the same time NCBI discontinued SRA, they discontinued Peptidome as well that was NCBI’s shotgun mass spectrometry proteomics data repository. Much of my work in 2011 and 2012 went into reprocessing the valuable Peptidome datasets and making them available at EBI’s PRIDE database (where I work), a process which was documented in the open access paper From Peptidome to PRIDE: Public proteomics data migration at a large scale. As opposed to SRA and next-gen sequencing DNA data, Peptidome was shut down permanently.
But back to DNAnexus and the fresh news that Google Ventures “has joined a group of investors seeding DNAnexus‘ cloud-computing services for biomedical researchers with a $15 million C round”. What makes this news interesting is that recently, DNAnexus announced large-scale projects with Stanford and Baylor College of Medicine in processing tens of thousands of genomes and making the resulting data sets securely available. From the press release:
Through these two projects alone, DNAnexus processed over 17,000 genomes and generated more than 500 terabytes of genomic data. The resulting data sets can be accessed on DNAnexus and are available for follow-on analysis and data sharing to researchers participating in the 1000 Genomes Project and the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium.
Put this news into the context of Google’s investment into Calico to address aging and healthy lifespan extension and into the fact that the first scientist hires of Calico, like Cynthia Kenyon and David Botstein are coming from genetics/genomics background. Medium boom, it’s still opening game on the board!
This all sounds awesome and I hope that the DNAnexus folks are going to deliver & have fun while doing so. But let me offer an angle here that makes this big data bioinformatics news just a little less awesome: DNAnexus already tried MapReduce and Hadoop for their genomics data processing pipelines and at the end they ditched it for proprietary homegrown software. Continue reading
Guessing the number of real protein-coding genes is an ‘ancient’ bioinformatics game and now a new argument & newish research field has been applied to this problem. Proteogenomics can refer to different type of studies but the basic idea is that mass spectrometry peptide/protein evidences are used to improve genome annotations. Now a joint Spanish – British – US team makes an interesting argument based on the LACK of mass spec evidence (negative proteogenomics mind you) in a pre-print deposited in arxiv entitled The shrinking human protein coding complement: are there fewer than 20,000 genes? The argument and claim in a nutshell: by pulling together and reanalysing 7 large-scale proteomics studies & mapping them to the GENCODE v12 annotation they identified 60% of protein coding genes. Then applying multiple non-coding features (the real meat of the study) they further restricted the non-coding set yielding at the end a set of 2001 genes out of which they think 1500 do not actual code for proteins at all.
Whether the bioinformatics argument is flawless or not the following findings are exciting: Continue reading