Google invests into DNAnexus: aging-driven big data bioinformatics without the Hadoop Ecosystem?

First time DNAnexus made me think a little about what they can achieve was when they came up with an alternative search and browse interface for the complete Sequence Read Archive (SRA) database. They came to the ‘rescue’ as NCBI discontinued SRA in 2011 although later they’ve changed their mind, so SRA is still up and running there.

But DNAnexus is here to stay and they also made me think personally since the same time NCBI discontinued SRA, they discontinued Peptidome as well that was NCBI’s shotgun mass spectrometry proteomics data repository. Much of my work in 2011 and 2012 went into reprocessing the valuable Peptidome datasets and making them available at EBI’s PRIDE database (where I work), a process which was documented in the open access paper From Peptidome to PRIDE: Public proteomics data migration at a large scale. As opposed to SRA and next-gen sequencing DNA data, Peptidome was shut down permanently.

But back to DNAnexus and the fresh news that Google Ventures “has joined a group of investors seeding DNAnexus‘ cloud-computing services for biomedical researchers with a $15 million C round”. What makes this news interesting is that recently, DNAnexus announced large-scale projects with Stanford and Baylor College of Medicine in processing tens of thousands of genomes and making the resulting data sets securely available. From the press release:

Through these two projects alone, DNAnexus processed over 17,000 genomes and generated more than 500 terabytes of genomic data. The resulting data sets can be accessed on DNAnexus and are available for follow-on analysis and data sharing to researchers participating in the 1000 Genomes Project and the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium.

Put this news into the context of Google’s investment into Calico to address aging and healthy lifespan extension and into the fact that the first scientist hires of Calico, like Cynthia Kenyon and David Botstein are coming from genetics/genomics background. Medium boom, it’s still opening game on the board!

This all sounds awesome and I hope that the DNAnexus folks are going to deliver & have fun while doing so. But let me offer an angle here that makes this big data bioinformatics news just a little less awesome: DNAnexus already tried MapReduce and Hadoop for their genomics data processing pipelines and at the end they ditched it for proprietary homegrown software. That was early 2013 and they named the following reasons of doing so (I mean ditching the kernel of the operating system of big data) amongst others:

1. “Most bioinformatics software that exists today is not written to run with Hadoop.”

2. “if you process genomics data with an online service, you’re forced to move a lot of data from place to place” see for instance S3 buckets to EC2

3. “The other problem with Hadoop is that it wasn’t designed for real-time queries”

Let me comment on these 3 points quickly:

1.  True and as far as I know Hadoop is not used in production in bioinformatics anywhere. But the benefits of reimplementing the crucial bioinformatics algorithms in the Hadoop ecosystem would make it a very worthy challenge and would connect the biodata world back to the world of general/web data. /I’m planning to write a follow-up post on reasons to do this reimplementation/. Also I already know a handful of bioinformaticians working with Hadoop&co. and their numbers are slowly growing. 😉 See Mikael Huss’ quick overview on existing Hadoop implementations in bioinformatics.

2. Well, it’s a funny thing that the argument of data locality is used against Hadoop on the cloud computing platform/service level while Hadoop/HDFS were designed around the cluster principle of data locality /to move code to the data and not the other way round/. On a more serious note: AWS != cloud computing platforms and developing a black box patented system instead might not benefit the whole field as much as a working solution built around Hadoop. Also a 10x performance benefit now might turn out to be the block of the 100x performance benefit later on a new system built from scratch.

3. True but it’s probably worth checking now what happened under the hood since early 2013 to address real-time queries within the Hadoop Ecosystem.

With that I’d like to congratulate to the DNAnexus folks for the C round and the new projects and suggest them to reconsider using the Hadoop Ecosystem. These are early days and it’s probably worth keeping a 3 person Hadoop developer/bioinformatician group around, in the basement, say. Besides the fact that Hadoop implementations can be used to benchmark homegrown code against one day it might help changing the tracks.

Disclosure: I’m running Hadoop jobs on a daily basis in the last couple of years (in proteomics, not in genomics) and being benefited from it first hand am biased towards it. Granted, couple hours or 1-2 days for a big job are good enough for ‘real-time’ in research bioinformatics. Also I’m certainly super-biased to use MapReduce/Hadoop in aging-driven big data bioinformatics research.

Maybe I should try to consider Apache Spark.