3 recent Hadoop/MapReduce applications in the life sciences: RNA structure prediction, neuroimaging genetics, EEG signal analysis

3 open access papers, 3 prototypes, source code available only for 1, healthy diversification of topics.

1. Enhancement of accuracy and efficiency for RNA secondary structure prediction by sequence segmentation and MapReduce

code available: haven’t found it referenced in the paper

Our previous research shows that cutting long sequences into shorter chunks, predicting secondary structures of the chunks independently using thermodynamic methods, and reconstructing the entire secondary structure from the predicted chunk structures can yield better accuracy than predicting the secondary structure using the RNA sequence as a whole.

This is quite unexpected but obviously favourable for a pleasingly parallel implementation MapReduce can offer. Lots of benchmarking and comparison, quite methodology focused.

2. Random forests on Hadoop for genome-wide association studies of multivariate neuroimaging phenotypes

code and documentation available: http://www2.imperial.ac.uk/~gmontana/parfr.htm but only java classes no project management and automated software build tools used

We have developed a parallel version of the RF algorithm for regression and genetic similarity learning tasks in large-scale population genetic association studies involving multivariate traits, called PaRFR (Parallel Random Forest Regression)… Notable speed-ups are obtained by introducing a distance-based criterion for node splitting in the tree estimation process. PaRFR has been applied to a genome-wide association study on Alzheimer’s disease (AD) in which the quantitative trait consists of a high-dimensional neuroimaging phenotype describing longitudinal changes in the human brain structure. PaRFR provides a ranking of SNPs associated to this trait, and produces pair-wise measures of genetic proximity that can be directly compared to pair-wise measures of phenotypic proximity. Several known AD-related variants have been identified, including APOE4 and TOMM40. We also present experimental evidence supporting the hypothesis of a linear relationship between the number of top-ranked mutated states, or frequent mutation patterns, and an indicator of disease severity.

First impression is of a well-thought, serious study with different types of results.

3. Cloudwave: distributed processing of “big data” from electrophysiological recordings for epilepsy clinical research using hadoop.

code available: I don’t think so

took some time to load the provided website http://prism.case.edu/prism/index.php/Cloudwave

data shift: intracranial electrodes generate significantly larger volume of signal data as compared to scalp electrodes

Epilepsy is the most common serious neurological disorder affecting 50–60 million persons worldwide. Multi-modal electrophysiological data, such as electroencephalography (EEG) and electrocardiography (EKG), are central to effective patient care and clinical research in epilepsy. Electrophysiological data is an example of clinical “big data” consisting of more than 100 multi-channel signals with recordings from each patient generating 5–10GB of data. Current approaches to store and analyze signal data using standalone tools, such as Nihon Kohden neurology software, are inadequate to meet the growing volume of data and the need for supporting multi-center collaborative studies with real time and interactive access. We introduce the Cloudwave platform in this paper that features a Web-based signal analysis interface integrated with a Hadoop-based data processing module implemented on clinical data stored in a “private cloud”. Cloudwave has been developed as part of the National Institute of Neurological Disorders and Strokes (NINDS) funded multi-center Prevention and Risk Identification of SUDEP Mortality (PRISM) project. The Cloudwave visualization interface provides real-time rendering of multi-modal signals with “montages” for EEG feature characterization over 2TB of patient data generated at the Case University Hospital Epilepsy Monitoring Unit. Results from performance evaluation of the Cloudwave Hadoop data processing module demonstrate one order of magnitude improvement in performance over 77GB of patient data.