Last week I’ve participated in a one day Apache Spark workshop in London developed by Databricks and organised by Big Data Partnership. Databricks Training Resources is the most important link you need to know in order to get started, contains the whole training material. Let me share some short comments: Spark is the next, logical generalised step leveraging the… Continue reading 1 day Apache Spark training: randomish insights
Earlier this year (February-April) I ran 9 short 1 hour hands-on sessions (5 persons/session) called Hadoop 101 for bioinformaticians at the Genome Campus for European Bioinformatics Institute and Sanger Institute people. The participants were bioinformaticians, developers and sysadmins. My idea was to start with a ~20 minutes long theoretical introduction so it provides some handles on whether… Continue reading Hadoop 101 for bioinformaticians: 1 hour crash course, code and slides
MCMC methods guarantee an accurate enough result (say parameter estimation for a phylogenetic tree). But they give it to you usually in the long-run and many burn-in steps might be necessary before performing ok. And if the data size grows larger, the number of operations to draw a sample grows larger too (N -> O(N)… Continue reading Pleasingly Parallel MCMC: cracked wide open for MapReduce and Hadoop
3 open access papers, 3 prototypes, source code available only for 1, healthy diversification of topics. 1. Enhancement of accuracy and efficiency for RNA secondary structure prediction by sequence segmentation and MapReduce code available: haven’t found it referenced in the paper Our previous research shows that cutting long sequences into shorter chunks, predicting secondary structures of… Continue reading 3 recent Hadoop/MapReduce applications in the life sciences: RNA structure prediction, neuroimaging genetics, EEG signal analysis
First time DNAnexus made me think a little about what they can achieve was when they came up with an alternative search and browse interface for the complete Sequence Read Archive (SRA) database. They came to the ‘rescue’ as NCBI discontinued SRA in 2011 although later they’ve changed their mind, so SRA is still up and running there.… Continue reading Google invests into DNAnexus: aging-driven big data bioinformatics without the Hadoop Ecosystem?