1 day Apache Spark training: randomish insights

Last week I’ve participated in a one day Apache Spark workshop in London developed by Databricks and organised by Big Data Partnership.

Databricks Training Resources is the most important link you need to know in order to get started, contains the whole training material.

Let me share some short comments:

Spark is the next, logical generalised step leveraging the map/shuffle/reduce processing engine paradigm. It is evolution but not revolution.

Main motivation behind Spark is to serve as the ultimate glue of the big data stack.

Do not take the in-memory distributed datasets message for granted, there’s still a lot write to disk element.

Spark attempts to cache but it is not automatic, there’s no guarantee, if caching does not work, it falls back to default write to disk, and default MapReduce performance.

There’s a last shuffle map that partitions the output to be ready for the shuffle.

There’s lightweight shuffle and there’s more complex shuffle: in Spark there’s no sort included on keys and values as in Hadoop/MapReduce.

In Spark there’s way more reducers than in MR, and in general there’s more lightweight tasks.

Productionalizing Python Spark is a nightmare … according to Scala/Java coders.

Nobody is using Spark in production in Europe, POCs only.

Go ahead and launch your first Spark Cluster on Google Cloud Compute Platform for free using Mesos and Mesosphere. Google provides a free starter pack for developers, $500 for 60 days, you only need to sign up. I launched my first cluster over the weekend, using Paco Nathan’s Spark atop Mesos on Google Cloud Platform. It works just fine.