Big Data, Small Code: Using Java 8 and Apache Crunch to quickly develop concise, efficient, readable and testable data pipelines for Hadoop MapReduce and Spark.

06/06/2016 - 11:00 to 11:40
long talk (40 min)

Session abstract: 

New execution platforms may be popping up all the time with the intention of being the "hot new thing" in Big Data, but all the while most of the heavy lifting in data organisations is still done with Hadoop MapReduce; and it continues to be a sensible choice for whole classes of ETL and aggregation problems. Apache Crunch is a simple framework on top of MapReduce (with support for running on Spark as well) which applies simple, typesafe, functional programming idioms to batch data processing pipelines to maximise developer productivity. With the addition of Java 8 and the upcoming crunch-lambda module, it is now simpler than ever to express your intent and get code working on your cluster quicker. This session will introduce the concepts behind Crunch, introduce the API, and provide practical examples of how it can be used to simplify your codebase and increase your productivity.