365 days of Spark!

06/12/2017 - 14:00 to 14:40
long talk (40 min)

Session abstract: 

The first 365 days of a relationship are full of discoveries, learnings and lessons! And that's also true with significant technological shifts. Apache Spark has rapidly transformed the data platform landscape, and has recently reached a 2.0 version, completing an important development cycle. But how to really succeed at rolling Spark into an organization? How to support various use-cases, from machine learning and data products all the way to analytics or reporting? And how to do that step by step, gracefully and efficiently?

I would like to share my experience and learning from a first and intense year moving Spark into production at GetYourGuide, a Berlin startup of 250+ employees in the travel space, with lots of data to analyse and crunch. Whether the starting point is a legacy data warehouse, a Hadoop infrastructure or anything in between, the "playbook" I would like to present aims to touch on the following important questions and topics:

  • How to support SQL-based use-cases, as well as complex machine learning on one platform
  • How to integrate an existing data warehouse into the rollout strategy
  • Which platform to choose, and what needed resources to expect?
  • Can everything be done with the Dataset API, are RDD still relevant?
  • How to get started with streaming?
  • How to best organize your data, moving from relational tables to a data lake on S3
  • How to integrate with existing BI tools? (e.g. Looker)

This presentation will be informative for whoever is currently planning or executing a migration to Spark. It will highlight intermediate technical topics for newcomers, as well as tips and advices to have a successfull first year relationship with Spark!