The first 365 days of a relationship are full of discoveries, learnings and lessons! And that's also true with significant technological shifts. Apache Spark has rapidly transformed the data platform landscape, and has recently reached a 2.0 version, completing an important development cycle. But how to really succeed at rolling Spark into an organization? How to support various use-cases, from machine learning and data products all the way to analytics or reporting? And how to do that step by step, gracefully and efficiently?
I would like to share my experience and learning from a first and intense year moving Spark into production at GetYourGuide, a Berlin startup of 250+ employees in the travel space, with lots of data to analyse and crunch. Whether the starting point is a legacy data warehouse, a Hadoop infrastructure or anything in between, the "playbook" I would like to present aims to touch on the following important questions and topics:
- How to support SQL-based use-cases, as well as complex machine learning on one platform
- How to integrate an existing data warehouse into the rollout strategy
- Which platform to choose, and what needed resources to expect?
- Can everything be done with the Dataset API, are RDD still relevant?
- How to get started with streaming?
- How to best organize your data, moving from relational tables to a data lake on S3
- How to integrate with existing BI tools? (e.g. Looker)
This presentation will be informative for whoever is currently planning or executing a migration to Spark. It will highlight intermediate technical topics for newcomers, as well as tips and advices to have a successfull first year relationship with Spark!