Are you a data analyst who works with Spark and often gets confused by failures you don’t understand? Have you seen a bunch of presentations or blog posts about Spark performance but you are still not certain how to apply the hints you have been given in practice?
Spark is commonly used by people who are not experts in programming but they know SQL and sometimes basic Python. They treat Spark as a tool for getting business value from the the data. And that is how it should be! Although it’s common that queries they run do not work for any obvious reason. This talk is designed for such Spark users and will be focused on common problems with Spark (especially DataFrames and SQL) which can be solved by anyone familiar with SQL. You don’t need to read bytecode to understand the techniques presented and apply them in practice!
This talk will be a case study of multiple DataFrame queries in Spark which initially do not work. I will not only explain how to fix them, but we will go through the solution step-by-step so you will learn what to pay attention to and how to apply similar techniques to your codebase!