Doing Data: The Critical Process of Data Preparation for Machine Learning and More

06/18/2019 - 11:00 to 11:40
long talk (40 min)

Session abstract: 

Machine learning is rapidly being democratized and is becoming something that developers at large can use effectively. To do so, however, developers need to add appropriate data skills to their coding skills. It’s not the advanced algorithms and modelling they must learn -- that may be left to the data scientists. Instead, the most important part of a machine learning system turns out to be the data preparation itself and the importance of the data is often underestimated, especially by people new to these approaches. Data preparation is essential to learn, and clever techniques for data exploration, feature extraction and data versioning can be applied across a wide variety of machine learning projects. In fact, in some cases, data exploration and feature extraction actually reveal solutions that can actually make the machine learning part of a machine learning system optional. Honing data preparation skills is useful for both experienced data scientists and for newcomers.

This presentation will cover practical techniques for data engineering with specific examples. We will focus on three areas of data preparation: data exploration, feature extraction and data management for machine learning systems, and we will give examples from real world stories of where these approaches are effective. The approaches we examine are powerful yet simple enough for people new to machine learning to easily understand, and they may surprise more experienced data scientists as well. We include the rationale behind the approaches along with specific implementation to make it easier for the audience to apply these techniques to their own situations.