Although building ML models on small/ toy data-set is easy, most production-grade problems involve massive datasets which current ML practices don’t scale to. In this talk, we cover how you can drastically increase the amount of data that your models can learn from using distributed data/ml pipes.
It can be difficult to figure out how to work with large data-sets (which do not fit in your RAM), even if you’re already comfortable with ML libraries/ APIs within python. Many questions immediately come up: Which library should I use, and why? What’s the difference between a “map-reduce” and a “task-graph”? What’s a partial fit function, and what format does it expect the data in? Is it okay for my training data to have more features than observations? What’s the appropriate machine learning model to use? And so on…
In this talk, we’ll answer all those questions, and more!
We’ll start by walking through the current distributed analytics (out-of-core learning) landscape in order to understand the pain-points and some solutions to this problem.
Here is a sketch of a system designed to achieve this goal (of building scalable ML models):
- a way to stream instances
- a way to extract features from instances
- an incremental algorithm
Then we’ll read a large dataset into Dask, Tensorflow (tf.data) & sklearn streaming, and immediately apply what we’ve learned about in last section. We’ll move on to the model building process, including a discussion of which model is most appropriate for the task. We’ll evaluate our model a few different ways, and then examine the model for greater insight into how the data is influencing its predictions. Finally, we’ll practice this entire workflow on a new dataset, and end with a discussion of which parts of the process are worth tuning for improved performance.
- Intro to out-of-core learning
- Representing large datasets as instances
- Transforming data (in batches) – live code [3-5]
- Feature Engineering & Scaling
- Building and evaluating a model (on entire datasets)
- Practicing this workflow on another dataset
- Benchmark other libraries/ for OOC learning
- Questions and Answers
By the end of the talk participants would know how to build petabyte scale ML models, beyond the shackles of conventional python libraries.
Participants would have a benchmarks and best case practices for building such ML models at scale.