Distributed and Native Hybrid optimizations for Machine Learning Workloads

06/12/2017 - 11:00 to 11:40
long talk (40 min)

Session abstract: 

Data scientists love tools like R and Scikit-Learn, as they offer a convenient and familiar syntax for analysis tasks. However, these systems are limited to operating serially on data sets that can fit on a single node and do not allow for distributed execution.

Mahout-Samsara is a linear algebra environment that offers both an easy-to-use Scala DSL and efficient distributed execution for linear algebra operations. Data scientists transitioning from R to Mahout can use the Samsara DSL for large-scale data sets with familiar R-like semantics. Machine Learning and Deep Learning algorithms built with the Mahout-Samsara DSL are automatically parallelized and optimized to execute on distributed processing engines like Apache Spark and Apache Flink accelerated natively by CUDA, OpenCL and OpenMP.

In this talk, we will look at Mahout's distributed linear algebra capabilities and demonstrate an EigenFaces classification using Distributed SSVD executing on a GPU cluster. This talk will also demonstrate the ease with which one can roll-out new Machine Learning Algorithms using Mahout-Samsara DSL. 

ML practitioners will come away from this talk with a better understanding of how Samsara's linear algebra environment can help simplify developing highly scalable, CPU/GPU accelerated ML and DL algorithms by focusing solely on the declarative specification of the algorithm without having to worry about the implementation details of a scalable distributed engine or having to learn to program with native math libraries.