Scalable machine-learned model serving

long talk (40 min)

Session abstract: 

Applying machine learning in online applications requires solving the problem of model serving: Evaluating the machine-learned model over some data point(s) in real time while the user is waiting for a response. Solutions such as TensorFlow Serving are available to solve this problem where the model only needs to be evaluated over a one data point per user request, but what about the case where a model needs to be evaluated over many data points per request, such as in search and recommendation systems? 

This talk will show that this is a bandwidth constrained problem, and outline an architectural solution where computation is pushed down to data shards in parallel. It will demonstrate how this solution can be put into use with - the open source big data serving engine - to achieve scalable model serving of TensorFlow and ONNX and show benchmarks comparing performance and scalability to TensorFlow Serving. 

Model serving with Vespa is used today for some of the worlds largest recommendation systems, such as serving personalized content on all Yahoo content pages, personalized ads in the worlds third largest ad network, and image search and retrieval by similarity in Flickr. These systems evaluate models over millions of data points per request for hundreds of thousands of requests per second.


  • Online evaluation of machine-learned models (model serving) is difficult to scale to large data sets.
  • is an open source solution to this problem in use today on some of the largest such systems in the world, such as the content pages on the Yahoo network and the worlds third largest ad network.
  • This talk will explain the problem and architectural solution, show how Vespa can be used to implement the solution to achieve scalable serving of TensorFlow, ONNX and hand-written models, and present benchmarks comparing performance and scalability to TensorFlow Serving.