Continuous Live Monitoring of Machine Learning Models with Delayed Label Feedback

06/12/2018 - 11:50 to 12:30
Moon Lounge
long talk (40 min)

Session abstract: 

The usual steps of developing a machine learning model are: training on a training set, tuning on a validation set and evaluating the performance on the test set. Often this is the end of the story. However, if the model is particularly good, it will be deployed to serve predictions in a production system. In this talk we present what happens to a machine learning model after it is deployed in production at Zalando Payments. We focus on the precautions we need to take to ensure that a model’s predictions always stay at the high quality we expect. The stakes are high, particularly for models that directly touch the revenue stream. Since we cannot afford to let a drop in prediction quality pass unnoticed, we need to continuously monitor our deployed machine learning models. As we operate in the fraud detection domain, one additional challenge we face is that we only know several weeks later if a customer paid his order at Zalando and if our predictions were accurate in that case. This makes the simple solution of monitoring the prediction accuracy impractical, because by the time we notice the problem, it is already too late. In this talk, we present our solution, which consists of monitoring the similarity between the distributions of features in the live traffic and the distributions of features in the test set on which the model was evaluated. This allows us to immediately detect if the conditions under which the model was evaluated have substantially changed, which would invalidate the conclusions we drew in the initial testing. We describe how the mentioned changes in feature distributions are automatically detected using the TDigest algorithm, and how alerts are raised. Further, we delve into the technical implementation decisions: First, we describe how we collect the live traffic of a mission-critical service in a non-intrusive way, in order to avoid interfering with the normal operation of the service. Secondly, we present how the collected data is processed in a scalable way using Apache Spark. Finally, we show how we automate everything with AWS Data Pipelines.