Learning-to-rank (LTR) is now widely used as a means of tailoring user search experiences based on factors external to the search engine. In the internet era of e-commerce, clickstream and low latency responses, the traditional mode of offline training of LTR models may potentially result in loss of business and a bad user experience. As data ages it quickly loses value, and not delivering search results from the recent past could potentially lead to a bad user experience in some domains and industries.
A pipeline framework that can ingest streaming data and process it in near real-time can provide up-to-date values for LTR features to prevent model staleness. The pipeline must be highly available and scalable to support the required throughput of data.
In this talk we will see a way in which the two worlds of search relevance and data engineering intersect. We will discuss an architecture for a data ingest pipeline designed to process text from a variety of sources like news articles and Twitter feeds. We will use Apache Kafka for data ingestion, natural language processing for content analysis, and Apache Flink to process streaming data and initiate training and deployment of LTR models to Elasticsearch. We’ll also be looking at how Stateful stream processing and SideInputs capabilities of Apache Flink can be leveraged for real time model training and inference.
The audience would come away with a better understanding of how modern streaming platforms can be leveraged for real time LTR model training and inference.