From boolean towards semantic retrieval models

06/12/2018 - 15:40 to 16:00
short talk (20 min)

Session abstract: 

The default boolean retrieval model in generic full text search engines is usually not the best choice for domains like e-commerce because of the lack of decidability about optimal combination of conjunctions and disjunctions of terms that would yield a significant number of relevant results. To improve this, Apache Lucene/Solr introduced the minimum match criteria, where a specified minimum number of query terms must match. But the problem is to decide and tell Solr which are the most important terms (the most salient query theme) that “must” match.

We present a framework to identify (a) important weighted terms in queries (called Must-have Tokens or MTs) and (b) augment them using synonyms. A dependency parser learns weights of tokens to build the MTs list and a neural net embedding and word sense disambiguation is deployed to learn the synonyms specific to MTs in the query context. The models to determine them are built at scale on Apache Spark by analysing clickstream and catalog data across various domains. A custom query parser for Solr is used to augment the queries with these MTs and synonyms.