You are a multi-national company having customers in different countries where the business language is not
necessarily English. How would you build a centralized search system catering for the needs of all your users?
Apache Lucene/Elasticsearch/Apache Solr all provide different language-based text analyzers to analyze
the text, but which one should you use and when? We found overselves in exactly this situation.
We are an email-security company having around 300 billion emails archived with us, resulting in indices in
petabytes of indexed data. Earlier we used whitespace analyzers from Lucene to be able to serve the searches
in different languages, but this approach, although simplistic, presented many limitations once we
started to serve in different languages (e.g. German). I will explain how we overcome these problems by
first identifying the language of the content through our own language detection model which in turn served
as the guide for the selection of an analyzer to analyze the email in various languages.
This talk will walk you through how to build multilingual search systems and explore different possible
approaches. It will also discuss different problems one may run into when these language-based
analyzers are used and what are the ways to improve the search results in these cases. In particular the talk
will focus on the query-log analysis as an effective way to improve the multilingual search by providing the
feedback to fine tune the analyzers used for stemming and lemmatization, thereby increasing not only the
recall but also the precision (relevance) of the search results.