Complete, precise graph-based phrase query with SpanNearQuery

06/17/2019 - 11:50 to 12:10
short talk (20 min)

Session abstract: 

The current implementation of SpanNearQuery in Lucene sacrifices precision and completeness in favor of performance. This tradeoff significantly limits Lucene's usefulness for some potential high-value use cases. This talk introduces a stable patch that makes SpanQuery matching precise and complete, while maintaining performance comparable to that of the extant implementation.

The patch will be discussed in the context of the issue LUCENE-7398, describing the central problem and how it differs from the problem addressed by the new IntervalsSource API.

We will discuss details of the patch implementation, in an effort to familiarize the audience with techniques involved (and hopefully inspire further innovation). Details covered will include:

  1. The main word-lattice data structure (linked, circular, resizable-array-backed, 2-dimensional queue with stable node references and support for efficient binary seek, serial traversal, and arbitrary node removal).
  2. Performance/GC management for linked data structures (as opposed to the array-based data structures that are more common in the Lucene codebase)
  3. Recording in the index: positionLength, nextStartPosition lookahead, and information about possible decrease of endPosition (de-"sausagization" of the index)

Possible implications/applications of robust support for precise, performant matching over complex indexed token graphs are significant:

  1. Index-time synonyms that leverage document context (the current recommendation for supporting graph-based phrase queries supports synonyms exclusively at query-time)
  2. CJK search: index support for orthographic variation
  3. More thorough scoring (e.g., variant forms, implicit boosting for exact orthographic matches, etc.)
  4. Potential for use in domains that prioritize precision, where "phrase" search could be used over data represented by complex token schemes (e.g., time-series data (scheduling, complex event processing), overlapping shingles, etc.)