Every week we are introducing new speakers which will be on stage at #bbuzz 2015. Thanks to our program committee we can present part of our new eclectic program. Presentations range from beginner friendly introductions on hot data analysis topics to in-depth technical presentations about scalable architectures. The conference presents more than 50 talks by international speakers specific to the three tags "search", "store" and "scale".
Toke has been working with Lucene and Solr for 8 years, most of that time as technical lead on all things search at the State and University Library, Denmark. During that time he has developed a full hierarchical faceting system for Lucene (LUCENE-2369 / SOLR-2412) and written a patch for improving faceting performance for high cardinality fields in Solr (SOLR-5894). He is somewhat fixated on bit fiddling and search performance.
Toke will talk about the netarchive.dk, which maintains a historical archive of Danish net resources. They are indexing its 500TB of raw data into Solr. One of the requirements is to provide faceting on several fields, the largest having billions of unique String values. Stock Solr is not capable of doing that with satisfiable performance on their hardware. Inspection of Solr's core faceting code has led to multiple performance improvements for high cardinality faceting. The principles behind the improvements will be presented and their influence on the faceting performance curve will be discussed and visualized with data from tests and production systems.
Adrien Grand is a Lucene/Solr committer at the Apache Software Foundation and a software engineer at Elasticsearch.
His talk is on making search fast with algorithms and data-structures that power Lucene and Elasticsearch. 80% of the job involves organizing data so that it can be accessed with as little work as possible. This is the exact reason why Lucene is based on an inverted index. But there are some interesting algorithms and data structures involved in that last 20% of the job. This talk will give insights into some internals of Lucene and Elasticsearch, and see how priority queues, finite state machines, bit twiddling hacks and several other algorithms and data structures help make them fast.
Ryan is an Elasticsearch developer and enjoys working on anything with bits. He's an Apache Lucene/Solr committer and PMC member. Prior to Elasticsearch, he worked on Amazon's Product Search and AWS CloudSearch.
Ryan will give an introduction on how compression is used in Lucene, including recent improvements for Lucene 5.0. Modern search engines can store billions of records containing both text and structured data, but as the amount of data being searched grows, so do the requirements for disk space and memory. Various compression techniques are used to decrease the necessary storage, but still allow fast access for search. While Lucene has always used compression for its inverted index, compression techniques have improved and been generalized to other parts of the index, like the built-in document and column-oriented data stores.