Meeting complex data load and data preparation challenges for search applications with Apache NiFi

06/12/2018 - 15:40 to 16:00
Moon Lounge
short talk (20 min)

Session abstract: 

Within many search solutions, the data quality of the index is not only determined by the indexing procedure (e. g. tokenization, lowercasing), but also by data load and data preparation processes running before the data actually is pushed into the search application. However, developing these processes can be a great challenge. In many cases, data has to be extracted from one or more sources, transformed and enriched in different ways as well as loaded into the target search application, all of which can imply a lot of complexity. Various time consuming issues and challenges potentially have to be overcome, such as dealing with performance bottlenecks, transforming different data formats, synchronizing source systems, enriching data via lookups in source systems or considering security mechanisms within clients.

Meeting these challenges is particularly difficult in the context of search applications as the most widespread tools that are specialized on handling such problems are only able to deal with structured data. However, search solutions mostly require dealing with semistructured (json, xml) or unstructured data (logs, full texts). For this reason, the open source Apache Top-Level project Apache NiFi is a very powerful option for search solutions. It provides various methods to extract data from various different source systems, to process data of any format as well as to load data into search applications. Users can develop dataflows via a user interface in order to realize a resource efficient, fault tolerant and highly scalable processing of data. 

The presentation shows how developing search solutions can be facilitated using Apache NiFi. Useful features to process and to enrich semistructured and unstructured data are demonstrated. Finally, recend advances in the interaction with Apache Solr and Elasticsearch are discussed.