Real time insights into multi-dimensional data is a key asset for data-driven businesses. We present the architecture of our fast and reliable streaming-only data processing pipeline which harnesses the qualities of Kafka, Flink and Druid. This trio turns out to be a very good choice for building real-time online analytics systems.
In recent years Apache Kafka has become the de-facto standard for highly available and highly scalable messaging.
Apache Flink allows us to consume, process and produce data with minimum delay. When using a streaming-only approach the challenge is to guarantee the correctness of your data. Flink’s capability of using different sources (in our case Kafka for real-time and HDFS for historical data) easily lets us reprocess data without any need of maintaining multiple code bases often needed in Lambda Architectures.
Druid is a datastore designed for real-time multidimensional analytics and overcomes weaknesses of alternative approaches like RDBs and Key-Value stores. It’s streaming ingestion plays extremely well with Flink which is able to process every event as it arrives. We have already contributed our Flink sink to the Druid project so that you can use it out-of-the-box.