The Stream Processor as a Database: Building Online Applications directly on Streams with Apache Flink and Apache Kafka

06/06/2016 - 12:20 to 13:00
long talk (40 min)

Session abstract: 

We present a new design pattern for data streaming applications, using Apache Flink and Apache Kafka: Building applications directly on top of the stream processor, rather than on top of key/value databases populated by data streams.

Unlike classical setups that use stream processors or libraries to pre-process/aggregate events and update a database with the results, this setup simply gives the role of the database to the stream processor (here Apache Flink), routing queries to its workers who directly answer them from their internal state computed over the log of events (Apache Kafka).

This architecture pattern has many interesting implications:

  - All state updates happen locally in the stream processor. This eliminates the need for (distributed) transactions with an external database, leading to very high performance (we saw cases with >100x speedup compared to pushing the state into a database).

  - Consistency (exactly-once) is maintained across the entire pipeline: log, stream processor, and application state. That stands in contrast to setups today, where duplicate changes may be pushed to the database, because the database is typically not integrated with the stream processor's consistency mechanism.

  - Consistency and persistence are realized through distributed snapshots. Changes between snapshots are replayed from the input streams or Kafka. When snapshots run asynchronously, the overhead of these snapshots is very low

  - Because application state is part of a streaming program, one can naturally upgrade/rollback the program and state together, replay the stream with a modified program (playing through what-if scenarios) or fork off a new variant of the state (A/B testing)

  - This architecture eases the handling of out of order streams and late-arriving events. The stream processor can expose early results in addition to the correct results.

  This talk will cover both the high-level introduction to the architecture, the techniques in Flink/Kafka that make this approach possible, as well as experiences from a large scale setup and technical details.