Stream Analytics with SQL on Apache Flink

06/13/2017 - 12:20 to 13:00
long talk (40 min)

Session abstract: 

SQL is undoubtedly the most widely used language for data analytics. It is declarative, many database systems and query processors feature advanced query optimizers and highly efficient execution engines, and last but not least it is the standard that everybody knows and uses. With stream processing technology becoming mainstream a question arises: “Why isn’t SQL widely supported by open source stream processors?”. One answer is that SQL’s semantics and syntax have not been designed with the characteristics of streaming data in mind. Consequently, systems that want to provide support for SQL on data streams have to overcome a conceptual gap.

Apache Flink is a distributed stream processing system. Due to its support for event-time processing, exactly-once state semantics, and its high throughput capabilities, Flink is very well suited for streaming analytics. Since about a year, the Flink community is working on two relational APIs for unified stream and batch processing, the Table API and SQL. The Table API is a language-integrated relational API and the SQL interface is compliant with standard SQL. Both APIs are semantically compatible and share the same optimization and execution path based on Apache Calcite. A core principle of both APIs is to provide the same semantics for batch and streaming data sources, meaning that a query should compute the same result regardless whether it was executed on a static data set, such as a file, or on a data stream, like a Kafka topic.

In this talk we present the semantics of Apache Flink’s relational APIs for stream analytics. We discuss their conceptual model and showcase their usage. The central concept of these APIs are dynamic tables. We explain how streams are converted into dynamic tables and vice versa without losing information due to the stream-table duality. Relational queries on dynamic tables behave similar to materialized view definitions and produce new dynamic tables. We show how dynamic tables are converted back into changelog streams or are written as materialized views to external systems, such as Apache Kafka or Apache Cassandra, and are updated in place with low latency.