From Batch to Streaming ET(L) with Apache Apex

06/13/2017 - 12:20 to 13:00
Palais Atelier
long talk (40 min)

Session abstract: 

Stream data processing is increasingly required to support business needs for faster actionable insight with growing volume of information from more sources. Apache Apex is a true stream processing framework for low-latency, high-throughput and reliable processing of complex analytics pipelines on clusters. Apex is designed for quick time-to-production, and is used in production by large companies for real-time and batch processing at scale.

This session will use an Apex production use case to walk through the incremental transition from a batch pipeline with hours of latency to an end-to-end streaming architecture with billions of events per day which are processed to deliver real-time analytical reports. The example is representative for many similar extract-transform-load (ETL) use cases with other data sets that can use a common library of building blocks. The transform (or analytics) piece of such pipelines varies in complexity and often involves business logic specific, custom components.

Topics include:

  • Pipeline functionality from event source through queryable state for real-time insights.
  • API for application development and development process.
  • Library of building blocks including connectors for sources and sinks such as Kafka, JMS, Cassandra, HBase, JDBC and how they enable end-to-end exactly-once results.
  • Stateful processing with event time windowing.
  • Fault tolerance with exactly-once result semantics, checkpointing, incremental recovery
  • Scalability and low-latency, high-throughput processing with advanced engine features for auto-scaling, dynamic changes, compute locality.
  • Who is using Apex in production, and roadmap.

Following the session attendees will have a high level understanding of Apex and how it can be applied to use cases at their own organizations.