With the rise of Tensorflow and libraries like Numpy, Python has become a popular choice for data processing. Applications built with Python are commonly single-node applications and need to be parallelized in order to scale for big amounts of data. Turns out, JVM-based languages are often the only choice to leverage the power of large-scale data processing tools like Apache Flink or Apache Spark.
This talk introduces Apache Beam, an open-source data processing framework for large-scale batch and stream processing which is designed with portability in mind. Apache Beam lets you use languages like Python, Go, Java, and Scala for data processing. Even better, the resulting programs can be run on the execution engine of your choice.
We will show how easy it is to run data processing jobs on Apache Beam and provide insight into different aspects of Apache Beam's portability architecture. In particular how Beam programs
- execute on top of different execution engines like Apache Spark, Apache Flink, or Google Cloud Dataflow
- support multiple languages like Python, Go, and Java
Apache Beam's portability avoids being locked into a single execution engine or programming language. Moreover, portability enables completely new use cases, e.g. to create data processing jobs which mix multiple languages, to reuse Java IO connectors for loading/storing data from a Python job, or to use libraries (e.g. for machine learning) that do not exist in the main language of the data processing job.
Please join us to learn more about the future of data processing where users are free to choose their programming language and execution engine.