Apache Spark Workshop

Join Us for a full day of Apache Spark! 

On November 22nd, 2018 we are holding an Apache Spark workshop for beginners presented by Marcin Szymaniuk at idealo Berlin.

Where: idealo GmbH, Ritterstr. 11, 10969 Berlin
nearest U-Bahn station is Moritzplatz or Kottbusser Tor
When: 22 November 2018, 9am - 5pm, doors open at 8:30 am
Tickets: limited to 20 seats. Tickets are available now here for only €395 including a 10% discount on Trust Us tickets for Berlin Buzzwords 2019* Each ticket includes food and drink, kindly provided by idealo GmbH.

About the workshop

The course is designed for people who have no previous Spark experience. The ultimate goal is to provide an overview of the most important Spark features so attendees get enough knowledge to start building their first Spark applications. No prior experience of Spark is required. All the hands-on exercises will be in Scala but they will be simple enough for anybody with good knowledge of any modern programming
language. All participants should have VirtualBox installed on their laptop so they can do hands on exercises and fully benefit from the workshop. After an exciting day of learning join us for an after-workshop-beer to relax.

About the instructor

Marcin is a data developer and architect with experience in data infrastructure administration. He has proven knoledge on real-life big data related problems that he solves on a daily basis (he has worked for companies like Spotify and Apple, and currently consulting on Big Data projects). The course emphasises practical aspects of Spark and common problems and misconceptions that he has encountered when helping clients. The course is an introduction to Spark led by a “hands-on” practitioner who gained his experience solving real-life problems for many of his clients.

Programme overview

1. Introduction to Spark

  • What is Spark?
  • Spark vs Hadoop
  • Spark with HDFS : quick overview
  • Spark on YARN : quick overview

2. Basic building blocks in Spark

  • Introduction to Resilient Distributed Datasets
  • Spark shell
  • Overview of RDD operations
  • Hands-on exercises:​ Log processing using simple transformations

3. Basic building blocks in Spark

  • Key-Value Pair RDDs
  • Aggregating Data with pair RDDs
  • Hands-on exercises:​ Word count

4. Writing and deploying Spark applications

  • Building Spark applications
  • Submitting a Spark application to a cluster
  • Hands-on exercises: ​ Joining RDDs

5. Spark on a cluster

  • Spark Web UI
  • RDD partitions : on HDFS, on local filesystem, after shuffle
  • Execution model overview : Stages, Tasks, Executors
  • RDD persistence
  • Data Locality
  • Fault tolerance
  • Spark Config: important options
  • Logging, YARN log aggregation.

6. SQL-like Spark features

  • Spark SQL,
  • DataFrames,
  • DataSets
  • Hands-on exercises:​ Spark-SQL aggregations,

7. Spark use cases overview

  • Data analysis
  • Machine learning
  • Iterative algorithms

Bonus exercises: Spark-SQL aggregations, Page Rank, Data generation with Spark, Broadcast join, Skewed join problem, AggregateByKey challenge, Tree-reduce.

 

*You will receive your discount code via email after purchasing your workshop ticket.

cc-by-sa 3.0 Jan Michalko/ Berlin Buzzwords