New data-processing frameworks such as Spark and Flink have made writing Hadoop jobs very easy. However, as the data grows, developers face new challenges: from stability issues to allocating the right amount of resources, large jobs are often hard to tune and debug as their distributed nature and scale make them hard to observe.
Babar, an open source profiler developed at Criteo, was introduced to make it easier for users to profile their Hadoop jobs and their resource usage with little effort and instrumentation. It helps understand CPU usage, memory consumption and GC load over the entire application, as well as where the CPU time is spent using flame graph visualizations.
In this session, we will see how to instrument a Spark job and go through its optimization in order to improve latency and stability, while reducing its footprint on the cluster.