Big Data technologies like distributed databases, queues, batch processors, and stream processors are fun and exciting to play with. Making them play nicely together can be challenging. Keeping it fun for engineers to continuously improve and operate them is hard. At ResearchGate, we run thousands of YARN applications every day to gain insights and to power user facing features. Of course, there are numerous integration challenges on the way:
- integrating batch and stream processors with operational systems
- ingesting data and playing back results while controlling performance crosstalk
- rolling out new versions of synchronous, stream, and batch applications and their respective data schemas
- controlling the amount of glue and adapter code between different technologies
- modeling cross-flow dependencies while handling failures gracefully and limiting their repercussions
In this talk we will discuss how ResearchGate has tackled those problems. We describe our ongoing journey in identifying patterns and principles to make our big data stack integrate well. Technologies to be covered will include MongoDB, Kafka, Hadoop (YARN), Hive (TEZ), Flink Batch, and Flink Streaming.