Over the last few years, Apache Solr has evolved to encompass a wide range of new, advanced functionality to make it one of the most widely used solutions for Information Retrieval and Analytics. With this new responsibility, deployments have grown substantially in size and cover multiple use cases. These larger deployments now ingest substantial quantities of data and perform increasingly complex queries.
At large companies most use-cases are generally consolidated and hosted by a single multi-tenant platform. All of these use cases generally span different teams, with different data volumes, throughput, and service expectations. As with most complex systems, where requests are often generated by the clients programmatically, the requests are more likely to be harmful for the health of the system. This is only amplified by the fact that Solr is currently unbounded by design. As we and our users push it to the limits, it has become increasingly easy to cause catastrophic failures that cannot easily be recovered from.
The failures often lead to degraded performance and even extended downtimes for systems that are playing increasingly critical roles in businesses. In addition, recovering from these situations almost always have high operational cost, distracting the platform owners from building new things.
During this talk, I would like to discuss common, catastrophic vulnerabilities encountered after running Solr at scale for multiple years and the newly introduced features to protect against those. These vulnerabilities include, but are not limited to, issues encountered when either too many fields are added to Solr, leading to out of memory exceptions and a stalled system or cores growing too large due to an unexpected indexing spike. At the end of this talk, the attendees would be better equipped with not only hosting a more stable multi-tenant search platform built on Solr, but even applying the same ideas to almost all scalable, distributed platforms.