Aug. 07, 2019
Bruce Clayton
|
A Solr administrator can make a single-character configuration error that issues no warning but causes delayed failures as the project matures. These issues are not easily diagnosed from their symptoms. One of these sensitive areas has to do with misconfigured Solr “replicas.” This post will examine the most common reasons behind replica issues and how to fix them.
A Solr “replica” is a complete copy of one index. It is a very simple thing. Every Solr collection has at least one replica.
Most SearchStax clients develop and test their projects on single-server deployments for reasons of economy. The server has one copy of the index (one replica). Production systems, however, usually have 2-3 servers, and can scale up to many servers. Each server has its own replica of the index.
In normal operation, replicas support the following behaviors:
Unfortunately, it is easy to create a Solr collection where there are fewer replicas than servers. For instance, people sometimes use the wrong replicationFactor setting when creating a collection. We have often seen three-node systems that had only one replica.
This situation creates the following issues:
Replicas can also go into “recovery mode.” Solr administrators sometimes overload their systems by asking Solr to index too many records in a single batch. CPU levels max out at 100% for extended periods. This causes service outages as one replica after another goes into recovery mode for no visible reason.
Zookeeper checks the status of each replica every few minutes. When this process times out due to CPU overload, Zookeeper assumes that the replica’s server is down. It puts the replica into “recovery mode” while it plays back all recent changes to repair the replica. The replica is unavailable to the system until this process is complete.
Replica recovery places additional burdens on the node’s CPU, of course, which interferes with Zookeeper’s attempts to monitor other replicas on the same node. In a multi-collection system (such as a Sitecore index,) one replica after another goes into recovery. Cascading failures can bring down the whole cluster.
The immediate cure is to stop Solr and then restart it. This interrupts ingestion and gives Zookeeper a chance to catch up. The replicas quickly come back on line. To avoid this behavior, adjust the ingestion batch size to give Zookeeper adequate access to the CPU.
Best practice: For each Solr collection, a replica should be present on every node of the cluster.
The Stack is delivered bi-monthly with industry trends, insights, products and more
Copyrights © SearchStax Inc.2014-2024. All Rights Reserved.
SearchStax Site Search solution is engineered to give marketers the agility they need to optimize site search outcomes. Get full visibility into search analytics and make real-time changes with one click.
close
SearchStax Managed Search service automates, manages and scales hosted Solr infrastructure in public or private clouds. Free up developers for value-added tasks and reduce costs with fewer incidents.
close
close