SearchStax Managed Search service clients sometimes have questions about the Disk Usage graph. They wonder why one of their Solr nodes uses more disk space than the other(s).
It is normal for Solr nodes to use slightly differing amounts of disk space. For instance, during indexing, transaction logs can build up on a collection’s leader node but not on the other node(s).
Solr replication also introduces some uncertainty into disk consumption. During indexing, new records build up in memory of a collection’s leader node until they are flushed to disk by a “commit.” This creates a new index segment file, which is then copied to the other nodes.
When documents are deleted from the index, they are flagged in the segment files, but they are not removed. Segment files can carry a significant number of “deleted” records, which occupy disk space.
Solr periodically merges some of the smaller segment files together to create larger segments. During merging, the “deleted” records are expunged. Disk space is freed. The logic of merging pairs of segments is a mystery to mere humans, who have devoted many web pages and blog posts to exploring the topic. Suffice to say that two nodes do not always merge the same segments in the same order at the same time, with implications for disk utilization.
The three replicas shown in the illustration above contain the same number of “live” documents, but have differing segment structures and different numbers of “deleted” records. Therefore, the three replicas occupy different amounts of disk space.
So, in conclusion, minor differences in disk utilization are normal. Major differences should still be investigated.
Questions?
Do not hesitate to contact the SearchStax Support Desk.