Sitecore Solr Stop Words Made Easy

Apr. 03, 2019

Karan Jeet Singh

3 min. read

Web content-management systems, like Sitecore, give creators endless flexibility in the type of content and amount of content they can create and make available to the users. Such flexibility requires an equally flexible yet robust search capability to find the content. Sitecore can use Solr as its backbone to help users parse through all of the content and find the exact searched piece. Solr has many tricks up its sleeve when it comes to searching; one of which is to figure out what exactly to look for and what to ignore. Note that certain features can be set up in either Sitecore or Solr. In general, I recommend centralizing as many of the search parameters as possible in the Solr configuration to simplify management and optimize efficiency. (Solr is very fast at indexing and searching).

We usually search for nouns and verbs rather than for articles. This makes our searches much more relevant. Let’s take an example – “Drinking a latte at Cafe tous le jours.” When performing a search for this statement, we would enter keywords like “drinking latte” and expect the result to pop up. We would not usually enter “a” or “at” as search keywords. Solr understands that and lets us remove such words from the text during indexing. This not only makes the lookup faster, but it also reduces the size of the index. Such common words are known as stop words. We can remove them by using

StopFilterFactory.

In Solr, any document or query can be sent through a series of tokenizers and filters collectively known as analyzers. Tokenizers are used to break up the text into tokens, and filters are used to remove, change, or swap the tokens. StopFilterFactory is a filter provided by Solr that removes stop words from documents and queries.

Analyzers are described when implementing a field type in the Solr schema, like so –

<filter class=”solr.StopFilterFactory” ignoreCase=”true”
words=”lang/stopwords_en.txt” />

Let’s take our example through the index analyzer. First, the tokenizer splits the sentence into tokens:

“Drinking a latte at Cafe tous le jours”
->
[“Drinking”, “a”, “latte”, “at”, “Cafe”, “tous”, “le”, “jours”]

Then our stop-word filter removes commonly-used English words that provide little value to document searching and storing:

[“Drinking”, “a”, “latte”, “at”, “Cafe”, “tous”, “le”, “jours”]
->
[“Drinking”, “latte”, “Cafe”, “tous”, “le”, “jours”]

After this, the document is ready to be indexed. An incoming search query goes through a similar transformation, apart from one additional step where a synonym filter broadens the query before it is executed. If you have been following along and have a keen eye, then you might have noticed that the stop-words filter only removed stop words for the English language and not for French. That’s because Solr provides language-specific stop words out of the box. We used “stopwords_en.txt”, which is a list of stop words for English. Solr provides custom stop-word lists for many languages in the conf/lang/ subdirectory, some of which are:

Arabic	Bulgarian	Catalan	Czech	Danish	German
Greek	English	Spanish	Basque	Farsi	Finnish
French	Galic	Hindi	Hungarian	Armenian	Indonesian
Italian	Japanese	Latvian	Dutch	Norwegian	Portuguese
Russian	Swedish	Thai	Turkish

With the amount of content that can be created, indexed, and searched in Sitecore, search efficiency can make or break the quality of the user experience. Stop words are one of the many tools that help you improve not only the quality of search but also the efficiency of it. Have fun with it, try it out, expand the list to your liking, and watch your collections perform better than ever before.

Sitecore Solr Stop Words Made Easy

By Karan Jeet Singh

Solutions Engineer

"...If you’ve taken the time to build extensive customer journey maps, your team will get to know your customers on a deeper level and be able to personalize the journey...."

Get Our Newsletter

You might also like: