Apr. 03, 2019
Karan Jeet Singh
|
Web content-management systems, like Sitecore, give creators endless flexibility in the type of content and amount of content they can create and make available to the users. Such flexibility requires an equally flexible yet robust search capability to find the content. Sitecore can use Solr as its backbone to help users parse through all of the content and find the exact searched piece. Solr has many tricks up its sleeve when it comes to searching; one of which is to figure out what exactly to look for and what to ignore. Note that certain features can be set up in either Sitecore or Solr. In general, I recommend centralizing as many of the search parameters as possible in the Solr configuration to simplify management and optimize efficiency. (Solr is very fast at indexing and searching).
We usually search for nouns and verbs rather than for articles. This makes our searches much more relevant. Let’s take an example – “Drinking a latte at Cafe tous le jours.” When performing a search for this statement, we would enter keywords like “drinking latte” and expect the result to pop up. We would not usually enter “a” or “at” as search keywords. Solr understands that and lets us remove such words from the text during indexing. This not only makes the lookup faster, but it also reduces the size of the index. Such common words are known as stop words. We can remove them by using
StopFilterFactory
.
In Solr, any document or query can be sent through a series of tokenizers and filters collectively known as analyzers. Tokenizers are used to break up the text into tokens, and filters are used to remove, change, or swap the tokens. StopFilterFactory is a filter provided by Solr that removes stop words from documents and queries.
Analyzers are described when implementing a field type in the Solr schema, like so –
<filter class=”solr.StopFilterFactory” ignoreCase=”true” words=”lang/stopwords_en.txt” />
Let’s take our example through the index analyzer. First, the tokenizer splits the sentence into tokens:
“Drinking a latte at Cafe tous le jours” -> [“Drinking”, “a”, “latte”, “at”, “Cafe”, “tous”, “le”, “jours”]
Then our stop-word filter removes commonly-used English words that provide little value to document searching and storing:
[“Drinking”, “a”, “latte”, “at”, “Cafe”, “tous”, “le”, “jours”] -> [“Drinking”, “latte”, “Cafe”, “tous”, “le”, “jours”]
Arabic | Bulgarian | Catalan | Czech | Danish | German |
Greek | English | Spanish | Basque | Farsi | Finnish |
French | Galic | Hindi | Hungarian | Armenian | Indonesian |
Italian | Japanese | Latvian | Dutch | Norwegian | Portuguese |
Russian | Swedish | Thai | Turkish |
With the amount of content that can be created, indexed, and searched in Sitecore, search efficiency can make or break the quality of the user experience. Stop words are one of the many tools that help you improve not only the quality of search but also the efficiency of it. Have fun with it, try it out, expand the list to your liking, and watch your collections perform better than ever before.
The Stack is delivered bi-monthly with industry trends, insights, products and more
Copyrights © SearchStax Inc.2014-2024. All Rights Reserved.
SearchStax Site Search solution is engineered to give marketers the agility they need to optimize site search outcomes. Get full visibility into search analytics and make real-time changes with one click.
close
SearchStax Managed Search service automates, manages and scales hosted Solr infrastructure in public or private clouds. Free up developers for value-added tasks and reduce costs with fewer incidents.
close
close