Jul. 03, 2019

Karan Jeet Singh

|

2 min. read

Some of our SearchStax clients index websites that use multiple languages. We were recently asked how to enable Solr indexing of Mandarin on a cloud platform. (This post describes indexing Traditional Chinese characters. It is also possible to use Simplified Chinese by following a similar series of steps. Contact us at support@searchstax.com for an example.)

Solr does not parse Chinese text by default, but it comes with the appropriate tokenizers included. The default configuration of the ICU Tokenizer is suitable for Traditional Chinese text. It follows the Word Break rules from the Unicode Text Segmentation algorithm for non-Chinese text, and uses a dictionary to segment Chinese words. To use this tokenizer, you must add additional .jars to Solr’s classpath (as described below).

Step 1: Obtain Configuration Files.

To add Traditional Chinese indexing to your Solr project, you need to modify your project configuration files. If you need to download the files from an existing project, see How can I view my Zookeeper Configurations?

Step 2. Add the Required Library.

Update solrconfig.xml file by adding following line after all the lib declarations.

				
					<!-- Traditional Chinese library -->
     <lib dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lucene-libs" regex="lucene-analyzers-icu-\d.*\.jar" />
     <lib dir="${solr.install.dir:../../../..}/contrib/analysis-extras/lib" regex="icu4j-\d.*\.jar" />
<!-- Traditional Chinese library - END -->
				
			

This library comes with Solr, so you don’t have to alter your deployment in any way to make it work.

Step 3. Update the Schema

A. Create a new field type in the managed-schema file with the SmartChineseAnalyzer.

<fieldType name="text_mandarin" class="solr.TextField">
    <analyzer>
        <tokenizer class="solr.ICUTokenizerFactory"/>
        <filter class="solr.CJKWidthFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
</fieldType>

B. Create a field that uses this field type.

<field name=”text_man” type=”text_mandarin” multiValued=”true” indexed=”true” stored=”true”/>

Step 4: Upload Configuration and Reload Collection

Upload the altered configuration to your SearchStax cloud server and reload your collection. See How do I update the Solr Schema? for step-by-step instructions.

By Karan Jeet Singh

Solutions Engineer

"This makes it clear that marketing should fully own the digital experience - starting from when a student lands on the website to explore and first learn about offerings all the way through collecting their cap and gown."

You might also like: