Crawler Walkthrough


This is an end-to-end walkthrough of indexing the SearchStax product documentation website using the SearchStax Site Search solution and its Crawler. This exercise takes about half an hour to complete.

Enterprise Clients Only!

The Crawler feature is restricted to Enterprise clients only. The following restrictions apply:

  • The feature is restricted to one crawl per day.
  • Crawls are limited to 10,000 pages or 100,000 pages per crawl, depending on your contract.
  • Individual page size is limited to 100 MB (HTML) or 1 GB (rich-text).

Getting search results for your website is surprisingly easy, but there are moments when you wonder what to do next. This discussion captures those moments.

We assume that, as an Enterprise client, you have a Site Search account courtesy of your SearchStax Onboarding Manager. Log in to the Site Search interface.

The exercise begins with setting up and running the Crawler. When the crawl is complete, we walk though the Site Search features that create the first search experience for your site.

Contents:

This page covers the following topics:

Create the Site Search App

When you Create the Site Search App, be certain to configure it as a “Custom” application.

Configure and Run the Crawler

If the crawler feature is enabled for your account, you’ll find it listed under Connectors in the Search App’s navigation menu:

This link opens the Crawler list, which is initially empty.

A Search App can have one or more crawlers depending on the terms of your contract. Each crawler can index a different website. The list is initially empty. Click Create a Crawler.

The next step is to provide your crawler with a name and a starting URL. Site Search will verify that the URL is reachable.

The crawler begins with a root URL and follows page links from there to all connected pages within the same corporate domain, subject to a configurable crawl depth.

You can Crawl Now if you wish, but we advise you to visit the list of crawler fields first. The crawler is limited to one run per day, and we need to set up a special field before launching it.

Configure the Crawler Fields

This is an optional step to demonstrate setting up a facet. The crawler imports a set of default fields from webpages (see Default Field Map for details). You will find, however, that your target website uses additional fields. Site Search lets you add these fields to the crawl.

The SearchStax website doc pages contain a Products meta tag that makes a simple facet demonstration:

<meta name="Products" content="Managed Search">

We’d like the crawler to import the value of this tag to the index.

Open the Manage Fields for Search Index section of the crawler settings. You’ll see the list of default fields.

These fields could be useful in your project, and are harmless if not. Click the Add Custom Field button. The resulting dialog box is described on the Crawler page.

  • Set the Custom Field Name to products. This is the label you will see in the Site Search lists of fields.
  • Select the Meta Tag Name option and enter Products. This is the meta tag name from the target page HTML.
  • For a field destined to become a facet, the string datatype is usually the best choice.

Click Add Field. You’ll see the new field in the list, labeled products_ss (your field name plus the string datatype).

When you are satisfied with the setup, click the Crawl Now button.

Crawling

As the crawl proceeds, you’ll see progress statistics updating.

The error count represents things like incompatible file types encountered in the crawl. Site Search cannot be more specific than that.

Inspect the Document Fields

This section presents some “tips and tricks” that are helpful for inspecting the output of the crawler before you configure the Search Fields and Result Fields in Site Search.

Wait Five Minutes!

Due to search-engine configuration settings, it may take as much as five minutes for the crawl data to be committed to the index. Until this time elapses, Site Search displays and query results will look the same as they did before the crawl.

Navigate to the Site Search App > Settings > Search API tab. You’ll find the Read-Only authentication token about halfway down the screen.

You’ll need to copy the token to the clipboard and paste it into a text buffer temporarily.

Now scroll back up the screen and find the App’s Select Endpoint.

Copy the endpoint to a text buffer and make these changes to it:

  • Change emselect to select.
  • Append ?q=*:*&wt=json&indent=true at the end following select.

Now we’ll assemble a Curl command in the text buffer. Use this format:

curl -H "Authorization: Token <Read-Only Token>" "<Select Endpoint>/select?q=*:*&wt=json&indent=true"

You should now have a URL similar in general to this:

curl -H "Authorization: Token 6e6a32<redacted numbers>597c5a" "https://searchcloud-1-us-west-2.searchstax.com/95338/doccrawler-1234/select?q=*:*&wt=json&indent=true"

Paste this string into a Linux Bash command window (or a Powershell terminal on Windows) and send it. It will return ten documents from your index, showing all of the fields in use and their content. (Notice the products_ss custom field near the bottom of this list.)

      {
        "id":"https://www.searchstax.com/docs/searchstudio/analytics-glossary/",
        "exif_tenant_id":"2",
        "exif_crawlid":"2151",
        "exif_crawl_definition_id":"43",
        "exif_appid":"studio-1810",
        "url":["https://www.searchstax.com/docs/searchstudio/analytics-glossary/"],
        "paths":["docs / searchstudio / analytics-glossary"],
        "document_type":["html"],
        "date":"2024-06-24T02:36:12Z",
        "title":["Analytics Glossary - SearchStax Site Search Docs"],
        "headings1":["Analytics Glossary"],
        "headings2":["Questions?"],
        "description":["The SearchStax Site Search solution's Analytics Glossary is a summary of key terms and definitions used for analytics in Site Search."],
        "products_ss":["Site Search"],
        "content":["Analytics Glossary - SearchStax Site Search Docs Managed Search Site Search Help 

             <Most of the content was removed for clarity>

        ],
        "_version_":1802708265532915712}

If you have difficulty making this work, contact SearchStax Support for assistance.

This output will be a convenient resource in the following steps.

Configure Result Fields

We must now tell Site Search which fields to display in the results and how you want them assigned to pre-formatted locations in the Search UI App.

Configure Results first!

You must set up at least one Results field before creating a Relevance Model.

Click Results Configuration in the navigation menu. Expose the Results and Display tab.

The Results and Display tab lets us select fields from the index for display in the search results.

Reload the Schema (again)!

After a crawler run, and after waiting five minutes for the index to commit, you should click the Reload Schema button to update the list of potential display fields.

Choose a field from the Return Field list (in the upper red box above). You can add a human-friendly Label if needed. Then map the field value to a Results Card Field, as explained on the Results Configuration page. The (+) icon at the right adds the configured field to the list of display fields (the lower red box).

For this exercise, make the following mappings:

  • Map the index’s url field to the result card’s URL field. This will make the result-items clickable, linking them to the web pages they represent.
  • Map the index’s title field to the result card’s Title field. This will put the page’s title at the top of the result summary.
  • Map headings1, headings2, and headings3 to “No mapping” on the results card. This lists the field values below the result-item’s title.
  • Map the products field to the result card’s ribbon field. This will display the product name as a banner above the result-item.

When finished, click the Publish button.

Configure Search Fields

At this point, the webpage data is in the index, but we can’t search it yet. Before we can search, we have to choose which fields to index. For that step, we need to create a Relevance Model.

Click Relevance Modeling in the Navigation Menu, followed by Create a Model.

Give the Relevance Model a name and click the Create button.

The Search Fields tab of the Relevance Model screen tells Solr which index fields to search.

Reload the Schema!

After a crawler run, and in addition to waiting five minutes for the index to commit, you should click the Reload Schema button to update the list of potential search fields.

The left column is the available fields in the schema (not necessarily present in the crawled documents). Click on a field to move it into the list of searchable fields.

In this case, click on title, description, headings1, headings2, and headings3. These fields contain the most relevant keywords of each page, making it easy to focus the search on pages with appropriate content.

To experiment with a facet list, also add the product_ss field to this list. Facets must be based on search fields.

Then click Publish to re-issue the index. Publishing a small project like this one takes a couple of minutes.

Configure a Facet

The search results seem incomplete without at least one facet list off to the side. How do we set that up?

On the Results Configuration screen, select the Faceting tab. Full instructions for operating this screen are on the Faceting page. Check the box that enables faceting.

The fields in the red box let you select an index field to use in a facet. (If you don’t see it in the list, click that Reload Schema button again.) Select products_ss. You can add a label to be the title of the facet list. In our example, the facet options will be ranked by count.

Don’t forget to Publish.

View Search Results

Now we can view our search results. The Search Preview screen lets us search the index and inspect our search results and facet list. Click Search Preview in the Navigation Menu.

This is a fully-functional search environment. Note that our content_type facet is present, along with ten documents from the index. The requested display fields are present, plus id and Elevated fields to assist with debugging.

From here, you can go back to the previous steps to tune your search. Just remember to Publish your changes before leaving each of the editing screens. You can then return to the Search Preview screen to inspect your changes.

Share Search Results

The Search Preview screen satisfies the developer’s need to view search behavior and result values, but it has one drawback. To see it, you have to be an authorized Site Search user. Experience has taught us that a search project often has many more stakeholders than developers. The project will need a public search portal for stakeholders.

The Search UI App is a shareable search page for colleagues who are not Site Search users. Click the Search UI Kit item in the Navigation Menu, and then select the Search UI App tab.

This screen provides a URL to a shareable search environment. You can View the page immediately, or you can use the Copy icon on the right to share the URL with coworkers.

Use the Regenerate button to refresh the Search UI App after making changes.

Questions?

Do not hesitate to contact the SearchStax Support Desk.