Crawler


The SearchStax Site Search solution offers a Crawler add-on for Enterprise clients. The Crawler indexes the pages of your website starting with a single root node. See Crawler Walkthrough for the full procedure.

Enterprise Clients Only!

The Crawler feature is restricted to Enterprise clients only. The following restrictions apply:

  • The feature is restricted to one crawl per day.
  • Crawls are limited to 10,000 pages or 100,000 pages per crawl, depending on your contract.
  • Individual page size is limited to 100 MB (HTML) or 1 GB (rich-text).

Contents:

This page covers the following topics:

What is the Crawler?

The SearchStax Site Search solution’s Crawler is an add-on connector that finds and indexes all of the pages of a website, making them searchable through a Search App.

The Crawler begins with a root URL and follows page links from there to all connected pages using the same corporate domain, subject to a configurable crawl-depth limitation.

Each Search App can have multiple crawlers, putting multiple websites into a single combined index.

The crawlers automatically refresh the index at a predetermined interval, updating, adding, and deleting index entries as needed.

Configure the Crawler

First, create a Search App!

You cannot enable the Crawler until you have created a SearchStax Search App.

If the Crawler feature is enabled for your Enterprise account, you’ll find it listed under Connectors in the Search App’s navigation menu:

This link opens the Crawler list, which is initially empty.

Crawler List

A Search App can have one or more Crawlers, each indexing pages from different web sites.

Crawler Limits

An Enterprise account may be authorized to create a certain number of concurrent crawlers. This limit is applied to the account, not to individual Site Search Apps. The progress bar on this screen shows the number of crawlers and the account limit.

From this list, you can monitor crawler status, open an editor to create or modify a crawler, kick off an immediate crawl, or delete a crawler.

When you rerun a crawler, it updates records of existing pages and deletes the records of pages that are no longer reachable. When you delete a crawler, the web pages it found are removed from the index.

To initiate a crawl, check the crawler in the list and use the Crawl Now button.

To view a crawler’s details, settings, and history, click the desired crawler in the list.

Settings Tab

Clicking on a crawler in the Crawler List takes you to the Crawler Details screen. Select the Settings tab.

Crawler Name

Each crawler in your SearchStax account must have a unique name. The names can be multi-word, mixed case, and alphanumeric. Site Search will ignore case when checking for redundant names.

Start URL

The crawler requires a starting or “seed” web page as the anchor of the crawling process. The crawl follows all the outgoing links from that page recursively until it runs out of pages that have the same DNS domain as the starting page. The crawler will not wander into other domains. If you want to include pages from another domain in the same index, create a second crawler. Your Search App can support more than one, subject to the terms of your contract.

Crawl Depth

The “crawl depth” is the number of links crawled from the Start URL. It has three defaults, depending on the starting page:

  • If the starting page is a sitemap of pages, default crawl depth = 1.
  • If the starting page is a sitemap of sitemaps, default crawl depth = 2.
  • Otherwise, the default crawl depth is “0” meaning “unlimited.”

You can manually select a crawl depth of 1 through 10 to customize your crawl.

Schedule

When enabled, the crawler will repeat its crawl daily at the indicated local time. Subsequent crawls add newly-found pages; delete pages that can no longer be found; and refresh the remainder.

Subject to your contract, Site Search will impose a limit of one crawl per day.

Manage Fields for Search Index

The crawler maps information about a web page to Solr schema fields in the Site Search index. Although the crawler has a default set of mappings, some customization would normally be expected. The Fields table lets you edit and refine your field mappings.

For a discussion of the default field mappings, see Crawler Default Fields.

You can delete a field using the in-line trash-can icon in the rightmost column.

To add a new custom field, click the Add Custom Field button. This opens a field editing form.

Notes on the field options:

  • The Custom Field Name will be modified by Site Search to indicate the field type and language. For instance, field “paragraph” will become “paragraph-txt-en” in the list of fields.
  • Meta Tag Name retrieves the content of a named meta tag in the web page. The default field list already includes the “description” and “keywords” meta tags.
  • XPath uses an XPath formula to scrape the content of HTML tags in the page. For instance, “//p/text()” retrieves the content of all paragraph (p) elements.
  • CSS lets us input a CSS class selector. The crawler will retrieve the content of all HTML elements that match the selector. For instance, “class~=name” will match any element whose class attribute contains the word “name” as a separate word within a space-separated list of words.
  • System offers a droplist of internal Site Search fields about a web page, such as id, title, url, and document_type. Most of these are predefined default fields.
  • Field Type is a droplist of Solr schema field types: Boolean, Date, Float, Integer, String, and Text. This has implications for how the data is indexed and queried. For instance, a “String” field requires an exact whole-string match, but a “Text” field is tokenized to index individual words.

Facet fields

Note that the “text” field type does not work well with facet lists. Try the “string” field type instead.

Exclusions

After your initial crawl, experience may show that the Crawler needs to be more limited in its scope. Exclusions are rules that prevent the crawler from exploring every branch of your domain.

  • Exclusion URLs: Enter part or all of a URL (or regex pattern) as the basis of an exclusion rule. Site Search will interpret it according to one of the following contexts:
    • Beginning with: Excludes any page with a URL that begins with this string.
    • Contains: Excludes any page containing the indicated substring.
    • Ending with: Excludes any page where the URL ends with this string.
    • Matching regex: Excludes any page where the URL matches the indicated regular expression.
  • Additional controls:
    • Plus (+) icon: Click here to add the exclusion to the list of active exclusions.
    • Save Changes button: Click to persist the changes you have made on this screen.

Exclusion doesn’t work?

The exclusion URLs are case-sensitive. You might need multiple rules to cover variations in capitalization.

To delete an exclusion, check the box on the left of the exclusion and click the trashcan icon.

History Tab

The History tab presents summary statistics of crawler runs.

Site Search Crawler is an add-on connector that finds and indexes all of the pages of a website. This table shows the history of previous crawls.

Not all of the discovered links can be crawled successfully, usually because of inappropriate file types. The Items Indexed and URL Crawled columns give a general idea of how successful the crawl was.

Questions?

Do not hesitate to contact the SearchStax Support Desk.