Crawler
The SearchStax Site Search solution offers a Crawler add-on for Enterprise clients. The Crawler indexes the pages of your website starting with a single root node. See Crawler Walkthrough for the full procedure.
The SearchStax Site Search solution offers a Crawler add-on for Enterprise clients. The Crawler indexes the pages of your website starting with a single root node. See Crawler Walkthrough for the full procedure.
The Crawler feature is restricted to Enterprise clients only. The following restrictions apply:
This page covers the following topics:
The SearchStax Site Search solution’s Crawler is an add-on connector that finds and indexes all of the pages of a website, making them searchable through a Search App.
The Crawler begins with a root URL and follows page links from there to all connected pages using the same corporate domain, subject to a configurable crawl-depth limitation.
Each Search App can have multiple crawlers, putting multiple websites into a single combined index.
The crawlers automatically refresh the index at a predetermined interval, updating, adding, and deleting index entries as needed.
You cannot enable the Crawler until you have created a SearchStax Search App.
If the Crawler feature is enabled for your Enterprise account, you’ll find it listed under Connectors in the Search App’s navigation menu:
This link opens the Crawler list, which is initially empty.
A Search App can have one or more Crawlers, each indexing pages from different web sites.
An Enterprise account may be authorized to create a certain number of concurrent crawlers. This limit is applied to the account, not to individual Site Search Apps. The progress bar on this screen shows the number of crawlers and the account limit.
From this list, you can monitor crawler status, open an editor to create or modify a crawler, kick off an immediate crawl, or delete a crawler.
When you rerun a crawler, it updates records of existing pages and deletes the records of pages that are no longer reachable. When you delete a crawler, the web pages it found are removed from the index.
To initiate a crawl, check the crawler in the list and use the Crawl Now button.
To view a crawler’s details, settings, and history, click the desired crawler in the list.
Clicking on a crawler in the Crawler List takes you to the Crawler Details screen. Select the Settings tab.
Each crawler in your SearchStax account must have a unique name. The names can be multi-word, mixed case, and alphanumeric. Site Search will ignore case when checking for redundant names.
The crawler requires a starting or “seed” web page as the anchor of the crawling process. The crawl follows all the outgoing links from that page recursively until it runs out of pages that have the same DNS domain as the starting page. The crawler will not wander into other domains. If you want to include pages from another domain in the same index, create a second crawler. Your Search App can support more than one, subject to the terms of your contract.
The “crawl depth” is the number of links crawled from the Start URL. It has three defaults, depending on the starting page:
You can manually select a crawl depth of 1 through 10 to customize your crawl.
When enabled, the crawler will repeat its crawl daily at the indicated local time. Subsequent crawls add newly-found pages; delete pages that can no longer be found; and refresh the remainder.
Subject to your contract, Site Search will impose a limit of one crawl per day.
The crawler maps information about a web page to Solr schema fields in the Site Search index. Although the crawler has a default set of mappings, some customization would normally be expected. The Fields table lets you edit and refine your field mappings.
For a discussion of the default field mappings, see Crawler Default Fields.
You can delete a field using the in-line trash-can icon in the rightmost column.
To add a new custom field, click the Add Custom Field button. This opens a field editing form.
Notes on the field options:
The “text” field type does not work well with facet lists. Try the “string” field type instead.
After your initial crawl, experience may show that the Crawler needs to be more limited in its scope. Exclusions are rules that prevent the crawler from exploring every branch of your domain.
The exclusion URLs are case-sensitive. You might need multiple rules to cover variations in capitalization.
To delete an exclusion, check the box on the left of the exclusion and click the trashcan icon.
The History tab presents summary statistics of crawler runs.
Not all of the discovered links can be crawled successfully, usually because of inappropriate file types. The Items Indexed and URL Crawled columns give a general idea of how successful the crawl was.
Do not hesitate to contact the SearchStax Support Desk.