Crawler Exclusions

The SearchStax Site Search solution’s Crawler add-on explores the pages of a website beginning at a start URL. It then follows the embedded links in the pages rather than following the hierarchical structure of the website.

The crawl is constrained by three limits:

  • The Crawler will not travel outside of the DNS domain specified in the start URL. For instance, if the start URL is “”, the Crawler will confine itself to pages within “”
  • You can set the “crawl depth” of the run. The Crawler will confine itself to pages that are no more than N links away from the start URL.
  • Crawler has configurable Exclusions. These rules prohibit Crawler from crawling pages where the page URL includes explicit substrings. For instance, do not include any page that contains the string “/internal/” in the URL.

Let us emphasize in passing that the exclusion rules are case-sensitive, so “/internal/” will not exclude “/Internal/”.

Exclusions are easy to configure, but it isn’t always obvious what branches of the namespace the Crawler has mistakenly included. Here is one way to get a look at the URLs of the crawled pages. If your Site Search App uses security tokens:

curl -H "Authorization: Token <read-only token>" "*&wt=json&indent=true&fl=url&rows=10&start=1"

If your Site Search App uses Basic Auth credentials:

curl -u <read-only user>:<read-only password>  "*&wt=json&indent=true&fl=url&rows=10&start=1"

Entered into a Linux terminal window, this /select query returns a list of URLs from the Site Search index, similar to this:


You can adjust the &rows and &start params to see different portions of the list.


