SearchStax Help Center


Crawler Exclusions

The SearchStax Site Search solution’s Crawler add-on explores the pages of a website beginning at a start URL. It then follows the embedded links in the pages rather than following the hierarchical structure of the website.

The crawl is constrained by three limits:

  • The Crawler will not travel outside of the DNS domain specified in the start URL. For instance, if the start URL is “https://my.company.com/bios/”, the Crawler will confine itself to pages within “my.company.com.”
  • You can set the “crawl depth” of the run. The Crawler will confine itself to pages that are no more than N links away from the start URL.
  • Crawler has configurable Exclusions. These rules prohibit Crawler from crawling pages where the page URL includes explicit substrings. For instance, do not include any page that contains the string “/internal/” in the URL.

Let us emphasize in passing that the exclusion rules are case-sensitive, so “/internal/” will not exclude “/Internal/”.

Exclusions are easy to configure, but it isn’t always obvious what branches of the namespace the Crawler has mistakenly included. Here is one way to get a look at the URLs of the crawled pages. If your Site Search App uses security tokens:

curl -H "Authorization: Token <read-only token>" "https://searchcloud-1-us-west-2.searchstax.com/12345/crawler-1234/select?q=url:*&wt=json&indent=true&fl=url&rows=10&start=1"

If your Site Search App uses Basic Auth credentials:

curl -u <read-only user>:<read-only password>  "https://searchcloud-1-us-west-2.searchstax.com/12345/crawler-1234/select?q=url:*&wt=json&indent=true&fl=url&rows=10&start=1"

Entered into a Linux terminal window, this /select query returns a list of URLs from the Site Search index, similar to this:

  "response":{"numFound":368,"start":1,"numFoundExact":true,"docs":[
      {
        "url":"https://www.searchstax.com/docs/"},
      {
        "url":"https://www.searchstax.com/docs/searchstax-cloud-filing-a-support-request/"},
      {
        "url":"https://www.searchstax.com/docs/integration-overview/"},
      {
        "url":"https://www.searchstax.com/docs/searchstax-cloud-docs-home/"},

You can adjust the &rows and &start params to see different portions of the list.

Questions?

Do not hesitate to contact the SearchStax Support Desk.


Return to Frequently Asked Questions.