The SearchStax Site Search solution’s Crawler add-on explores the pages of a website beginning at a start URL. It then follows the embedded links in the pages rather than following the hierarchical structure of the website.
The crawl is constrained by three limits:
- The Crawler will not travel outside of the DNS domain specified in the start URL. For instance, if the start URL is “https://my.company.com/bios/”, the Crawler will confine itself to pages within “my.company.com.”
- You can set the “crawl depth” of the run. The Crawler will confine itself to pages that are no more than N links away from the start URL.
- Crawler has configurable Exclusions. These rules prohibit Crawler from crawling pages where the page URL includes explicit substrings. For instance, do not include any page that contains the string “/internal/” in the URL.
Let us emphasize in passing that the exclusion rules are case-sensitive, so “/internal/” will not exclude “/Internal/”.
Exclusions are easy to configure, but it isn’t always obvious what branches of the namespace the Crawler has mistakenly included. Here is one way to get a look at the URLs of the crawled pages. If your Site Search App uses security tokens:
curl -H "Authorization: Token <read-only token>" "https://searchcloud-1-us-west-2.searchstax.com/12345/crawler-1234/select?q=url:*&wt=json&indent=true&fl=url&rows=10&start=1"
If your Site Search App uses Basic Auth credentials:
curl -u <read-only user>:<read-only password> "https://searchcloud-1-us-west-2.searchstax.com/12345/crawler-1234/select?q=url:*&wt=json&indent=true&fl=url&rows=10&start=1"
Entered into a Linux terminal window, this /select query returns a list of URLs from the Site Search index, similar to this:
"response":{"numFound":368,"start":1,"numFoundExact":true,"docs":[
{
"url":"https://www.searchstax.com/docs/"},
{
"url":"https://www.searchstax.com/docs/searchstax-cloud-filing-a-support-request/"},
{
"url":"https://www.searchstax.com/docs/integration-overview/"},
{
"url":"https://www.searchstax.com/docs/searchstax-cloud-docs-home/"},
You can adjust the &rows and &start params to see different portions of the list.
Questions?
Do not hesitate to contact the SearchStax Support Desk.