Data Ingestion for SearchStax Studio

July 11, 2024

Tom Humbarger

|

7 min. read

If you’re considering SearchStax Site Search or have recently purchased a site search solution, you’re probably thinking through data ingestion. Data ingestion is often a major consideration in getting site search up and running, and developing a sound approach to data import is critical as it directly impacts the quality and relevance of search results.

In the context of site search, data ingestion is the process of importing and loading data from one or more data sources and making it available in a structured format that can be indexed and searched by a search engine. The data may include website content, program details, research, documents and more. Data ingestion also involves repeatedly pulling in data on a real-time or regular batched basis.

The goal of data ingestion for site search is to ensure that the search engine can quickly and accurately retrieve relevant information for a user’s search query and improve the overall user experience.

Getting Data into SearchStax Site Search

With SearchStax Site Search, data ingestion means getting data into a search index so it can be accessed via a search request on a website or in a custom application. This post looks at the various ways to load data into Site Search, identifies the sources and types of data we support and provides recommendations for best practices.

There are three main ways to load data into SearchStax Site Search:

  • SearchStax Crawler
  • CMS Connectors for Sitecore and Drupal
  • Ingest APIs

SearchStax Crawler for Site Search

SearchStax Site Search Crawler (Crawler) makes it easier and faster to get started with Site Search. Crawler works with any website content management system CMS and digital experience platform (DXP).

Crawler can automatically discover and index content across your entire website without any modification to your code. You can get your site search up and running in minutes with customized field extraction, search configuration and relevancy tuning – all easily accessible through the Site Search dashboard.

How Does Crawler Work?

Crawler starts with an entry URL or sitemap URL. From there it will automatically discover additional pages on your site by looking for navigation, links and sitemap entries for each page. Once every page has been discovered and crawled, each page gets indexed and is ready for search discovery.

SearchStax Site Search Crawler performs a daily crawl to keep your search results current as your content evolves over time.

CMS Connectors for Sitecore and Drupal

If you use Sitecore or Drupal for your content management system CMS, SearchStax has integration modules that automate the data indexing process and offer the added advantage of real-time updates to your search index with every content addition, change or deletion.

Sitecore Integration

The SearchStax Site Search Connector for Sitecore is available for Sitecore versions starting with version 9 through version 10.3. The Connector integrates with the Sitecore Indexing Manager and automatically indexes all Sitecore content items out-of-the-box. Additional information can be found in the Sitecore Connector product documentation.

Drupal Integration

The SearchStax Site Search Connector for Drupal automatically tracks all search results known to the Drupal Search API. Once the Drupal Connector is installed and configured, it automatically indexes any new or updated content in the Drupal environment. The module adds search functionality while requiring virtually no changes to the Drupal website. The Drupal integration was developed by Thomas Seidl (drunken monkey), the creator and maintainer of the Drupal Search API, and follows all Drupal open source code guidelines. Additional information can be found in the Drupal Connector product documentation or from the Drupal Connector module page at Drupal.org.

SearchStax Ingest APIs

The SearchStax Data Ingest API is a service that allows you to index and search structured data in your SearchStax search service. The API enables you to send data to your search service in real-time, making it immediately searchable by users. Customers can also use the SearchStax Ingest API to load documents into their Site Search application. On the Settings page, the Ingest endpoint is the /update endpoint and uses the “Read-Write” Search API credentials.

The Ingest APIs simplify the data ingestion process by enabling a customer or an implementation partner to create a small piece of code to get data from any source and push it into SearchStax Site Search. You can index individual JSON documents, multiple JSON documents or a JSON file with an array of JSON objects. You can also index XML documents by sending one or multiple tags. Additional information on using the Ingest APIs can be found in the Site Search product documentation.

Sources and Types of Data for SearchStax Site Search

The primary use cases for SearchStax Site Search involve adding search capabilities to any content management systems such as Sitecore, Drupal, Acquia, Adobe AEM, WordPress, Hubspot, Optimizely, Coremedia, Hannon Hill, Magnolia, Salesforce DXP, any HTML website and custom apps.

If you are working with Sitecore or Drupal, most customers will use the SearchStax Connectors for these CMS solutions. For other content sources, you will either use our Crawler or have developers use the ingest APIs to index your content.

As far as content, the following types of data can be managed by SearchStax Site Search: HTML web pages, PDFs, Word documents, Excel spreadsheets, Powerpoint files, text files, rich text format (RTF) and Visio drawing files (VSD).

SearchStax Site Search enables marketers and developers to deliver powerful site search at scale. Schedule a product demo with our search experts to see how search can improve the visitor experience and gain actionable insights to quickly optimize the search experience.

Data Ingest for Site Search FAQs

What is Site Search Crawler?

SearchStax Crawler is a web crawling tool designed to help website owners index and search the content of their websites or web applications. Crawler scans through the pages of a website, extracts the content and metadata and makes it searchable using SearchStax Site Search. Users configure the crawl settings to meet their specific needs, such as defining the fields to crawl and specifying exclusions. It supports a variety of file types, including HTML, PDF and Microsoft Office documents.

Contact SearchStax to learn more about SearchStax Crawler and pricing.

Will Crawler work on my site?

Crawler is designed for flexible crawling across different CMSs and content formats. Crawler is capable of finding and indexing content from any public-facing website. It uses on-page links and your site’s XML sitemap to find all of the pages within your domain and can also extract data from common file formats such as PDFs, PowerPoints, Excel spreadsheets, Word documents and similar rich text formats.

What is Site Search?

Site search refers to the feature on a website that allows users to search for specific content or information within that website. It typically involves a search box and search results page, which may display relevant pages, documents, products, or other content based on the user’s search query. Site search can improve user experience by helping visitors find what they’re looking for quickly and efficiently.

SearchStax Site Search is our site search solution that makes powerful search easy with best-in-class experience, actionable search insights, self-service marketing tools and quick implementation to accelerate digital transformation projects.

What is Data Ingest for Site Search?

In the context of site search, data ingestion is the process of importing and loading data from one or more data sources and making it available in a structured format that can be indexed and searched by a search engine. The data may include website content, product information, user behavior data, documents and more.

What is the SearchStax Ingest API?

The SearchStax Data Ingest API is a service that allows you to index and search structured data in your SearchStax search service. The API enables you to send data to your search service in real-time, making it immediately searchable by users.

Using the SearchStax Data Ingest API, you can create, update, and delete documents in your search index. You can also configure custom mappings to define how your data should be indexed and searched. The API supports a variety of data formats, including JSON, XML, and CSV, and you can choose to send data to your search service in batches or individually.

By using the SearchStax Data Ingest API, you can ensure that your search service is always up-to-date with the latest data from your application. This can help to improve the relevance of search results and provide a better search experience for your users.

What is the SearchStax Drupal Connector?

SearchStax Drupal Connector is a module for the Drupal content management system that allows you to integrate your website’s search functionality with the SearchStax search engine. The SearchStax Drupal module allows you to easily configure and customize the search experience for your website users. You can use the module to create custom search forms, configure search settings, and manage search results. The module provides several advanced features such as faceted search, autocomplete, and spelling suggestions.

What is the SearchStax Sitecore Connector?

The SearchStax Sitecore Connector is a Sitecore module that Sitecore developers can install to leverage all the search capabilities offered by SearchStax Studio for customer-facing search pages. The Connector contains a Sitecore index connector which can index your Sitecore items using the out-of-the-box Indexing Manager provided by Sitecore. The SearchStax Sitecore Conector is easy to install and integrate into a Sitecore solution, and provides a user-friendly interface for configuring search options and managing search indexes. It also supports multilingual content for websites that serve a global audience.

By Tom Humbarger

Senior Marketing Programs Manager

“...Data ingestion is often a major challenge in getting site search up and running, and developing a sound approach to data import is critical...”

You might also like: