July 25, 2022

Thomas DiLascio

|

5 min. read

We get questions about how the SearchStax Disaster Recovery (DR) process works with Managed Solr from an operational standpoint and we dive into the details in this blog post. SearchStax offers three disaster recovery options for Solr deployments to meet a range of business requirements – Hot, Warm and Cold. You can learn more about SearchStax DR in our Solr DR options blog post.

Editor’s Note: Managed Solr is now SearchStax Cloud. It is the same great product that lets developers set up and deploy Solr infrastructure in minutes.

What is RTO and RPO for Disaster Recovery?

Before we get into how Solr Disaster Recovery works in SearchStax, here is a quick refresher on two critical DR terms, RTO and RPO. Check out our blog post on The Important R’s for a Solr Disaster Recovery Plan for more details on these important concepts.

Recovery Time Objective, or RTO, is the amount of time that your Solr service can be unavailable in case of an emergency. RTO defines the maximum acceptable time for your application’s Solr requests to fail before recovery to a healthy state occurs.

Recovery Point Objective, or RPO, defines the amount of time which your disaster recovery data can be comparatively stale versus what is stored in the Production deployment at the time of a DR incident. If your business can tolerate losing four hours of recent documents being added/updated and/or removed from Solr, then a 4 Hour RPO would be considered acceptable. 

How are Solr indexes updated on a DR deployment?

In this example, we will assume a Solr deployment with a warm DR strategy that includes a 24 Hour RPO and a 10 Minute RTO.

To meet these business objectives, we take a backup once every 24 hours and that backup includes all Solr configuration files, Solr collections, aliases, security files and custom JAR files. After a backup is taken successfully, a restoration process would then be initiated in the DR environment. This backup/restore process continues to run day in and day out to meet the 24 Hour RPO.

Does SearchStax send any notification about Solr deployment unavailability and failover mechanism being initiated?

We use a couple of notification channels for DR events. First, account users who are assigned to the Heartbeat Alerts will be notified by email when Solr nodes become unreachable. Once both Solr nodes are unreachable for 5 minutes, a DR event will be triggered. At that time, a SearchStax support ticket is created and our team will CC your specified users into that ticket to inform them that a DR event has occurred. A bridge meeting link will be included for users to join and work with our team through the incident.

How do the SearchStax DR offerings manage traffic routing?

SearchStax DR offerings rely on a vanity DNS, or CName, to route traffic from the primary cluster to the secondary environment or vice versa. When SearchStax DR is fully configured the primary cluster is assigned what’s referred to as a Vanity URL. The DR mechanism will trigger a DNS entry update behind this Vanity URL when a DR event occurs and all requests to the Vanity URL will route to the secondary environment whilst the primary cluster is unhealthy. 

A customer application can always connect to the standard Solr URL endpoint, but the Vanity URL needs to be used in order for DR capabilities to be leveraged. During the onboarding process it’s important to update the application to connect to Solr using the Vanity URL prior to going live or testing DR with your SearchStax onboarding team. 

Below is an example of how a standard Solr URL looks compared to a Vanity URL

How does SearchStax test the Solr DR process?

  • When we conduct a one-time DR test during onboarding, we try to make the test as realistic as possible. As a starting point, we verify which users need to be notified for a live DR event so our Standard Operating Procedure (SOP) can be updated appropriately.

    To keep your team informed of disaster recovery events, you should also ensure that all of your critical users are set up to receive Heartbeat Alerts and that we have a complete list of users who need to be notified for DR failure events.

    During a DR test, a simulated disaster occurs during which time the team makes the Primary Cluster unavailable. We run through a checklist on the SearchStax side, but for the client it’s good to verify from the application that RTO and RPO are being met.

    We use the following checklist to confirm a successful DR test with customers.

What steps do we take during a real Solr DR incident?

Here are the steps we take during a real disaster recovery incident:

  1. SearchStax team receives a node unavailability alert and begins working on the issue
  2. SearchStax tries to restart the Solr node and bring collections back to a healthy state
  3. If all Solr nodes in a cluster become unavailable for 5 minutes, this triggers a DR failover
  4. SearchStax receives a support ticket which notifies us that DR has triggered, and a DNS switch will be occurring
  5. DNS propagation takes 1-2 minutes after the initial 5 minutes. We aim to achieve less than a 10 minute RTO for Warm and Hot DR
  6. SearchStax next step is to notify your team via the support ticket that a DR failover has occurred and provide a meeting link for your users to join and work through the issue together
  7. SearchStax continues working to bring primary cluster Solr nodes up and wait for collections to recover
  8. If we receive responses by email or via the meeting room, we work with the customer’s team to coordinate on switching the DNS back to the primary cluster
  9. If we are receiving no responses from your teams, we move you back to primary as soon as all collections are healthy
  10. If you have purchased Platinum or Platinum Plus support plan, SearchStax provides Post Mortems as an additional service. We follow up with a Post Mortem report within 3 business days and schedule a time with your team to review what happened, answer questions and identify best next steps

Now that you have a clear explanation of the SearchStax DR options and how Disaster Recovery works, you will be able to decide for yourself which option best fits your business case and needs.

By Thomas DiLascio

Senior Product Manager

“…search should not only be for those organizations with massive search budgets...”

You might also like: