The State of Site Search in Higher Education Research Report with The Chronicle of Higher Education | Download the Report

PODCASTS

Data Ingestion

Pete Navarra, VP of Onboarding at Search Stax, and Karan Jeet Singh, Sr. Solutions Architect, sit down to talk about data ingestion and how it can help you take your site search to the next level.

Listen to Part 2 of the Data Ingestion Podcast

Data Ingestion Discussion - Podcast Transcript

Jae

Okay, so today we have Karan Singh and Pete Navarra from SearchStax, both on the solutions team, and we will be discussing the basic concepts of data ingest from the SearchStax perspective. So let me just start with asking, you know, kind of the highest level question, which is, can you just walk us through when data ingest is important for SearchStax customers? Is this just for Site Search? Do we ever have to worry about it outside of Site Search? And contextually, the reason I ask this question, is more to just make sure that for anyone who’s new to SearchStax, we’re just kind of basically giving them that foundational kind of context. So maybe we can start with Pete and then follow up with Karan.

Pete

I mean, without data ingestion, we wouldn’t have search results, because there would be nothing in the index. So it is extremely important to be able to have a process where, you know, we can get data into the index. So we call that ingestion. Some people call it updating the APIs/update, you know, so it’s gonna go by a various number of names. But at the end of the day, it’s really the process for how we get information into the index that allows us to then query against it. Is it just for Site Search? I think Site Search relies on Solr. And when we’re talking about getting video search results into Solr, we’re talking about data ingestion. And so is it just for Site Search? No, I think – and Karan has done a lot of work in this area – where we have to ingest directly into Solr. And so I think it’s a viable conversation for both our products, Managed Search and Site Search.

Karan

Yeah, exactly, exactly. So I’m building just off of Pete’s point, you won’t have any data, you won’t have any search results if you don’t ingest data. And that’s because Solr is an index, and you need data to create an index. And that’s why data becomes so important. Because the better data you have, the more data you have, you will have more control over the quality of search results. You’ll have more control over which fields should be used for searching, you’ll have more control over the facets or the sorting options. So the quality of data will impact directly the quality of search results you get.

Jae

That makes a ton of sense. And then let me maybe ask a tangential question to that, which is when people are new to search – maybe they’ve been working on projects that are touching search, but they’re not doing it directly themselves, and when they’re new – are there sort of any general recommendations or just any kind of guiding principles that you give people in general when it comes to data? Like how to think about that from a search standpoint, because, again, as you mentioned, it’s important to get right, obviously.

Karan

So my first advice is just make sure that you can surface up all the data that you are using, because a customer might have some data in one repository, some data in some other repository, some metadata sitting in some third repository. Now, that’s the first thing you need to make sure – that you can surface up all these different data points. So that they can be ingested into Site Search, and then you can provide a search on top of it. So that’s the first thing, just make sure that you can send data to us.

Jae

Right! Anything to add to that, Pete?

Pete

I mean, yeah, you know, sometimes even the first step is understanding what data you’re going to send. And then going through the process. I mean, when we talk about data, we’re really talking about fields. And when we talk about fields, we’re talking about content. And, you know, it’s kind of taking it up a level. A lot of our customers are running these large content management systems where all they focus on is content, and that content does have fields baked into it. And so sometimes, you know, the ability for us to get content is going to be a scrape. Sometimes it’s going to be a selection of fields from the backend of the CMS, and so taking the time to understand your content and understand what portions of your page – what portions are revealed – matter most, before you really get into the process of doing your ingestion, that’s going to make doing your ingestion much easier.

Jae

Yeah, so we touched on two things there. One is the sort of mechanism of being able to actually send data. The other one is the data sources themselves. And even the fields specifically, you know, kind of mapping it to the fields. And maybe a question I have related to that, then, is going back to what you said, Karan. Like, what do I need to have figured out beforehand, in order to know that I can send data? So what are some things that are important for me to think about at the brass-tacks level?

Karan

The first thing that you need to make sure when figuring out whether the data can be sent is making sure that you can connect to our system. That’s like, the first thing – whatever system you’re using, that system can connect to us using our APIs. So once you figure that out, once that connection is there, then you can start looking into the data itself. Then you can start looking into fields, you can start looking into how should you populate those fields? And then how often should you push data to Site Search or SearchStax Managed Search, whatever product you’re pushing the data to? So first you check if you can connect to us, then you check if you have sorted out all the fields that you need to send to us, and then you just send data to us.

Jae

Right, that makes sense. And what else do we need to have figured out beforehand? You know, we touched on a couple of things there. And maybe what I’ll ask in a minute, is just can we get an overview of Site Search itself? But, you know, what are some of the other things that I really have to have figured out at the data ingest level? Is there anything that’s not specifically related to these areas that we’ve already discussed, but are just as critical? Maybe there’s processes internally, you know, ownership of the data, a lot of those kinds of thing. Can you give a bit more detail about some of that? Maybe we’ll start with you, Pete, on this one.

Pete

Yeah, I mean, in terms of what you need beforehand, as I kind of said, it’s really just an understanding of where your content lives, what your content is, you know, the mechanisms, right? So most of our content is going to get ingested via crawler. So what are we going to be crawling? Are we going to be crawling just your homepage? Are we going to be crawling your site XML? I would say there’s some homework ahead of time that is going to be helpful for us to understand where the data is coming from. And once we know where the data is coming from, then it’s which data. For instance, this is something that Karan sees almost on a daily basis. When we’re crawling a site, do we want the meta tags? And which meta fields from the header are we looking at? Do you want your H1 tags to stand out as different fields? Do you want to crawl your navigation? I don’t think the answer is going to be yes on that. But when you think about the different parts of any particular page on your website, having an understanding of what parts we really want to index is going to be helpful, because if you do that process, when we’re trying to onboard you as an example, that can become a very lengthy onboarding. Because now it’s like, we’re having to spend some time waiting for the client to go through and actually do that assessment. So kind of having an understanding ahead of time of, “hey, I know we’re going to be doing some new search result things and, you know, it’s really gonna be based on content from our site.” Well, let’s make a list of what content, what template types, what page types, what URL structures. Are we going just off your sitemap xml? And is your sitemap xml up to date? Is your robots.txt up to date? Actually when we’re thinking about crawlers, that’s even more important sometimes. What do you not want us to index? The robots.txt helps a lot there. That got a little technical, but I mean, really, that’s for me, especially as somebody who helps kind of funnel customers on this journey of onboarding Site Search – understanding where the content comes from is really the first step when we think about what they need to have figured out beforehand.

Jae

Okay, that’s really helpful, and you mentioned the customers and then that Karan sees this every day – so maybe Karan, can we take it to you? And if you could just provide like a couple of examples of projects recently – or even in the past – where you had to do a lot of this type of project. If you could give us an example of one that was relatively straightforward and easy, and maybe one where it was not quite so much. Just to give people a sense of range of what this kind of initial stuff entails?

Karan

Sure, I’ll give you two examples. One was last year, around October or November. We were onboarding a customer for whom we had to use our crawlers. And during that whole exercise, we did not go through the homework where, you know, we would ask the customer – or the customer would just look at the data and look at the fields that are being indexed, look at the metatags that are being indexed. And they just relied on the default behavior of all the different components, they just relied on our default crawler, and just assumed that it’s gonna grab everything, because you know, every website is different. So we need to make sure that whatever system we use is catered to get the exact data that you want us to get. And two months into the onboarding, we figured out that there was no information in the description field – that is, the field that is used to describe the page. And customers use that page, use that field to show content in the search results. And they were not seeing anything, because there was no data in the description tags. So that was one example. And then the other example is from last week, where were onboarding a customer. But we went through this exercise of just going through everything – different types of pages, going through all the metatags that are there in the pages, and then just figuring out what exactly the crawler is gonna grab for you. During that exercise, the customer realized that it’s super easy to work with our crawlers – they can just add data in the meta fields, and then our crawler is just going to grab that. And that’s exactly what they did. Within a day, they added nine more fields in the meta tags. And then the day after that, our crawler grabbed all of it, and pushed it into their data. So within two days, we were able to just improve the quality of their data by a lot by just manipulating the meta tags. And all of this was because of the homework where you just look at the page and see which field you want indexed. You figure out which data do you want in those fields. This is why this exercise becomes so important, because you can just get done with the data ingestion part of it within a few days, rather than stretching it out over months or weeks.

Pete

It really becomes really important for customers who, you know, it’s the balance – expectation setting. Sometimes customers aren’t prepared that they’re going to need to have this additional data in their meta tag. And that’s going to require development effort on their part. And it could be that now we’re waiting for a development sprint or development release. And now we’re in a time zone, or a time bucket basically, where we’re waiting to onboard now, because the customer has come to light – “oh, I need to have additional fields deployed, and I gotta deploy that.” So all of that kind of bubbles up into the homework in the initial stuff that you need to have. If you don’t have those fields available, it’s probably going to be a development effort on the customer part to get those to them.

Jae

Wow, so those are two great examples. And I hear your point there, Pete, it’s critical. It’s foundational. Maybe we can segue from that directly into the data types themselves. And this is kind of a two-part question that we can answer together. I’ll start with you, Karan, on this one. So what are the most common types of data that we see that – that we support – that we see often? And then what are the different methods that we would use for each of those you touched on? You know, the crawler, and the APIs. Maybe we can go into that in detail – list out a little bit more comprehensively what we do there.

Karan

So the most common type of data would be JSON objects. You can create a JSON object and then use our API to send it to us. The content of that JSON object can be anything. If you can convert a PDF into a JSON object, then you can send it to us. If you can convert an Excel into a JSON object, then you can still use ingest API. Or if you have a CSV, then that can be uploaded using our ingest API. If you have an XML, even that can be uploaded. Basically, ingest API can work with a lot of different types of data objects. As I mentioned, it can work with JSON, XML, CSV, sometimes people just plug in directly into their pipeline that might be working on Python or Java, so it can ingest Python and Java objects as well, and dotnet objects, using the libraries that that are available in those platforms. So it’s super flexible, it can ingest any type of data. Whatever Solr can support, it can support that. As far as crawlers are concerned, crawlers can also help bring in data, like PDF, Excel, obviously, a web page, and on top of that Word docs, PowerPoints, VSD X Files – no one uses that, but our crawler can support that, as well as text files and rich text files. So you do have a lot of bandwidth. You can ingest a lot of different type of data using our crawlers and ingest APIs. It’s up to your imagination.

Jae

That’s great. I appreciate it. Pete, anything to add on, as far as the importance of that flexibility and the way the product was designed, and the intent behind it?

Pete

One of the things that is really interesting is – and it could be one of our differentiators – is that we provide the native Solr. I mean, if Solr can do it, you can do it, and I think that’s really the important piece. It’s not like we’re obfuscating your access to Solr through some native API that’s proprietary to us, and you have to conform to our standards. For companies out there that are doing this by hand right now and doing it through Solr and just aren’t getting a lot of the metrics and analytics and the speed in which you can make adjustments that Site Search provides, most of whatever they’re using for their ingestion APIs will work with our ingestion API. And so if they’ve already got that process outlined, I think it’s really easy to kind of move to Site Search in that regard. That being said, I think it’s also worth noticing that with great power comes great responsibility. You can bring your JSON objects and your JSON data in any format that you want. But if it’s not more of a unified format, where you have a common set of fields of JSON objects, then it’s going to be an interesting indexing situation, if you don’t have some form and fashion of the data that you’re bringing in. At the end of the day, Solr is a very flexible engine, which makes Site Search very flexible in that regard. So, going back to some of that homework that we were talking about – really understanding that as you’re uploading documents, having a common form of the object that you’re putting into Solr is important. So things like, hey, we’re always gonna have this description field that Karan was talking about, right – we’re always gonna have a title field, we need to have a URL field, defining what the fields are that we’re going to need to have for display purposes, especially within the result is really crucial. Because we want to make sure that all the results have that same series of fields. So that’s kind of what I would add to that conversation, is just understand that there’s a lot of power here that you can actually get with being able to use the native Solr APIs.

Jae

Awesome. As far as the methods themselves, then, maybe we can start with you on this one as well, Pete. What are some of the pros and cons associated with different forms of data? You hit upon it in some way, you’re kind of hinting at it, with that “with great power comes great responsibility.” I love that. I think the question then is, when I’m a customer, is it pretty clear that I need to make a decision when it comes to different methods that I’m going to use for data ingest? Or are there pros and cons? Or is it just more like, it’s a pretty clear one to one mapping, and it’s very clear that I should do this in this case and a different way, in a different case?

Pete

I think in most cases, you know, it is going to be pretty cut and dry. I would say that in most of our customer base, we are just dealing with crawlers. And so we’re just scraping a page and going on to the next link and scraping a page and going on to the next link. But we also provide modules, depending on your content management system that you’re using. As an example, we, we have a Sitecore connector that works with Sitecore, and does things on more of a push mentality, instead of I would call a pull mentality. Not to go too deep on the technical side of Sitecore specifically, but Sitecore has functionality out of the box on how to manage Solr index. So we tap into that by providing a native Solr index to Sitecore that’s really our Site Search index. And so as Sitecore indexes its own purposes, it’s also indexed in the Site Search one, so it’s already creating that form and fashion and specific set of documents that we would expect to have to kind of work out when we’re doing it with a crawler. We have the same thing on the Drupal side, where on the Drupal side, we have a module that can really more push content into our Site Search index, instead of relying on more of that crawler base. I would say, largely, it is more crawler based – not just in Site Search, it’s really important than on the Managed Search side. When you’re running just normal Solr operations, there are various methods for pushing, especially like PDF content – there’s other methods for being able to index and push content. I’ll let Karan talk to us a little bit, but a great example that we’re seeing right now is – a lot of customers out there are used to using the direct import handler in earlier versions of Solr. In the newer versions of Solr, they’re stepping away from that for one reason or another. There’s just different ways to get stuff in. Karan, what’s going on with that direct import handler, by the way?

Karan

Yeah, so the community has just decided that they don’t want to maintain that anymore. So what they have done is they have converted the data import handler from a core module into a contributed module. So you can still use it if you want to. But it’s just that it’s not maintained by the same community who’s maintaining Solr, it’s maintained by another community. Now, I’m pretty sure there must be some overlap between the two. But it’s not maintained by the main community anymore. So if you still want to use it, you have to assume the risks, if there are any, and then use that otherwise. The Apache Solr community, they want people to use the ingest API, the update endpoint, that is their installer. So they have worked a lot on making sure that the ingest API is versatile, and it can handle any kind of data. Like I mentioned earlier, they want to make sure that you can ingest data from different sources. It can ingest data in different formats. So they have worked on making sure that that process becomes super easy, and people can slowly start moving away from the data import handler. So that’s what’s going on in that way. If you’re using some few-years-old method of connecting Solr and your data using data import handler, then I mean, you can still use it if you want to. But it would be better if you just spent some development effort and move to APIs.

Jae

Gotcha. That’s really helpful. And thanks for the context, Pete. I’m curious about not only the context for the data import handler, but there’s also a couple questions I have here. I’d love to hear an example of what you just described, Karan, you know, where somebody made the decision to do that, based on some of the older ingest methodology they were using. And then an example or two of, you know, Pete mentioned PDFs – just elaborating on what he was just talking about, related to the fact that in general it’s crawler based, and there’s obviously connectors and pros and cons. Just a little bit of context to elaborate on some of the things he was saying, if you have any color there.

Karan

Sure. So, the reason to move away from data import handler was mainly security, because the ingest API is super secure. It works on port 443, it has the TLS 1.3 encryption, so it’s a very secure way of ingesting data. Whereas data import handler, and in data import handler, there were all different kinds of connectors like JDBC, connecting to a database connecting to an RSS feed, and so on. All of them worked well, but they were not really secure. So that’s why the community decided to move away from all of those and just focus on ingest API. That’s why in Site Search, we just focus on ingest API – even our web crawlers use ingest APIs. We consume our own product, where we use our web crawler to get the data, and then when pushing the data into Solr, we use the same ingest APIs that are available for everyone to use with our product. And ingest APIs, as Pete has mentioned, it’s super powerful – “with great power comes great responsibility” – you just have to then make sure that the data that you’re sending is in line with all the other data that’s there, because you can send any data to it. But if you’re sending data from multiple sources, you need to make sure that data is normalized into something that is uniform across all the sources that you’re using. So data from one database, and the second database, needs to have similar fields, so that when you are creating a search experience, you can create a unified search experience. You don’t have, you know, two separate blobs of data sitting in there. You can use ingest APIs to get the data from anywhere, and it’s super secure. And you just need to make sure that your fields are proper, and you’re good to go.

Pete

Yeah, it really speaks to – Jae, I know me and you had a conversation of talking about the difference between, you know, federated search versus unified search. And again, that kind of speaks to the whole idea around this being a unified index that we’re working with. And even though you might have federated tendencies, where you’ve got search results coming from five different external websites and different systems, we really are working with a unified index. And so having that really common data format and set of fields is really important. And it goes back to everything we’ve been talking about.

Jae

That’s awesome. Do either one of you have another maybe anecdote or real world example? Where, you know, again, somebody came and discovered what we could do for them in terms of making this a lot easier. And so, you know, maybe they were looking at a pretty big project.

Karan

Yeah, actually, it just happened two weeks ago, where we were talking to one of our customers and just trying to figure out how can we best use our crawlers to get the data. We were going through the same motion that we do with every customer, where we’re just trying to make sure that we have access to all the pages, all the data is there, all the metadata is there. And if a customer found that they don’t really have a lot of good quality data available publicly, a lot of data that they have is in their database behind the scenes. And that customer was using AEM – Adobe. So a lot of data was in AEM. And when the engineers saw the ingest API, they immediately made the decision to just move to an ingest API, so that they have complete control over what data should be sent to Site Search in which format, which field should have which data. They did not want to rely on anything else. So they saw ingest API, and within a day or two, they completely moved away from crawlers and just went for ingest APIs. And then, after a day or two of that, they had already implemented it. And within a week, they were live.

Pete

And I will say it was interesting, because we were expecting that when they made the decision that they were going to walk away from using the crawlers and using the ingest API. At least for me, I thought that “okay, this means that they’re probably going to come with a bunch of questions, and we might have to provide some enablement in that area.” And they came back and said yep, we’re good. We’re already launched, like, we’re set, we almost provided no assistance to them to be able to use those ingest APIs. And that really speaks to the fact that we are using the native language of Solr. If you’re familiar with Solr, you’re going to be familiar with our APIs. And so it was really neat to see that kind of light bulb click on, especially for a platform like AEM where we don’t currently have a large adoption rate yet (of AEM). For us, it’s a fairly new platform. So being able to see how easy it was for a customer that’s using AEM, and using our Site Search product, and that partner having such a lightbulb idea of, hey, this is easy, we were able to do it – it was magical to see, I’ll say that.

Jae

I would imagine – that’s really a great anecdote there. And I would imagine that when we talk about who gets involved with this process – and you’re talking about an Adobe, in particular, an Adobe deployment – and so I would imagine that maybe there’s some potential complexity in terms of the project size, right. So, can you guys just talk a little bit about who gets involved in this process? You know, the whole data ingest process? And what’s individual teams’ roles? We also do a lot with implementation partners, obviously. So what’s their role? What’s SearchStax’s role? Just help our viewers and listeners to kind of get a sense for how that typically works, starting with Pete.

Pete

Well, it starts with a dream, Jae. In terms of the players, I mean, it really depends on the maturity of who’s leading the projects, and whether they’re being driven by the customer, whether it’s being driven by a partner. Going back to this example, and without trying to reveal too much, this was an existing Site Search partner – or sorry, an existing Site Search customer on a different platform that wasn’t AEM. And we had just onboarded them to Site Search on their old platform. And everybody was applauding and gave everybody a pat on the back because we had a successful onboarding, and they were off to the races on their own platform. Within about a week later, they came back to us and were like, “so we’re switching to AEM, we’re completely rewriting all of our – like, redoing everything, replatforming – and we’re only doing it in like a month.” And we kind of looked at them like, okay, here we go. And again, we were trying to figure out, okay, what’s the fastest way that we can do this? So the players that were involved, I mean, obviously, we had a really great partner that was working with that customer. They had a really great customer who had adopted our viewpoint and adopted our product and said, “hey, this is a great product, and we want to take it with us, can we take it with us to another platform.” So we had really great buy in from the partner, we had a really great buy in from the customer. And then on the platform side, even Adobe got involved, and was a champion for us. And so having that really nice synergy between the platform, the marketing team, and the technical team was really important for being able to have such a quick and successful launch, in my opinion, at least from me being kind of an overseer on some of these things. It was really neat to see – you know, Karan, from a technical side, you can probably get a little bit deeper on that, but it was neat to see how all those players contributed here.

Jae

Yeah, that’s really good to hear. And by the way, a goal is a dream with a deadline, so what you’re saying is not far off. And I guess Karan – can you share a little bit additional context with us just about the actual day to day operational – you know, how did it all go? And again, without revealing anything important or sensitive, but you know, the important aspects being more like from the perspective of what we’ve been talking about. What were you able to avoid in terms of just general additional friction and chaos and unorganized aspects of the project by virtue of what we’re talking about here?

Karan

Yeah, no, you know what, in all the onboarding I’ve been a part of, it all depends on how much involvement does the customer really want from us? Because most of the time, 99% of times, we are never the blockers. Because we don’t really have any implementation, we don’t do anything – I mean, we just set up a crawler, and then it’s off to the races. When setting up the environments, we just need a day. And we can set up any number of environments that you want. And we are never really the blocker. So this was the case with this customer as well, where as Pete mentioned, there was really nice synergy between the customer implementation partner and the platform. And immediately when they saw the ingest API, all of them knew that that was the way to go. There was no hurdle, no blocker from anyone, there was no objection from anyone, and everyone just got aligned to that ingest API, everyone got hyper focused on that, and then within a few days, they just knocked it out of the park. We do get some customers where there sometimes might be a mismatch, or maybe they don’t even have any technical expertise to make a call. So when that happens, then they rely on us to make those decisions. And if they don’t really have technical expertise, then we often ask them to just go for crawlers, because then that’s a completely hands off approach. The only thing that they have to focus on is creating a website and putting content on their website. Everything else is taken care of by search tags where our crawler goes and gets the data, and then just helps them create a nice search experience. So basically, the level of involved involvement from us depends completely on the customer and the kind of hand holding they want, the kind of white glove treatment they want, and the level of technical expertise they have. I remember we had a customer who was looking – we had a trial, I think someone spun up a trial. And we reached out to them and asked them, you know, how are you going to use Solr? Do you need any help? And they were like, “No, I already know how to do dense vector search. And I’ll just use Solr,” and without any help from us, they implemented semantic search on Solr. We do get customers like that, who are highly technical. We have some customers who have played around with Learning to Rank without any involvement from us, and they have like two or three Solr engineers on their staff. So our involvement completely depends on the technical expertise that customers have. We can be as hands on as you want, as hands off as you want us to be.

Jae

That’s awesome. This has been a really, really great conversation. I really appreciate both of you. Any parting words, just to kind of help people understand the basics? We’re going to have more of this type of content getting into some of those connectors that Pete talked about, and some other things, but any other kind of parting comments from either of you? Karan, we can start with you.

Karan

Oh, my parting comment is going to be that data ingestion is hard. And it’s daunting. I’ll just say that. Don’t worry. It’s hard for everyone. We will talk about it. We’ll figure it out.

Pete

I think that really plays into my parting thoughts. You know, really it’s about expectation management and knowing there’s probably going to be some work on your side that’s going to be needed. And obviously, that’s not just in the ingestion side. We’re only talking about ingestion here, right? But even on the implementation side, you know, you’re gonna have the customers who are gonna have some work to do to implement Site Search and get it to look like they wanted to on their site and stuff like that. So knowing ahead of time and being aware of what the expectations are, and just setting the proper expectations that it is hard, ingestion is hard. It can have some homework, you’re gonna have to understand your content. And if you don’t have a grasp on it, and especially for visual – for new digital marketers, who maybe are new to their company, and they’re having to understand their site – make sure that we’ve got folks on the call during our onboarding process who understands the content, because that’s going to help us help you much easier.

Jae

Awesome. Well, thank you both. This was really terrific. I’m looking forward to more of these sessions.

SearchStax helps companies create exceptional search experiences managing Solr infrastructure on the backend via SearchStax Managed Search and site search on the front end with SearchStax Site Search. Interested in learning more?