Thoughts

Searching With Solr: A Startup Guide

Categories:

Development

A stylized photo of a construction site.

Chances are, you’ve found yourself in the sticky situation of adapting a default search engine. But, there are options beyond the default, starting with an open source solution called Apache Solr. Here's how to get started.

Chances are, you’ve found yourself in the sticky situation of adapting a default search engine. In the beginning, everything seems “good enough.” It’s tested internally with moderate results. Content is uploaded and files are added. More people begin testing the search engine, and more content is piled on top. Query after query is requested and returned.

Yet, the results don’t seem to be very relevant. So we dive in, hoping the API allows for adaptation of results — make the title higher, create synonyms within search taxonomy, filter based on words in a field.

Can we add suggested terms? Can we adapt the search results at all? Meanwhile, the people we’re working with wonder, “Why isn’t this more like Google?”

Introducing Solr.

Search is not a new practice. There are a variety of available solutions for immediate search results: Google Site Search, search engine redirection, or even that default search that came with your software package.

At Blend, we’ve gone in a fourth direction, looking toward an open source solution called Apache Solr.

Solr is an Apache project which aims to make advanced searching and indexing of data as easily available as MySQL made storing and deleting it. It provides full text search, faceted navigation, recommended/related searching, spell suggest/correct, result highlighting, and much more.

Now, instead of relying on default search systems or Google Site Search, we look to Solr to deliver the same features of advanced search appliances (like Google Search Appliance) and surpass its flexibility and scalability.

Getting started.

We could talk all day about Solr, but let’s go a step further and actually put it to work. Solr’s requirements are simple. You’ll need:

The latest version of Java
A directory to put the index in
A directory to put the configuration files in

Once you’ve confirmed this, go to: https://solr.apache.org/downloads.html and download solr. This article is based on version 8.11.1

After you've downloaded and extracted the zip file go in and copy the _default folder from solr-8.11.1\server\solr\configsets to solr-8.11.1\server\solr\.

In your command line navigate to solr-8.11.1\bin and start your solr server by typing:

solr start

This will start a solr server on the default port number 8983. Alternativley you can start solr on a different port number by typing:

solr start -p your_port_number

You’ll know you’re successful when http://localhost:8983/solr/ displays a running solr instance.

Using Solr.

Solr’s up and running. Hooray!

Selecting a collection.

Begin by opening the Solr homepage at http://localhost:8983/solr/.

This default homepage gives us statistics about the server, but we’re going to focus on collections. Collections are groups of indexed data — the rough equivalent to an individual database in MySQL. We're going to be adding our own collection based on the _default solr configuration we copied earlier.

Go to "Core Admin", click "Add Core" and fill out the name and instanceDir field with the _default folder name we copied earlier. Then click add core and you should see a new core called _default.

Clicking on “Core Selector” and then through to a collection will give us a list of the information and options available specific to that collection. For this article, we’ll primarily focus on Query and Schema.

Execute query.

While in “_default”, click on Query, then scroll to the bottom and “Execute Query”. The right area of the page will fill with XML of the resulting matches. (NOTE: because no content has been added, results will return as zero.)

The response is composed to two sections: responseHeader, which indicates technical details of the query, and Result, which indicates actual results. Diving deeper into responseHeader:

status — Indicates if an error has occured. If everything looks clear, it will return “0”.
QTime — Time the query took in milliseconds.
params — Represents an array of the parameters for the query. It will have a value for everything modified in the query form — from indents to queries to writers.

In the above example, the params array shows that indent was set to true, the query was for *:* and the writer was set for xml.

In addition to the XML, executing a query will return a url above the XML. In this case, that url is http://localhost:8983/solr/_default/select?q=%3A&wt=xml&indent=true

Everything in Solr is an http request for sending or receiving data. This url can be broken down into the Solr servlet, collection name (_default), and the select handler along with its query parameters.

Loading data.

Lets load some data into Solr to get actual results. Begin by getting a command prompt in the Solr directory in focus. Then, download example content to load:

curl -O https://gist.githubusercontent.com/sclarson/5129795/raw/438c0eabd8d906850e87d4872ab756aa0b30eac1/snes.json

Next, we’ll load it into Solr:

curl http://localhost:8983/solr/_default/update?commit=true -H "Content-Type: application/json" -T "snes.json" -X POST

By clicking into _default, we now see that there are 845 documents loaded.

Let’s run that original query again and look at the changes in the XML.

    <?xml version="1.0" encoding="UTF-8"?>
    <response>
     <lst name="responseHeader">

       ...

     </lst>
       <result name="response" numFound="845" start="0">

       ...

     </result>
    </response>

The numFound attribute on result gives us the total number of results returned — in this case, 845 results. The start attribute gives us the first element returned. These will also be available in the lst name="params" element.

Looking at the response we also have a lot of <doc> elements. Everything stored and searched in solr is a part of a document. Sets of results that return some information will contain a list of documents.

Search for single term.

Since we searched for *:*, every document was returned. The wildcard matched all of the fields and all records. Let’s test again to find all of the games developed by Square.

First, find the q textbox and enter: developer_s:Square. This is to search the developer_s field looking for the text Square. This should return 7 results.

Search for multiple AND/OR terms.

Maybe you want to know all of the games published by either Square or Nintendo. The common first attempt to do this would be to add “nintendo” to the query. This makes the value for q developer_s:Square Nintendo and executing this query gives us a result with only 7 documents found.

But, because we are all well-seasoned Super Nintendo fans, we know this isn’t accurate. Square made seven games, and Nintendo made many more, so this can’t be entirely correct.

Instead, we must separate the terms using parenthesis. Wrap parenthesis around all the words you want to find in the developer_s field. Update the value of q to developer_s:(Square Nintendo) and execute the query and we’ll get 24 results. Sure enough Legend of Zelda, Super Mario World, and Super Mario Kart are all included.

The difference between these two queries is small but significant. Words surrounded by the parenthesis are grouped to the field they follow. When the fields aren’t surrounded, “Nintendo” is searched for against the default field.

More information about how search queries are formed and processed can be found in the solr query parser documentation.

Since we can put documents into the index, and query them out, lets look into the details.

Each document will be composed of a list of its fields (though not all are required, depending on the query).

Lets take a look at the Chrono Trigger doc.

    <doc>
       <str name="id">ChronoTrigger1995[NA]</str>
       <int name="year_i">1995</int>
       <arr name="title">
           <str>Chrono Trigger</str>
       </arr>
       <str name="publisher_s">Square</str>
       <str name="developer_s">Square</str>
       <arr name="region_ss">
           <str>NA</str>
       </arr>
       <long name="_version_">1429159051143413760</long>
    </doc>

The doc is composed of elements named with the type of data they contain. In this case we have an integer for year, strings for id, publisher_s and developer_s, and arrays of strings for title and region_ss. The names and values match those in the JSON.

    {
     "id" : "ChronoTrigger1995[NA]",
     "year_i":"1995",
     "title":"Chrono Trigger",
     "publisher_t":"Square",
     "developer_s":"Square",
     "region_ss":"NA"
    }

A few things might seem odd. The _i, _t, and _ss are telltale signs that we’re using a Solr concept called dynamic fields. These allow us to create fields as we need. The id and title fields are not dynamic and exist in every document. (We’ll see why in a bit.) When we compare the JSON to the XML we also see that JSON holds a single value for title and region_ss and they are arrays in XML. This is because the fields are specified as allowed to hold multiple values.

The managed-schema file.

So far we’re able to easily query a given column (NOTE: Queries are case sensitive in this case, as you may have noticed) and add data to the system. We did this without even modifying the default configuration; we just started Solr and threw information at it.

Solr’s ease of installation and startup alone makes the system worth considering, but this simple method is not without some problems. For example: if we execute a search with the query set to developer_s:square, we’ll get 0 results because strings and text are handled differently. Another example: a query for the word Nintendo (and nothing else) will return one result, the ill-fated monstrosity known as the Super Nintendo Scope 6.

Without diving into the very low level details, this is a problem with analyzers and copy fields. To fix this we’re going to venture into a new document: the managed-schema file. This file defines what Solr contains. Our current fields include id, title, multiple _s, and a _ss fields.

At the bottom of the managed-schema file we see some defined fields — in our case, we’re making use of the id and title fields.

<field name="_nest_path_" type="_nest_path_"/>
<field name="_root_" type="string" docValues="false" indexed="true" stored="false"/>
<field name="_text_" type="text_general" multiValued="true" indexed="true" stored="false"/>
<field name="_version_" type="plong" indexed="false" stored="false"/>
<field name="id" type="string" multiValued="false" indexed="true" required="true" stored="true"/>
<field name="title" type="text_general"/>

Then we created developer_s and publisher_s which match the dynamicField of name="*_s" as well as publisher_ss which made use of the name="*_ss" dynamic field.

 <fields>

   ...

   <dynamicField name="*_ss" type="strings" indexed="true" stored="true"/>
 
   ...

   <dynamicField name="*_s" type="string" indexed="true" stored="true"/>
   

   ...

 </fields>

The non-dynamic fields all will exist and store content that comes in matching their name. The dynamic fields act as catch-alls for a given combination of types and settings and allow you to create and query fields as needed without needing to update the schema.

Analyzers.

The managed-schema file also allows us to define how Solr works. In its current state, searching for developer_s:square without a capital S will return no results because the fieldType with name="string" doesn’t have any settings which tell Solr to modify or process the text before putting it in the index. We need to tell Solr to do something with the input so that queries case insensitive.

Copy fields.

Additionally, our failed “Nintendo” search did not specify a field, so the query assumed the default field — `name=”_text_”’ The only Super Nintendo game with the word “Nintendo” in the title was — ugh — Super Nintendo Scope 6.

This is the result of what are called copy fields. Copy Fields are settings to duplicate data being entered into a second field. This is done to allow the same text to be analyzed multiple ways.

With Solr, you can make your search more reliable by adding the following line at the very bottom.

<copyField source="*_s" dest="_text_"/>

This will take all of our dynamic string fields and add them to the _text_ field.

But it won’t work just yet. Since Copy Fields are processed at indexing time, we need to re-process all of the input data, which, like all updates to managed-schema, requires stopping and starting Solr.

Stopping and starting Solr.

solr restart -p 8983

Solr should be starting back up. When it finishes, re-run this command:

curl http://localhost:8983/solr/_default/update?commit=true -H "Content-Type: application/json" -T "snes.json" -X POST

Restarting Solr updates the schema configuration, and re-running the curl command sends all the data back at Solr to be processed again. Because all indexing occurs as documents are sent to Solr, any changes to the configuration will require re-submitting data to be indexed — hence the re-running of the curl command.

Going back to the query window we can test the recent indexing by searching for “Nintendo” (and nothing else) again. We now get 53 results, and now Super Nintendo Scope 6 has some company to live up to.

Faceted Search.

One final Solr perk is facilitation of easy faceted search. Typically, the process for gathering unique values and counts for filtering is long and arduous. With Solr, we simplify the process by simply specifying the fields we wish to gather values for.

To test this, open up the query screen and enter “Nintendo” for q. Then, check the “facet” checkbox. You’ll now be able to enter “developer_s” for facet.field.

Clicking “Execute Query” will give the “Nintendo” search results with this addition at the bottom:

   <lst name="facet_counts">
     <lst name="facet_queries"/>
     <lst name="facet_fields">
       <lst name="developer_s">
         <int name="Nintendo">17</int>
         <int name="Rare Ltd.">6</int>

         ...

         </lst>
     <lst name="facet_dates"/>
     <lst name="facet_ranges"/>
   </lst>

Resulting Query URL: http://localhost:8983/solr/_default/select?q=nintendo&wt=xml&indent=true&facet=true&facet.field=developer_s

By simply specifying the field, we were able to receive a list of indexed terms, and the count of results for each one. Every one of these is a valid value for the developer_s field we specified. Having this information makes it easy for us to represent filters to the user to shrink their result count, and it enables us to easily know the query to add to do so.

Here’s a bonus. When we do this, we can use what is called a Filter Query, the filter query is both cached and applied first before the full text search happens. This makes the search faster by reducing the documents searched to only those with the exact term before moving on to the full text.

Harder, better, faster, stronger.

This is just the surface of what can be done with Solr, and already these features are more powerful than what most software products support out of the box.

Previous to using Solr, we’d spent weeks customizing delivered search solutions, or configuring and troubleshooting Google Mini appliances. Now, Solr enables us to customize search with days of effort instead of weeks, meeting common requirements without the hassle of wrangling a pre-packaged solution.

What’s more, Solr provides a layer of separation between our content system and our search solution, which allows for an easy debug/testing point for queries.

Custom solutions. Easy query debugging/reproducing. Multiple index sources and easy faceted search. All the things we’d love to see from a pre-packaged solution, but can never find. Suddenly, “good enough” doesn’t seem to cut it anymore.