Searching With Solr

Chances are, you’ve found yourself in the sticky situation of adapting a default search engine. At Blend, we’ve gone in a new direction, looking toward an open source solution called Apache Solr.

    Seth Larson
  • Jun. 21 2013

Chances are, you’ve found yourself in the sticky situation of adapting a default search engine. In the beginning, everything seems “good enough.” It’s tested internally with moderate results. Content is uploaded and files are added. More people begin testing the search engine, and more content is piled on top. Query after query is requested and returned.

Yet, the results don’t seem to be very relevant. So we dive in, hoping the API allows for adaptation of results — make the title higher, create synonyms within search taxonomy, filter based on words in a field.

Can we add suggested terms? Can we adapt the search results at all? Meanwhile, the people we’re working with wonder, “Why isn’t this more like Google?”

Introducing Solr

Search is not a new practice. There are a variety of available solutions for immediate search results: Google Site Search, search engine redirection, or even that default search that came with your software package.

At Blend, we’ve gone in a fourth direction, looking toward an open source solution called Apache Solr.

Solr is an Apache project which aims to make advanced searching and indexing of data as easily available as MySQL made storing and deleting it. It provides full text search, faceted navigation, recommended/related searching, spell suggest/correct, result highlighting, and much more.

Now, instead of relying on default search systems or Google Site Search, we look to Solr to deliver the same features of advanced search appliances (like Google Search Appliance) and surpass its flexibility and scalability. And we’re not alone: Solr is already in use on the back end for sites like LinkedIn, Twitter, CNET, Netflix, and Digg.

Getting Started

We could talk all day about Solr, but let’s go a step further and actually put it to work. Solr’s requirements are simple. You’ll need:

  • The latest version of Java
  • A directory to put the index in
  • A directory to put the configuration files in

Once you’ve confirmed this, go to: http://lucene.apache.org/solr/4_2_1/tutorial.html and follow the directions through the point of having Solr started. You’ll now you’re successful when http://localhost:8983/solr/ displays a running solr instance.

Go ahead. We’ll wait.

Using Solr

Solr’s up and running. Hooray!

Note: Solr ships with some working examples to familiarize users with features. Running start.jar uses a default directory of ‘solr’. If you later want to experiment with the more advanced configurations shipped, you can do so by specifying a new Solr home directory in the start command. For an example:

                            
        
            
java -jar start.jar
        
    

For this article we’ll only be using the default solr directory.

Selecting a Collection

Begin by opening the Solr homepage at http://localhost:8983/solr/.

Solr dashboard form

This default homepage gives us statistics about the server, but we’re going to focus on collections. Collections are groups of indexed data — the rough equivalent to an individual database in MySQL. We’ve been provided with some default collections — this one’s named “collection1” — so let’s dive in.

Clicking on “Core Selector” and then through to a collection will give us a list of the information and options available specific to that collection. For this article, we’ll primarily focus on Query and Schema.

Solr dashboard form

Execute Query

While in “collection1,” click on Query, then scroll to the bottom and “Execute Query”. The right area of the page will fill with XML of the resulting matches. (NOTE: because no content has been added, results will return as zero.)

                            
        
            
    <response>
    <lst name="responseHeader">
     <int name="status">0</int>
       <int name="QTime">0</int>
     <lst name="params">
       <str name="indent">true</str>
       <str name="q">*:*</str>
       <str name="wt">xml</str>
     </lst>
    </lst>
    <result name="response" numFound="0" start="0">
     </result>
    </response>
        
    
Solr dashboard form

The response is composed to two sections: responseHeader, which indicates technical details of the query, and Result, which indicates actual results. Diving deeper into responseHeader:

  • status — Indicates if an error has occured. If everything looks clear, it will return “0”.
  • QTime — Time the query took in milliseconds.
  • params — Represents an array of the parameters for the query. It will have a value for everything modified in the query form — from indents to queries to writers.

In the above example, the params array shows that indent was set to true, the query was for *:* and the writer was set for xml.

In addition to the XML, executing a query will return a url above the XML. In this case, that url is http://localhost:8983/solr/collection1/select?q=%3A&wt=xml&indent=true

Everything in Solr is an http request for sending or receiving data. This url can be broken down into the Solr servlet, collection name (collection1), and the select handler along with its query parameters.

Loading Data

Lets load some data into Solr to get actual results. Begin by getting a command prompt in the Solr directory in focus. Then, download example content to load:

                            
        
            
curl -O https://gist.github.com/sclarson/5129795/raw/438c0eabd8d906850e87d4872ab756aa0b30eac1/snes.json

        
    

Next, we’ll load it into Solr:

                            
        
            
curl 'http://localhost:8983/solr/update/json?commit=true'  --data-binary @snes.json -H 'Content-type:application/json'

        
    

By clicking into collection1, we now see that there are 845 documents loaded.

Solr dashboard form

Let’s run that original query again and look at the changes in the XML.

                            
        
            
    <?xml version="1.0" encoding="UTF-8"?>
    <response>
     <lst name="responseHeader">

       ...

     </lst>
       <result name="response" numFound="845" start="0">

       ...

     </result>
    </response>

        
    

The “numFound” attribute on result gives us the total number of results returned — in this case, 845 results. The “start” attribute gives us the first element returned. These will also be available in the lst name="params" element.

Looking at the response we also have a lot of <doc> elements. Everything stored and searched in solr is a part of a document. Sets of results that return some information will contain a list of documents.

Search for Single Term

Since we searched for *:*, every document was returned. The wildcard matched all of the fields and all records. Let’s test again to find all of the games developed by Square.

First, find the q textbox and enter: developer_s:Square. This is to search the developer_s field looking for the text Square. This should return 7 results.

Solr dashboard form

Search for Multiple AND/OR Terms

Maybe you want to know all of the games published by either Square or Nintendo. The common first attempt to do this would be to add “nintendo” to the query. This makes the value for q developer_s:Square Nintendo and executing this query gives us a result with only 8 documents found.

Solr dashboard form

But, because we are all well-seasoned Super Nintendo fans, we know this isn’t accurate. Square made seven games, and Nintendo made many more, so this can’t be entirely correct.

Instead, we must separate the terms using parenthesis. Wrap parenthesis around all the words you want to find in the developer_s field. Update the value of q to developer_s:(Square Nintendo) and execute the query and we’ll get 24 results. Sure enough Legend of Zelda, Super Mario World, and Super Mario Kart are all included.

Solr dashboard form

The difference between these two queries is small but significant. Words surrounded by the parenthesis are grouped to the field they follow. When the fields aren’t surrounded, “Nintendo” is searched for against the default field.

More information about how search queries are formed and processed can be found in the solr query parser documentation.

Since we can put documents into the index, and query them out, lets look into the details.

Each document will be composed of a list of its fields (though not all are required, depending on the query).

Lets take a look at the Chrono Trigger doc.

                            
        
            
    <doc>
       <str name="id">ChronoTrigger1995[NA]</str>
       <int name="year_i">1995</int>
       <arr name="title">
           <str>Chrono Trigger</str>
       </arr>
       <str name="publisher_s">Square</str>
       <str name="developer_s">Square</str>
       <arr name="region_ss">
           <str>NA</str>
       </arr>
       <long name="_version_">1429159051143413760</long>
    </doc>
        
    

The doc is composed of elements named with the type of data they contain. In this case we have an integer for year, strings for id, publisher_s and developer_s, and arrays of strings for title and region_ss. The names and values match those in the JSON.

                            
        
            
    {
     "id" : "ChronoTrigger1995[NA]",
     "year_i":"1995",
     "title":"Chrono Trigger",
     "publisher_t":"Square",
     "developer_s":"Square",
     "region_ss":"NA"
    }
        
    

A few things might seem odd. The _i, _t, and _ss are telltale signs that we’re using a Solr concept called dynamic fields. These allow us to create fields as we need. The id and title fields are not dynamic and exist in every document. (We’ll see why in a bit.) When we compare the JSON to the XML we also see that JSON holds a single value for title and region_ss and they are arrays in XML. This is because the fields are specified as allowed to hold multiple values.

The Schema.XML File

So far we’re able to easily query a given column (NOTE: Queries are case sensitive in this case, as you may have noticed) and add data to the system. We did this without even modifying the default configuration; we just started Solr and threw information at it.

Solr’s ease of installation and startup alone makes the system worth considering, but this simple method is not without some problems. For example: if we execute a search with the query set to developer_s:square, we’ll get 0 results because strings and text are handled differently. Another example: a query for the word Nintendo (and nothing else) will return one result, the ill-fated monstrosity known as the Super Nintendo Scope 6.

Without diving into the very low level details, this is a problem with analyzers and copy fields. To fix this we’re going to venture into a new document: the schema.xml file. This file defines what Solr contains. Our current fields include id, title, multiple _s, and a _ss fields. Skimming through the schema.xml file, we’ll find a few lines that look similar.

                            
        
            
   <uniqueKey>id</uniqueKey>
   <!-- copyField commands copy one field to another at the time a document
       is added to the index.  It's used either to index the same field differently,
       or to add multiple fields to the same field for easier/faster searching.  -->
   <copyField source="title" dest="text"/>
   <type>
     <!-- The StrField type is not analyzed, but indexed/stored verbatim. -->
     <fieldType name="string" class="solr.StrField" sortMissingLast="true" />
     <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
       ... <!-- detail about analyzing here -->
     </fieldType>
   </type>
        
    

The schema file begins with defined fields — in our case, we’re making use of the id and title fields.

                            
        
            
 <fields>
     <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />

     ...

     <field name="title" type="text_general" indexed="true" stored="true" multiValued="true"/>

     ...

     <field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>

     ...

 </fields>
        
    

Then we created developer_s and publisher_s which match the dynamicField of name="*_s" as well as publisher_ss which made use of the name="*_ss" dynamic field.

                            
        
            
 <fields>

   ...

   <dynamicField name="*_s"  type="string"  indexed="true"  stored="true" />
   <dynamicField name="*_ss" type="string"  indexed="true"  stored="true" multiValued="true"/>

   ...

 </fields>
        
    

The non-dynamic fields all will exist and store content that comes in matching their name. The dynamic fields act as catch-alls for a given combination of types and settings and allow you to create and query fields as needed without needing to update the schema.

Analyzers

The schema.xml file also allows us to define how Solr works. In its current state, searching for developer_s:square without a capital S will return no results because the fieldType with name="string" doesn’t have any settings which tell Solr to modify or process the text before putting it in the index. We need to tell Solr to do something with the input so that queries case insensitive.

Copy Fields

Additionally, our failed “Nintendo” search did not specify a field, so the query assumed the default field — `name=”text”’ The only Super Nintendo game with the word “Nintendo” in the title was — ugh — Super Nintendo Scope 6.

This is the result of what are called copy fields. Copy Fields are settings to duplicate data being entered into a second field. This is done to allow the same text to be analyzed multiple ways.

In our example configuration we see <copyField source="title" dest="text"/>. This tells Solr to always copy the title field to a field named text for every entry. Since the text field has a type of text_general and not string. This causes different analyzers to apply. If you look at the schema.xml file you’ll find many copyFields copying into text. This makes the search default far better.

With Solr, you can make your search more reliable by adding the following line above the top copyField.

                            
        
            
<copyField source="*_s" dest="text"/>
        
    

This will take all of our dynamic string fields and add them to the text field.

But it won’t work just yet. Since Copy Fields are processed at indexing time, we need to re-process all of the input data, which, like all updates to schema.xml, requires stopping and starting Solr.

Stopping and Starting Solr

                            
        
            
control+c

java -jar start.jar
        
    

Solr should be starting back up. When it finishes, re-run this command:

                            
        
            
curl 'http://localhost:8983/solr/update/json?commit=true'  --data-binary @snes.json -H 'Content-type:application/json'

        
    

Restarting Solr updates the schema configuration, and re-running the curl command sends all the data back at Solr to be processed again. Because all indexing occurs as documents are sent to Solr, any changes to the configuration will require re-submitting data to be indexed — hence the re-running of the curl command.

Going back to the query window we can test the recent indexing by searching for “Nintendo” (and nothing else) again. We now get 53 results, and now Super Nintendo Scope 6 has some company to live up to.

Solr dashboard form

Faceted Search

One final Solr perk is facilitation of easy faceted search. Typically, the process for gathering unique values and counts for filtering is long and arduous. With Solr, we simplify the process by simply specifying the fields we wish to gather values for.

To test this, open up the query screen and enter “Nintendo” for q. Then, check the “facet” checkbox. You’ll now be able to enter “developer_s” for facet.field.

Clicking “Execute Query” will give the “Nintendo” search results with this addition at the bottom:

                            
        
            
   <lst name="facet_counts">
     <lst name="facet_queries"/>
     <lst name="facet_fields">
       <lst name="developer_s">
         <int name="Nintendo">17</int>
         <int name="Rare Ltd.">6</int>

         ...

         </lst>
     <lst name="facet_dates"/>
     <lst name="facet_ranges"/>
   </lst>
        
    

Resulting Query URL: http://localhost:8983/solr/collection1/select?q=nintendo&wt=xml&indent=true&facet=true&facet.field=developer_s

Solr dashboard form

By simply specifying the field, we were able to receive a list of indexed terms, and the count of results for each one. Every one of these is a valid value for the developer_s field we specified. Having this information makes it easy for us to represent filters to the user to shrink their result count, and it enables us to easily know the query to add to do so.

Here’s a bonus. When we do this, we can use what is called a Filter Query, the filter query is both cached and applied first before the full text search happens. This makes the search faster by reducing the documents searched to only those with the exact term before moving on to the full text.

Harder, Better, Faster, Stronger

This is just the surface of what can be done with Solr, and already these features are more powerful than what most software products support out of the box.

Previous to using Solr, we’d spent weeks customizing delivered search solutions, or configuring and troubleshooting Google Mini appliances. Now, Solr enables us to customize search with days of effort instead of weeks, meeting common requirements without the hassle of wrangling a pre-packaged solution.

What’s more, Solr provides a layer of separation between our content system and our search solution, which allows for an easy debug/testing point for queries.

Finally, Solr also allows us to index things from multiple sources in the same place if we need to, without the worry of non-supported systems and a lack of replacement options if the engine should fail.

Custom solutions. Easy query debugging/reproducing. Multiple index sources and easy faceted search. All the things we’d love to see from a pre-packaged solution, but can never find. Suddenly, “good enough” doesn’t seem to cut it anymore.