Monday, January 27, 2014

Get that Search Button Working with Apache Solr!

Search is probably the most important component of an application when it comes to finding valuable data, and that is precisely what Solr has been made for. Let this article guide you with setting up your own search server that will handle all the search needs of your Web application, whichever language it happens to be built in — Java, Python, PHP or anything else!
Solr is an open source search server that was originally developed at CNet, and then donated to Apache. You should only be thinking about Solr if you want to implement some kind of search facility for your Web application that’s backed by random sources of data. Although it has been built with enterprise applications in mind, where users would set up instances of a search server on a separate machine dedicated for this purpose, this shouldn’t stop you from using it for small projects too.
Solr is based on the full-text search library called Apache Lucene. It is mostly built as an enhancement on top of an index provided by it. Solr has been built completely in Java and so, you’ll need to have Java and a servlet container like Tomcat or GlassFish installed, on top of which Solr can be deployed as a war package. However, that does not mean that you need to know Java, unless you want to make modifications to Solr!
Solr provides XML/HTTP interfaces and APIs just like Web services, and so can work with almost any programming language. The extensive external configurations available are more than enough to make customisations. Apart from this, it even provides a comprehensive administration interface with a browser.
Solr also gives you a good set of default settings to start with, and provides good documentation along with it. The best applications being used out there are highly customised, but they require a deep understanding of the type of data to be searched, and the queries that users are going to make.
Information can be collected directly from filesystems, databases, websites, directories, mail servers or anything else. On the whole, Solr is a good compromise for enterprise use as well as personal applications.

Why Solr?

Whenever we think about search, what comes to our minds is usually Google; after all, it has been the world leader in terms of search technology. Then why use Solr?
For starters, Solr allows you to have precise control over what gets indexed, how it gets indexed and how it’s retrieved from the index. In short, a powerful search solution that is totally under your control, and can change the way users perceive and interact with your application. Setting up a separate instance of a search server will help you streamline your website and help your users wade through data with an ease that they have never experienced before.
But then, why go through the trouble of setting up a separate search server, when even databases can provide such text-indexing capabilities? Why can’t we just SELECT * FROM Content to retrieve data? More importantly, if Solr has been implemented on top of Lucene, why can’t we build our own search application with its help?
Well, building a search application on top of Lucene would require a lot of work — and you’d really have to know what you’re doing (in which case, you wouldn’t need to read this article at all). Talking about comparisons with databases, it depends on what kind of search application you want to build. If you want something that helps with exact-match queries, like the names of all the people who were born in 1988, or a list of all authors who wrote books on a particular subject, then you’re better off with databases, without any extra effort. However, if you want to answer queries that are less predictable and need proper analysis, like a list of documents in which a particular keyword occurs, or anything similar that is more flexible, Solr is there for you.
Before creating a database in a DBMS, you need to properly design a schema — and even minor changes in the structure of the data need to be reflected back into the database. This is not so with Solr — actually, it acts more like a NoSQL document store, where documents are in turn stored in the form of a Lucene index. Does that mean we can use it as a NoSQL database?
Probably, but Solr hasn’t been built for that purpose; it’s more efficient as a search server, and we should let it remain that way.

Features and components

Carefully implemented instances of Solr can provide features that can match even those of Google — and what’s more, it allows for further customisation, if needed; after all, it’s open source!
Let’s take a brief look at some of the important and special features that Solr has to offer, without going into too much technical detail.
  • Faceted search: This is one feature that stands out from the others. Whenever we encounter search in most applications, it comes with options to narrow down our search results with respect to certain parameters. Faceting provides such functionality, by providing facets as the navigational elements in the search. This technique is quite popular in places like online e-commerce websites, where we might need to filter search results from a particular category or a particular company.
    A live example of Facets in the search application of the SourceForge website
    Figure 1: A live example of Facets in the search application of the SourceForge website
  • Spatial search: Although spatial search or geo-aware search is not a “feature” as such, it is quite relevant these days — and geo-location information is often included with the data, so it is important to mention it here. Spatial search is a plugin that is expected to be included with native support in Solr version 4.0, and will provide results filtered according to the location information. For example, a search for a coffee shop should produce results suitable to your current location.
  • Clustering: Clustering is also used for the aggregation of data. Solr provides for search result clustering with the Carrot2 clustering engine. Clustering is different from faceting, in the sense that it is done dynamically. We don’t need to mark documents or specify fields to be categorised for clustering to take place; it is done automatically by recognising the words that occur most commonly in the text, and by recognising the structure of the text. Clustering is done when the search query is executed, while faceting is done during indexing. One of the clustering algorithms that I am aware of is the k-means clustering algorithm.
Other features (which search engines must obviously have) include hit highlighting, auto-suggest, spell-checker and the “more like this” functionality, which do not need introduction.

Installation and setup

First of all, let’s get down to installing and setting up a single core instance of Solr on top of Tomcat. We will try to install the example schema that comes along with the archive.
Before proceeding, it is worth mentioning that both the Lucene and Solr projects were merged in March 2010, due to which the version numbering sometimes becomes quite ambiguous. To avoid such confusion, let’s look at the current and next version numbers of Lucene/Solr:
Version Status
1.4.1 The current version of Apache Solr
1.5/2.0 These were the next versions to be released before the Lucene/Solr merge, and have been skipped in order to keep up with Lucene.
3.1 This is the next Lucene/Solr point release.
4.0 This will be the next major release after the Lucene/Solr merge.
So, having dealt with that, you can begin by downloading and unzipping the apache-solr-1.4.1.tgz file from the website. As mentioned before, you will need to have at least Java 5 and a servlet container installed.
In the apache-solr-1.4.1/dist directory, you will find the apache-solr-1.4.1.war file, which can be uploaded to any servlet container (in our case, I assume it is Tomcat). Rename it to solr.war and deploy the file.
Next, you need to set up the home folder which contains all the configuration files, etc. To do this, you can either go to apache-solr-1.4.1/example, copy the solr directory, and paste it into the bin folder of Tomcat — or you can set the Java system property of the solr.solr.home parameter to point to the location of your home folder.
Now, visit http://localhost:8080/solr/admin. You should see the Solr Admin console (see Figure 2), and you’re done.
Solr Admin Console
Figure 2: Solr Admin Console
If you’re still not able to get this working, you can try an example setup using the lightweight Jetty server, which can be run using the following command in the example folder:
java -jar start.jar
This will set up Jetty to listen on port 8983 instead of 8080, so you will have to replace the port number in the corresponding examples. You may explore the Admin console now, which consists of several tabs and a query box. At this point, queries will not return results, since we haven’t yet indexed anything; we will do so later.

Architecture

Solr consumes data in the form of documents. A document is like a basic unit of information that is fed into Solr. It consists of fields, which are, in turn, a more specific piece of information.
The process shown so far, is a simplified version of the installation procedure, and probably not how you would like it in a production system. For such purposes, you should refer the actual documentation on the Solr wiki.
Let’s now look at some of the important files, and the directory structure:
Directory Function
solr/bin This is an optional directory for keeping replication scripts, in case we want to implement some amount of replication on our servers.
solr/data This is the default location where the index of your search application will be stored. Its location can be changed in the solrconfig.xml file.
solr/conf This is the most important directory, where all the configuration files are stored.
The two most important files in the conf directory are as follows; each of these is extensively commented, with every parameter explained in detail:
  • solrconfig.xml — This file handles the entire primary configuration. It is used to define request handlers, response writers (explained below), auto-warming parameters and plugins.
  • schema.xml — This file, as is obvious, dictates the “schema” or the structure of your documents. It also specifies how each field in the document should be indexed, and what fields each document can contain. The file is divided into sections to define types, fields and other definitions pertaining to documents. It also specifies Analyzers in order to determine the manner in which each field is indexed and retrieved. Schema design is one of the most important steps while designing an application on top of Solr.

RequestHandler

A RequestHandler dictates what is to be done when a particular request is made. It basically processes all the interactions with the search server, including querying, indexing, etc. Examples of some of the many RequestHandlers available (besides the StandardRequestHandler) are DisMaxRequestHandler, DataImportHandler, MoreLikeThisHandler, SpellCheckerRequestHandler, etc.
Each of these are built to perform specific functions; Solr even allows you to write custom handlers for your own purpose.

ResponseWriter

ResponseWriter generates and formats the response to a query, or any request made to the server. Multiple ResponseWriters are available to produce the data to be consumed in the form of JSON, PHP, Python, Ruby, CSV, XML and even XSLT. A custom ResponseWriter can also be written in a manner similar to a RequestHandler, by implementing an interface specified for each, in Java.

Search components

Search components are something that provide a specific functionality that can be reused with any type of handler to provide a common logic. These are the most basic extension methodologies in Solr. Each component is configured in the solrconfig.xml file as an addition to the RequestHandlers.

Indexing techniques

Indexing is feeding data into Solr’s index. Proceeding with the example, let’s first send the data to our installed application for indexing, with the help of the included simple post application. Go to the apache-solr-1.4./example/exampledocs folder, from the location where you initially unzipped the Solr archive. In this directory, give the following command:
java -Durl=http://localhost:8080/solr/update -jar post.jar *.xml
After the process is completed, you will be able to search for your data from the admin console. The data that gets posted to the Solr index is all the XML files that are contained n the exampledocs folder. You may look at the structure of each of the documents that get posted.
There are multiple ways for sending data to be indexed:
  • XML-post: Solr allows for a predefined set of XML schemas, which specify the instruction or the action to be performed. This may range from the simple addition of new documents, to deletion, updation and committing of data. These XML commands can simply be POSTed to the server (at http://localhost:8080/solr/updatein this case), and it will perform the specified operation. This is precisely what the post.jar application does. For example, to delete a document, you can run the following command in a terminal:
    curl http://localhost:8080/solr/update -F stream.body='<delete><id>VA902B</id></delete>'
  • CSV: Solr can import the data to be accepted in the form of a CSV format, where values are usually separated by a delimiter such as a comma or a tab. It can also be set up to accept streams of data in the form of CSV, if your data source generates it. CSV data can be sent to Solr by sending data to http://localhost:8080/solr/update/csv. A sample books.csv file is included in exampledocs, which can be uploaded using curl in a manner similar to that shown above.
  • Solr Cell: What more could you possibly want from a search server, than its ability to index PDF, Word, Excel, JPEG and even MP3 files? Solr Cell, a.k.a. ExtractingRequestHandler, is what allows you to do so! It has been built as a wrapper around the Apache Tika project, which can strip out data from a wide range of supported documents — and what’s more, it can even auto-detect the MIME type of the data.
  • DataImportHandler: This has been built to handle direct database imports, either with the help of JDBC, or XML imports from a URL or a file. The XML import functionality can be used to index data over HTTP/REST APIs, or even RSS feeds. In addition to full import, it supports delta-imports — that is, a partial update of the index, reflecting only the recent changes in the database since the last build, instead of a complete rebuild of the index. The DataImportHandler can be defined in the solrconfig.xml file; it can further point to a file for detailed configuration of the connection and data source information. The example-DIH folder already contains some examples, along with detailed documentation, for you to try out.

Querying

So finally we have all our data set, and we are ready to make a query or two in order to know how it works. Let’s make a basic query now: type solr into the search box at the admin console, and see what you get. The URL generated looks something like what’s shown below:
http://localhost:8080/solr/select/?q=solr&version=2.2&start=0&rows=10&indent=on
We can guess that http://localhost:8080/solr/select is the endpoint at which we will be making our queries, while the q=solr parameter represents our search query. The start=0 and rows=10 parameters are used for paginating, since Solr does not display all results at once. Besides this, there are a lot of parameters that have been left out, which include features like boosting, sub-expressions, filter queries, sorting, and of course, faceting!
Additionally, the DisMax Query parser allows for even more sophisticated queries that should be more than enough to satisfy even advanced users.
Looking at the search results, we’ll be able to see XML that has been enclosed within the <response> tags. The initial metadata, like the status, query time, etc., are enclosed within the <lst name="responseHeader"> element, and the subsequent results are enclosed within the <result name="response" numFound="1" start="0"> element, which further contains the <doc> element that represents each document returned as a result. It should be obvious that the numFound attribute represents the number of results returned.
In conclusion, it will be useful to note that add-on modules for using Solr with CMSes like Drupal and WordPress are already available. These modules include all the configuration information to set up Solr for your website, and provide a much better searching mechanism than the one included with the core.

Installing Webmin with YUM (CentOS,RHEL)

 there you go: 
 
echo -e "[Webmin]\nname=Webmin Distribution Neutral\nbaseurl=http://download.webmin.com/download/yum\nenabled=1" > /etc/yum.repos.d/webmin.repo
rpm --import http://www.webmin.com/jcameron-key.asc
yum install webmi