We’ve been working on an interesting POC recently that involved Google like search. After evaluating a few alternatives we chose Elastic Search. Search is an interesting topic, often associated with Big Data, NoSQL and all sorts of other interesting technologies. In this article I’m going to look at a few interesting aspects of Elastic Search and how you might use it.
Elastic Search is an open source search product based on Apache Lucene. It’s all written in Java and installing it is just a matter of copying it to a directory and then as long as you already have Java installed, it’s ready to go. An alternative to Elastic Search is Apache Solr which is also based on Lucene. Both are quite good, but we preferred Elastic Search since it seemed to have added quite a bit of functionality beyond what Solr offers.
Elastic search is basically a way of searching JSON documents. It has a RESTful API and you use this to add JSON documents to be indexed and searched. Sounds simple, but how is this helpful for business applications? There is a plugin that allows you to connect to databases via JDBC and to setup these to be imported and indexed on a regular schedule. This allows you to perform Google like searches on your database, only it isn’t really searching your database, its searching the data it extracted from your database.
This is similar to how web search services like Google works. Google dispatches thousands of web crawlers that spend their time crawling around the Internet and adding anything they find to Google’s search database. This is the real search process. When you do a search from Google’s web site it really just does a read on its database (it actually breaks up what you are searching for and does a bunch of reads). The bottom line though is that when you search, it is really just doing a direct read on its database and this is why it’s so fast. You can see why this is Big Data since this database contains the results for every possible search query.
This is quite different than a relational database where you search with SQL and the search goes out and rifles through the database to get the results. In a SQL database putting the data into the database is quite fast (sometimes) and then reading or fetching it can be quite slow. In NoSQL or BigData type databases much more time goes into adding the data, so that certain predefined queries can retrieve what they need instantly. This means the database has to be designed ahead of time to optimize these queries and then often inserting the data takes much longer (often because this is where the searching really happens).
Elastic Search is designed to scale out from the beginning, it automatically does most of the work for starting and creating clusters. It makes adding nodes with processing and databases really easy so you can easily expand Elastic Search to handle your growing needs. This is why you find Elastic Search as the engine behind so many large Internet sites like GitHub, StumbleUpon and Stack Overflow. Certainly a big part of Elastic Search’s popularity is how easy it is to deploy, scale out, monitor and maintain. Certainly much easier than deploying something based on Hadoop.
When you index your data its fed through a set of analyzers which do things like convert everything to lower case, split up sentences to lower case, split words into roots (walking -> walk, Steve’s -> Steve), deal with special characters, dealing with other language peculiarities, etc. Elastic Search has a large set of configurable analyzers so you can tune your search results based on knowledge of what you are searching.
One of the coolest features is fuzzy search, in this case you might not know exactly what you are searching for or you might spell it wrong and then Elastic Search magically finds the correct values. When ranking the values, Elastic Search uses something called Levenshtein distance to rank which values give the best results. Then the real trick is how does ElasticSearch do this without going through the entire database computing and ranking this distance for everything? The answer is having some sophisticated transformers on what you entered to limit the number of reads it needs to do to find matching terms, combined with good analyzers above, this turns out to be extremely effective and very performant.
Notice that since these search engines don’t search the data directly they won’t be real time. The search database is only updated infrequently, so if data is being rapidly added to the real database, it won’t show up in these type of searches until the next update processes them. Often these synchronization updates are only performed once per day. You can tune these to be more frequent and you can write code to insert critical data directly into the search database, but generally it’s easier to just let it be updated on a daily cycle.
When searching enterprise databases there has to be some care on applying security, ensuring the user has the rights to search through whatever they are searching through. There has to be some controls added so that the enterprise API can only search indexes that the user has the rights to see. Even if the search results don’t display anything sensitive they could still leak information. For instance if you don’t have rights to see a customer’s orders, but can search on them, then you could probably figure out how many orders are done by each customer which could be quite useful information.
Certainly when returning search results you wouldn’t reveal anything sensitive like say salaries, to get these you would need to click a link to drill down into the applications UI where full security screen is done.
Elastic Search is a very powerful search technology that is quite easy to deploy and quite easy to integrate into existing systems. A lot of this is due to a powerful RESTful API to perform operations.