Stephen Smith's Blog

Musings on Machine Learning…

Posts Tagged ‘nosql

Elastic Search

with 5 comments

Introduction

We’ve been working on an interesting POC recently that involved Google like search. After evaluating a few alternatives we chose Elastic Search. Search is an interesting topic, often associated with Big Data, NoSQL and all sorts of other interesting technologies. In this article I’m going to look at a few interesting aspects of Elastic Search and how you might use it.

elasticsearchlogo

Elastic Search is an open source search product based on Apache Lucene. It’s all written in Java and installing it is just a matter of copying it to a directory and then as long as you already have Java installed, it’s ready to go. An alternative to Elastic Search is Apache Solr which is also based on Lucene. Both are quite good, but we preferred Elastic Search since it seemed to have added quite a bit of functionality beyond what Solr offers.

Elastic Search

Elastic search is basically a way of searching JSON documents. It has a RESTful API and you use this to add JSON documents to be indexed and searched. Sounds simple, but how is this helpful for business applications? There is a plugin that allows you to connect to databases via JDBC and to setup these to be imported and indexed on a regular schedule. This allows you to perform Google like searches on your database, only it isn’t really searching your database, its searching the data it extracted from your database.

Web Searches

This is similar to how web search services like Google works. Google dispatches thousands of web crawlers that spend their time crawling around the Internet and adding anything they find to Google’s search database. This is the real search process. When you do a search from Google’s web site it really just does a read on its database (it actually breaks up what you are searching for and does a bunch of reads). The bottom line though is that when you search, it is really just doing a direct read on its database and this is why it’s so fast. You can see why this is Big Data since this database contains the results for every possible search query.

This is quite different than a relational database where you search with SQL and the search goes out and rifles through the database to get the results. In a SQL database putting the data into the database is quite fast (sometimes) and then reading or fetching it can be quite slow. In NoSQL or BigData type databases much more time goes into adding the data, so that certain predefined queries can retrieve what they need instantly. This means the database has to be designed ahead of time to optimize these queries and then often inserting the data takes much longer (often because this is where the searching really happens).

Scale Out

Elastic Search is designed to scale out from the beginning, it automatically does most of the work for starting and creating clusters. It makes adding nodes with processing and databases really easy so you can easily expand Elastic Search to handle your growing needs. This is why you find Elastic Search as the engine behind so many large Internet sites like GitHub, StumbleUpon and Stack Overflow. Certainly a big part of Elastic Search’s popularity is how easy it is to deploy, scale out, monitor and maintain. Certainly much easier than deploying something based on Hadoop.

elasticsearch_topologies

Analyzers

When you index your data its fed through a set of analyzers which do things like convert everything to lower case, split up sentences to lower case, split words into roots (walking -> walk, Steve’s -> Steve), deal with special characters, dealing with other language peculiarities, etc. Elastic Search has a large set of configurable analyzers so you can tune your search results based on knowledge of what you are searching.

Fuzzy Search

One of the coolest features is fuzzy search, in this case you might not know exactly what you are searching for or you might spell it wrong and then Elastic Search magically finds the correct values. When ranking the values, Elastic Search uses something called Levenshtein distance to rank which values give the best results. Then the real trick is how does ElasticSearch do this without going through the entire database computing and ranking this distance for everything? The answer is having some sophisticated transformers on what you entered to limit the number of reads it needs to do to find matching terms, combined with good analyzers above, this turns out to be extremely effective and very performant.

Real Time

Notice that since these search engines don’t search the data directly they won’t be real time. The search database is only updated infrequently, so if data is being rapidly added to the real database, it won’t show up in these type of searches until the next update processes them. Often these synchronization updates are only performed once per day. You can tune these to be more frequent and you can write code to insert critical data directly into the search database, but generally it’s easier to just let it be updated on a daily cycle.

Security

When searching enterprise databases there has to be some care on applying security, ensuring the user has the rights to search through whatever they are searching through. There has to be some controls added so that the enterprise API can only search indexes that the user has the rights to see. Even if the search results don’t display anything sensitive they could still leak information. For instance if you don’t have rights to see a customer’s orders, but can search on them, then you could probably figure out how many orders are done by each customer which could be quite useful information.

Certainly when returning search results you wouldn’t reveal anything sensitive like say salaries, to get these you would need to click a link to drill down into  the applications UI where full security screen is done.

Summary

Elastic Search is a very powerful search technology that is quite easy to deploy and quite easy to integrate into existing systems. A lot of this is due to a powerful RESTful API to perform operations.

 

Advertisements

NoSQL for ERP?

with 11 comments

Introduction

NoSQL databases are a class of database management systems that grew up with the Internet. Generally these refer to any database system that doesn’t use SQL as its query language (though some call them Not only SQL, meaning they support SQL and something else as well). This would make them any database system other than the standard databases. However these are usually associated with a class of database system that is intended to server large Internet companies. In this article we are only considering the large distributed databases and not the specialized in-memory databases. The goal of these is to be highly distributed and highly redundant. As a database scales you would be able to scale usage by just adding servers and have them add to the capacity of the system in a linear manner. You would like to have your database distributed all over the world, so it can be accessed from anywhere with a very low latency.

The main difference between traditional SQL databases and NoSQL databases comes down to the CAP Theorem which says out of the three things:

  • Consistency – roughly meaning that all clients of a data store get responses to requests that ‘make sense’. For example, if Client A writes 1 then 2 to location X, Client B cannot read 2 followed by 1.
  • Availability – all operations on a data store eventually return successfully. We say that a data store is ‘available’ for, e.g. write operations.
  • Partition tolerance – if the network stops delivering messages between two sets of servers, will the system continue to work correctly?

You have to choose two, you can’t have all three.

Then SQL databases choose consistency and partition tolerance and NoSQL databases choose availability and partition tolerance. The CAP Theorem is shown graphically below:

Generally then there has been a proliferation of NoSQL databases that all make a slightly different tradeoffs for the CAP theorem. If they can’t have all three does it make sense to have 2/3 of each perhaps? Generally there has been some great research that has produced amazing results, but this has led to a bit of a tower of Babel of database access languages.

Search Example

The classic example is Google’s BigData database used to power the Google search engine. How is it that when you type a search into Google, you immediately get back a page of search results instantly? How is it translating what you typed into some sort of database query and how is that query operating so quickly? How is Google handling millions of these queries coming from all over the world so quickly? What is the database schema that allows this magic?

The structure of the database is a really simple key-value store. The key is every possible search term that anyone could type in. The value is the table of web sites (and descriptions) to return for this search term. The database is called BigData because this value consisting of all the search results could have millions of hits for you to scroll through. There is no limit to the size of the value and there is no limit to the number of keys. Even so an indexed binary tree read of such a table would still be quite slow and this is Google’s big advancement with an algorithm called MapReduce that lets these simple key reads happen so quickly. The BigTable database spreads this huge table over many servers to operate and it has redundancy of having the database in many places at once. The MapReduce algorithm is a parallel algorithm in that parts of the search can be done by multiple servers in parallel.

Sound wonderful? Google published the MapReduce algorithm and there is an open source implementation of it called Apache HBase, so why doesn’t everyone use it? All the NoSQL databases provide a huge optimization on one of the things a SQL database does, but at a cost of making something else very slow, or eliminating something else entirely. In the case of BigTable, inserting and updating records in this table is very slow. But for Google search this is ok. Google operates a giant number of “web spiders” that are always out there traversing the web looking for new sites and updating the Google database of search results. These are running all the time and updating all the servers all the time. But these updates and inserts might take hours or days to happen, especially if you count the amount of time to find their way over all the thousands of servers involved. But Google doesn’t mind as long as it happens and is eventually consistent.

ERP Database Usage

Generally ERP packages rely very heavily on the tenants of a SQL type relational database. They require database transactioning and consistency (often referred to as ACID). ERP packages require general SQL type queries. They need good performance writing transactions to the database as well as reading and reporting from the database. I blogged a bit about Sage 300 ERP’s database usage here.

The NoSQL camp would contest these statements. For instance if I insert an Order, is it necessary that if I immediately list the Orders that I see it, or is it ok if it shows up in that list, say ten minutes later? Or do ERP packages believe they need all the benefits of SQL without weighing them against the other benefits of alternative systems? Often customers will prefer speed over customizability or functionality.

I like the way in SQL you can hit post and get back a response that a transaction is either all committed or all rolled back. However from working with SQL, I also know this is a huge cost, especially in our ability to scale out the database and is a definite limit on multi-user contention. For a large scale system, I’m really questioning the locking requirements that large transactions like entering an order place on the system.  Remember the “availability” part of NoSQL databases will ensure the Order is entered eventually.

Another problem is adhoc SQL queries. ERP packages tend to like to do reporting and data inquiries using adhoc SQL queries. Usually this is a no no in the NoSQL world. Often in the NoSQL world, all queries must be designed ahead of time using very complicated MapReduce formula. The question is whether this is really necessary. If you have a really powerful Google like search ability built into the database, then do you really need to search the data with SQL queries? Is it that we just do this because we know how and that in fact real users would much rather use the Google method than the SQL method?

Hybrid Approach

Suppose you want the data query ability of a NoSQL database, but still want the SQL transaction integrity and consistency? One approach has been hybrid systems. Think of these as two databases in one where both a SQL and NoSQL database engine operate on your data. Perhaps a cleaner solution is to take a data warehousing approach and create your data in a SQL database, but then archive it to a NoSQL database for various querying and analysis. This is the foundation of many cloud based BI solutions that are out there. With this approach you get some of the benefits of both. What you lack is the high availability and scalability of the database for data entry and regular SQL type queries. But still this can be a way to get a bit of the best of both worlds.

Summary

I think it will be a while longer before NoSQL databases gain acceptance in the ERP world. However the amount of innovation going on in the NoSQL world is huge and great strides are being taken. I suspect we will start to see NoSQL begin to be used for non-traditional BI via data warehouses. Even adding Google like search to ERP usually involves adding a NoSQL database on the side. I think once this starts to happen then we will see more and more functionality move to the NoSQL side of things and then perhaps down the road somewhere we will be able to remove the SQL database. But the jury is still our as to whether this is a good long term goal.

Written by smist08

January 28, 2012 at 5:20 pm