NoSQL for ERP?
NoSQL databases are a class of database management systems that grew up with the Internet. Generally these refer to any database system that doesn’t use SQL as its query language (though some call them Not only SQL, meaning they support SQL and something else as well). This would make them any database system other than the standard databases. However these are usually associated with a class of database system that is intended to server large Internet companies. In this article we are only considering the large distributed databases and not the specialized in-memory databases. The goal of these is to be highly distributed and highly redundant. As a database scales you would be able to scale usage by just adding servers and have them add to the capacity of the system in a linear manner. You would like to have your database distributed all over the world, so it can be accessed from anywhere with a very low latency.
The main difference between traditional SQL databases and NoSQL databases comes down to the CAP Theorem which says out of the three things:
- Consistency – roughly meaning that all clients of a data store get responses to requests that ‘make sense’. For example, if Client A writes 1 then 2 to location X, Client B cannot read 2 followed by 1.
- Availability – all operations on a data store eventually return successfully. We say that a data store is ‘available’ for, e.g. write operations.
- Partition tolerance – if the network stops delivering messages between two sets of servers, will the system continue to work correctly?
You have to choose two, you can’t have all three.
Then SQL databases choose consistency and partition tolerance and NoSQL databases choose availability and partition tolerance. The CAP Theorem is shown graphically below:
Generally then there has been a proliferation of NoSQL databases that all make a slightly different tradeoffs for the CAP theorem. If they can’t have all three does it make sense to have 2/3 of each perhaps? Generally there has been some great research that has produced amazing results, but this has led to a bit of a tower of Babel of database access languages.
The classic example is Google’s BigData database used to power the Google search engine. How is it that when you type a search into Google, you immediately get back a page of search results instantly? How is it translating what you typed into some sort of database query and how is that query operating so quickly? How is Google handling millions of these queries coming from all over the world so quickly? What is the database schema that allows this magic?
The structure of the database is a really simple key-value store. The key is every possible search term that anyone could type in. The value is the table of web sites (and descriptions) to return for this search term. The database is called BigData because this value consisting of all the search results could have millions of hits for you to scroll through. There is no limit to the size of the value and there is no limit to the number of keys. Even so an indexed binary tree read of such a table would still be quite slow and this is Google’s big advancement with an algorithm called MapReduce that lets these simple key reads happen so quickly. The BigTable database spreads this huge table over many servers to operate and it has redundancy of having the database in many places at once. The MapReduce algorithm is a parallel algorithm in that parts of the search can be done by multiple servers in parallel.
Sound wonderful? Google published the MapReduce algorithm and there is an open source implementation of it called Apache HBase, so why doesn’t everyone use it? All the NoSQL databases provide a huge optimization on one of the things a SQL database does, but at a cost of making something else very slow, or eliminating something else entirely. In the case of BigTable, inserting and updating records in this table is very slow. But for Google search this is ok. Google operates a giant number of “web spiders” that are always out there traversing the web looking for new sites and updating the Google database of search results. These are running all the time and updating all the servers all the time. But these updates and inserts might take hours or days to happen, especially if you count the amount of time to find their way over all the thousands of servers involved. But Google doesn’t mind as long as it happens and is eventually consistent.
ERP Database Usage
Generally ERP packages rely very heavily on the tenants of a SQL type relational database. They require database transactioning and consistency (often referred to as ACID). ERP packages require general SQL type queries. They need good performance writing transactions to the database as well as reading and reporting from the database. I blogged a bit about Sage 300 ERP’s database usage here.
The NoSQL camp would contest these statements. For instance if I insert an Order, is it necessary that if I immediately list the Orders that I see it, or is it ok if it shows up in that list, say ten minutes later? Or do ERP packages believe they need all the benefits of SQL without weighing them against the other benefits of alternative systems? Often customers will prefer speed over customizability or functionality.
I like the way in SQL you can hit post and get back a response that a transaction is either all committed or all rolled back. However from working with SQL, I also know this is a huge cost, especially in our ability to scale out the database and is a definite limit on multi-user contention. For a large scale system, I’m really questioning the locking requirements that large transactions like entering an order place on the system. Remember the “availability” part of NoSQL databases will ensure the Order is entered eventually.
Another problem is adhoc SQL queries. ERP packages tend to like to do reporting and data inquiries using adhoc SQL queries. Usually this is a no no in the NoSQL world. Often in the NoSQL world, all queries must be designed ahead of time using very complicated MapReduce formula. The question is whether this is really necessary. If you have a really powerful Google like search ability built into the database, then do you really need to search the data with SQL queries? Is it that we just do this because we know how and that in fact real users would much rather use the Google method than the SQL method?
Suppose you want the data query ability of a NoSQL database, but still want the SQL transaction integrity and consistency? One approach has been hybrid systems. Think of these as two databases in one where both a SQL and NoSQL database engine operate on your data. Perhaps a cleaner solution is to take a data warehousing approach and create your data in a SQL database, but then archive it to a NoSQL database for various querying and analysis. This is the foundation of many cloud based BI solutions that are out there. With this approach you get some of the benefits of both. What you lack is the high availability and scalability of the database for data entry and regular SQL type queries. But still this can be a way to get a bit of the best of both worlds.
I think it will be a while longer before NoSQL databases gain acceptance in the ERP world. However the amount of innovation going on in the NoSQL world is huge and great strides are being taken. I suspect we will start to see NoSQL begin to be used for non-traditional BI via data warehouses. Even adding Google like search to ERP usually involves adding a NoSQL database on the side. I think once this starts to happen then we will see more and more functionality move to the NoSQL side of things and then perhaps down the road somewhere we will be able to remove the SQL database. But the jury is still our as to whether this is a good long term goal.