Stephen Smith's Blog

Musings on Machine Learning…

NoSQL for ERP?

with 11 comments


Introduction

NoSQL databases are a class of database management systems that grew up with the Internet. Generally these refer to any database system that doesn’t use SQL as its query language (though some call them Not only SQL, meaning they support SQL and something else as well). This would make them any database system other than the standard databases. However these are usually associated with a class of database system that is intended to server large Internet companies. In this article we are only considering the large distributed databases and not the specialized in-memory databases. The goal of these is to be highly distributed and highly redundant. As a database scales you would be able to scale usage by just adding servers and have them add to the capacity of the system in a linear manner. You would like to have your database distributed all over the world, so it can be accessed from anywhere with a very low latency.

The main difference between traditional SQL databases and NoSQL databases comes down to the CAP Theorem which says out of the three things:

  • Consistency – roughly meaning that all clients of a data store get responses to requests that ‘make sense’. For example, if Client A writes 1 then 2 to location X, Client B cannot read 2 followed by 1.
  • Availability – all operations on a data store eventually return successfully. We say that a data store is ‘available’ for, e.g. write operations.
  • Partition tolerance – if the network stops delivering messages between two sets of servers, will the system continue to work correctly?

You have to choose two, you can’t have all three.

Then SQL databases choose consistency and partition tolerance and NoSQL databases choose availability and partition tolerance. The CAP Theorem is shown graphically below:

Generally then there has been a proliferation of NoSQL databases that all make a slightly different tradeoffs for the CAP theorem. If they can’t have all three does it make sense to have 2/3 of each perhaps? Generally there has been some great research that has produced amazing results, but this has led to a bit of a tower of Babel of database access languages.

Search Example

The classic example is Google’s BigData database used to power the Google search engine. How is it that when you type a search into Google, you immediately get back a page of search results instantly? How is it translating what you typed into some sort of database query and how is that query operating so quickly? How is Google handling millions of these queries coming from all over the world so quickly? What is the database schema that allows this magic?

The structure of the database is a really simple key-value store. The key is every possible search term that anyone could type in. The value is the table of web sites (and descriptions) to return for this search term. The database is called BigData because this value consisting of all the search results could have millions of hits for you to scroll through. There is no limit to the size of the value and there is no limit to the number of keys. Even so an indexed binary tree read of such a table would still be quite slow and this is Google’s big advancement with an algorithm called MapReduce that lets these simple key reads happen so quickly. The BigTable database spreads this huge table over many servers to operate and it has redundancy of having the database in many places at once. The MapReduce algorithm is a parallel algorithm in that parts of the search can be done by multiple servers in parallel.

Sound wonderful? Google published the MapReduce algorithm and there is an open source implementation of it called Apache HBase, so why doesn’t everyone use it? All the NoSQL databases provide a huge optimization on one of the things a SQL database does, but at a cost of making something else very slow, or eliminating something else entirely. In the case of BigTable, inserting and updating records in this table is very slow. But for Google search this is ok. Google operates a giant number of “web spiders” that are always out there traversing the web looking for new sites and updating the Google database of search results. These are running all the time and updating all the servers all the time. But these updates and inserts might take hours or days to happen, especially if you count the amount of time to find their way over all the thousands of servers involved. But Google doesn’t mind as long as it happens and is eventually consistent.

ERP Database Usage

Generally ERP packages rely very heavily on the tenants of a SQL type relational database. They require database transactioning and consistency (often referred to as ACID). ERP packages require general SQL type queries. They need good performance writing transactions to the database as well as reading and reporting from the database. I blogged a bit about Sage 300 ERP’s database usage here.

The NoSQL camp would contest these statements. For instance if I insert an Order, is it necessary that if I immediately list the Orders that I see it, or is it ok if it shows up in that list, say ten minutes later? Or do ERP packages believe they need all the benefits of SQL without weighing them against the other benefits of alternative systems? Often customers will prefer speed over customizability or functionality.

I like the way in SQL you can hit post and get back a response that a transaction is either all committed or all rolled back. However from working with SQL, I also know this is a huge cost, especially in our ability to scale out the database and is a definite limit on multi-user contention. For a large scale system, I’m really questioning the locking requirements that large transactions like entering an order place on the system.  Remember the “availability” part of NoSQL databases will ensure the Order is entered eventually.

Another problem is adhoc SQL queries. ERP packages tend to like to do reporting and data inquiries using adhoc SQL queries. Usually this is a no no in the NoSQL world. Often in the NoSQL world, all queries must be designed ahead of time using very complicated MapReduce formula. The question is whether this is really necessary. If you have a really powerful Google like search ability built into the database, then do you really need to search the data with SQL queries? Is it that we just do this because we know how and that in fact real users would much rather use the Google method than the SQL method?

Hybrid Approach

Suppose you want the data query ability of a NoSQL database, but still want the SQL transaction integrity and consistency? One approach has been hybrid systems. Think of these as two databases in one where both a SQL and NoSQL database engine operate on your data. Perhaps a cleaner solution is to take a data warehousing approach and create your data in a SQL database, but then archive it to a NoSQL database for various querying and analysis. This is the foundation of many cloud based BI solutions that are out there. With this approach you get some of the benefits of both. What you lack is the high availability and scalability of the database for data entry and regular SQL type queries. But still this can be a way to get a bit of the best of both worlds.

Summary

I think it will be a while longer before NoSQL databases gain acceptance in the ERP world. However the amount of innovation going on in the NoSQL world is huge and great strides are being taken. I suspect we will start to see NoSQL begin to be used for non-traditional BI via data warehouses. Even adding Google like search to ERP usually involves adding a NoSQL database on the side. I think once this starts to happen then we will see more and more functionality move to the NoSQL side of things and then perhaps down the road somewhere we will be able to remove the SQL database. But the jury is still our as to whether this is a good long term goal.

Written by smist08

January 28, 2012 at 5:20 pm

11 Responses

Subscribe to comments with RSS.

  1. Great MaybeNoSQL post touching on many considerations for ERP.

    Diving further into adhoc SQL queries and ERP vs using Google search. The productivity is too rich to not pursue. Consider how Google (and Bing) have moved to incremental indexing (use the search on forecast inquiries on part #1000 to seed the indexing) via to Percolator and Dremel cite @mlmilleratmit http://bit.ly/wvxIks

    This approach ties back to the Hybrid Approach “thinking of two databases in one” with “both a SQL and NoSQL database engine”. This pattern is essential the architecture of Force.com. Designed for business developers by using inserts to trigger reindexing of ERP important data such as order totals and financial balances.

    Bringing this back to innovation, I’ve encountered very entrepreneurial companies using Sage for small factories internationally, then using some ETL process to build a consolidated planning forecast and financials. This approach usually leads to purchasing a larger system, often after the IT architect leaves.

    Michael Rys @sqlservermike illustrates a different architecture approach using SQL Azure to gain scalability by relaxing the cap theorem http://littlebigpass

    Common to the new approaches seems to be moving the on-premise databases into the Cloud to remove the latency necessary to relax the cap theorem. Perhaps if we had NoSQL ERP databases on premise we could percolate updates to the BI layer in the Cloud, or its simpler to architect ERP for the Cloud.

    clive boulton

    February 1, 2012 at 7:18 pm

    • Thanks, those are all really good points.

      Cheers

      smist08

      February 1, 2012 at 8:24 pm

  2. […] starts to put us into the world of “Big Data”. I blogged about Big Data and ERP here. Currently we are gathering all the data into SQL Server, but this is already strained with only a […]

  3. Please check out Bangdb (http://www.iqlect.com) the new key value store which seems to be very fast in terms of IOPS for both read and write. The Bangdb will be in many flavors for ex; embedded, network, elastic cache/imdg. Being crash proof, with many configuration parameters, it can be tuned to operate in most suitable fashion for a given requirement.

    cloud-dev

    June 22, 2012 at 7:22 pm

  4. Stephen — Bangdb brings to mind was / is Btrieve NoSQL for ERP?

    clive boulton

    June 22, 2012 at 10:55 pm

    • I guess it is in that it is not SQL. But Btrieve is like SQL in that it gives you consistency but not availability and partition tolerance.

      smist08

      June 22, 2012 at 11:29 pm

      • Update the sands are shifting:

        SAP HANA in memory database continues on with a SQL, yet the default storage and query optimizer is columnar, though row based is also supported. Columnar tables for ERP make lots of sense as we shift into smaller mobile devices or wearable computing like Google Glass.

        Perhaps not for SMB, but for mid-market ERP and really BigERP with multisite architectures NoSQL ERP properly using distributed databases could really simplify and add performance.

        This blog post from Adron with Basho on the Riak team gets into the use case. http://compositecode.com/2012/12/19/cloud-computing-and-distributed-computing-something-is-broken/#comments

        boulton clive

        December 21, 2012 at 6:59 am

      • Thanks for the link, that’s a good blog post. Certainly interesting now that innovation in databases is exploding again and we aren’t just doing slight minor tweaks to SQL databases anymore.

        smist08

        December 21, 2012 at 3:59 pm

  5. You might want to check out the JugglingDB ORM http://jsdoc.info/1602/jugglingdb/ for both SQL & noSQL adapter capabilities (Built for RailwayJS but could be used readily with any node.js setup).

    For example, you could do:

    schema ‘redis’, url: process.env.REDISTOGO_URL, ->
    define ‘Lead’
    # other definitions for redis schema

    schema ‘sqlite3’, url: process.env.SQLLITE3_URL, ->
    define ‘Transaction’
    # other definitions for mongoose schema

    I love the idea, by the way!

    Richard

    July 16, 2012 at 4:51 pm

    • oh thats embarrassing, I meant to suggest mongo instead of redis! redis would not be the best of places to store leads

      Richard

      July 16, 2012 at 4:52 pm

  6. […] In the cloud, it’s quite easy to create virtual machines to run computations for you. It’s also quite easy to take advantage of various Big Data databases for storing large amounts of data (these are often referred to as NoSQL databases). […]


Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.