Stephen Smith's Blog

Musings on Machine Learning…

Posts Tagged ‘big data

On Calculating Dashboards

with 5 comments

Introduction

Most modern business applications have some sort of dashboard that displays a number of KPIs when you first sign-in. For instance here area a couple of KPIs from Sage 300 ERP:

s300portal

To be useful, these KPIs can involve quite sophisticated calculations to display relevant information. However users need to have their home page start extremely quickly so they can get on with their work. This article describes various techniques to calculate and present this information quickly. Starting with easy straight forward approaches progressing into more sophisticated methods utilizing the power of the cloud.

Simple Approach

The simplest way to program such a KPI is to leverage any existing calculations (or business logic) in the application and use that to retrieve the data. In the case of Sage 300 ERP this involves using the business logic Views which we’ve discussed in quite a few blog posts.

This usually gives a quick way to get something working, but often doesn’t exactly match what is required or is a bit slow to display.

Optimized Approach

Last week, we looked a bit at using the Sage 300 ERP .Net API to do a general SQL Query which could be used to optimize calculating a KPI. In this case you could construct a SQL statement to do exactly what you need and optimize it nicely in SQL Management Studio. In some cases this will be much faster than the Sage 300 Views, in some cases it won’t be if the business logic already does this.

Incremental Approach

Often KPIs are just sums or consolidations of lots of data. You cloud maintain the KPIs as you generate the data. So for each new batch posted, the KPI values are stored in another table and incrementally updated. Often KPIs are generated from statistics that are maintained as other operations are run. This is a good optimization approach but lacks flexibility since to customize it you need to change the business logic. Plus the more data that needs to be updated during posting will slow down the posting process, annoying the person doing posting.

Caching

As a next step you could cache the calculated values, so if the user has already executed a KPI once today then cache the value, so if they exit the program and then re-enter it then the KPIs can be quickly drawn by retrieving the values from the cache with no SQL or other calculations required.

For a web application like the Sage 300 Portal, rather than cache the data retrieved from the database or calculated, usually it would cache the JSON or XML data returned from the web service call that asked for the data. So when the web page for the KPI makes a request to the server, the cache just gives it the data to return to the browser, no formatting, calculation or anything else required.

Often if the cache lasts one day that is good enough, there can be a manual refresh button to get it recalculated, but mostly the user just needs to wait for the calculation once a day and then things are instant.

The Cloud

In the cloud, it’s quite easy to create virtual machines to run computations for you. It’s also quite easy to take advantage of various Big Data databases for storing large amounts of data (these are often referred to as NoSQL databases).

Cloud Approach

Cloud applications usually don’t calculate things when you ask for them. For instance when you do a Google search, it doesn’t really search anything, it looks up your search in a Big Data database, basically doing a database read that returns the HTML to display. The searching is actually done in the background by agents (or spiders) that are always running, searching the web and adding the data to the Big Data database.

In the cloud it’s pretty common to have lots or running processes that are just calculating things on the off chance someone will ask for it.

So in the above example there could be a process running at night the checks each user’s KPI settings and performs the calculation putting the data in the cache, so that the user gets the data instantly first thing in the morning, and unless they hit the manual refresh button, never wait for any calculations to be performed.

That helps things quite a bit but the user still needs to wait for a SQL query or calculation if they change the settings for their KPI or hits the manual refresh button. A sample KPI configuration screen from Sage 300 is:

s300portalconfig

As you can see from this example there are quite a few different configuration options, but in some sense not a truly rediculous number.

I’ve mentioned “Big Data” a few times in this article but so far all we’ve talked about is caching a few bits of data, but really the number of these being cached won’t be a very large number. Now suppose we calculate all possible values for this setup screen. Use the distributed computing powe of the cloud to do the calculations and then store all the possibilities in a “Big Data” database. This is much larger than we talked about previously, but we are barely scratching the surface of what these databases are meant to handle.

We are using the core functionality of the Big Data database, we are doing reads based on the inputs and returning the JSON (or XML or HTML) to display in the widget. As our cloud grows and we add more and more customers, the number of these will increase greatly, but  the Big Data database will just scale out using more and more servers to perform the work based on the current workload.

Then you can let these run all the time, so the values keep getting updated and even the refresh button (if you bother to keep it), will just get a new value from the Big Data cache. So a SQL query or other calculation is never triggered by a user action ever.

This is the spider/read model. Another would be to sync the application’s SQL database to a NoSQL database that then calculates the KPIs using MapReduce calculations. But this approach tends to be quite inflexible. However it can work if the sync’ing and transformation of the database solves a large number of queries at once. Creating such a database in a manner than the MapReduce queries all run fast is a rather nontrivial undertaking and runs the risk that in the end the MapReduces take too long to calculate. The two methods could also be combined, phase one would be to sync into the NoSQL database, then the spider processes calculate the caches doing the KPI calculations as MapReduce jobs.

This is all a lot of work and a lot of setup, but once in the cloud the customer doesn’t need to worry about any of this, just the vendor and with modern PaaS deployments this can all be automated and scaled easily once its setup correctly (which is a fair amount of work).

Summary

There are lots of techniques to produce/calculate business KPIs quickly. All these techniques are great, but if you have a cloud solution and you want its opening page to display in less that a second, you need more. This is where the power of the cloud can come in to pre-calculate everything so you never need to wait.

Written by smist08

February 14, 2015 at 7:25 pm

Google Mobile Trends

with 2 comments

Introduction

In a previous blog I talked about Apple’s mobile directions following their annual developer’s conference. This past week Google help their annual I/O developer’s conference in San Francisco. So it seems like a good time to comment on Google’s mobile trends. There is a lot of similarity between Apple and Google’s directions. But there are also differences since each company has a slightly different view on what direction to take. Both companies are very large and have their hands in a lot of pies. As a consequence their announcements tend to be quite diverse and it’s interesting to see how they try to unify them all.

New Android

Like Apple announced a new version of iOS, so too Google announced a new version of Android namely “L”. I don’t know what happened to the cute code names like Kit Kat or Ice Cream Sandwich, but “L” it is. One big feature they’ve added is 64 Bit support. Apple recently introduced 64 Bit processors in their latest phones and now Google devices can catch up. This means that most new higher end phones are now all going to sport 64 Bit quad core processors. This is an amazing amount of computing power in such tiny devices.

As with most new operating systems these days, it includes a new UI look. With the new Android this new update is called Material Design and follows the fashion of a flatter more austere look to things.

There are many other new features in the new version of Android which will hopefully make people more productive.

Wearable’s Everywhere

A big trend in rumored and announced, but generally not shipping products are wearable’s. Google leads the way in these with Google Glasses. Then there are the endless smart watch announcements, rumors and even the odd product.

smartwatch

I have a Garmin GPS Watch with a heart rate monitor. Combined with the website where I upload the data, this is a great device to track my cycling and running. It would be nice if its battery lasted longer and if it was more waterproof so I could swim with it. In the meantime there are many great apps to do the same sort of things with your phone. These all do more than the Garmin watch, but the phone is bulkier and less durable in damp environments.

Having a waterproof watch that can do more than my Garmin and has a longer battery life would be fantastic.

Although in the early stages, both Google and Apple see the fitness market for metrics and tracking as a huge potential market. Both companies are both partnering and developing new sensors to measure new things, like small waterproof sensors for swimming or unique ways to measure other sports like for golf swings. Similarly measuring all sorts of biometric data beyond heart rate to include blood pressure, blood glucose levels and all sorts of other things. Eventually these will morph into a Star Trek like medical tricoder.

Home Automation

Google has now purchased a number of home automation companies including Nest. And are now integrating these with Android to provide full control of everything in your home from your phone. Including remotely setting the thermostat, receiving smoke detector alerts and monitoring security cameras. Most of these things are available now as separate discrete components but Google is working especially hard to make this whole area much more unified.

nest_thermostat_insteon-800x420

Online Cars

Another big area of interest is integrating into cars. Already most cars can interface to iPhones and Android phones to make calls hands free and play music. Now the goal is to sell Android (and iOS) into the auto industry. To have better more connected GPS (with real time traffic updates) and access to all your music library. Further many car companies are enabling using your car as a Wi-Fi hotspot.

I’m not sure how far this should go, since it all gets very distracting. Already we have so many potential distractions in cars. And just things like texting are causing many accidents.

Everything in the Cloud

With all Google’s products, the emphasis is storing all data in the cloud. They will only store things on local devices if there is a huge outcry of people that need to work offline (like on airplanes). Chromebooks really showed that this was possible and Google has led the way in offering lots of free cloud storage and making sure everything they do will interact seamlessly with these cloud documents.

They tout the convenience of this that things are always backed up, so if your laptop is stolen or destroyed you don’t lose anything. However critics worry about privacy concerns with storing sensitive data under someone else’s control, especially a search provider. It’s rather scary to corporate compliance officers that sensitive corporate documents might start showing up in people’s search results. Often this wouldn’t be due to Google doing something maliciously as someone just misclicking the visibility of the document to allow it to be viewed by anyone, and by anyone this means anyone on the Internet (and not just anyone that finds it on the corporate network).

All that being said, Google, Apple and Microsoft are all pushing this model like mad, and a lot of innovations that are in the pipeline completely rely on the adoption of this philosophy. It certainly is convenient to have all your photos and videos automatically uploaded and not to have to worry about sync’ing things by hand anymore.

Big Data

Google really started the big data revolution when they published the details of their Map Reduce algorithm that the original Google search was built upon. This then spawned a huge industry of open source tools and databases all built around this algorithm. Map Reduce was revolutionary since it let people get instant search results on searching for everything. Basically it worked by having a database of everything people might search for, like a giant cache that could return results instantly.

The limitation of Map Reduce is that constructing queries is quite difficult and often requires the database to be rebuilt in a different way. If you don’t do that, although the main query the database solves returns instantly, any other query takes a week to process.

Google is now claiming Map Reduce and all the industry like Hadoop based on it are completely obsolete. They were heavily promoting their new Cloud Dataflow service where the claim this service can also do efficient real time analytics as well as preserve the performance of the main functionality.

It will be interesting to see what this new service can really do and will it really threaten all the various NoSQL databases like MongoDB.

Summary

There are a lot of interesting things going on in the mobile world. It will be interesting to see if all our phones are replaced by watches or glasses in a couple of years. It will be interesting to see what great things come of all these new cloud big data services.

Written by smist08

June 28, 2014 at 5:28 pm

Elastic Search

with 5 comments

Introduction

We’ve been working on an interesting POC recently that involved Google like search. After evaluating a few alternatives we chose Elastic Search. Search is an interesting topic, often associated with Big Data, NoSQL and all sorts of other interesting technologies. In this article I’m going to look at a few interesting aspects of Elastic Search and how you might use it.

elasticsearchlogo

Elastic Search is an open source search product based on Apache Lucene. It’s all written in Java and installing it is just a matter of copying it to a directory and then as long as you already have Java installed, it’s ready to go. An alternative to Elastic Search is Apache Solr which is also based on Lucene. Both are quite good, but we preferred Elastic Search since it seemed to have added quite a bit of functionality beyond what Solr offers.

Elastic Search

Elastic search is basically a way of searching JSON documents. It has a RESTful API and you use this to add JSON documents to be indexed and searched. Sounds simple, but how is this helpful for business applications? There is a plugin that allows you to connect to databases via JDBC and to setup these to be imported and indexed on a regular schedule. This allows you to perform Google like searches on your database, only it isn’t really searching your database, its searching the data it extracted from your database.

Web Searches

This is similar to how web search services like Google works. Google dispatches thousands of web crawlers that spend their time crawling around the Internet and adding anything they find to Google’s search database. This is the real search process. When you do a search from Google’s web site it really just does a read on its database (it actually breaks up what you are searching for and does a bunch of reads). The bottom line though is that when you search, it is really just doing a direct read on its database and this is why it’s so fast. You can see why this is Big Data since this database contains the results for every possible search query.

This is quite different than a relational database where you search with SQL and the search goes out and rifles through the database to get the results. In a SQL database putting the data into the database is quite fast (sometimes) and then reading or fetching it can be quite slow. In NoSQL or BigData type databases much more time goes into adding the data, so that certain predefined queries can retrieve what they need instantly. This means the database has to be designed ahead of time to optimize these queries and then often inserting the data takes much longer (often because this is where the searching really happens).

Scale Out

Elastic Search is designed to scale out from the beginning, it automatically does most of the work for starting and creating clusters. It makes adding nodes with processing and databases really easy so you can easily expand Elastic Search to handle your growing needs. This is why you find Elastic Search as the engine behind so many large Internet sites like GitHub, StumbleUpon and Stack Overflow. Certainly a big part of Elastic Search’s popularity is how easy it is to deploy, scale out, monitor and maintain. Certainly much easier than deploying something based on Hadoop.

elasticsearch_topologies

Analyzers

When you index your data its fed through a set of analyzers which do things like convert everything to lower case, split up sentences to lower case, split words into roots (walking -> walk, Steve’s -> Steve), deal with special characters, dealing with other language peculiarities, etc. Elastic Search has a large set of configurable analyzers so you can tune your search results based on knowledge of what you are searching.

Fuzzy Search

One of the coolest features is fuzzy search, in this case you might not know exactly what you are searching for or you might spell it wrong and then Elastic Search magically finds the correct values. When ranking the values, Elastic Search uses something called Levenshtein distance to rank which values give the best results. Then the real trick is how does ElasticSearch do this without going through the entire database computing and ranking this distance for everything? The answer is having some sophisticated transformers on what you entered to limit the number of reads it needs to do to find matching terms, combined with good analyzers above, this turns out to be extremely effective and very performant.

Real Time

Notice that since these search engines don’t search the data directly they won’t be real time. The search database is only updated infrequently, so if data is being rapidly added to the real database, it won’t show up in these type of searches until the next update processes them. Often these synchronization updates are only performed once per day. You can tune these to be more frequent and you can write code to insert critical data directly into the search database, but generally it’s easier to just let it be updated on a daily cycle.

Security

When searching enterprise databases there has to be some care on applying security, ensuring the user has the rights to search through whatever they are searching through. There has to be some controls added so that the enterprise API can only search indexes that the user has the rights to see. Even if the search results don’t display anything sensitive they could still leak information. For instance if you don’t have rights to see a customer’s orders, but can search on them, then you could probably figure out how many orders are done by each customer which could be quite useful information.

Certainly when returning search results you wouldn’t reveal anything sensitive like say salaries, to get these you would need to click a link to drill down into  the applications UI where full security screen is done.

Summary

Elastic Search is a very powerful search technology that is quite easy to deploy and quite easy to integrate into existing systems. A lot of this is due to a powerful RESTful API to perform operations.