Stephen Smith's Blog

Musings on Machine Learning…

Archive for the ‘Performance’ Category

Apple M1 Unified Memory

leave a comment »


I recently upgraded three 2008 MacBook Pros from 1Gig to 4Gig of RAM. It was super-easy, you remove the battery (accessible via a coin), remove a small cover over the RAM and hard-drive, then pop the RAM and push in the new ones. Upgrading the hard drive or RAM on these old laptops is straightforward and anyone can do it. Newer MacBooks require partial disassembly which makes the process harder. For the newest ARM based MacBooks, upgrading is impossible. So, do we gain anything for this lack of upgradeability?

This article looks at Apple’s new unified memory architecture that they claim gives large performance gains. Apple hasn’t released a lot of in depth technical details on the M1 chip, but from what they have released, and now that people have received these units and performed real benchmarks we can see that Apple really does have something here.

Why Would We Want To Upgrade?

In the case of the 2008 MacBook Pro, when it was new, 4Gig was expensive. Now 4Gig of DDR2 memory is $10. It makes total sense to upgrade to maximum memory. Similarly, the MacBook came with a mechanical hard drive which is quite small and slow by modern standards. It was easy to upgrade these to larger faster SSD drives for around $40 each.Often this is the case that the maximum configuration is too expensive at the time of the original purchase, but becomes much cheaper a few years later. Performing these upgrades then lets you get quite a few years more service out of your computer. The 2008 MacBook Pros upgraded to maximum configuration are still quite usable computers (of course you have to run Linux on them, since Apple software no longer supports them).

Enter the New Apple ARM Based Macintoshes

The newly released MacBooks based on ARM System on a Chips (SoCs) have their RAM integrated into their CPU chips. This means that unless you can replace the entire CPU, you can’t upgrade the RAM. Apple claims integrating the memory into the CPU gives them a number of performance gains, since the memory is high speed, very close to all the devices and shared by all the devices. A major bottleneck in modern computer systems is moving data between memory and the CPU or copying data from the CPU’s memory to the GPU’s memory.

AMD and nVidia graphics cards contain their own memory separate from the memory used by the CPU. So a modern gaming computer might have 16Gig RAM for the CPU and then 8Gig or RAM for the GPU. If you want the GPU to perform a matrix multiplication you need to transfer the matrices to the GPU, tell it to multiply them and then transfer the resulting matrix back to the CPU’s memory. nVidia and AMD claim this is necessary since they incorporate newer faster memory in their GPUs than is typically installed on the CPU motherboard. Most CPUs currently use DDR4 memory whereas GPUs typically incorporate faster DDR6 memory. There are GPUs (like the Raspberry Pi’s) that share CPU memory, however these tend to be lower end (cheaper since they don’t have their own memory) and slower since there is more contention for the CPU memory.

The Apple M1 tries to address these problems by incorporating the fastest memory and then providing a much wider bandwidth between the memory and the various processors on the M1 chip. For the M1 there isn’t just the GPU, but also a Neural Engine for AI processing (which is similar to a GPU) as well as other units for specialized functions like data encryption and video decoding. Most newer computers have a 64-bit memory controller that can move 64-bits of data between the CPU and RAM at the speed of the RAM, sometimes the RAM is as fast as the CPU, sometimes it’s a bit slower. Newer CPUs have large caches to try to save on some of this transfer, but the caches are in MegaBytes whereas main memory is in GigaBytes. Separate GPU memory helps by having a completely separate memory controller, expensive servers help by having multiple memory controllers. Apple’s block diagrams seem to indicate they have two 64-bit memory controllers or parallel pathways to main memory, but this is a bit hypothetical. As people are benchmarking these new computers, it does appear that Apple has made some significant performance improvements.


If Apple has greatly reduced the memory bottleneck and having the GPU, Neural Engine and CPU all accessing the same memory doesn’t cause too much contention, then saving the copying of data between the processing units will be a big advantage. On the downside, you should overbuy on the memory now, since you can’t upgrade it later.

If you are interested in the M1 ARM processor and want to learn more about how it works internally, then consider my book: Programming with 64-Bit ARM Assembly Language.

Written by smist08

November 20, 2020 at 1:09 pm

Playing with CUDA on my Gaming Laptop

with one comment

Playing with CUDA on my Gaming Laptop


Last year, I blogged on playing with CUDA on my nVidia Jetson Nano. I recently bought a new laptop which contains an nVidia GTX1650 graphics card with 4Gig of RAM. This is more powerful than the coprocessor built into the Jetson Nano.  I took advantage of the release of newer Intel 10th generation processors along with the wider availability of newer nVidia RTX graphics cards to get a good deal on a gaming laptop with an Intel 9th generation processor and nVidia GTX graphics. This is still a very fast laptop with 16Gig of RAM and runs the couple of video games I’ve tried fine. It also compiles and handles my normal projects easily. In this blog post, I’ll repeat a lot of my previous article on the nVidia Jetson, but in the context of running on Windows 10 with an Intel CPU.

I wanted an nVidia graphics card because these have the best software support for graphics, gaming, AI, machine learning and parallel programming. If you use Tensorflow for AI, then it uses the nVidia graphics card automatically. All the versions of DirectX support nVidia and if you are doing general parallel programming then you can use a system like OpenCL. I find nVidia leads AMD in software support and Intel is going to have a lot of trouble with their new Xe graphics cards reaching this same level of software support.


On Windows, most developers use Visual Studio. I could do this all with GCC, but this is more difficult, since when you install the SDK for CUDA, you get all the samples and documentation for Visual Studio. The good news is that you can use Visual Studio Community Edition which is free and actually quite good. Installing Visual Studio is straightforward, just time consuming since it is large.

Next up, you need to install nVidia’s CUDA toolkit. Again, this is straightforward, just large. Although the install is large, you likely have all the drivers already installed, so you are mostly getting the developer tools and samples out of this.

Performing these installs and then dealing with the program upgrades, really makes me miss Linux’s package managers. On Linux, you can upgrade all the software on your computer with one command on a regular basis. On Windows, each program checks for upgrades when it starts and usually wants to upgrade itself before you do any work. I find that this is a real productivity killer on Windows. Microsoft is starting work on a package manager for Windows, but at this point it does little.

Compiling the deviceQuery sample produced the following output on my gaming laptop:

CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 1650 with Max-Q Design"
  CUDA Driver Version / Runtime Version          11.0 / 11.0
  CUDA Capability Major/Minor version number:    7.5
  Total amount of global memory:                 4096 MBytes (4294967296 bytes)
  (16) Multiprocessors, ( 64) CUDA Cores/MP:     1024 CUDA Cores
  GPU Max Clock rate:                            1245 MHz (1.25 GHz)
  Memory Clock rate:                             3501 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 1048576 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 6 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  CUDA Device Driver Mode (TCC or WDDM):         WDDM (Windows Display Driver Model)
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.0, CUDA Runtime Version = 11.0, NumDevs = 1
Result = PASS

If we compare this to the nVidia Jetson Nano, we see everything is better. The GTX 1650 is based on the newer Turing architecture and the memory is local to the graphics card and not shared with the CPU. The big difference is that we have 1024 CUDA cores, rather than the Jetson’s 128. This means we can perform 1024 operations in parallel for SIMD operations.

CUDA Samples

The CUDA toolkit includes a large selection of sample programs, in the Jetson Nano article we listed the vector addition sample. Compiling and running this on Windows is easy in Visual Studio. These samples are a great source of starting points for your own projects. 

Programming for Portability

If you are writing a specialized program and want the maximum performance on specialized hardware, it makes sense to write directly to nVidia’s CUDA API. However, most software developers want to have their programs to run on as many computers out in the world as possible. The solution is to write to a higher level API that then has drivers for different popular hardware.

For instance, if you are creating a video game, you could write to the DirectX interface and then your program can run on newer versions of Windows on a wide variety of GPUs from different vendors. If you don’t want to be limited to Windows, you could use a portable graphics API like OpenGL. You can also go higher level and create your game in a system like UnReal Engine or Unity. These then have different drivers to run on DirectX, MacOS, Linux, mobile devices or even in web browsers.

If you are creating an AI or Machine Learning application, you can use a library like Tensorflow or PyTorch which have drivers for all sorts of different hardware. You just need to ensure their support is as broad as the market you are trying to reach.

If you are doing something more general or completely new, you can consider a general parallel processing library like OpenCL which has support for all sorts of devices, including the limited SIMD coprocessors included with most modern CPUs. A good example of a program that uses OpenCL is Folding@Home which I blogged on here.


Modern GPUs are powerful and flexible computing devices. They have high speed memory and often thousands of processing cores to work on your task. Libraries to make use of this computing power are getting better and better allowing you to leverage this horsepower in your applications, whether they are graphics related or not. Today’s programmers need to have the tools to harness these powerful devices, so the applications they are working on can reach their true potential.

Written by smist08

June 20, 2020 at 1:43 pm

Windows Bit-Rot

with 9 comments


In investigating some performance problems being reported on some systems running Sage 300 ERP, it lead down the road to investigating Windows Bit-Rot. Generally Bit-Rot refers to the general degradation of a system over time. Windows has a very bad reputation for Bit-Rot, but what is it? And what can we do about it? Some people go so far as to reformat their hard disk and re-install the operating system every year as a rather severe answer to Bit-Rot.

Windows Bit-Rot is the tendency for a Windows system to get slower and slower over time. Becoming slower to boot, taking longer to log-in, and taking longer to start programs. Along with other symptoms like excessive and continuous hard disk activity when nothing is running.

This blog posting is going to look at a few things that I’ve run into as well as some other background from around the web.


I needed to investigate why on some systems printing Crystal reports was quite slow. This involved software we have written as well as a lot of software from third parties. On my laptop Crystal would print quite slowly the first time and then would print quickly on subsequent times. My computer is used for development and is full of development tools, so the things I found here, might be relevant to myself more than real customers. So how to see what is going on? A really useful program for seeing what is going on is Process Monitor (procmon) from Microsoft (from their SysInternals acquisition). This program will show you every access of the registry, the file system and the network. You can filter the display, in particular you can filter to monitor only a single program to see what it’s doing.


ProcMon yielded some very interesting results.

The Registry

My first surprise was to see that every entry in HKEY_CLASSES_ROOT was read. On my computer which has had many pieces of software installed, including several versions of Visual Studio, several versions of Crystal Reports and several versions of Sage 300 ERP, the number of classes registered here was huge. OK, but did it take much time? Well the first time something that’s run that does this it seems to take several seconds, then after this its fast probably because the registry ends up cached in memory. It appears that several .Net programs I tried do this. Not sure why, perhaps just .Net wants to know all the classes in the system.

But this does mean that as your system gets older and you install more and more programs (after all why bother un-installing when you have a multi-terabyte hard drive?), starting these programs will get slightly slower and slower. So to me this counts as Bit-Rot.

So what can we do about this? Un-installing unused programs should help, especially if they use a lot of COM classes. Visual Studio being the big one on my system, followed by Crystal and Sage 300. This helps a bit. But there are still a lot of classes there.

Generally I think uninstall programs leave a lots of bits and pieces in the registry. So what to do? Fortunately this is a good stomping ground for utility programs. Microsoft used to have RegClean.exe, Microsoft discontinued support for this program, but you can still find it around the web. A newer and better utility is Ccleaner from Piriform. Fortunately the free version includes a registry cleaner. I ran RegClean.exe first which helped a bit, but then ran Ccleaner and it found quite a bit more to clean up.

Of course there is danger in cleaning your registry, so it’s a use at your own risk type thing (backing up the registry first is a good bet).

At the end of the day all this reduced the first time startup time of a number of program by about 10 seconds.

Group Policy

My second surprise was the number of calls to check Windows Group Policy settings. Group Policy is a rather ad-hoc mechanism added to Windows to allow administrators to control networked computers on their domain. Each group policy is stored in a registry key, and when Windows goes to do an operation controlled by group policy, it reads that registry key to see what it should do. I was surprised at the amount of registry activity that goes on reading and checking group policy settings. Besides annoying users by restricting what they can do on their computer, it appears group policy causes a general high overhead of excessive registry reading in almost every aspect of Windows operation. There is nothing you can do about this, but it appears as Windows goes from version to version, that more and more gets added to this and the overhead gets higher and higher.


You may not think that you install that many programs on your computer, so you shouldn’t have these sort of problems but remember many programs including Windows/Microsoft Update, Adobe Updater and such are regularly installing new programs on your computer. Chances are these programs are leaving behind unused bits of older versions that are cluttering up your file system and your registry.

Auto-Run Crap

Related to auto-updates, it appears that so many programs now run as icons in the task bar, install Windows services or install programs to run when you log-in. All of these slow down the time it takes you to boot Windows and to sign-in. Further many of these programs, say like Dropbox, will keep frequently polling their server to see if there are any updates. Microsoft has a good tool Autoruns for Windows which helps you see all the things that are automatically run and help you remove them. Again this can be a bit dangerous as some of them are necessary (perhaps like a trackpad utility).

Similarly it seems that everyone and their mother wants to install browser toolbars. Each one of these will slow down the startup of your browser and use up memory and possibly keep polling a server. Removing/disabling these isn’t hard, but it is a nuisance to have to keep doing this.

Hard Disk Fragmentation

Another common problem is hard drive fragmentation. As your system operates the hard disk becomes more and more fragmented. Windows has a de-frag program that is either scheduled to run when your computer is turned off or you never bother to run it by hand. It is worth de-fragging your hard drive from time to time to speed up access. There are third party de-frag programs, but generally I just use the one that comes built into Windows.

Related to the above problems, often un-installation programs leave odds and ends files around and sometimes it’s worth going into explorer (or a cmd prompt) and deleting folders for un-installed programs. Generally it reduces clutter and speeds up operations like reading all the folders under program files.

Dying Hard Drives

Another common cause of slowness is that as hard drives age, rather than just out right failing, often they will start having to retry reading sectors more. Windows can mark sectors bad and move things around.  Hard drives seem to be able to limp along for a while this way before completely failing. I tend to think that if you hear your hard drive resetting itself fairly often then you should replace it. Or when you defrag if you see the number of bad sectors growing, then replace it.


After going through this, I wonder if the people that just reformat their hard drive each year have the right idea? Does the time spent un-installing, registry cleaning, de-fragging just add up to too much? Are you better off just starting clean each year and not worrying about all these maintenance tasks? Especially now that it seems like we replace our computers far less frequently, is Bit-Rot becoming a much worse problem?

Written by smist08

May 4, 2013 at 2:31 pm

Day End Processing

with 20 comments

Sage ERP Accpac Day End Processing is a rather simple form in I/C Periodic Processing. It only has Process and Cancel buttons and you are expected to run it after the close of business every day. But what does it do? The online help states:

Use this dialog box to:

  • Update costing data for all transactions (unless you chose the option to update costing during posting).
  • Produce general ledger journal entries from the transactions that were posted during the day (unless you do item costing during posting or create G/L transactions using the Create G/L Batch icon).
  • Produce a posting journal for each type of transaction that was posted.
  • Update Inventory Control statistics and transaction history.

Day End Processing also performs processing tasks for the Order Entry and Purchase Orders modules, if you have them:

  • Processing transactions that were posted during the day in Order Entry and Purchase Orders.
  • Activating and posting future sales orders and purchase orders that have reached their order date, and updating quantities on sales order and on purchase order.
  • Removing quotes and purchase requisitions with expiration dates up to and including the session date for day-end processing.
  • Updating sales commissions.
  • Creating batches of Accounts Receivable summary invoices and credit notes from posted Order Entry transactions.
  • Deleting completed transaction details if you do not keep transaction history.
  • Updating statistics and history in Order Entry and Purchase Orders.

Below is the flow in and out of Inventory. All of these transactions affect costing and generate sub-ledger transactions.

The main purpose of Day End is to move a lot of processing away from data entry. Generally in a large Accpac installation you will have hundreds of people entering Orders, Invoices, POs, etc. They need to get their work done in the most efficient way possible. A person entering Orders from the CRM system doesn’t want to have to wait for A/R and G/L transactions to be created every time. What they care about is entering their Orders as quickly as possible. As a side benefit, Day End can batch all the transactions together, so rather than each Order creating a single G/L Batch, these can all be combined reducing the number of documents downstream.

However there are a number of misunderstandings and confusion about Day End. This blog posting is looking to cover a few topics around Day End to hopefully make things a bit clearer.

Over the years we have also changed the way Day End operates and added additional options to let you choose when things happen. Prior to version 5.2A, all the functions mentioned above had to be done during Day End and there were no alternatives. People became imaginative and ran macros to run Day End on a frequent basis. Why were people doing this? The main reason was that for many businesses, updating the costing in inventory only once a day is not sufficient, if costs are changing quickly you want this reflected right away. Another reason is that if you operate your business 24×7 then you don’t have an after-hours time period when you can run this. Plus for some people Day End was taking longer than the overnight period to complete.

In version 5.2A we introduced the feature of “Day End at Posting Time”. With this mode essentially whenever you posted a document in I/C, O/E or P/O, we would run Day End. Then you never had to run the original stand-alone Day End screen and you could operate 24×7 without running a separate Day End process and your I/C Costing was always up-to-date. This worked fine for some people (usually people with lower volume), but it caused problems for others. One was that it slowed down posting time of documents too much and impeded the productivity of people posting Orders and such. Second, when you have longer transactions, you now run a larger risk of multi-user conflicts (which are really quite annoying). Third, this resulted in a large number of G/L, A/R and A/P batches being produced. The usual workaround for people that really need this was to turn off other features that slow down posting such as “Keep Statistics” or “Keep History”. You can speed up posting quite a bit by turning off various options in the various Accounting Module’s Options screens. However you then lose use of these features and often which you do can be a difficult trade-off.

In version 5.5A we introduced the feature of “Costing during Posting”. Here when you post an I/C, O/E or P/O document we would run the Costing part of Day End, but not all the other parts. This turned out to be a good compromise. It didn’t noticeably slow down document posting and hence didn’t introduce more multi-user conflicts. So people could now keep their Costing data up-to-date without frequently running Day End. However you still need to run the Day End processing function at night to create all the sub-ledger documents, create audit history and other miscellaneous functions.

Now let’s go through each of the Day End functions in a bit more detail.

Cost Transactions

This is based on the Item Costing Method and is when the costing buckets are updated for the affected items in I/C. Basically depending on whether we are buying or selling:

  • Incoming Transactions: Increases Total Quantity/Cost in Location Details/Costing Buckets
  • Outgoing Transactions: Calculates/Removes Quantity/Cost from Location Details/Costing Buckets

Create Audit Information

Day End is responsible for populating all the various I/C, O/E and P/O audit history tables including:

  • Posting Journals (all transactions)
  • Item Valuation (all transactions)
  • IC Transaction History (all transactions)
  • IC Statistics (all transactions)
  • IC Serial Numbers Audit (IC and OE Shipments)
  • IC Sales Statistics (IC and OE Shipments)
  • OE Sales History (OE transactions)
  • OE Sales Statistics (OE transactions)
  • OE Commissions Audit (OE Invoices/Credit Notes/Debit Notes)
  • PO Purchase History (PO transactions)
  • PO Payables Clearing Audit (PO transactions)

Generate GL/AR/AP/PM/FA Entries

Create all the various batches in the sub-ledgers. These include:

  • GL Entries (all transactions)
  • AR Entries (OE Invoices, Credit Notes, and Debit Notes)
  • AP Entries (PO Invoices, Credit Notes, and Debit Notes)
  • PM Entries (IC Shipments, OE Shipments, PO Purchase Orders / Receipts / Invoices / Returns / Credit Notes / Debit Notes)
  • FA Entries (IC Internal Usages, PO Receipts)

Miscellaneous Functions

Then there are a collection of miscellaneous functions that include:

  • Activates future orders (OE)
  • Deletes expired quotes that have not been activated (OE)
  • Deletes completed orders if “Keep History” is OFF (OE)
  • Activates future purchase orders (PO)
  • Clears completed transactions if “Keep History” is OFF (PO)

Day End Processing Structure

Day End Processing (DEP) is structured as follows:

Note that DEP doesn’t process all transactions in chronological order.


Your best bet is to use “Costing during Posting”. This will give you real-time costing without badly affecting performance. As ERP packages address larger organizations there tend to be more and more of these types of operations. The more people doing data entry, the less you want them burdening the application and database servers to maximize productivity. Many tier one ERP packages split this into more parts, the advantage of this is that several (that aren’t adjacent) can be run at once without causing multi-user conflicts. Sage ERP Accpac has always been under pressure to combine things down into all-in-one operations that work well for smaller businesses, however if we are to move the operations suite into larger Enterprises then we will have to slice these processes up finer.

Written by smist08

April 2, 2011 at 4:21 pm

Automated Testing in Sage ERP Accpac Development

with 7 comments

All modern products rely heavily on automated testing to maintain quality during development. Accpac follows an Agile development methodology where development happens in short three week sprints where at the end of every sprint you want the product at a shippable level of quality. This doesn’t mean you would ship, that would depend on Product Management which determines the level of new features required, but it means that quality and/or bugs wouldn’t be a factor. When performing these short development sprints, the development team wants to know about problems as quickly as possible so any problems can be resolved quickly and a backlog of bugs doesn’t accumulate.

The goal is that as soon as a developer checks in changes, these are immediately built into the full product on a build server and a sequence of automated tests are run on that build to catch any introduced problems. There are other longer running automated tests that are run less frequently to also catch problems.

To perform the continuous builds and to run much of our automated tests we use “Hudson” ( Hudson is an extremely powerful and configurable system to continuously build the complete product. It knows all the project dependencies and knows what to build when things change.  Hudson builds on many other tools like Ant (similar to make or nmake) and Ivy among others to get things done ( The key thing is that it builds the complete Accpac system, not just a single module. Hudson also has the ability to track metrics so we get a graph of how the tests are running, performance metrics, lines of code metrics, etc. And this is updated on every build. Invaluable information for us to review.

The first level of automated testing are the “unit tests” ( These are tests that are included with the source code. They are short tests where each test should run in under 1 second. They are also self contained and don’t rely on the rest of the system being present. If other components are required, they are “mocked” with tools like easy-mock ( Mocking is a process of simulating the presence of other system components. One nice thing about “mocking” is that it makes it easy to introduce error conditions, since the mocking component can easily just return error codes. The unit tests are run as part of building each and every module. They provide a good level of confidence that a programmer hasn’t completely broken a module with the changes they are checking in to source control.

The next level of automated tests are longer running. When the Quality Assurance (QA) department first tests a new module, they write a series of test plans, the automation team takes these and transfers as many of these as possible to automated test scripts. We use Selenium (, which is a powerful scripting engine that drivers an Internet Browser simulating actual users. These tests are run over night to ensure everything is fine. We have a subset of these call the BVT (Build Validation Test) that runs against every build as a further smoke test to ensure things are ok.

All the tests so far are functional. They test whether the program is functioning properly. But further testing is required to ensure performance, scalability, reliability and multi-user are fine. We record the time taken for all the previous tests, so sometime they can detect performance problems, but they aren’t the main line of defense. We use the tool JMeter ( to test multi-user scalability. JMeter is a powerful tool that simulates any number of client’s accessing a server. This tool tests the server by generating SData HTTP requests from a number of workstations and bombarding the server with them. A very powerful tool. For straight performance we use VBA macros that access the Accpac Business Logic Layer to write large numbers of transaction or very large transactions, which are all timed to make sure performance is fine.

We still rely heavily on manual QA testers to devise new tests and to find unique ways to break the software, but once they develop the tests we look to automate them so they can be performed over and over without boring a QA tester to death. Automated testing is a vibrant area of computer science where new techniques are always being devised. For instance we are looking at incorporating “fuzz testing” ( into our automated test suites. This technique if often associated with security testing, but it is proving quite useful for generalized testing as well. Basically fuzz testing takes the more standard automated test suites and adds variation, either by knowing how to vary input to cause problems, or just running for a long time trying all possible combinations.

As we incorporate all these testing strategies into our SDLC (Software Development Life Cycle) we hope to make Accpac more reliable and to make each release more trouble free than the previous.

Written by smist08

April 5, 2010 at 5:24 pm

Sage ERP Accpac 6 Performance Testing

with 4 comments

As Accpac moves to becoming a Web based application, we want to ensure that performance for the end user is at least as good as it was in the Accpac 5.x days running client/server with VB User Interface programs. Specifically can you have as many users running Accpac off one Web Server as you currently have running off one Citrix Server today?  In some regards Accpac 6 has an advantage over Citrix, in that the User Interface programs do not run on the server. On Citrix all the VB User Interface programs run on the Citrix server and consume a lot of memory (the VB runtime isn’t lightweight). In Accpac 6, we are only running the Business Logic (Views) on the Server and the UIs run as JavaScript programs on the client’s Browser. So we should be able to handle more users than Citrix, but we have to set the acceptance criteria somewhere. Performance is one of those things that you can never have enough of, and tends to be a journey that you keep working on to extend the reach of your product.

Web based applications can be quite complex and good tools are required to simulate heavy user loads, automate functional testing and diagnose/troubleshoot problems once they are discovered. We have to ensure that Accpac will keep running reliably under heavy user loads as users are performing quite a diverse set of activities. With Accpac 6 there are quite a few new components like the dashboard data servlets that need to have good performance and not adversely affect people performing other tasks. We want to ensure that sales people entering sales orders inside CRM are as productive as possible.

We are fundamentally using SData for all our Web Service communications. This involved a lot of building and parsing XML. How efficient is this? Fortunately all web based development systems are highly optimized for performing these functions. So far SData has been a general performance boost and the overhead of the XML has been surprisingly light.

Towards ensuring this goal we employ a number of open source or free testing tools. Some of these are used repeatedly run automated tests to ensure our performance goals are being met. Some are used to diagnose problems when they are discovered. Here is a quick list of some of the tools we are using and to what end.

Selenium ( is a User Interface scripting/testing engine for performing automated testing of Web Based applications. It drives the Browser as an end user would to allow automated testing. We run a battery of tests using Selenium and as well as looking for functionality breakages, we record the times all the tests take to run to watch for performance changes.

JMeter ( is a load testing tool for Web Based applications. It basically simulates a large number of Browsers sending HTTP requests to a server. Since SData just uses HTTP, we can use JMeter to test our SData services. This is a very effective test tool for ensuring we run under heavy multiuser loads. You just type in the number of users you want JMeter to simulate and let it go.

Fiddler2 ( is a Web Debugging Proxy that records all the HTTP traffic between your computer and the Internet. We can use this to record the number of network calls we make and the time each one takes. One cool thing Fiddler does is record your network performance and then calculates the time it would have taken for users in other locations or other bandwidths would have taken. So we get the time for our network, but also estimates of the time someone using DSL in California or a modem in China would have taken to load our web page.

dynaTrace AJAX Edition ( is a great tool that monitors where all the time is going in the Browser. It tells you how much time was spent waiting for the network, executing JavaScript, processing CSS, parsing JavaScript, rendering HTML pages. A good tool to tell you where to start diving in if things are slow.

Firebug ( is a great general purpose profiling, measuring, debugging tool for the Firefox browser. Although our main target for release if Internet Explorer, Firebug is such a useful tool, that we find we are often drawn back to doing our testing in Firefox for the great tools like Firebug that are available as add-ins.

IE Developer Tools ( are built into IE 8 and are a very useful collection of tools for analyzing things inside IE. The JavaScript profiler is quite useful for seeing why things are slow in IE. It is necessary to use this tool, since often things that work fine in Firefox or Chrome are very slow in IE and hence you need to use an IE based tool to diagnose the problem.

Using these tools we’ve been able to successfully stress the Accpac system and to find and fix many bugs that have caused the system to crash, lockup or drastically slow down. Our automation group runs performance tests on our nightly builds and posts the results to an internal web site for all our developers to track. As we develop out all the accounting applications in the new SWT (Sage Web Toolkit), we will aggressively be developing new automated tests to ensure performance is acceptable and look to keep expanding the performance of Accpac.

Written by smist08

February 26, 2010 at 9:33 pm

Amazing Google Street View

leave a comment »

I remember when SQL Server 7 was nearly ready for release, Microsoft research had a project to make a 1 terabyte database. Their project was to server up satellite images of anywhere on Earth. Basically they had 1 Terabyte of images, so that was their terabyte database. I didn’t think this was a realistic terabyte database since their weren’t that many records, just each one was an image and quite large. Anyway they put the thing up on the web as a beta, and as soon as a few people tried it, the whole thing collapsed under the load.

A year or two later Google launched Google Earth, which basically did the same thing, only more detailed. But Google Earth can easily handle the load of all the people around the world accessing it. Why the difference? Why could Google do this and Microsoft couldn’t? I think the main difference is that Microsoft hosted it on one single SQL Server and had no way to scale it besides beefing up the hardware at great expense. Whereas Google uses a massively distributed database running on many many servers all coordinated and all sharing and balancing the load. Google uses many low cost Linux based servers keeping costs down and performance high.

This week Google released Google StreetView for major Canadian cities including Vancouver, where I live. So I can virtually cruise around Vancouver streets with very good resolution including seeing my house and neighborhood. This is really amazing technology. Rather than just panning around a patchwork of satellite images, we are actually navigating in 3D around the world. Suddenly Google has produced a virtual model of the entire world at quite good photographic quality.

Think about the size of this distributed database with all these photos, plus all the data to allow them to be stitched together into 3D Views that you can navigate through. This is so far beyond Google Earth, it’s really amazing. Is this the first step to having a completely virtual alternate Earth? If you are wearing 3D goggles, will you be able to tell if they are transparent or viewing these images?

I think we are just seeing the first applications of what is possible with these giant distributed databases. I’m really looking forwards to seeing some really amazing and mind blowing applications in the future. The neat thing is that Google is starting to open source this database technology so others can use it. Are SQL databases just dinosaurs waiting to be replaced? What will be able to accomplish in our business/enterprise databases and data warehouses one we start apply and using this technology?

Written by smist08

October 11, 2009 at 12:24 am

SQL Server, Performance and Clustered Indexes

with 2 comments

One of our engineers Ben Lu, was doing some analysis of why SQL Server was taking along time to find some dashboard type totals for statistical display. There was an index on the data, and their weren’t that many records that met the search criteria that needed adding up. It turned out that even though it was following an index to add these up, that the records were fairly evenly distributed through the whole table. This meant that to read the 1% or so of records that were required for the sum, SQL Server was actually having to read pretty much every data table in the table usually to get one record. This problem is a bit unique to getting total statistics or sums; if the data was fed to a report writer, the index would return enough records quickly to start the report going and then SQL Server can feed the records faster than a printer or report preview windows can handle. However in this case nothing displays till all the records are processed.

It turns out that SQL Server physically will store records in the order on disk or the index that is specified as a clustered index (and only 1 is required). In Accpac we designate the primary index as a clustered index as we found long ago that this speeds up most processing (since most processing is based off the primary index). However in this case the primary index was customer number, document number; and for the query were interested in processing all unpaid documents which are spread fairly evenly over the customers (at least in the customer database we were testing with).

We found that we could speed up this calculation quite a bit by added an additional segment to the beginning of the primary key. For instance without any changes, the query was taking 24 seconds on a large customer database. Putting the document date as the first segment reduced the query to 17 seconds (since hopefully unpaid invoices are near the end). However it turned out there were quite a few old unpaind invoices. Adding an “archive” segment to the beginning where we mark older paid invoices as archived reduced the query time to 4 seconds. We could also add the unpaid flag as the first index to get the same speed up. But were are thinking that perhaps the concept of a virtual archiving would have other applications other than this one query.

Anyway now that we understand what is going on, we can decide how to best address the problem. Whether adding the paid flag to the record, adding the date or adding an archive flag. Its interesting to see how SQL Server’s performance characteristics actually affect the schema design of your database, if you want to get best perforamance. And that often these choices go beyond just adding another index.

Written by smist08

June 7, 2009 at 9:05 pm