Stephen Smith's Blog

Musings on Machine Learning…

Archive for June 2020

Exciting Days for ARM Processors

with 13 comments


ARM CPUs have long dominated the mobile world, nearly all Apple and Android phones and tablets utilize some model of ARM processor. However Intel and AMD still dominate the laptop, desktop, server and supercomputer markets. This week we saw a number of announcements where this will likely change:

  1. Apple announced they are going to transition all Mac computers to the ARM processor over two years.
  2. Ampere announced a 128-core server ARM processor.
  3. Japan now has the world’s most powerful supercomputer and it is based on 158,976 ARM Processors.

In this blog post, we’ll look at some of the consequences of these moves.

Apple Macs Move to ARM

The big announcement at this year’s Apple WorldWide Developers Conference is that Apple will be phasing out Intel processors in their Mac desktop and laptop computers. You wouldn’t know they are switching to ARM processors from all their marketing speak, which exclusively talks about the switch from Intel to Apple Silicon. But the heart of Apple Silicon are ARM CPU cores. The name Apple Silicon refers to the System on a Chip (SoC) that they are building around the ARM processors. These SoCs will include a number of ARM cores, a GPU, an AI processor, memory manager and other support functions.

Developers can pay $500 to get an iMac mini running the same ARM CPU as the latest iPad Pro, the downside is that you need to give this hardware back when the real systems ship at the end of this year. It is impressive that you can get a working ARM Mac running MacOS along with a lot of software already including the XCode development system. One cool feature is that you can run any iPad or iPhone app on your Mac, now that all Apple devices share the same CPU.

The new version of MacOS for ARM (or Apple Silicon) will run Intel compiled programs in an emulator, but the hope from Apple is that developers will recompile their programs for ARM fairly quickly, so this won’t be needed much. The emulation has some limitations, in that it doesn’t support Intel AVX SIMD instructions or instructions related to virtualization.

For developers converting their applications, if they have Assembly Language code, this will have to be converted from Intel Assembly to ARM Assembly and of course a great resource to do this is my book:

I’m excited to see what these new models of ARM based Apple computers look like. We should see them announced as we approach the Christmas shopping season. Incorporating all the circuitry onto a single chip will make these new computers even slimmer, lighter and more compact. Battery life should be far longer but still with great performance.

I think Apple should be thanking the Raspberry Pi world for showing what you can do with SoCs, and for driving so much software to already be ported to the ARM processor.

One possible downside of the new Macs, is that Apple keeps talking about the new secure boot feature only allowing Apple signed operating systems to boot as a security feature. Does this mean we won’t be able to run Linux on these new Macs, except using virtualization? This will be a big downside, especially down the road when Apple drops support for them. Apple makes great hardware that keeps on working long after Apple no longer supports it. You can get a lot of extra life out of your Apple hardware by installing Linux and keeping on trucking with new updates.

New Ampere Server ARM Chips

Intel and AMD have long dominated the server and data center markets, but that is beginning to change. Amazon has been designing their own ARM chips for AWS and Ampere has been providing extremely powerful ARM based server chips for everyone else. Last year they announced an 80-core ARM based server chip which is now in production. Just this week they announced the next generation which is a 128-core ARM server chip.

If you aren’t interested in a server, but would like a workstation containing one of these chips then you could consider a computer from Avantek such as this one.

These are just one of several powerful ARM based server chips coming to market. It will be interesting to see if there is a lot of uptake of ARM in this space.

Japan’s Fugaku ARM Based Supercomputer is Number One

Japan just took the number one spot in the list of the world’s most powerful supercomputers. The Fugaku supercomputer is located in Kobe and uses 158,976 Fujitsu 48-core ARM SoCs. Of course this computer runs Linux and currently is being used to solve protein folding problems around developing a cure for COVID-19, similar to folding@home. This is a truly impressive warehouse of technology and shows where you can go with the ARM CPU and the open source Linux operating system.


ARM conquered the mobile world some years ago, and now it looks like ARM is ready to take on the rest of the computer industry. Expect to see more ARM based desktop and laptop computers than just Macs. Only time will tell whether this is a true threat to Intel and AMD, but the advantage ARM has over previous attempts to unseat Intel as king is that they already have more volume production than Intel and AMD combined. The Intel world has stagnated in recent years, and I look forward to seeing the CPU market jump ahead again.

Written by smist08

June 24, 2020 at 11:28 am

Playing with CUDA on my Gaming Laptop

with one comment

Playing with CUDA on my Gaming Laptop


Last year, I blogged on playing with CUDA on my nVidia Jetson Nano. I recently bought a new laptop which contains an nVidia GTX1650 graphics card with 4Gig of RAM. This is more powerful than the coprocessor built into the Jetson Nano.  I took advantage of the release of newer Intel 10th generation processors along with the wider availability of newer nVidia RTX graphics cards to get a good deal on a gaming laptop with an Intel 9th generation processor and nVidia GTX graphics. This is still a very fast laptop with 16Gig of RAM and runs the couple of video games I’ve tried fine. It also compiles and handles my normal projects easily. In this blog post, I’ll repeat a lot of my previous article on the nVidia Jetson, but in the context of running on Windows 10 with an Intel CPU.

I wanted an nVidia graphics card because these have the best software support for graphics, gaming, AI, machine learning and parallel programming. If you use Tensorflow for AI, then it uses the nVidia graphics card automatically. All the versions of DirectX support nVidia and if you are doing general parallel programming then you can use a system like OpenCL. I find nVidia leads AMD in software support and Intel is going to have a lot of trouble with their new Xe graphics cards reaching this same level of software support.


On Windows, most developers use Visual Studio. I could do this all with GCC, but this is more difficult, since when you install the SDK for CUDA, you get all the samples and documentation for Visual Studio. The good news is that you can use Visual Studio Community Edition which is free and actually quite good. Installing Visual Studio is straightforward, just time consuming since it is large.

Next up, you need to install nVidia’s CUDA toolkit. Again, this is straightforward, just large. Although the install is large, you likely have all the drivers already installed, so you are mostly getting the developer tools and samples out of this.

Performing these installs and then dealing with the program upgrades, really makes me miss Linux’s package managers. On Linux, you can upgrade all the software on your computer with one command on a regular basis. On Windows, each program checks for upgrades when it starts and usually wants to upgrade itself before you do any work. I find that this is a real productivity killer on Windows. Microsoft is starting work on a package manager for Windows, but at this point it does little.

Compiling the deviceQuery sample produced the following output on my gaming laptop:

CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 1650 with Max-Q Design"
  CUDA Driver Version / Runtime Version          11.0 / 11.0
  CUDA Capability Major/Minor version number:    7.5
  Total amount of global memory:                 4096 MBytes (4294967296 bytes)
  (16) Multiprocessors, ( 64) CUDA Cores/MP:     1024 CUDA Cores
  GPU Max Clock rate:                            1245 MHz (1.25 GHz)
  Memory Clock rate:                             3501 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 1048576 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 6 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  CUDA Device Driver Mode (TCC or WDDM):         WDDM (Windows Display Driver Model)
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.0, CUDA Runtime Version = 11.0, NumDevs = 1
Result = PASS

If we compare this to the nVidia Jetson Nano, we see everything is better. The GTX 1650 is based on the newer Turing architecture and the memory is local to the graphics card and not shared with the CPU. The big difference is that we have 1024 CUDA cores, rather than the Jetson’s 128. This means we can perform 1024 operations in parallel for SIMD operations.

CUDA Samples

The CUDA toolkit includes a large selection of sample programs, in the Jetson Nano article we listed the vector addition sample. Compiling and running this on Windows is easy in Visual Studio. These samples are a great source of starting points for your own projects. 

Programming for Portability

If you are writing a specialized program and want the maximum performance on specialized hardware, it makes sense to write directly to nVidia’s CUDA API. However, most software developers want to have their programs to run on as many computers out in the world as possible. The solution is to write to a higher level API that then has drivers for different popular hardware.

For instance, if you are creating a video game, you could write to the DirectX interface and then your program can run on newer versions of Windows on a wide variety of GPUs from different vendors. If you don’t want to be limited to Windows, you could use a portable graphics API like OpenGL. You can also go higher level and create your game in a system like UnReal Engine or Unity. These then have different drivers to run on DirectX, MacOS, Linux, mobile devices or even in web browsers.

If you are creating an AI or Machine Learning application, you can use a library like Tensorflow or PyTorch which have drivers for all sorts of different hardware. You just need to ensure their support is as broad as the market you are trying to reach.

If you are doing something more general or completely new, you can consider a general parallel processing library like OpenCL which has support for all sorts of devices, including the limited SIMD coprocessors included with most modern CPUs. A good example of a program that uses OpenCL is Folding@Home which I blogged on here.


Modern GPUs are powerful and flexible computing devices. They have high speed memory and often thousands of processing cores to work on your task. Libraries to make use of this computing power are getting better and better allowing you to leverage this horsepower in your applications, whether they are graphics related or not. Today’s programmers need to have the tools to harness these powerful devices, so the applications they are working on can reach their true potential.

Written by smist08

June 20, 2020 at 1:43 pm

Blocking Speculative Execution Vulnerabilities on ARM Processors

with 4 comments


We blogged previously about the Spectre and Meltdown exploits that use the side effects of a CPU’s speculative execution to learn secrets that should be protected. The other day, ARM put out a warning and white paper for another similar exploit called “Straight-line Speculation”. Spectre and Meltdown took advantage of branch prediction, this is the opposite case where the processor’s speculation mechanism continues on past an unconditional branch instruction.

These side-channel attacks are hard to set up and exploit, but the place where they are most dangerous is in virtual environments, especially in the cloud. This is since it is a case where data leaks from one virtual machine to another and allows you to steal data from another cloud user. Running on a phone or PC isn’t so dangerous, since if you have the ability to run a program, you have way more power to do much more dangerous things, so why bother with this? This means that as ARM is used more often in cloud data centers, they are going to have to be very sensitive to these issues.

Why Does a Processor Do This?

Modern CPU’s process instructions ahead of the current executing instruction for performance reasons. There is capacity for a CPU core to process several instructions simultaneously and if one is delayed, say due to waiting for main memory, other instructions that don’t have a dependency on this can be executed. But why would the processor bother to process instructions after an unconditional branch? Afterall, ARM added the return (RET) instruction to their 64-bit instruction set so that the CPU would know where execution continued next and the pipeline wouldn’t need to be flushed.

First, ARM provides reference designs, and many manufacturers like Apple, take these and improve on them. This means that not all ARM processors work the same in this regard. Some might use a simpler brute-force approach that doesn’t interpret the instructions at all, others may add intelligence and their speculative pipeline interprets instructions and knows how to follow unconditional branches. The smarter the approach and the more silicon is required to implement it and possibly more heat is generated as it does its job.

The bottom line is that not all ARM processors will have this problem. Some will have limited pipelines and are safe, others interpret instructions and are safe. Then a few will implement the brute-force approach and have the vulnerability. The question then: what needs to be done about this?

ARM Instructions to Mitigate the Problem

Some ARM processors let you turn off speculative execution entirely. This is a very heavy handed approach and terrible for performance. This may make sense in some security sensitive situations, but usually is too heavy a hit on performance. The other approach is to use a number of ARM Assembly Language instructions that will halt speculative execution when encountered. Let’s look at them:

  • Data Synchronization Barrier (DSB): completes when all instructions before this instruction complete.
  • Data Memory Barrier (DMB): ensures that all explicit memory accesses before the DMB instruction complete before any explicit memory accesses after the DMB instruction start.
  • Instruction Synchronization Barrier (ISB): flushes the pipeline in the processor, so that all instructions following the ISB are fetched from cache or memory, after the ISB has been completed.
  • Speculation Barrier (SB): bars speculation of any instructions after this one, until after it is executed. Note: this instruction is optional and not available on all processors.
  • Clear the cache (MSR and SYS): using these instructions you can clear or invalidate all or part of the cache. However, these are privileged instructions and invalid from user space. Generally, these instructions are only used in the operating system kernel.

The various ARM security whitepapers have recommendations on how to use these instructions and the various developer tools like GCC and LLVM have compiler options to add these instructions to your code.

Adding a DSB/ISB instruction pair after an unconditional branch won’t affect the execution performance of your program, but it will increase the program size by 64-bits (two 32-bit instructions) after each unconditional branch. The recommendation is to only turn on the compiler options to generate this extra code in routines that do something sensitive like handling passwords or user’s sensitive data. Linux Kernel developers have to be cognizant of these issues to ensure kernel data doesn’t leak.

If you are playing with a Raspberry Pi, you can add the DSB, DMB and ISB instructions and run them in user mode. If you add an SB instruction then you need to add something like “-march=armv8.2-a+sb” to your as command line. The Broadcom ARM processor used in the Pi doesn’t support this instruction, so you will get an “Illegal Instruction” error when you execute it.

Ideally, this should all be transparent to the programmer. Future generations of the ARM processor should fix these problems. These instructions were added so systems programmers can address security problems as they appear. Without these tools, there would be no remedy. Addressing the current crop of speculative execution exploits, doesn’t guarantee that new ones won’t be discovered in the future. Its a cat and mouse, or whack-a-mole type game between hackers and security professionals. The next generation of chips will be more secure, but then the ball is in the hackers court.


It’s fascinating to see how hackers can exploit the smallest side-effect in programs or hardware to either take control of a system or to steal data from it. These recent exploits show how security has to be taken seriously in all design aspects, whether hardware, microcode, system software or user applications. As new attacks are developed, everyone has to scurry to develop workarounds, solutions and mitigations. In this cat and mouse world, there is no such thing as absolute security and everyone has to be aware that there are always risks if you are attached to a network.

If you are interested in these sorts of topics, be sure to check out my book: Programming with 64-Bit ARM Assembly Language.

Written by smist08

June 12, 2020 at 5:13 pm

Raspberry Pi Gets 8Gig and 64-Bits

with one comment


The Raspberry Pi Foundation recently announced the availability of the Raspberry Pi 4 with 8-Gig of RAM along with the start of a beta for a 64-bit version of the Raspberry Pi OS (renamed from Raspbian). This blog post will look into these announcements, where the Raspberry Pi is today and where it might go tomorrow.

I’ve written two books on ARM Assembly Language programming, one for 32-bits and one for 64-bits. All Raspberry Pis have the ARM CPU as their brains. If you are interested in learning Assembly language, the Raspberry Pi is the ideal place to do so. My books are:

32- Versus 64-Bits

Previously the Raspberry Pi Foundation had been singing the virtues of their 32-bit operating system. It uses less memory than a 64-bit operating system and would run on every Raspberry Pi ever made. Further if you really wanted 64-bits then you could run alternative versions of Linux from Ubuntu, Gentoo or Kali. The limitation of 32-bits is that you can only address 4 Gig of memory, so this seems like a problem for an 8-gig device, but 32-bit Raspbian handles this and in fact each process can have up to 4-gig of RAM and hence all the 8gig will get used if needed, just across multiple processes.

The downside to this is that the ARM 64-bit instruction set is faster, the memory addressing is simpler without this extra Linux memory management and modern ARM processors are optimised around 64-bits and only maintain 32-bits for compatibility. There are no new improvements to the 32-bit instruction set and typically it can’t take advantage of newer features and optimizations in the processor.

The Raspberry Pi foundation has released a beta version of the Raspberry Pi OS where the kernel is compiled for 64-bits. Many of the applications are still 32-bits but can run fine in compatibility mode, this is just a band-aid until everything is compiled 64-bit. I’ve been running the 64-bit version of Kali Linux on my Raspberry Pi 4 with 4-gig for a year now and it is excellent. I think the transition to 64-bits is a good one and there will be many benefits down the road.

New Hardware

The new Raspberry Pi 4 model with 8-gig of RAM is similar to the older model. The change was facilitated by the recent availability of a 8-gig RAM chip in a compatible form factor. They made some small adjustments to the power supply circuitry to handle the slightly higher power requirements of this model. Otherwise, everything else is the same. If a 16-gig part becomes available they would be able to offer such a model as well. The current Raspberry Pi memory controller can only handle up to 16-gig, so to go higher, this would need to be upgraded as well.

The new model costs $75 USD with 8-gig of RAM. The 2-gig model is still only $35 USD. This is incredibly inexpensive for a computer, especially given the power of the Pi. Remember this is the price for the core unit, you still need to provide a monitor, cables, power supply, keyboard and mouse.

Raspberry Pi Limitations

For most daily computer usage the Raspberry Pi is fine. But what is the difference between the Raspberry Pi and computers costing thousands of dollars. Here are the main ones:

  1. No fast SSD interface. You can connect an SSD or mechanical harddrive to a Raspberry Pi USB port, but this isn’t as fast as if there was an M.2 or SATA interface. M.2 would be ideal for a Raspberry Pi given its compact size. Adding an M.2 slot shouldn’t greatly increase the price of a Pi.
  2. Poor GPU. On most computers GPUs can be expensive. For $75 or less you get an older less powerful GPU. A better GPU, like ARM’s Mali GPU or some nVidia CUDA cores would be terrific, but will probably double or triple the price of the Pi. Even with the poor GPU, the RetroPi game system is terrific.
  3. Faster memory interface. The Raspberry Pi 4 has DDR4 memory, but it doesn’t compare will to other computers with DDR4. This probably indicates a bottleneck in either the PCI bus or Pi memory controller. I suspect this keeps the price low, but limits CPU performance due to bottlenecks limiting the data flow to and from memory.

If the Raspberry Pi addressed these issues, it would be competitive with most computers costing hundreds of dollars more.


The 8-gig version of the Raspberry Pi is a powerful computer for only $75. Having 8-gig of RAM allows you to run more programs at once, have more browser windows open and generally have more work in progress at one time. Each year the Raspberry Pi hardware gets more powerful. Combine this with the forthcoming 64-bit version of the Raspberry Pi OS and you have a powerful system that is ideal for the DIY hobbyist, for people learning about programming, and even people using it as a general purpose desktop computer.

Written by smist08

June 5, 2020 at 4:36 pm