Stephen Smith's Blog

Musings on Machine Learning…

Archive for the ‘Software Architecture’ Category

Rust Never Sleeps

leave a comment »


Rust is a relatively new programming language that first appeared in 2010. The purpose of Rust is to provide a systems programming language with the performance of C/C++ but without all the memory/pointer/multithreading/security problems. Version 6 of the Linux kernel will be adding support for writing modules in Rust and many companies have been writing their backend web servers in Rust.

I thought I’d have a look at programming in Rust and had a go on both a regular Raspberry Pi as well as with embedded Rust on the Raspberry Pi Pico. In this article we’ll look at a simple program to generate a Koch snowflake, a relatively simple program that still shows some of the features of Rust. The program listing is at the end of the article.

How to Create Graphics?

Rust doesn’t include a giant do everything runtime like C# has in .Net or Java has in the JRE. Instead people outside of the core language group have developed wrapper libraries of various commonly used systems. For GUI and graphics, I chose to use the wrapper for GTK3 as this seems to have good support and the included Cairo 2D drawing library is fairly good.

This works, but you can’t write GUI type design patterns like MVC in Rust because their memory management schemes are inherently unsafe or require garbage collection. Rust enforces the lifetime of every object and ensures there are no hanging references, this makes the sea of objects model most GUIs use where each object has references to a bunch of other objects illegal. Rust recommends alternate design patterns, but if you are wrapping an existing legacy library then you will have to work around a few issues.

Installing Rust and Required Libraries

Rust is easy to install, but it only installs Rust components. Any wrapped components need to be installed separately, in our case GTK3’s development components:

    sudo apt-get install libgtk-3-dev

Now we can install Rust, using rustup from their website: The current command to do this is:

    curl --proto '=https' --tlsv1.2 -sSf | sh

But best to copy the command from their website, in case it’s been updated. Then at the prompt select “1 – proceed with installation” to download and install Rust.

The installation adds a number of environment variables to the shell initialization scripts, to proceed you need to either restart a new terminal window or run:

    source "$HOME/.cargo/env"

Now we can create our Rust project, first creating a folder for our Rust projects and then using Rust’s cargo utility to create a project.

    mkdir rust
    cd rust
    cargo new fractal1 --bin --vcs none

The “–vcs none” part is to start cargo creating a github repository as well. We now have a working Rust program, which if we run with:

    cargo run

Will compile the simple program and print “Hello World”. The first time you run, the compilation takes a while since there are a number of system components that need to be compiled the first time.

To produce the Koch snowflake fractal, copy the Cargo.toml file at the bottom of this article over the one created by cargo and then copy the file over the created file in the src folder.

With these in place, type:

    cargo run

Then after everything compiles, the program should run and produce a window with the Koch snowflake at level 3.

Notes on the

The GTK3 wrapper is from I based the program on one of the examples in their Github repository. I found these examples to be the best source of information on the usage of this library as the documentation tends to be of a reference nature and doesn’t show how to put the parts together into a working program.

If you’ve seen any of my other blog articles where I created a Koch snowflake as an example, then you might wonder why I don’t store the Cairo drawing context in the TurtleGraphics struct rather than having it as a parameter to most method calls. The reason is Rust’s rules around avoiding dangling pointers, where you need to ensure the lifetime rules of every object. This means you can’t just store references in objects without guaranteeing their lifetime is less than the reference. In most practical cases this is difficult and the hoops to jump through are far worse than just passing it as a parameter. Not as object oriented perhaps, but on the other hand, no dangling pointers and no garbage collection.


Rust wants to replace C as a system programming language for projects such as the Linux kernel, web servers and even video games. Rust isn’t a simple language, it is large and some of its parts are quite complex. The safety rules to ensure that pointers are correct place a lot of limitations on what you can do as a programmer. The Rust parts of your program are safer than corresponding C programs, but with the state of things today, your Rust program will wrap a lot of components written in C/C++ that could be doing all sorts of bad things with pointers. The other limitation is that you can’t easily call Rust objects from other languages, limiting some interoperability scenarios.

I’m going to continue learning and playing with Rust. It has a number of appealing features, but I’m not sure if its complexity and strict enforcement of lifetime and ownership rules are too constraining.

Source Code Listings

The Cargo.toml file:

name = “fractal1”
version = “0.1.0”
edition = “2021”

# See more keys and their definitions at

gtk = “*”
gio = “*”
gdk = “*”
cairo = “*”

The follows:

use std::f64::consts::PI;
use gtk::prelude::*;
use gtk::DrawingArea;
use gtk::cairo::{Context};

const STARTX:f64 = 50.0;
const STARTY:f64 = 130.0;
const STARTANGLE:f64 = 0.0;
const LEVEL:i64 = 3;

fn build_ui(application: &gtk::Application) {
    drawable(application, 500, 500, |_, cr| {
        cr.set_source_rgb(0.0, 0.0, 0.0);
        cr.move_to(STARTX, STARTY);
        let tg = TurtleGraphics{x:STARTX, y:STARTY, angle:STARTANGLE};
        let mut kf = KochFlake{tg: tg};
        kf.koch_snowflake(cr, LEVEL);
        cr.stroke().expect(“Invalid cairo surface state”);

fn main() {
    let application = gtk::Application::new(

pub fn drawable<F>(application: &gtk::Application, width: i32, height: i32, draw_fn: F)
    F: Fn(&DrawingArea, &Context) -> Inhibit + ‘static,
    let window = gtk::ApplicationWindow::new(application);
    let drawing_area = Box::new(DrawingArea::new)();
    window.set_title(“Koch Snowflake”);
    window.set_default_size(width, height);

struct KochFlake
    tg: TurtleGraphics

impl KochFlake
    pub fn koch_snowflake(&mut self, cr:&Context, level:i64)
    { 60 );
        self.koch_snowflake_side(cr, level , 400); -120 );
        self.koch_snowflake_side(cr, level, 400); -120 );
        self.koch_snowflake_side(cr, level, 400);

    pub fn koch_snowflake_side(&mut self, cr:&Context, level:i64, size:i64)
        if level == 0
   cr, size );
            self.koch_snowflake_side( cr, level – 1, size / 3 );
   60 );
            self.koch_snowflake_side( cr, level-1, size / 3);
   -120 );
            self.koch_snowflake_side( cr, level-1, size / 3);
            self.koch_snowflake_side( cr, level-1, size / 3);


struct TurtleGraphics

impl TurtleGraphics
    pub fn move_dist(&mut self, cr:&Context, dist:i64)
        self.x = self.x + dist as f64 * f64::cos( self.angle * PI / 180.0);
        self.y = self.y + dist as f64 * f64::sin( self.angle * PI / 180.0);
        cr.line_to(self.x, self.y); 

    pub fn turn(&mut self, angle_increment:i64)
        self.angle = self.angle + angle_increment as f64;

Written by smist08

August 8, 2022 at 12:37 pm

Is Apple Silicon Really ARM?

with one comment


Apple recently announced they were transitioning their line of Mac computers from using Intel CPUs to what they announced as Apple Silicon. This led to a lot of confusion since most people expected them to announce they would transition to the ARM CPU. The confusion arises from Apple’s market-speak where everything has to be Apple this or Apple that. Make no mistake the heart of Apple Silicon is the ARM CPU, but Apple Silicon refers to Apple’s System on a Chip (SoC) which includes a number of additional components along with the ARM CPUs. In this blog we’ll look at what Apple Silicon really is and discuss some of the current misconceptions around it.

ARM at the Core

My book “Programming with 64-Bit ARM Assembly Language” includes examples of adding ARM Assembly Language to Apple iOS apps for the iPhone and iPad. The main processors in iPhones and iPads have been redesignated as Apple Silicon chip, and the SoC used in the latest iPad is being used in Apple’s developer preview hardware of the new Mac Mini.

In fact, one of my readers, Alex vonBelow, converted all the source code for my book to run on the prototype Apple hardware. The Github repository of his work is available here. Again, this all demonstrates that ARM is at the center and is the brains behind Apple Silicon.

Other bloggers have claimed that Apple Silicon does not in fact use ARM processors since Apple customizes their ARM CPUs, rather than using off-the-shelf designs from ARM Holdings. This is true, but in fact most ARM licensees do the same thing, adding their own optimizations to gain competitive advantage. The key point is that Apple licenses the ARM CPU ISA (Instruction Set Architecture), which is the format and syntax of all the Assembly Language instructions the CPU processes, and then Apple uses ARMs software/hardware verification suite to ensure their designs produce correct results.

These other bloggers claim that Apple may stray from the ARM ISA in the future, which of course is possible, but highly unlikely. One of the keys to Apple’s success is the way they leverage open source software for much of both their operating system and development tools. Both MacOS and iOS are based on the open source Mach Unix Kernel from Carnegie Mellon University which in turn is based on BSD Unix. The XCode development environment looks Apple proprietary, but to do the actual compiling, it uses the open source tools from GCC or LLVM. Apple has been smart to concentrate their programming resources on areas where they can differentiate themselves from the competition, for instance making everything easier to use, and then using open source components for everything else. If Apple strays from the ARM ISA, then they have to take all this on themselves and will end up like Microsoft with the Windows 10 mess, where they can’t compete effectively with the much larger open source community.

Apple Silicon is an SoC

Modern integrated circuits allow billions of chips on a single small wafer. This means a single IC has room to hold multiple CPU cores along with all sorts of other components. Early PCs contained hundreds of ICs, nowadays, they contain a handful with most of the work being done by a single chip. This has been key to cell phones allowing them to be such powerful computers in such a small package. Similarly the $35 Raspberry Pi is a credit card sized computer where most of the work is done by a single SoC.

Let’s look at what Apple is cramming into their Apple Silicon SoC.

  1. A number of high performance/high power usage ARM CPU cores.
  2. A number of low power usage/lower performance ARM CPU cores.
  3. A Graphics Processing Unit (GPU).
  4. An Artificial Intelligence Processing Unit.
  5. An audio DSP processor.
  6. Power/performance management.
  7. A cryptographic acceleration unit.
  8. Secure boot controller.

For some reason, Apple marketing doesn’t like to mention ARM, which is too bad. But any programmer knows that to program for Apple Silicon, your compiler has to generate ARM 64-bit instructions. Similarly, if you want to debug a program for iPads, iPhones or the new Macs, then you have to use an ARM capable debugger such as GDB and you have to be able to interpret ARM instructions, ARM registers and ARM memory addressing modes.

It isn’t unusual for marketing departments to try and present technical topics to the general population in a way that doesn’t make sense to programmers. Programmers have to stick to reality or their programs won’t work and ignore most of what comes out of marketing departments. If you attended Apple’s WWDC a few months ago, you could see the programmers struggling to stick to the marketing message and every now and then having to mention ARM processors.


The transition of Macs from Intel to Apple Silicon is an exciting one, but don’t be fooled by the marketing spin, this is a transition from Intel to ARM. This is Apple going all-in on the ARM processor, using the same technology to power all their devices including iPhones, iPads and Mac computers.

If you want to learn more about the ARM processor and programming the new Apple devices, check out my book: Programming with 64-Bit ARM Assembly Language. Available directly from Apress, along with all the main booksellers.

Written by smist08

July 31, 2020 at 11:31 am

Exciting Days for ARM Processors

with 13 comments


ARM CPUs have long dominated the mobile world, nearly all Apple and Android phones and tablets utilize some model of ARM processor. However Intel and AMD still dominate the laptop, desktop, server and supercomputer markets. This week we saw a number of announcements where this will likely change:

  1. Apple announced they are going to transition all Mac computers to the ARM processor over two years.
  2. Ampere announced a 128-core server ARM processor.
  3. Japan now has the world’s most powerful supercomputer and it is based on 158,976 ARM Processors.

In this blog post, we’ll look at some of the consequences of these moves.

Apple Macs Move to ARM

The big announcement at this year’s Apple WorldWide Developers Conference is that Apple will be phasing out Intel processors in their Mac desktop and laptop computers. You wouldn’t know they are switching to ARM processors from all their marketing speak, which exclusively talks about the switch from Intel to Apple Silicon. But the heart of Apple Silicon are ARM CPU cores. The name Apple Silicon refers to the System on a Chip (SoC) that they are building around the ARM processors. These SoCs will include a number of ARM cores, a GPU, an AI processor, memory manager and other support functions.

Developers can pay $500 to get an iMac mini running the same ARM CPU as the latest iPad Pro, the downside is that you need to give this hardware back when the real systems ship at the end of this year. It is impressive that you can get a working ARM Mac running MacOS along with a lot of software already including the XCode development system. One cool feature is that you can run any iPad or iPhone app on your Mac, now that all Apple devices share the same CPU.

The new version of MacOS for ARM (or Apple Silicon) will run Intel compiled programs in an emulator, but the hope from Apple is that developers will recompile their programs for ARM fairly quickly, so this won’t be needed much. The emulation has some limitations, in that it doesn’t support Intel AVX SIMD instructions or instructions related to virtualization.

For developers converting their applications, if they have Assembly Language code, this will have to be converted from Intel Assembly to ARM Assembly and of course a great resource to do this is my book:

I’m excited to see what these new models of ARM based Apple computers look like. We should see them announced as we approach the Christmas shopping season. Incorporating all the circuitry onto a single chip will make these new computers even slimmer, lighter and more compact. Battery life should be far longer but still with great performance.

I think Apple should be thanking the Raspberry Pi world for showing what you can do with SoCs, and for driving so much software to already be ported to the ARM processor.

One possible downside of the new Macs, is that Apple keeps talking about the new secure boot feature only allowing Apple signed operating systems to boot as a security feature. Does this mean we won’t be able to run Linux on these new Macs, except using virtualization? This will be a big downside, especially down the road when Apple drops support for them. Apple makes great hardware that keeps on working long after Apple no longer supports it. You can get a lot of extra life out of your Apple hardware by installing Linux and keeping on trucking with new updates.

New Ampere Server ARM Chips

Intel and AMD have long dominated the server and data center markets, but that is beginning to change. Amazon has been designing their own ARM chips for AWS and Ampere has been providing extremely powerful ARM based server chips for everyone else. Last year they announced an 80-core ARM based server chip which is now in production. Just this week they announced the next generation which is a 128-core ARM server chip.

If you aren’t interested in a server, but would like a workstation containing one of these chips then you could consider a computer from Avantek such as this one.

These are just one of several powerful ARM based server chips coming to market. It will be interesting to see if there is a lot of uptake of ARM in this space.

Japan’s Fugaku ARM Based Supercomputer is Number One

Japan just took the number one spot in the list of the world’s most powerful supercomputers. The Fugaku supercomputer is located in Kobe and uses 158,976 Fujitsu 48-core ARM SoCs. Of course this computer runs Linux and currently is being used to solve protein folding problems around developing a cure for COVID-19, similar to folding@home. This is a truly impressive warehouse of technology and shows where you can go with the ARM CPU and the open source Linux operating system.


ARM conquered the mobile world some years ago, and now it looks like ARM is ready to take on the rest of the computer industry. Expect to see more ARM based desktop and laptop computers than just Macs. Only time will tell whether this is a true threat to Intel and AMD, but the advantage ARM has over previous attempts to unseat Intel as king is that they already have more volume production than Intel and AMD combined. The Intel world has stagnated in recent years, and I look forward to seeing the CPU market jump ahead again.

Written by smist08

June 24, 2020 at 11:28 am

Learning Electronics with Arduino

with 3 comments


I’ve worked with the Raspberry Pi quite a bit, written books about it and blogged fairly extensively about it. However, many people consider the Pi overkill. After all it runs full versions of Linux and usually requires a keyboard, mouse, monitor and Internet connection. Even at $35, many people consider it too expensive for their projects.

Parallel to the Raspberry Pi, there is the Arduino project which is an open source software and hardware project for microcontrollers. The Raspberry Pi includes a full ARM 64-bit processor with up to 4Gig of RAM. The Arduino is based on various microcontrollers that are often 8-bit and only have 32Kb of RAM. These microcontrollers don’t run a full operating system, they just contain enough code to start your program, whether burned on their flash memory or downloaded via serial port from a PC.

There are a great many Arduino compatible boards that can perform all sorts of functions. A typical Arduino has a set of external pins similar to the Raspberry Pi’s GPIO ports. The big advantage of the Arduino is that they are low cost, simple to program and low power.

In this article, I’ll look at the official Arduino Starter Kit.

Package Contents

The package contains an Arduino Uno microcontroller board, a breadboard and a large assortment of discrete electronic components. It contains a project book with 15 projects you can build out of all these components.

The Arduino Uno contains the Amtel ATmega328p microcontroller and 32kb of memory. These are flexible low cost processors that are used in many embedded applications.

Programming the Arduino

You can program the Arduino with any compiler that generates the correct machine code to run on the processor you’ve chosen (or even program it in Assembly Language). However most people use the Arduino IDE. This IDE is based on Sketch and Processing. You write your programs in a limited version of C (with a few extensions). The IDE then knows how to compile and download it to a great many Arduino boards so you can test out your program.

There are libraries to provide support for common functions like controlling a servo motor or controlling a LED character display. There is a Capacitive sensor library to measure a circuit’s capacitance. There are hundreds of sensors you can wire up to your Arduino and there are libraries available for most of these, making the programming to read or control them easy.

You can debug your program by sending strings back to the PC via a serial port which you can monitor in the IDE. You can also flash a couple of LEDs.

People might point out the C is really old and shouldn’t be used due to its use of pointers and such. However C is still the best language for low level programming like this. It is also used for nearly all systems programming. Linux is entirely written in C. Learning C is both useful in itself as well as acting as a jump start to newer languages, mostly based on C such as Java or C#.

Learning Electronics

I first became interested in electronics when I took Electricity 9 in junior high school. We learned the basics of soldering and I built a Radio Shack transistor radio from a kit. With this course I could fix some basic wiring issues around the house and occasionally fix appliances, and perhaps fix a TV by replacing a tube. The difficulty when I was younger was that it was a lot of work to build anything, since the whole thing needed to be built from discrete components and equipment was expensive.

Today things are much easier. You can build a lot of simple circuits attached to the Arduino where a lot of the work is done in software on the microcontroller. Things are much cheaper today. You can purchase a complete Arduino starter kit for under $100 and test equipment is far less expensive. You can pick up a good digital multimeter for under $20 and there are even good oscilloscopes for around $300. There are many simple integrated circuits like optocouplers and H-bridges to further simplify your circuites.

The Arduino is low power, so you can’t electrocute yourself. It has short detection, so if your circuit contains a short circuit, the Arduino shuts down. This all allows you to safely play with electronic components without any risk to yourself or the Arduino.

The starter kit projects include several techniques to connect the Arduino up to external devices safely. For instance controlling a DC motor with either a transistor used as a switch or via an H-bridge. Then how to interface to another device using an optocoupler to keep both devices completely electrically separate.


Arduino provides a great platform to both learn electronics and to learn programming. The IDE is simple to use and helps with learning. Building circuits attached to an Arduino is a safe place to experiment and learn without risking damaging expensive components or equipment. I found working through the 15 labs in the Arduino Projects Book that accompanied the starter kit quite enjoyable and I learned quite a few new things.


ARM Processor Modes

leave a comment »


Last time we discussed how ARM Processor interrupts work, and we mentioned that interrupts switch the processor from user mode to an operating system mode, but we never discussed what exactly the ARM Processor modes are. In this article we will discuss the ARM Processor modes, why they exist and when they are used.

The available processor modes vary by ARM model, so we will look at those commonly available. For the exact details on any specific ARM processor you need to check in that processor’s reference manual.

ARM Processor Modes

The purpose of processor modes is to regulate access to memory and hardware resources so that a process initiated by a specific user can’t access the memory of other processes or access hardware they don’t have permission for. The operating system can add quite refined permissions, so users only have access to certain files, read-only access to certain files, or other granular rights. This might all sound like overkill for a Raspberry Pi, but all versions of Linux, including Raspbian support multiple users and multiple processes all logged in and running at once. Further you might set up specific users and groups to grant the exact rights to processes like web servers to help protect you system from malicious hackers or program bugs causing havoc.

Most ARM processors have two security levels for processes. PL0 is for user mode programs and then PL1 is for operating system code. Newer ARM processors used in servers have a third level PL2 for virtualization hypervisors, so they can keep their various hosted operating systems completely separate. There is also an optional ARM build for secure computing, if this is present then there is an even higher PL3 level that is used for a system security monitor.

The following table from the ARM Processor Reference manual. There are quite a few processor modes and we’ll talk about them all, but the two main ones are user mode for regular programs and then system mode for the operating system.

Let’s list all the processor modes and describe what it is used for:

  • User – regular programs that can access the resources they have permission for.
  • FIQ – the processor goes into this mode when handling as fast interrupt. The operating system provides this code and it has access to all operating system resources.
  • IRQ – the processor goes into this mode when handling a regular interrupt. The operating system provides this code and it has access to all operating system resources.
  • Supervisor – when a user mode program makes an SVC Assembly instruction which calls an operating system services, the program switches to this mode, which allows the program to operate at a privileged level for the duration of the code.
  • Monitor – if you have an ARM processor with security extensions then this mode is used to monitor the system.
  • Abort – if a user mode program tries to access memory it isn’t allowed, then this mode is entered to let the operating system intervene and either terminate the program, or send the program a signal.
  • Hyp – this is hypervisor mode that is an optional ARM extension. This allows the virtual hypervisor run at a more secure level than the operating systems it is virtualizing.
  • Undefined – is a user mode program tries to execute an undefined or illegal Assembly instruction then this mode is entered and the operating system can terminate the program or send it a signal.
  • System – this is the mode that the operating system runs at. Processes that the operating system considers part of itself run at this level.

The mode bits in the table, are the bits that are set in the Control Program Status Register (CPSR) are the bits that get set in the lower order bits. This way the operating system can see what mode it’s in and act accordingly when appropriate.

ARM Boot Process

When powered on, the ARM processor starts up by initiating a reset interrupt. This causes the reset interrupt handler code to execute, which will typically be a branch to the code to start the operating system. At this point we are running in IRQ mode. We will change the processor mode to supervisor and initiate the operating system boot process. To change the processor mode we directly manipulate the bits in the CPSR with code like:

MRS   R0, CPSR        @ Move the CPSR to R0
BIC   R0, R0, #0x1F   @ clear the mode bits
ORR   R0, R0, #0x13   @ Set the mode bits to 10011 (SVC mode)
MSR   CPSR, R0        @ Update the CPSR

Note that reading and writing the CPSR like this are privileged instructions and only available to programs running in PL1 or better. Besides updating the processor mode, the operating system uses these to save a program’s state when doing multitasking. Saving the registers is easy, but the CPSR must also be preserved so as not to disrupt the running process.


This was a quick introduction to the ARM Processor modes. You don’t need to know this for application programming, but if you are interested in writing an operating system or if you are interested in how operating system support works for the ARM processor then this is a starting point.

If you are interested in learning more about ARM Assembly Language Programming, please check out my book, the details are available here.

Written by smist08

December 2, 2019 at 12:37 pm

Interrupting the ARM Processor

with 3 comments


I recently published my book: “Raspberry Pi Assembly Language Programming”. Information on my book is available here. In my book, I cover the Assembly language instructions you can use when running under Raspbian, so we could always have working examples that we could play with. However there is more to the ARM processor, the details of which are transparently handled by Linux, so we don’t need to (and can’t) play with these. In this article we’ll start to look at how the ARM Processor deals with interrupts. This will be a simplified discussion, so we don’t get bogged down in all the differences between different Raspberry Pi models, with how the various interrupt controllers work, the various ARM operating modes or consider interactions with the virtual memory manager.

What Are Interrupts?

Interrupts started as a mechanism for devices to report they have data asynchronously. For instance, if you ask the disk drive to read a block of data, then the computer can keep doing other work (perhaps suspend the requesting process and continue running another process), until the disk drive sends the processor an interrupt telling it that it has retrieved the data. The ARM processor then needs to quickly process this data so that the disk drive can go on to other work. Some devices need to have the data handled quickly or it will be overwritten by new data being processed by the device. Most hardware devices have a limited buffer or queue of data they can hold before it is overwritten.

The interrupt mechanism has been used for additional purposes like reporting memory access errors and illegal instruction errors. There are also a number of system timers that send interrupts at regular intervals, these can be used to update the system clock, or preempt the current task, to give other tasks a turn under the operating system’s multitasking algorithm. Operating system calls are often implemented using interrupts, since a side effect of an interrupt being triggered is to change the operating state of the processor from user mode to system mode, allowing the operating system to run at a more privileged level. You can see this described in Chapter 7 of my book, on Linux Operating System Services.

How Are Interrupts Called?

If a device receives data, it notifies the interrupt controller which then maps the interrupt to one of the ARM processor interrupt codes. Transfer of control immediately switches to the code contained in a specific memory location. Below is a table  of the various interrupts supported by one particular ARM model. In Raspbian the memory offsets are added to 0xffff0000 to get the actual address.

Each ARM instruction is 32-bits in size, so each slot in the interrupt table can hold a single ARM instruction, hence this has to be a branch instruciton, or an instruction that does something to the program counter. The exception is the last one, the FIQ Interrupt which is the “Fast” interrupt, since fast interrupts need fast processing, it is deemed to slow to do a branch instruction first, so the interrupt handler can be entirely placed at this address, which is why it’s the last one in the table.

Some example instructions you might see in this table are:

B myhandler @ will be a PC relative address
MOV PC, #0x1234 @ has to be a valid operand2
LDR PC, [PC, #100] @ load PC with value from nearby memory

You can read about operand2 and the details of these instructions in my book.

Interrupt Calling Convention

When you call a regular function, there is a protocol, or calling convention that specifies who is responsible for saving which registers. Some are saved by the caller if they need them preserved and some have to be saved by the callee if it uses them. There are conventions on how to use the stack and how to return values from functions. With interrupt routines, the code that is interrupted can’t do anything. It’s been interrupted and has no knowledge of what is happening. Preserving the state of things is entirely handled by a combination of the ARM CPU and the interrupt handler code.

The ARM processor has a bank of extra (shadow) registers that it will automatically switch with the regular registers when an interrupt happens. This is especially necessary to preserve the Control Program Status Register (CPSR). The block diagram below shows the banks of registers for the various interrupt states.

Consider the code:

CMP R6, #66
BEQ _loop

If the interrupt occurs between these two instructions, then the CPSR (which holds the result of the comparison) must be preserved or the BEQ instruction could do the wrong thing. This is why the ARM processor switches the CPSR with one of the SPSR registers when the interrupt occurs and then switches them back when the interrupt service routine exits.

Similarly there are shadow registers for the Stack Pointer (SP) and Link Return (LR) register.

For fast (FIQ) interrupts, the ARM CPU also switches registers R8-R12, so the FIQ interrupt handler has five registers to use, without wasting time saving and restoring things.

If the interrupt handler uses any other registers then it needs to store them on the stack on entry and pop them from the stack before returning.

Interrupting an Interrupt?

When an interrupt occurs, the ARM processor disables interrupts until the interrupt handler routines. This is the simplest case since the operating system writer doesn’t have to worry about their interrupt routine being interrupted. This works ok as long as it can handle things quickly, but some interrupt handlers have to do quite a bit of work, for instance if a device returns 4k of data to be processed. Notice that the shadow registers have separate copies for each type of interrupt. This way if you are handling an IRQ interrupt, it is easy to enable FIQ interrupts and allow the IRQ handler to be interrupted by the higher priority FIQ. Newer interrupt handlers have support for more sophisticated nested interrupt handling, but that can be the topic for another article. Linux can disable or enable interrupts as it needs, for instance to finish initialization on a reboot before turning on interrupts.

Returning from an Interrupt

The instruction you use to return from an interrupt is interesting, it is:

SUBS R15, R14, #4

This instruction is taking the Link Return register, subtracting 4 and placing the result in the Program Counter (PC). The ARM Processor knows about this, so it can re-swap the shadow registers.

Normally we just need to move LR to PC to return, why the subtract 4? The reason is the instruction pipeline. Remember the pipeline model is three steps, first the instruction is loaded from memory, then its decoded and then its executed. When we were interrupted we lost the last instruction decode, so we need to go back and do it again.

This is a bit of a relic, since newer ARM processors have much more sophisticated pipeline, but once this was ingrained in enough code then the ARM processor has to respect it and stick with this scheme.


This was a quick introductory overview of how the ARM processor handles interrupts. You don’t need to know this unless you are working on the ARM support in the Linux kernel, or you are creating your own operating system. Still it is interesting to see what is going on under the hood as the ARM Processor and Linux operating system provide all their services to make things easy for your programs running in user mode.

Written by smist08

November 22, 2019 at 11:33 am

Out-of-Order Instructions

leave a comment »


We think of computer processors executing a set of instructions one at a time in sequential order. As programmers this is exactly what we expect the computer to do and if the computer decided to execute our carefully written code in a different order then this terrifies us. We would expect our program to fail, producing wrong results or crashing. However we see manufacturers claiming their processors execute instructions out-of-order and that this is a feature that improves performance. In this article, we’ll look at what is really going on here and how it can benefit us, without causing too much fear.


ARM defines the Instruction Set Architecture (ISA), which defines the Assembly Language instruction set. ARM provides some reference implementations, but individual manufacturers can take these, customize these or develop their own independent implementation of the ARM instruction set. As a result the internal workings of ARM processors differs from manufacturer to manufacturer. A main point of difference is in performance optimizations. Apple is very aggressive in this regard, which is why the ARM processors in iPads and iPhones beat the competition. This means the level of out-of-order execution differs from manufacturer to manufacturer, further this is much more prevalent in newer ARM chips. As a result, the examples in this article will apply to a selection of ARM chips but not all.

A Couple of Simple Cases

Consider the following small bit of code to multiply two numbers then load another number from memory and add it to the result of the multiplication:

MUL R3, R4, R5 @ R3 = R4 * R5
LDR R6, [R7]   @ Load R6 with the memory pointed to by R7
ADD R3, R6     @ R3 = R3 + R6

The ARM Processor is a RISC processor and its goal is to execute each instruction in 1 clock cycle. However multiplication is an exception and takes several clock cycles longer due to the loop of shifting and adding it has to perform internally. The load instruction doesn’t rely on the result of the multiplication and doesn’t involve the arithmetic unit. Thus it’s fairly simple for the ARM Processor to see this and execute the load while the multiply is still churning away. If the memory location is in cache, chances are the LDR will complete before the MUL and hence we say the instructions executed out-of-order. The ADD instruction then needs the results from both the MUL and LDR instruction, so it needs to wait for both of these to complete before executing it’s addition.

Consider another example of three LDR instructions:

LDR R1, [R4] @ memory in swap file
LDR R2, [R5] @ memory not in cache
LDR R3, [R6] @ memory in cache

Here the memory being loaded by the first instruction, has been swapped out of memory to secondary storage, so loading it is going to be slow. The second memory location is in regular memory. DDR4 memory, like that used in the new Raspberry Pi 4, is pretty fast, but not as fast as the CPU and it is also loading instructions to process, hence this second LDR might take a couple of cycles to execute. It makes a request to the memory controller and its request is queued with everything else going on. The third instruction, assumes the memory is in the CPU cache and hence processed immediately, so this instruction really does take only 1 clock cycle.

The upshot is that these three LDR instructions could well complete in reverse order.

Newer ARM processors can look ahead through the instructions looking for independent instructions to execute, the size of this pool will determine how out-of-order things can get. The important point is that instructions that have dependencies can’t start and that to the programmer, it looks like his code is executing in order and that all this magic is transparent to the correct execution of the program.

Since the CPU is executing all these instructions at once, you might wonder what the value of the program counter register (PC) is? This register has a very precisely defined value, since it is used for PC relative addressing. So the PC can’t be affected by out-of-order execution. 


All newer ARM processors include floating-point coprocessors and NEON vector coprocessors. The instructions that execute on these usually take a few instructions cycles to execute. If the instructions that follow a coprocessor instruction are regular ARM instructions and don’t rely on the results of coprocessor operations, then they can continue to execute in parallel to the coprocessor. This is a handy way to get more code parallelism going, keeping all aspects of the CPU busy. Intermixing coprocessor and regular instructions is another great way to leverage out-of-order instructions to get better performance.

Compilers and Code Generation

This indicates that if a compiler code generator or an Assembly Language program rearranges some of their instructions, they can get more things happening at once in parallel giving the program better performance. ARM Holdings contributes to the GNU Compiler Collection (GCC) to fully utilize the optimization present in their reference implementations. In the ARM specific options for GCC, you can select the ARM processor version that matches your target and get more advanced optimizations. Since Apple creates their own development tools under XCode, they can add optimizations specific to their custom ARM implementations.

As Assembly Language programmers, if we want to get the absolute best performance we might consider re-arranging some of our instructions so that instructions that are independent of each other are in a row and hopefully can be executed in parallel. This can require quite a bit of testing to reverse engineer the exact out-of-order instruction capability of your particular target ARM processor model. As always with performance optimizations, you must test the performance to prove you are improving things, and not just making your code more cryptic.


This all sounds great, but what happens when an interrupt happens? This could be a timer interrupt to say your time-slice is up and another process gets to use the ARM Core, or it could be that more data needs to be read from the Wifi or a USB device.

Here the ARM CPU designer has a choice, they can forget about the work-in-progress and handle the interrupt quickly, or they can wait a couple of cycles to let work-in-progress complete and then handle the interrupt. Either way they have to allow the interrupt handler to save the current context and then restore the context to continue execution. Typically interrupt handlers do this by saving all the CPU and coprocessor registers to the system stack, doing their work and then restoring state.

When you see an ARM processor advertised as designed for real-time or industrial use, this typically means that it handles interrupts quickly with minimal delay. In this case, the work-in-progress is discarded and will be redone after the interrupt is finished. For ARM processors designed for general purpose computing, this usually means that user performance is more important than being super responsive to interrupts and hence they can let some of the work-in-progress complete before servicing the interrupt. For general purpose computing this is ok, since the attached devices like USB, ethernet and such have buffers that can hold enough contents to wait for the CPU to get around to them.

A Step Too Far and Spectre

Hardware designers went even further with branch prediction, where if a conditional branch instruction needs to wait for a condition code to be set, they don’t wait but keep going assuming one branch direction (perhaps based on the result from the last time this code executed) and keep going. The problem here is that at this point, the CPU has to save the current state, incase it needs to go back when it guesses wrong. This CPU state was saved in a CPU cache that was only used for this, but had no security protection, resulting in the Spectre attack that figured out a way to get at this data. This caused data leakage across processes or even across virtual machines. The whole spectre debacle showed that great care has to be taken with these sorts of optimizations.

Heat, the Ultimate Gotcha

Suppose your your ARM processor has four CPU cores and you write a brilliant Assembly language program that deploys to use all four cores and fully exploits out-of-order execution. Your program is now using every bit of the ARM CPU, each core is intermixing regular ARM, floating point and NEON instructions You have intermixed your ARM instructions to get the arithmetic unit operating in parallel to the memory unit. This will be the fastest implementation yet. Then you run your program, it gets off to a great start, but then suddenly slows down to a crawl. What happened?

The enemy of parallel processing on a single chip is heat. Everything the CPU does generates a little heat. The more things you get going at once the more heat will be generated by the CPU. Most ARM based computers like the Raspberry Pi assume you won’t be running the CPU so hard, and only provide heat dissipation for a more standard load. This is why Raspberry Pis usually do so badly playing high-res videos. They can do it, as long as they don’t overheat, which typically doesn’t take long.

This leaves you a real engineering problem. You need to either add more cooling to your target device, or you have to deliberately reduce the CPU usage of your program, where perhaps paradoxically you get more work done using two cores rather than four, because you won’t be throttled due to overheating.


This was a quick overview of out-of-order instructions. Hopefully you don’t find these scary and keep in mind the potential benefits as you write your code. As newer ARM processors come to market, we’ll be seeing larger and larger pools of instructions executed in parallel, where the ability for instructions to execute out-of-order will have even greater benefits.

If you are interested in machine code or Assembly Language programming, be sure to check out my book: “Raspberry Pi Assembly Language Programming” from Apress. It is available on all major booksellers or directly from Apress here.

Written by smist08

November 15, 2019 at 11:11 am

RISC Instruction Encoding

with one comment


Modern microprocessors execute programs from memory that are formatted specifically for the processor and the instructions it is capable of executing. This machine code is generated by tools, either fairly directly from Assembly Language source code or via a compiler that translates a high level language to machine code. There are two popular philosophies on how machine code is structured.  One is Reduced Instruction Set Computers (RISC) exemplified by ARM, RISC-V, PowerPC and MIPs processors, and the other is Complex Instruction Set Computers (CISC) exemplified by Intel and AMD processors. In RISC computers, each instruction is quite small and does a tiny bit of work, in CISC computers the instructions tend to be larger and each one does more work. The advantage of RISC processors is that the circuitry is simpler which means they use less power, this is why nearly all mobile devices use RISC processors. In this article we will be looking at some of the tricks RISC computers use to keep their instructions small and quick.

32-Bit Instructions

Most RISC processors use 32-bit machine code instructions. It doesn’t matter if the processor is 32-bit or 64-bits, this only refers to the size of pointers for memory addressing and the size of the registers, in both cases the instructions stay at 32-bits in length. With all rules there are exceptions, for instance in RISC-V processors most instructions are 32-bit, but there is a facility to allow longer instructions where necessary and in ARM processors, in 32-bit mode, there is the ability to limit instructions to 16-bits in length. Modern processors are very powerful and have a lot of functionality, so how do they encode all the information needed for an instruction into 32-bits? This restriction imposes a lot of discipline on the instruction set designers, but the solutions they have come up with are quite interesting. In comparison, Intel x86 instructions are variable length and often 120 bits in length.

Having all the instructions 32-bits in length makes creating an efficient execution pipeline very efficient, since you can load and start working on a set of instructions in parallel. You don’t need to decode one instruction to learn where the next one starts. You know there is a new instruction every 4-bytes in memory. This uniformity saves a lot of complexity and greatly enhances instruction execution throughput.

Where Do the Bits Go?

What needs to be encoded in a machine language instruction? Here are some of the possible components:

  1. The opcode. This tells the processor what the instruction does, whether its add two numbers, load data from memory or jump to another program location. If the opcode takes 8-bits then there are 256 possible instructions. To really save space some opcodes can be less bits, like perhaps if it start 011 then the other bits can go to the immediate value.
  2. Registers. Microprocessors load data into registers and then process the data in the registers. Often two or three registers need to be specified in an instruction, like the two numbers to add and then where to put the result. If there are 32 registers, then each register field will take 5-bits.
  3. Immediate data. Most processors have a way to encode some data in an instruction. Like “LOAD R1, 5” might mean load the value 5 into register R1. Here 5 is data encoded in the instruction, and called an immediate value. The size of these varies based on the instruction and use cases.
  4. Memory Addresses. Data has to be loaded from memory, or program execution has to jump to a different memory location. Note that in a modern computer memory addresses are either 32-bit or 64-bits. These are both too big to fit in a 32-bit instruction (we need at least an opcode as well). In RISC, how do we specify memory addresses?
  5. Bits for additional parameters. Perhaps there are several addressing modes, or perhaps other options for an instruction that need to be encoded. Often there are a few bits in each instruction for this purpose.


That’s a lot of information to pack into a 32-bit instruction. How do they do it? My introduction to Raspberry Pi Assembly Language shows how this is done for ARM processors in 32-bit mode.

How to Load a Register

Let’s look at how to load a 32-bit register with data. We can’t fit a full 32-bit value inside a 32-bit instruction, so what do we do? You might suggest that we load the value from memory rather than encode the value in the instruction. This is a legitimate thing to do, but it just moves the problem since we now need to load the 32 or 64-bit memory address into memory first.

First we could do it in two steps, perhaps we can fit a 16-bit value in an instruction and then perform two load instructions to load the value. In an ARM processor, there is a MOV instruction that can load a 16-bit immediate value and then a MOVT instructions that loads a 16-immediate value into the top 16-bits of a register. Suppose we want to load 0x12345678 into register R1, then in ARM 32-Bit Assembly we would encode:

MOVT R1, #0x1234
MOV  R1, #0x5678

This works and we do expect that working in RISC is going to take lots of small instructions to perform the work we need to get done. However this is somehow not satisfying, since this is something we do a lot and it seems wasteful to take two instructions. The other thing is that if we are running 64-bit mode and want to load a 64-bit register then this will take 4 instructions.

Another trick is to make use of the Program Counter (PC) register. This register points to the instructions currently being executed. So if we can position the value near this then we could load it by dereferencing the PC (plus a small offset). As long as the offset fits in the amount of room we have for an immediate value then this could work. In the ARM world, the Assembler helps us generate this code. We write something like:

LDR R1, =mydata


mydata: .WORD 0x12345678

Then the Assembler will convert the LDR instruction to something like:

LDR R1, [PC, #20]

Which means load the data pointed to by PC + 20 into R1. Now it only takes one instruction to load the data.  This technique has the advantage that it will remain one instruction to execute when dealing with 64-bit data.


This was a quick discussion of how RISC processors encode each machine code instruction as a 32-bit value. This is one of the key things that keeps RISC processors simple, allowing them to be quick while at the same time simple, and hence more power efficient.

If you are interested in machine code or Assembly Language programming, be sure to check out my book: “Raspberry Pi Assembly Language Programming” from Apress. It is available on all major booksellers or directly from Apress here.

Written by smist08

November 8, 2019 at 11:55 am

Playing with CUDA on My NVIDIA Jetson Nano

with 2 comments


I reported last time about my new toy, an NVIDIA Jetson Nano Development Kit. I’m pretty familiar with Linux and ARM processors. I even wrote a couple of articles on Assembler programming, here and here. The thing that intrigued be about the Jetson Nano is its 128 Maxwell GPU cores. What can I do with these? Sure I can speed up TensorFlow since it uses these automatically. I could probably do the same with OpenGL programs. But what can I do directly?

So I downloaded the CUDA C Programming Guide from NVIDIA’s website to have a look at what is involved.


The claim is that the microSD image of 64Bit Ubuntu Linux that NVIDIA provides for this computer has all the NVIDIA libraries and utilities you need all pre-installed. The programming guide made it clear that if you need to use the NVIDIA C compiler nvcc to compile your work. But if I typed nvcc at a command prompt, I just got an error that this command wasn’t found. A bit of Googling revealed that everything is installed, but it did it before installation created your user, so you need to add the locations to some PATHS. Adding:

export PATH=${PATH}:/usr/local/cuda/bin
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/cuda/lib64

To my .bashrc file got everything working. It also shows where cuda is installed. This is handy since it includes a large collection of samples.

Compiling the deviceQuery sample produced the following output on my Nano:

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA Tegra X1"
  CUDA Driver Version / Runtime Version          10.0 / 10.0
  CUDA Capability Major/Minor version number:    5.3
  Total amount of global memory:                 3957 MBytes (4148756480 bytes)
  ( 1) Multiprocessors, (128) CUDA Cores/MP:     128 CUDA Cores
  GPU Max Clock rate:                            922 MHz (0.92 GHz)
  Memory Clock rate:                             13 Mhz
  Memory Bus Width:                              64-bit
  L2 Cache Size:                                 262144 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            No
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.0, CUDA Runtime Version = 10.0, NumDevs = 1

Result = PASS

This is all good information and what all this data means is explained in NVIDIA’s developer documentation (which is actually pretty good). The deviceQuery sample exercises various information APIs in the CUDA library to tell you all it can about what you are running. If you can compile and run deviceQuery in the samples/1_Utilities folder then you should be good to go.

CUDA Hello World

The 128 NVidia Maxwell cores basically consist of a SIMD computer (Single Instruction Multiple Data). This means you have one instruction that they all execute, but on different data. For instance if you want to add two arrays of 128 floating point numbers you have one instruction, add, and then each processor core adds a different element of the array. NVidia actually calls their processors SIMT meaning single instruction multiple threads, since you can partition the processors to different threads and have the two threads each with a collection of processors doing their SIMD thing at once.

When you write a CUDA program, you have two parts, one is the part that runs on the host CPU and the other is the part that runs on the NVidia GPUs. The NVidia C compiler, NVCC adds a number of extensions to the C language to specify what runs where and provide some more convenient syntaxes for the common things you need to do. For the host parts, NVCC translates its custom syntax into CUDA library calls and then passes the result onto GCC to compile regularly. For the GPU parts, NVCC compiles to an intermediate format called PTX. The reason it does this is to support all the various NVidia GPU models. When the NVidia device driver goes to load this code, it does a just in time compile (which it then caches), where the PTX code is compiled to the correct binary code for your particular set of GPUs.

Here is the skeleton of a simple CUDA program:

// Kernel definition
__global__ void VecAdd(float* A, float* B, float* C)
    int i = threadIdx.x;
    C[i] = A[i] + B[i];

int main()
    // Kernel invocation with N threads
    VecAdd<<<1, N>>>(A, B, C);


The __global__ identifier specifies the VecAdd routine as to run on the GPU. One instance of this routine will be downloaded to run on N processors. Notice there is no loop to add these vectors, Each processor will be a different thread and the thread’s x member will be used to choose which array element to add.

Then in the main program we call VecAdd using the VecAdd<<>> syntax which indicates we are calling a GPU function with these three arrays (along with the size).

This little example skips the extra steps of copying the arrays to GPU memory or copying the result out of GPU memory. There are quite a few different memory types, and various trade offs for using them.

The complete program for adding two vectors from the samples is at the end of this article.

This example also doesn’t explain how to handles larger arrays or how to do error processing. For these extra levels of complexity, refer to the CUDA C Programming Guide.

The CUDA program here is very short, just doing an addition. If you wanted to say multiply two 10×10 matrices, you would have your CUDA code do the dot product of a row in the first matrix by a column in the second matrix. Then you would have 100 cores execute this code, so the result of the multiplication would be done 100 times faster than just using the host processor. There are a lot of samples on how to do matrix multiplication in the samples and documentation.

Newer CUDA Technologies

The Maxwell GPUs in the Jetson Nano are a bit old and reading and playing with the CUDA libraries revealed a few interesting tidbits on things they are missing. We all know how NVidia has been enhancing their products for gaming and graphics with the introduction of things like real time ray tracing, but the thing of more interest to me is how they’ve been adding features specific to Machine Learning and AI. Even though Google produces their own hardware for accelerating their TensorFlow product in their data centers, NVidia has added specific features that greatly help TensorFlow and other Neural Network programs.

One thing the Maxwell GPU lacks is direct matrix multiplication support, newer GPUs can just do A * B + C as a single instruction, where these are all matrices.

Another thing that NVidia just added is direct support for executing computation graphs. If you worked with the early version of TensorFlow then you know that you construct your model by building a computational graph and then training and executing it. The newest NVidia GPUs can now execute these graphs directly. NVidia has a TensorRT library to move parts of TensorFlow to the GPU, this library does work for the Maxwell GPUs in the Jetson Nano, but is probably way more efficient in the newest, bright and shiny GPUs. Even just using TensorFlow without TensorRT is a great improvement and handles moving the matrix calculations to the GPUs even for the Nano, it just means the libraries have more work to do.


The GPU cores in a product like the Jetson Nano can be easily utilized using products that support them like TensorFlow or OpenGL, but it’s fun to explore the lower level programming models to see how things are working under the covers. If you are interested in parallel programming on a SIMD type machine, then this is a good way to go.


 * Copyright 1993-2015 NVIDIA Corporation.  All rights reserved.
 * Please refer to the NVIDIA end user license agreement (EULA) associated
 * with this source code for terms and conditions that govern your use of
 * this software. Any use, reproduction, disclosure, or distribution of
 * this software and related documentation outside the terms of the EULA
 * is strictly prohibited.

 * Vector addition: C = A + B.
 * This sample is a very basic sample that implements element by element
 * vector addition. It is the same as the sample illustrating Chapter 2
 * of the programming guide with some additions like error checking.

#include <stdio.h>

// For the CUDA runtime routines (prefixed with "cuda_")
#include <cuda_runtime.h>

#include <helper_cuda.h>

 * CUDA Kernel Device code
 * Computes the vector addition of A and B into C. The 3 vectors have the same
 * number of elements numElements.

__global__ void
vectorAdd(const float *A, const float *B, float *C, int numElements)
    int i = blockDim.x * blockIdx.x + threadIdx.x;

    if (i < numElements)
        C[i] = A[i] + B[i];

 * Host main routine

    // Error code to check return values for CUDA calls
    cudaError_t err = cudaSuccess;

    // Print the vector length to be used, and compute its size
    int numElements = 50000;
    size_t size = numElements * sizeof(float);
    printf("[Vector addition of %d elements]\n", numElements);

    // Allocate the host input vector A
    float *h_A = (float *)malloc(size);

    // Allocate the host input vector B
    float *h_B = (float *)malloc(size);

    // Allocate the host output vector C
    float *h_C = (float *)malloc(size);

    // Verify that allocations succeeded
    if (h_A == NULL || h_B == NULL || h_C == NULL)
        fprintf(stderr, "Failed to allocate host vectors!\n");

    // Initialize the host input vectors
    for (int i = 0; i < numElements; ++i)
        h_A[i] = rand()/(float)RAND_MAX;
        h_B[i] = rand()/(float)RAND_MAX;

    // Allocate the device input vector A
    float *d_A = NULL;
    err = cudaMalloc((void **)&d_A, size);

    if (err != cudaSuccess)
        fprintf(stderr, "Failed to allocate device vector A (error code %s)!\n", cudaGetErrorString(err));

    // Allocate the device input vector B
    float *d_B = NULL;
    err = cudaMalloc((void **)&d_B, size);

    if (err != cudaSuccess)
        fprintf(stderr, "Failed to allocate device vector B (error code %s)!\n", cudaGetErrorString(err));

    // Allocate the device output vector C
    float *d_C = NULL;
    err = cudaMalloc((void **)&d_C, size);

    if (err != cudaSuccess)
        fprintf(stderr, "Failed to allocate device vector C (error code %s)!\n", cudaGetErrorString(err));

    // Copy the host input vectors A and B in host memory to the device input vectors in
    // device memory
    printf("Copy input data from the host memory to the CUDA device\n");
    err = cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

    if (err != cudaSuccess)
        fprintf(stderr, "Failed to copy vector A from host to device (error code %s)!\n", cudaGetErrorString(err));

    err = cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

    if (err != cudaSuccess)
        fprintf(stderr, "Failed to copy vector B from host to device (error code %s)!\n", cudaGetErrorString(err));

    // Launch the Vector Add CUDA Kernel
    int threadsPerBlock = 256;
    int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;
    printf("CUDA kernel launch with %d blocks of %d threads\n", blocksPerGrid, threadsPerBlock);
    vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements);
    err = cudaGetLastError();

    if (err != cudaSuccess)
        fprintf(stderr, "Failed to launch vectorAdd kernel (error code %s)!\n", cudaGetErrorString(err));

    // Copy the device result vector in device memory to the host result vector
    // in host memory.
    printf("Copy output data from the CUDA device to the host memory\n");
    err = cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

    if (err != cudaSuccess)
        fprintf(stderr, "Failed to copy vector C from device to host (error code %s)!\n", cudaGetErrorString(err));

    // Verify that the result vector is correct
    for (int i = 0; i < numElements; ++i)
        if (fabs(h_A[i] + h_B[i] - h_C[i]) > 1e-5)
            fprintf(stderr, "Result verification failed at element %d!\n", i);

    printf("Test PASSED\n");

    // Free device global memory
    err = cudaFree(d_A);

    if (err != cudaSuccess)
        fprintf(stderr, "Failed to free device vector A (error code %s)!\n", cudaGetErrorString(err));

    err = cudaFree(d_B);

    if (err != cudaSuccess)
        fprintf(stderr, "Failed to free device vector B (error code %s)!\n", cudaGetErrorString(err));

    err = cudaFree(d_C);

    if (err != cudaSuccess)
        fprintf(stderr, "Failed to free device vector C (error code %s)!\n", cudaGetErrorString(err));

    // Free host memory

    return 0;

Written by smist08

April 3, 2019 at 6:01 pm

Can NVidia Bake a Better Pi Than Raspberry?

with 6 comments


I love my Raspberry Pi, but I find it’s limited 1Gig of RAM can be quite restricting. It is still pretty amazing what you can do with these $35 computers. I was disappointed when the Raspberry Foundation announced that the Raspberry Pi 4 is still over a year away, so I started to look at Raspberry Pi alternatives. I wanted something with 4Gig of RAM and a faster ARM processor. I was considering purchasing an Odroid N2, when I saw the press release from NVidia’s Developer Conference that they just released their NVidia Jetson Nano Developer Kit. This board has a faster ARM A57 quad core processor, 4 Gig of RAM plus the bonus of a 128 core Maxwell GPU. The claim being that this is an ideal DIY computer for those interested in AI and machine learning (i.e. me). It showed up for sale on, so I bought one and received it via FedEx in 2 days.


If you already have a Raspberry Pi, setup is easy, since you can unplug things from the Pi and plug them into the Nano, namely the power supply, keyboard, monitor and mouse. Like the Pi, the Nano runs from a microSD card, so I reformatted one of my Pi cards to a download of the variant of Ubuntu Linux that NVidia provides for these. Once the operating system was burned to the microSD card, I plugged it into the Nano and away I went.

One difference from the Pi is that the Nano does not have built in Wifi or Bluetooth. Fortunately the room I’m setting this up in has a wired Internet port, so I went into the garage and found a long Internet cable in my box of random cables, plugged it in and was all connected to the Internet. You can plug a USB Wifi dongle in if you need Wifi, or there is an M.2 E slot (which is hard to access) for an M.2 Wifi card. Just be careful of compatibility, since the drivers need to be compiled for ARM64 Linux.

The board doesn’t come with a case, but the box folds into a stand to hold the board. For now that is how I’m running. If they sell enough of these, I’m sure cases will appear, but you will need to ensure there is enough ventilation for the huge heat sink.

Initial Impressions

The Jetson Nano certainly feels faster than the Raspberry Pi. This is all helped by the faster ARM processor, the quadrupled memory, using the GPU cores for graphics acceleration and that the version of Linux is 64 Bit (unlike Raspbian which is 32 Bit). It ran the pre installed Chromium Browser quite well.

As I installed more software, I found that writing large amounts of data to the microSD card can be a real bottleneck, and I would often have to wait for it to catch up. This is more pronounced than on the Pi, probably because other things are quite slow as well. It would be nice if there was an M.2 M interface for an NVMe SSD drive, but there isn’t. I ordered a faster microSD card (over three times faster than what I have) and hope that helps. I can also try putting some things on a USB SSD, but again this isn’t the fastest.

I tried running the TensorFlow MNIST tutorial program. The version of TensorFlow for this is 1.11. If I want to try TensorFlow 2.0, I’ll have to compile it myself for ARM64, which I haven’t attempted yet. Anyway, TensorFlow automatically used the GPU and executed the tutorial orders of magnitude faster than the Pi (a few minutes versus several hours). So I was impressed with that.

This showed up another gotcha. The GPU cores and CPU share the same memory. So when TensorFlow used the GPU, that took a lot of memory away from the CPU. I was running the tutorial in a Jupyter notebook running locally, so that meant I was running a web server, Chromium, Python, and then TensorFlow with bits on the CPU and GPU. This tended to use up all memory and then things would grind to a halt until garbage collection sorted things out. Running from scratch was fine, but running iteratively felt like it kept hitting a wall. I think the lesson here is that to do machine learning training on this board, I really have to use a lighter Python environment than Jupyter.

The documentation mentions a utility to control the processor speeds of the ARM cores and GPU cores, so you can tune the heat produced. I think this is more for if you embed the board inside something, but beware this sucker can run hot if you keep all the various processors busy.

How is it so Cheap?

The NVidia Jetson Nano costs $99 USD. The Odroid is $79 so it is fairly competitive with other boards trying to be super-Pis. However, it is cheaper than pretty much any NVidia graphics card and even their Nano compute board (which has no ports and costs $129 in quantities of 1000).

The obvious cost saving is no Wifi and no bluetooth. Another is the lack of a SATA or M.2 M interface. It does have a camera interface, a serial interface and a Pi like GPIO block.

The Nano has 128 Maxwell GPU cores. Sounds impressive, but remember most graphics cards have 700 to 4000 cores. Further Maxwell is the oldest supported platform (version 5) where as the newest is the version 7 Volta core.

I think NVidia is keeping the cost low, to get the DIY crowd using their technologies, they’ve seen the success of the Raspberry Pi community and want to duplicate it for their various processor boards. I also think they want to be in the ARM board game, so as better ARM processors come out, they might hope to supplant Intel in producing motherboards for desktop and laptop computers.


If the Raspberry Pi 4 team can produce something like this for $35 they will have a real winner. I’m enjoying playing with the board and learning what it can do. So far I’ve been pretty impressed. There are some limitations, but given the $100 price tag, I don’t think you can lose. You can play with parallel processing with the GPU cores, you can interface to robots with the GPIO pins, or play with object recognition via the camera interface.

For an DIY board, there are a lot of projects you can take on.