Stephen Smith's Blog

Musings on Machine Learning…

Archive for the ‘Software Development Lifecycle’ Category

Apple M1 as a Development Workstation

with 12 comments


I’ve been playing with my new M1 based Apple MacBook Air for a few weeks now, so I thought I’d blog about how it is as a development machine. These are now the best way to develop iOS Apps for iPhones and iPads. These systems are really new so there are a few missing pieces, but these are filling in fast. You can run most MacOS Intel based programs using Rosetta 2, but I’m interested in what runs in natively compiled ARM code that ideally uses the builtin M1 functionality where appropriate.


XCode is Apple’s IDE for development. The whole XCode system is a combination of Apple created software along with a number of open source development tools. As long as you don’t compile or debug for Intel, you don’t need Rosetta installed to run all these. That means besides XCode and Swift, you also get natively compiled versions of LLVM, Python and a number of other tools. After installing XCode (which is huge), you can run command line tools to compile C code and Assembly language code. There is a version of make installed and you can do this all from a command shell without using the XCode IDE at all. All these tools are very fast and seem to work perfectly in the native ARM environment. This shouldn’t be too much of a surprise as these have all been working fine on the ARM based Raspberry Pi for quite a few years now.

If you develop iOS or MacOS applications using Cocoa then this is the platform for you. On older Intel based Macs, to test the application the computer had to emulate the ARM processor and the iOS simulators were quite slow and clunky. Now that everything is using the same processor, suddenly the iPhone and iOS simulators are fast and much more productive. In fact currently the M1 processor is faster than any existing iPhone or iPads, so iOS apps actually run fastest on the new Macs.

What’s Missing?

Apple has fully embraced the LLVM open source toolchain and helped have that project fully support the new Macs. Sadly they didn’t provide the same level of help to the GNU GCC open source toolchain. There are now test builds of the GNU toolchain, or you can build it yourself, but not a released one yet. This then slowed down the development of any applications that depend on the GNU toolchain. The most notable case is anything written in Fortran was stalled because GNU has the good Fortran compiler.

Now you might ask, so what? Who uses Fortran these days anyway? The thing is nearly all scientific libraries were written in the 60s and 70s in Fortran and have all been made open source. These libraries are highly reliable and work great. Now you might ask, so what? Not many people do scientific computing? The thing is that all modern machine learning and AI applications make extensive use of these libraries especially for things like linear algebra. This means that even though Python itself is available natively for the Apple M1, all the scientific libraries, most notably numpy are not as they contain Fortran components. Again there are test builds available, but it could be a few months before these are all generally available.

Another problem is the Apple M1 GPU and machine learning accelerator components. Even once these all compile and are available for the M1, which should be soon, it may be considerably longer before versions are available that can make use of the GPU or TPU for vector acceleration. Most of these libraries have support for nVidia and AMD GPUs, however now that Apple has gone their own way, it may be a bit of a wait for Apple versions. Apple has allocated engineers to help with these projects so hopefully it is sooner than later, and any project that previously supported acceleration on an iPhone or iPad will be good to go.

Meanwhile if you use some other programming language or system, like say ERLang, you will have to check their website for native availability, compile them yourself or use Rosetta 2.

The new XCode is great for Apple mobile development, but what about Android? Android Studio is currently being ported to the M1 and there are test builds available, but with lots of missing pieces. Once complete, this will be the best way to develop Android applications since again, you can run the apps natively and don’t require an ARM emulator for testing and debugging.


Whenever a new generation of hardware is released, there is always a delay as software catches up. If you do a lot of development for iOS then you need one of these new Macs as these are now the best environment for mobile development. Once Android Studio finishes their M1 version, the new Apple M1’s will be by far the best platform for mobile development. Apple has done a really good job of having so much work at release for their new generation of Macintosh computers; but, as is always the case at the bleeding edge, there are a few holes to be filled in. Of course most of these projects are open source so if you need them, you can always contribute to help them move a little faster. As more and more M1 based Macs ship and get into people’s hands these problems will start to be knocked off and more things will move into the “it just works” category.

Written by smist08

January 1, 2021 at 11:54 am

Is Internet Time a Good Thing?

leave a comment »


Internet Time is the concept that things happen faster online than they do offline and that this phenomena is accelerating the pace of change. Practically, this is the justification that companies use to cut corners, that to compete they have to keep doing things quicker regardless of the consequences.

Neal Stephenson’s novel “Anathem”, considers what it would be like to go radically in the other direction. In “Anathem” monasteries of Mathematicians have ten, a hundred or even a thousand years to try to solve problems. The idea is that if you have the time, you will undertake solving much harder problems and instead of our current state of incremental improvement, you will get longer periods of stability followed by larger more fundamental changes. Stephenson partly wrote this novel in opposition to Internet time and it certainly makes you think and I recommend it, even though it’s around a thousand pages long.

In this article, I’m going to look at some of the consequences of Internet time, things we have accelerated and what are the pros and cons.

Software Development

It used to be that a software product would introduce a new version every two years, but with delays, it would really be every three or four years. Usually these upgrades were fairly major and rather disruptive to the end users, since there would be major changes to functionality that they had gotten used to over the last couple of years. Nowadays, most installed software such as Windows 10, or most major Linux distributions release every six months or so. Some more frequently, but few less.

For online applications like Facebook or Twitter, the matra is DevOps and continuous deployment, where as soon as a developer checks in their code to source control, it gets built, unit tested and automatically deployed to the live system. There is no such thing as a major release, just small patches being deployed continuously.

Practically speaking, all applications are updating themselves more frequently. Everytime you run Visual Studio, it wants to update some component or another. Your Linux distribution almost always has new updates you can install. So installed applications get small security and bug fixes continuously, with a major feature release every six months. Even Facebook works this way, you don’t see major interface changes that often and they tend to get rolled out to users one group at a time.

But does this make software better? I think both approaches suffer from the same problem, that as the software becomes larger and more complex due to all these updates, it gets harder and harder to understand and harder to add major new features. Once the software reaches a certain size, it becomes cost prohibitive (and a major risk) to refactor major parts of it. Ths slows the pace of change in the software and the increments get smaller and smaller.

If software developers had ten years for a release, could they do more and produce something better? The counter argument is that they would spend nine years in meetings and messing around and then still only work on the new version for less than a year. But suppose you did have the time to rewrite the entire piece of software with the newest tools, techniques and technologies? Could you produce something majorly better? If you had the time, could you use more low level high performance techniques rather than using really high level programming systems? Could you do a better job of QA and security testing?

News Reporting

News reporting used to be a lot of work. There was a lot of time spent researching stories, digging for the underlying reasons things happened. The problem now is that to be read you have to be the first to post to the Internet. If you aren’t the first to Facebook and Twitter, then no one will bother reading you when your wonderfully researched insightful story finally appears.

Internet time has destroyed journalism and led to the crazy world of Internet conspiracy theories.

Sure we get information way quicker these days; but, a lot of the information we get now is wrong, low quality and deliberately misleading. I really like that I get information in real-time as events happen. However, I do miss balanced researched journalism.


I blogged about being brainwashed by social media and how social media divides us. Politics is happening at Internet Time. Events are happening much quicker and the procedures followed by our institutions aren’t keeping up. Politicians are giving up on presenting their policies and using reasoned arguments to convince us to vote for them. Instead they are just brarraging us with malicious misleading information over social media. Now that people live in Internet Time, they don’t have time to read long articles on peoples policies and points of views, they just get bombarded by internet memes and make their decisions based on these. As politics has moved to Internet Time, I don’t think anything has changed for the better.


Long ago we used to communicate by writing letters, then talking on the telephone. Along came e-mail and now messaging. With messaging we don’t even have time to write complete sentences, instead we have gone back to hieroglyphics, sending streams of emojis to communicate. Are we communicating better in Internet Time? Or is something being lost? We are certainly communicating more, as you can message all day and use less time than the phone or e-mail, but is the quality and depth of communications still there?


These are just a few areas that have been affected by moving to Internet Time. There are a lot of advantages to things happening quicker. Humans aren’t necessarily all that patient and like immediate gratification. We instinctively want to innovate at a faster pace, move into the future quicker. But are we taking the time to ensure we are really improving things? Are we losing control of our technologies and spiralling into chaos? I like the faster pace of change, but I do miss some of the deeper thought that used to go into things.

Written by smist08

October 30, 2020 at 2:26 pm

Is Apple Silicon Really ARM?

with one comment


Apple recently announced they were transitioning their line of Mac computers from using Intel CPUs to what they announced as Apple Silicon. This led to a lot of confusion since most people expected them to announce they would transition to the ARM CPU. The confusion arises from Apple’s market-speak where everything has to be Apple this or Apple that. Make no mistake the heart of Apple Silicon is the ARM CPU, but Apple Silicon refers to Apple’s System on a Chip (SoC) which includes a number of additional components along with the ARM CPUs. In this blog we’ll look at what Apple Silicon really is and discuss some of the current misconceptions around it.

ARM at the Core

My book “Programming with 64-Bit ARM Assembly Language” includes examples of adding ARM Assembly Language to Apple iOS apps for the iPhone and iPad. The main processors in iPhones and iPads have been redesignated as Apple Silicon chip, and the SoC used in the latest iPad is being used in Apple’s developer preview hardware of the new Mac Mini.

In fact, one of my readers, Alex vonBelow, converted all the source code for my book to run on the prototype Apple hardware. The Github repository of his work is available here. Again, this all demonstrates that ARM is at the center and is the brains behind Apple Silicon.

Other bloggers have claimed that Apple Silicon does not in fact use ARM processors since Apple customizes their ARM CPUs, rather than using off-the-shelf designs from ARM Holdings. This is true, but in fact most ARM licensees do the same thing, adding their own optimizations to gain competitive advantage. The key point is that Apple licenses the ARM CPU ISA (Instruction Set Architecture), which is the format and syntax of all the Assembly Language instructions the CPU processes, and then Apple uses ARMs software/hardware verification suite to ensure their designs produce correct results.

These other bloggers claim that Apple may stray from the ARM ISA in the future, which of course is possible, but highly unlikely. One of the keys to Apple’s success is the way they leverage open source software for much of both their operating system and development tools. Both MacOS and iOS are based on the open source Mach Unix Kernel from Carnegie Mellon University which in turn is based on BSD Unix. The XCode development environment looks Apple proprietary, but to do the actual compiling, it uses the open source tools from GCC or LLVM. Apple has been smart to concentrate their programming resources on areas where they can differentiate themselves from the competition, for instance making everything easier to use, and then using open source components for everything else. If Apple strays from the ARM ISA, then they have to take all this on themselves and will end up like Microsoft with the Windows 10 mess, where they can’t compete effectively with the much larger open source community.

Apple Silicon is an SoC

Modern integrated circuits allow billions of chips on a single small wafer. This means a single IC has room to hold multiple CPU cores along with all sorts of other components. Early PCs contained hundreds of ICs, nowadays, they contain a handful with most of the work being done by a single chip. This has been key to cell phones allowing them to be such powerful computers in such a small package. Similarly the $35 Raspberry Pi is a credit card sized computer where most of the work is done by a single SoC.

Let’s look at what Apple is cramming into their Apple Silicon SoC.

  1. A number of high performance/high power usage ARM CPU cores.
  2. A number of low power usage/lower performance ARM CPU cores.
  3. A Graphics Processing Unit (GPU).
  4. An Artificial Intelligence Processing Unit.
  5. An audio DSP processor.
  6. Power/performance management.
  7. A cryptographic acceleration unit.
  8. Secure boot controller.

For some reason, Apple marketing doesn’t like to mention ARM, which is too bad. But any programmer knows that to program for Apple Silicon, your compiler has to generate ARM 64-bit instructions. Similarly, if you want to debug a program for iPads, iPhones or the new Macs, then you have to use an ARM capable debugger such as GDB and you have to be able to interpret ARM instructions, ARM registers and ARM memory addressing modes.

It isn’t unusual for marketing departments to try and present technical topics to the general population in a way that doesn’t make sense to programmers. Programmers have to stick to reality or their programs won’t work and ignore most of what comes out of marketing departments. If you attended Apple’s WWDC a few months ago, you could see the programmers struggling to stick to the marketing message and every now and then having to mention ARM processors.


The transition of Macs from Intel to Apple Silicon is an exciting one, but don’t be fooled by the marketing spin, this is a transition from Intel to ARM. This is Apple going all-in on the ARM processor, using the same technology to power all their devices including iPhones, iPads and Mac computers.

If you want to learn more about the ARM processor and programming the new Apple devices, check out my book: Programming with 64-Bit ARM Assembly Language. Available directly from Apress, along with all the main booksellers.

Written by smist08

July 31, 2020 at 11:31 am

Browsing MSDOS and GW-Basic Source Code

leave a comment »


These days I mostly play around with ARM Assembly Language and have written two books on it:

But long ago, my first job out of University involved some Intel 80186 Assembly Language programming, so I was interested when Microsoft recently posted the source code to GW-Basic which is entirely written in 8086 Assembly Language. Microsoft posted the source code to MS-DOS versions 1 and 2 a few years ago, which again is also entirely written in 8086 Assembly Language.

This takes us back to the days when C compilers weren’t as good at optimizing code as they are today, processors weren’t nearly as fast and memory was at a far greater premium. If you wanted your program to be useful, you had to write it entirely in Assembly Language. It’s interesting to scroll through this classic code and observe the level of documentation (low) and the programming styles used by the various programmers.

Nowadays, programs are almost entirely written in high-level programming languages and any Assembly Language is contained in a small set of routines that provide some sort of highly optimized functionality usually involving a coprocessor. But not too long ago, often the bulk of many programs consisted entirely of Assembly Language.

Why Release the Source Code?

Why did Microsoft release the source code for these? One reason is that they are a part of computer history now and there are historians that want to study this code. It provides insight into why the computer industry progressed in the manner it did. It is educational for programmers to learn from. It is a nice gesture and offering from Microsoft to the DIY and open source communities as well.

The other people who greatly benefit from this are those that are working on the emulators that are used in systems like RetroPie. Here they have emulators for dozens of old computer systems that allow vintage games and programs to be run on modern hardware. Having the source code for the original is a great way to ensure their emulations are accurate and a great help to fixing bugs correctly.


Here is an example routine from find.asm in MS-DOS 2.0 to convert a binary number into an ASCII string. The code in this routine is typical of the code throughout MS-DOS. Remember that back then MS-DOS was 16-bits so AX is 16-bits wide. Memory addresses are built using two 16-bit registers, one that provides a segment and the other that gives an offset into that 64K segment. Remember that MS-DOS can only address memory upto 640K (ten such segments).

;       Binary to Ascii conversion routine                
; Entry:                                                          
;       DI      Points to one past the last char in the             
;       AX      Binary number                                       
;             result buffer.                                        
; Exit:                                                             
;       Result in the buffer MSD first                            
;       CX      Digit count                                         
; Modifies:                                                         
;       AX,BX,CX,DX and DI                                          
        mov     bx,0ah
        xor     cx,cx
        inc     cx
        cmp     ax,bx
        jb      div_done
        xor     dx,dx
        div     bx
        add     dl,’0′          ;convert to ASCII
        push    dx
        jmp     short go_div
        add     al,’0′
        push    ax
        mov     bx,cx
        pop     ax
        loop    deposit
        mov     cx,bx

For an 8086 Assembly Language programmer of the day, this will be fairly self evident code and they would laugh at us if we complained there wasn’t enough documentation. But we’re 40 or so years on, so I’ll give the code again but with an explanation of what is going on added in comments.

        mov     bx,0ah ; we will divide by 0ah = 10 to get each digit
        xor     cx,cx ; cx will be the length of the string, initialize it to 0
        inc     cx ; increment the count for the current digit
        cmp     ax,bx ; Is the number < 10 (last digit)?
        jb      div_done   ; If so goto div_done to process the last digit
        xor     dx,dx ; DX = 0
        div     bx ; AX = AX/BX  DX=remainder
        add     dl,’0′          ;convert to ASCII. Know remainder is <10 so can use DL
        push    dx ; push the digit onto the stack
        jmp     short go_div ; Loop for the next digit
        add     al,’0′ ; Convert last digit to ASCII
        push    ax ; Push it on the stack
        mov     bx,cx ; Move string length to BX
        pop     ax ; get the next significant digit off the stack.
        stosb ; Store AX at ES:DI and increment DI
       ; Loop decrements CX and branches if CX not zero.
; Falls through when CX=0
        loop    deposit
        mov     cx,bx ; Put the count back in CX
        ret ; return from routine.

A bit different than a C routine. The routine assumes the DF flag is set, so the stosb increments the memory address, perhaps this is a standard across MS-DOS or perhaps it’s just local to this module. I think the comment is incorrect and that the start of the output buffer is passed in. The routine uses the stack to reverse the digits, since the dividing by 10 algorithm peels off the least significant digit first and we want the most significant digit first in the buffer. The resulting string isn’t NULL terminated so perhaps MS-DOS treats strings as a length and buffer everywhere.

Comparison to ARM

This code is representative of CISC type processors. The 8086 has few registers and their usage is predetermined. For instance the DIV instruction is only passed one parameter, the divisor. The dividend, quotient and remainder are set in hard-wired registers. RISC type processors like the ARM have a larger set of registers and tend to have three operands per instruction, namely two input registers and an output register.

This code could be assembled for a modern Intel 64-bit processor with little alteration, since Intel has worked hard to maintain a good level of compatibility as it has gone from 16-bits to 32-bits to 64-bits. Whereas ARM redesigned their instruction set when they went from 32-bits to 64-bits. This was a great improvement for ARM and only possible now that the amount of Assembly Language code in use is so much smaller.


Kudos to Microsoft for releasing this 8086 Assembly Language source code. It is interesting to read and gives insight into how programming was done in the early 80s. I hope more classic programs have their source code released for educational and historical purposes.

Written by smist08

May 25, 2020 at 6:56 pm

Learning Electronics with Arduino

with 3 comments


I’ve worked with the Raspberry Pi quite a bit, written books about it and blogged fairly extensively about it. However, many people consider the Pi overkill. After all it runs full versions of Linux and usually requires a keyboard, mouse, monitor and Internet connection. Even at $35, many people consider it too expensive for their projects.

Parallel to the Raspberry Pi, there is the Arduino project which is an open source software and hardware project for microcontrollers. The Raspberry Pi includes a full ARM 64-bit processor with up to 4Gig of RAM. The Arduino is based on various microcontrollers that are often 8-bit and only have 32Kb of RAM. These microcontrollers don’t run a full operating system, they just contain enough code to start your program, whether burned on their flash memory or downloaded via serial port from a PC.

There are a great many Arduino compatible boards that can perform all sorts of functions. A typical Arduino has a set of external pins similar to the Raspberry Pi’s GPIO ports. The big advantage of the Arduino is that they are low cost, simple to program and low power.

In this article, I’ll look at the official Arduino Starter Kit.

Package Contents

The package contains an Arduino Uno microcontroller board, a breadboard and a large assortment of discrete electronic components. It contains a project book with 15 projects you can build out of all these components.

The Arduino Uno contains the Amtel ATmega328p microcontroller and 32kb of memory. These are flexible low cost processors that are used in many embedded applications.

Programming the Arduino

You can program the Arduino with any compiler that generates the correct machine code to run on the processor you’ve chosen (or even program it in Assembly Language). However most people use the Arduino IDE. This IDE is based on Sketch and Processing. You write your programs in a limited version of C (with a few extensions). The IDE then knows how to compile and download it to a great many Arduino boards so you can test out your program.

There are libraries to provide support for common functions like controlling a servo motor or controlling a LED character display. There is a Capacitive sensor library to measure a circuit’s capacitance. There are hundreds of sensors you can wire up to your Arduino and there are libraries available for most of these, making the programming to read or control them easy.

You can debug your program by sending strings back to the PC via a serial port which you can monitor in the IDE. You can also flash a couple of LEDs.

People might point out the C is really old and shouldn’t be used due to its use of pointers and such. However C is still the best language for low level programming like this. It is also used for nearly all systems programming. Linux is entirely written in C. Learning C is both useful in itself as well as acting as a jump start to newer languages, mostly based on C such as Java or C#.

Learning Electronics

I first became interested in electronics when I took Electricity 9 in junior high school. We learned the basics of soldering and I built a Radio Shack transistor radio from a kit. With this course I could fix some basic wiring issues around the house and occasionally fix appliances, and perhaps fix a TV by replacing a tube. The difficulty when I was younger was that it was a lot of work to build anything, since the whole thing needed to be built from discrete components and equipment was expensive.

Today things are much easier. You can build a lot of simple circuits attached to the Arduino where a lot of the work is done in software on the microcontroller. Things are much cheaper today. You can purchase a complete Arduino starter kit for under $100 and test equipment is far less expensive. You can pick up a good digital multimeter for under $20 and there are even good oscilloscopes for around $300. There are many simple integrated circuits like optocouplers and H-bridges to further simplify your circuites.

The Arduino is low power, so you can’t electrocute yourself. It has short detection, so if your circuit contains a short circuit, the Arduino shuts down. This all allows you to safely play with electronic components without any risk to yourself or the Arduino.

The starter kit projects include several techniques to connect the Arduino up to external devices safely. For instance controlling a DC motor with either a transistor used as a switch or via an H-bridge. Then how to interface to another device using an optocoupler to keep both devices completely electrically separate.


Arduino provides a great platform to both learn electronics and to learn programming. The IDE is simple to use and helps with learning. Building circuits attached to an Arduino is a safe place to experiment and learn without risking damaging expensive components or equipment. I found working through the 15 labs in the Arduino Projects Book that accompanied the starter kit quite enjoyable and I learned quite a few new things.


Out-of-Order Instructions

leave a comment »


We think of computer processors executing a set of instructions one at a time in sequential order. As programmers this is exactly what we expect the computer to do and if the computer decided to execute our carefully written code in a different order then this terrifies us. We would expect our program to fail, producing wrong results or crashing. However we see manufacturers claiming their processors execute instructions out-of-order and that this is a feature that improves performance. In this article, we’ll look at what is really going on here and how it can benefit us, without causing too much fear.


ARM defines the Instruction Set Architecture (ISA), which defines the Assembly Language instruction set. ARM provides some reference implementations, but individual manufacturers can take these, customize these or develop their own independent implementation of the ARM instruction set. As a result the internal workings of ARM processors differs from manufacturer to manufacturer. A main point of difference is in performance optimizations. Apple is very aggressive in this regard, which is why the ARM processors in iPads and iPhones beat the competition. This means the level of out-of-order execution differs from manufacturer to manufacturer, further this is much more prevalent in newer ARM chips. As a result, the examples in this article will apply to a selection of ARM chips but not all.

A Couple of Simple Cases

Consider the following small bit of code to multiply two numbers then load another number from memory and add it to the result of the multiplication:

MUL R3, R4, R5 @ R3 = R4 * R5
LDR R6, [R7]   @ Load R6 with the memory pointed to by R7
ADD R3, R6     @ R3 = R3 + R6

The ARM Processor is a RISC processor and its goal is to execute each instruction in 1 clock cycle. However multiplication is an exception and takes several clock cycles longer due to the loop of shifting and adding it has to perform internally. The load instruction doesn’t rely on the result of the multiplication and doesn’t involve the arithmetic unit. Thus it’s fairly simple for the ARM Processor to see this and execute the load while the multiply is still churning away. If the memory location is in cache, chances are the LDR will complete before the MUL and hence we say the instructions executed out-of-order. The ADD instruction then needs the results from both the MUL and LDR instruction, so it needs to wait for both of these to complete before executing it’s addition.

Consider another example of three LDR instructions:

LDR R1, [R4] @ memory in swap file
LDR R2, [R5] @ memory not in cache
LDR R3, [R6] @ memory in cache

Here the memory being loaded by the first instruction, has been swapped out of memory to secondary storage, so loading it is going to be slow. The second memory location is in regular memory. DDR4 memory, like that used in the new Raspberry Pi 4, is pretty fast, but not as fast as the CPU and it is also loading instructions to process, hence this second LDR might take a couple of cycles to execute. It makes a request to the memory controller and its request is queued with everything else going on. The third instruction, assumes the memory is in the CPU cache and hence processed immediately, so this instruction really does take only 1 clock cycle.

The upshot is that these three LDR instructions could well complete in reverse order.

Newer ARM processors can look ahead through the instructions looking for independent instructions to execute, the size of this pool will determine how out-of-order things can get. The important point is that instructions that have dependencies can’t start and that to the programmer, it looks like his code is executing in order and that all this magic is transparent to the correct execution of the program.

Since the CPU is executing all these instructions at once, you might wonder what the value of the program counter register (PC) is? This register has a very precisely defined value, since it is used for PC relative addressing. So the PC can’t be affected by out-of-order execution. 


All newer ARM processors include floating-point coprocessors and NEON vector coprocessors. The instructions that execute on these usually take a few instructions cycles to execute. If the instructions that follow a coprocessor instruction are regular ARM instructions and don’t rely on the results of coprocessor operations, then they can continue to execute in parallel to the coprocessor. This is a handy way to get more code parallelism going, keeping all aspects of the CPU busy. Intermixing coprocessor and regular instructions is another great way to leverage out-of-order instructions to get better performance.

Compilers and Code Generation

This indicates that if a compiler code generator or an Assembly Language program rearranges some of their instructions, they can get more things happening at once in parallel giving the program better performance. ARM Holdings contributes to the GNU Compiler Collection (GCC) to fully utilize the optimization present in their reference implementations. In the ARM specific options for GCC, you can select the ARM processor version that matches your target and get more advanced optimizations. Since Apple creates their own development tools under XCode, they can add optimizations specific to their custom ARM implementations.

As Assembly Language programmers, if we want to get the absolute best performance we might consider re-arranging some of our instructions so that instructions that are independent of each other are in a row and hopefully can be executed in parallel. This can require quite a bit of testing to reverse engineer the exact out-of-order instruction capability of your particular target ARM processor model. As always with performance optimizations, you must test the performance to prove you are improving things, and not just making your code more cryptic.


This all sounds great, but what happens when an interrupt happens? This could be a timer interrupt to say your time-slice is up and another process gets to use the ARM Core, or it could be that more data needs to be read from the Wifi or a USB device.

Here the ARM CPU designer has a choice, they can forget about the work-in-progress and handle the interrupt quickly, or they can wait a couple of cycles to let work-in-progress complete and then handle the interrupt. Either way they have to allow the interrupt handler to save the current context and then restore the context to continue execution. Typically interrupt handlers do this by saving all the CPU and coprocessor registers to the system stack, doing their work and then restoring state.

When you see an ARM processor advertised as designed for real-time or industrial use, this typically means that it handles interrupts quickly with minimal delay. In this case, the work-in-progress is discarded and will be redone after the interrupt is finished. For ARM processors designed for general purpose computing, this usually means that user performance is more important than being super responsive to interrupts and hence they can let some of the work-in-progress complete before servicing the interrupt. For general purpose computing this is ok, since the attached devices like USB, ethernet and such have buffers that can hold enough contents to wait for the CPU to get around to them.

A Step Too Far and Spectre

Hardware designers went even further with branch prediction, where if a conditional branch instruction needs to wait for a condition code to be set, they don’t wait but keep going assuming one branch direction (perhaps based on the result from the last time this code executed) and keep going. The problem here is that at this point, the CPU has to save the current state, incase it needs to go back when it guesses wrong. This CPU state was saved in a CPU cache that was only used for this, but had no security protection, resulting in the Spectre attack that figured out a way to get at this data. This caused data leakage across processes or even across virtual machines. The whole spectre debacle showed that great care has to be taken with these sorts of optimizations.

Heat, the Ultimate Gotcha

Suppose your your ARM processor has four CPU cores and you write a brilliant Assembly language program that deploys to use all four cores and fully exploits out-of-order execution. Your program is now using every bit of the ARM CPU, each core is intermixing regular ARM, floating point and NEON instructions You have intermixed your ARM instructions to get the arithmetic unit operating in parallel to the memory unit. This will be the fastest implementation yet. Then you run your program, it gets off to a great start, but then suddenly slows down to a crawl. What happened?

The enemy of parallel processing on a single chip is heat. Everything the CPU does generates a little heat. The more things you get going at once the more heat will be generated by the CPU. Most ARM based computers like the Raspberry Pi assume you won’t be running the CPU so hard, and only provide heat dissipation for a more standard load. This is why Raspberry Pis usually do so badly playing high-res videos. They can do it, as long as they don’t overheat, which typically doesn’t take long.

This leaves you a real engineering problem. You need to either add more cooling to your target device, or you have to deliberately reduce the CPU usage of your program, where perhaps paradoxically you get more work done using two cores rather than four, because you won’t be throttled due to overheating.


This was a quick overview of out-of-order instructions. Hopefully you don’t find these scary and keep in mind the potential benefits as you write your code. As newer ARM processors come to market, we’ll be seeing larger and larger pools of instructions executed in parallel, where the ability for instructions to execute out-of-order will have even greater benefits.

If you are interested in machine code or Assembly Language programming, be sure to check out my book: “Raspberry Pi Assembly Language Programming” from Apress. It is available on all major booksellers or directly from Apress here.

Written by smist08

November 15, 2019 at 11:11 am

Raspberry Pi Assembly Language Programming

with 2 comments



My new book “Raspberry Pi Assembly Language Programming” has just been published by Apress. This is my first book to be published by a real publisher and I’m thrilled to see it appearing on websites of booksellers all over the Internet. In this blog post I’ll talk about how this book came to exist, the process of writing and publishing it and a bit about the book itself.

For anyone interested in this book, here are a few places where it is available:

Most of these sites let you see a preview and the table of contents.

This blog’s dedicated page to my book.

How this Book Came About

I purchased my Raspberry Pi 3+ in late 2017 and had a great deal of fun playing with it. I wrote quite a few blog posts on the Pi, a directory of these is available here. The Raspberry Pi package I purchased included a breadboard and a selection of electronic components. I put together a set of LEDs connected to the Pi’s GPIO ports. I then wrote a series of articles on making these LEDs flash using various programming languages including C, Python, Scratch, Fortran, and Erlang. In early 2018 I was interested in learning more about how the Pi’s ARM processor works and delved into Assembly language programming. This resulted in two blog posts, an introduction and then my flashing LED program ported to ARM Assembly Language.

Earlier this year I was contacted by an Apress Talent Acquisition agent who had seen my blog articles on ARM Assembly Language and wanted to know if I wanted to develop them into a book. I thought about it over the weekend and was intrigued. The material I found when writing the blog articles wasn’t great, and I felt I could do better. I replied to the agent and we had a call to discuss the book. He had me write up a proposal and possible table of contents. I did this, Apress accepted it and sent me a contract to sign.

The Process

Apress provided a Word style sheet and a written style guide. My writing process has been to write in Google Docs and then have my spouse, a professional editor, edit it. The collaboration of Google Docs is just too good to do away with. So I wrote the chapters in Google Docs, got them edited and then transferred them to MS Word and applied the Apress style sheet.

I worked with a coordinating editor at Apress who was very energetic in getting all the pieces done. She found a technical editor who would provide a technical review of each chapter as I wrote it. He was located in the UK, so often I would submit a chapter and see it edited overnight.

Once I had submitted all the chapters then a senior development editor gave the whole book a review. At that point I thought I was done, but then the book was given to Springer’s (Apress’s parent company) production department who did another editing pass. I was surprised that the production department still found quite a few things that needed fixing or improving.

After all that the book appeared fairly quickly. I like the cover, they used my photo of my breadboard with the flashing LEDs. As of today, the book is available at most booksellers, some with stock and some on preorder. I signed the contract in June and did the bulk of the writing in July and August. Overall, I’m pretty happy with the process and how things turned out.

The Book

My philosophy was to introduce complete working programs from Chapter 1 with the traditional “Hello World” program. I only covered topics where you could write the code with the tools included with the Raspberry Pi and run them. I lay the foundations for how to write larger Assembly programs, with how to code the various structured programming constructs, but also include a chapter on how to interoperate with C and Python code.

Raspbian is a 32-bit operating system as older Raspberry Pi’s and the Raspberry Pi Zero can only run 32-bit code. I didn’t want to leave out 64-bit code, as there are 64-bit versions of Linux from other distributions like Ubuntu that are available for the Pi. So I included a chapter on ARM 64-bit Assembly along with guidelines on how to port your 32-bit code to 64-bit. I then included 64-bit versions of several of the programs we had developed along the way.

There is a lot of interest in ARM Assembly Language, especially from hackers, as all phones, tablets and even a few laptops are running ARM processors now. I included a number of hacking related topics like how to reverse engineer code, as security professionals are very interested in this as they work to protect the mobile devices utilized by their organizations.

The ARM Processor is a good example of a RISC processor, so if you are interested in RISC, this book will give a good introduction to the concepts, like how to do everything with instructions that are only 32-bits in length. Once you understand ARM Assembly, picking up the Assembly language of another RISC processor like the Risc-V becomes much easier.

The book also covers how to program the floating point processor included with most ARMs along with the NEON vector processor that is available on newer Raspberry Pis.


If you are interested in learning Assembly Language, please check out my book. The Raspberry Pi provides a great platform to do this. Even if you only program in higher level languages, knowing Assembly Language will help you understand what is going on at a deeper level. How modern processors design their Assembly Language to maximize program performance and minimize memory usage is quite fascinating and I hope you find the topic as interesting as I do.


Written by smist08

November 1, 2019 at 11:22 am

Getting Productive with Julia

with 3 comments


Julia is a programming language that is used quite extensively by the scientific community. It is open source, it just reached its version 1.0 milestone after quite a few years of development and it is nearly as fast as C but with many features associated with interpretive languages like R or Python.

There don’t seem to be many articles on getting all up and running with Julia, so I thought I’d write about some things that I found useful. This is all based on playing with Julia on my laptop running Ubuntu Linux.

Run in the Cloud

One option is to avoid any installation hassles is to just run in the cloud. You can do this with JuliaBox. JuliaBox gives you a Jupyter Notebook interface where you can either play with the various tutorials or do your own programming. Just beware the resources you get for free are quite limited and JuliaBox makes its money by charging you for additional time and computing power.

Sadly at this point, there aren’t very many options for running Julia in the cloud since the big AI clouds seem to only offer Python and R. I’m hoping that Google’s Kaggle will add it as an option, since the better performance will open up some intriguing possibilities in their competitions.

JuliaBox gives you easy direct access to all the tutorials offered from Julia’s learning site. Running through the YouTube videos and playing with these notebooks is a great way to get up to speed with Julia.

Installing Julia

Julia’s website documents how to install Julia onto various operating systems. Generally the Julia installation is just copying files to the right places and adding the Julia executable to the PATH. On Ubuntu you can search for Julia in the Ubuntu Software App and install it from there. Either way this is usually pretty easy straight forward. This gives you the ability to run Julia programs by typing “julia sourefile.jl” at a command prompt. If you just type “julia” you get the REPL environment for entering commands.

You can do quite a lot in REPL, but I don’t find it very useful myself except for doing things like package management.

If you like programming by coding in your favorite text editor and then just running the program, then this is all you need. For many purposes this works out great.

The Juno IDE

If you enjoy working in full IDE’s then there is Juno which is based on the open source Atom IDE. There are commercial variants of this IDE with full support, but I find the free version works just fine.

To install Juno you follow these instructions. Basically this involves installing the Atom IDE by downloading and running a .deb installation package. Then from within Atom, adding Julia support to the IDE.

Atom has integration with Julia’s debugger Gallium as well as provides a plot plane and access to watch variables. The editor is fairly good with syntax highlighting. Generally not a bad way to develop in Julia.


JuliaBox mentioned above uses Jupyter and runs it in the cloud. However, you can install it locally as well. Jupyter is a very popular and powerful notebook metaphor for developing programs where you see  the results of each incremental bit of code as you write it. It is really good at displaying all sorts of fancy formats like graphs. It is web based and will run a local web server that will use the local Julia installation to run. If you develop in Python or R, then you’ve probably already played with Jupyter.

To install it you first have to install Jupyter. The best way to do this is to use “sudo apt install jupyter”. This will install the full Jupyter environment with full Python support. To add Julia Jupyter support, you need to run Julia another way (like just entering julia to get the REPL) and type “Pkg.add(“IJulia”)”. Now next time you start Jupyter (usually by typing “jupyter notebook”), you can create a new notebook based on Julia rather than Python.

Julia Packages

Once you have the core Julia programming environment up and running, you will probably want to install a number of add-on packages. The package manager is call Pkg and you need to type “using Pkg” before using it. These are all installed by the Pkg.add(“”) command. You only need to add a package once. You will probably want to run “Pkg.update()” now and again to see if the packages you are using have been updated.

There are currently about 1900 registered Julia packages. Not all of them have been updated to Julia version 1.0 yet, so check the compatibility first. There are a lot of useful packages for things like machine learning, scientific calculations, data frames, plotting, etc. Certainly have a look at the package library before embarking on writing something from scratch.


These are currently the main ways to play with Julia. I’m sure since Julia is a very open community driven system, that these will proliferate. I don’t miss using the giant IDEs Visual Studio or Eclipse, these have become far too heavy and slow in my opinion. I find I evenly distribute my time between using Jupyter, Juno and just edit/run. Compared to Python it may appear their aren’t nearly as many tools for Julia, but with the current set, I don’t feel deprived.


Written by smist08

October 10, 2018 at 3:55 am

Avoiding Airline Collisions with Julia

leave a comment »


I was just watching an old episode of “Mayday: Air Crash Investigations“, on the crash of a Russian passenger jet with a DHL cargo plane over Switzerland. In this episode, both planes had onboard collision avoidance systems, but one plane listened to air traffic control rather than the collision avoidance system and went down rather than up, resulting in the collision. In reading about the programming language Julia recently, I had noticed several presentations on the development of the next generation of collision avoidance systems, in Julia. This piqued my interest, along with the fact that my wife is currently getting her pilot’s license, to have a slightly deeper look into this.

Modern airliners have employed an onboard Traffic Collision Avoidance Systems (TCAS) since the 1980s. TCAS is required on any passenger airplane that takes more than 19 passengers. These systems work by monitoring the transponders of nearby aircraft and determining when a collision is imminent. At this point it provides a warning to the plane’s pilot along with a course of action. The TCAS systems on the two aircraft communicate so one plane is ordered to go up and the other to descend.

Generally there are three layers to collision avoidance that operate on different timescales. At the coarsest level planes travelling in one direction are required to be at a different altitude than planes in the reversion direction. Usually one direction gets even altitudes like 30,000 feet and the reverse gets odd altitude like 31,000 feet. At a finer level, air traffic control is responsible for keeping the planes apart at medium distances. Then close up (minutes apart) it is TCAS’s job to avoid the collisions. This is partly due to the aftermath of the Russian/DHL crash and partly due to a realization that the latency in communications with air traffic control is too great when things get too close for comfort.

Interestingly it was the collision of two passenger plane’s over the Grand Canyon in 1956 that caused congress to create the FAA and started the development of the current TCAS system. It took thirty years to develop and deploy since it required computers to get much smaller and faster first.

Why Julia

The FAA has funded the development of the next generation of traffic avoidance which has been dubbed ACAS X. This started in 2008 and after quite a bit of study, it was decided to use Julia extensively in its development. Reading the reasons for why Julia was selected is rather scary when you consider what it highlights about the current TCAS system.

Problem 1 – Specifications

A big problem with TCAS was that the people that defined the system wrote the specification first as English like pseudo-code and then re-wrote that as a more programmy pseudo-code with variables and such. Then others would take this code and implement it in Mathlab to test the algorithms. Then the people who actually made the hardware would take this and re-implement it in C++ or Assembler. When people had a recent look at all this code, they found it to be a big mess, where the different specs and code bases had been maintained separately and didn’t match. There was no automation and very little validation. The first idea of fixing this code base was rejected as completely unreliable and impossible to add new features to.

They wanted to the new system to take advantage of modern technologies like satellite navigation systems, GPS, and on-board radar systems. This means the new system will work with other planes that don’t have collision avoidance or perhaps don’t even have a transponder. In fact they wanted the new system to be easily extensible as new sensor inputs are added. Below is a small example of the reams of pseudo code that makes up TCAS.

The hope with Julia is to unify these different code bases into one. The variable pseudo-code would actually be true Julia code and the English code would be incorporated into JavaDoc like comments in the code (actually using Latex). This would then eliminate the need to use Mathlab to test the pseudo-code. The consensus is that Julia code is easily as readable as the above pseudo-code but with the advantage of being runnable and testable.

The FAA doesn’t have the authority to mandate Avionics hardware companies run Julia on their ACAS X systems, but the hope is that the performance of Julia is good enough that they won’t bother reimplementing the system in C++ and that everything will be the same Julia code. Current estimates have the Julia code running 1.5 times the speed of C code and the thought is that with newer computer chips, this should be sufficient. The hope then is that the new system will not have the translation errors that dog TCAS.

Now that the specification is true computer code many other tools can be written or used to help check correctness, such as the tool below which generates a flowchart from the Julia code/specification.

Problem 2 – Testing/Validation

Certainly with TCAS implementing the system in Mathlab was hard. But then Mathlab is quite slow and that greatly restricts the number of test cases that can be effectively be automated. The TCAS system is based on a huge number of giant decision trees and billions of test cases. A number of test/validation frameworks have been developed to test the new ACAS X system including using theorem proving, probabilistic model checking, adaptive stress testing, simulations and weakest precondition code analysis.

Now if the Avionics hardware manufacturers run the actual Julia code, there will have only been one code base from specification to deployment and it will have all been very thoroughly developed, tested and validated.


The new ACAS X system is currently being flight tested and is projected to start being deployed in regular commercial aircraft starting in 2020. Looking at the work that has gone into this system, it looks like it will make flying much safer. Hopefully it also sets the stage for how future large safety-critical systems will be developed. Further it looks like the Julia programming language will play a central part in this.

Written by smist08

October 7, 2018 at 10:28 pm

Performance Testing in Swift

with one comment


A couple of blog posts ago I covered writing my first Swift program for iOS so that I could draw a Koch Snowflake on an iPad or an iPhone. Then last time I covered adding unit tests to that project. This time I’m going to add performance tests.

In the process of adding performance tests, I had to refactor the test project and we’ll also look at why that was and how it makes it better going forwards as more tests are added. I’ll also mention a few things that should be done if this project gets a bit bigger.

I put an updated version of the Koch Snowflake project on Google Drive here.

Performance Tests in XCode

Of course you could instrument your program yourself and perhaps write the performance results out to a file or something, for that matter you can drill down into the Swift test case class and have a look at their implementation. But XCode gives you a bit of support so you generally don’t need to. If you add self.measureBlock {} around code in a unit test then the time taken of the code inside the measureBlock will be recorded and reported inside XCode as shown in the following screenshot.

Screen Shot 2016-06-01 at 3.32.53 PM

Actually it does a bit more than that. When you add measureBlock to a unit test, then when you run that unit test, it won’t just be run once, but will be run ten times, so that the average and standard deviation will be recorded. Due to this it is crucial that any performance tests are idempotent. You can also set a baseline, so the percentage deviation from the baseline gets recorded. This is shown in the following screenshot that is a drill down from the previous screen shot.

Screen Shot 2016-06-01 at 3.33.04 PM

Hence XCode gives a fairly painless way to add some performance metrics to your unit tests.

Test Case Organization

Generally, you want your unit tests to run against every build or your product, so you want then to run in a second or two. Once the performance tests get longer, you will probably want to separate them off into a separate test group and then run this test group perhaps once over night. I haven’t done that, but if the project gets any bigger then I will.

In fact, the test framework inside XCode is quite good for performing integration tests (which would run against real databases and real servers), but since these may require some setup or be quite time consuming, you could also set these to run once per night.

There is also a separate framework for UI testing, which again is too slow to run against every build, but makes sense to run every night.

Refactoring the Unit Tests

For the performance test, I wanted to record the time it takes to draw the Koch Snowflake at various fractal levels. To do this I wanted to do something like the previous testInitialViewController routine, but it contained a lot of setup code. So first the unit test framework includes setUp function that is called before the unit tests are run and a tearDown routine that is called after then finished. So I moved the creating of the graphic context to this routine, along with the code to get the view controller started. Then it was fairly easy to add tests for fractal levels 3 through 7.

Last time I just had 2 unit tests, each was quite large and performed multiple things. Now we’ve split things up into more unit tests that do less, which is generally a better practice. This was actually forced on me since you can only have one measureBlock in any unit test, so I couldn’t performance test the different fractal levels in the same unit test (at least with separate timings). Really I should break up the turtle graphics tests into multiple unit tests, perhaps next time.

The reason I went all the way to fractal level 7, was that the performance reports in XCode are often 2 decimal places (or sometimes 3 decimal places) on the number of seconds the test takes. For my fractal, the drawing is quite quick so I needed go this high to get some longer test times recorded (kind of a good problem to have). I could have gone higher or put them in an additional loop, but thought this was sufficient.

//  KochSnowFlakeTests.swift
//  KochSnowFlakeTests
//  Created by Stephen Smith on 2016-05-13.
//  Copyright © 2016 Stephen Smith. All rights reserved.

import XCTest

@testable import KochSnowFlake

class KochSnowFlakeTests: XCTestCase {
    var storyboard:UIStoryboard!
    var viewController:ViewController!

    override func setUp() {
        // Put setup code here. This method is called before the invocation of each test method in the class.

        UIGraphicsBeginImageContextWithOptions(CGSize(width: 50, height: 50), false, 20);

        self.storyboard = UIStoryboard(name: "Main", bundle: nil)

        self.viewController = storyboard.instantiateInitialViewController() as! ViewController
        _ = viewController.view

    override func tearDown() {
        // Put teardown code here. This method is called after the invocation of each test method in the class.

    func testTurtleGraphics() {
        // Test the turtle graphics library.
        // Note we need a valid graphics context to do this.

        let context = UIGraphicsGetCurrentContext();
        let tg = TurtleGraphics(inContext: context!);
        XCTAssert(tg.x == 50, "Initial X value should be 50");
        XCTAssertEqual(tg.y, 150, "Initial Y value should be 150");
        XCTAssertEqual(tg.angle, 0, "Initial angle should be 0");
        XCTAssertEqual(tg.x, 60, "X should be incremented to 60");
        XCTAssertEqual(tg.y, 150, "Initial Y value should be 150");
        XCTAssertEqual(tg.angle, 0, "Initial angle should be 0");
        XCTAssertEqualWithAccuracy(tg.x, 60, accuracy: 0.0001, "X should be o 60");
        XCTAssertEqualWithAccuracy(tg.y, 160, accuracy: 0.0001, "Initial Y value should be 160");
        XCTAssertEqual(tg.angle, 90, "Initial angle should be 90");
        XCTAssertEqualWithAccuracy(tg.x, 60 + 10 * sqrt(2) / 2, accuracy: 0.0001, "X should be o 60+10*sqrt(2)/2");
        XCTAssertEqualWithAccuracy(tg.y, 160 + 10 * sqrt(2) / 2, accuracy: 0.0001, "Initial Y value should be 160+10*sqrt(2)/2");
        XCTAssertEqual(tg.angle, 45, "Initial angle should be 45");

    func testPerformanceLevel3()
        // Test that the storyboard is connected to the view controller and
        // that we can create and use the view and controls.

        viewController.fractalLevelTextField.text = "3"
        self.measureBlock {
            self.viewController.fracView.drawRect(CGRect(x:0, y:0, width: 50, height: 50))

        XCTAssertTrue(viewController.fracView.level == 3)
        // This next line is just to get 100% code coverage.

    func testPerformanceLevel4() {
        // This is an example of a performance test case.
        viewController.fractalLevelTextField.text = "4"
        self.measureBlock {
            self.viewController.fracView.drawRect(CGRect(x:0, y:0, width: 50, height: 50))

    func testPerformanceLevel5() {
        // This is an example of a performance test case.
        viewController.fractalLevelTextField.text = "5"
        self.measureBlock {
            self.viewController.fracView.drawRect(CGRect(x:0, y:0, width: 50, height: 50))

    func testPerformanceLevel6() {
        // This is an example of a performance test case.
        viewController.fractalLevelTextField.text = "6"
        self.measureBlock {
            self.viewController.fracView.drawRect(CGRect(x:0, y:0, width: 50, height: 50))

    func testPerformanceLevel7() {
        // This is an example of a performance test case.
        viewController.fractalLevelTextField.text = "7"
        self.measureBlock {
            self.viewController.fracView.drawRect(CGRect(x:0, y:0, width: 50, height: 50))



I found adding performance tests to my fractal iOS application quite easy. XCode gives quite nice support to perform these tests painlessly, hopefully motivating more programmers to include them.

At this point I’m not going to optimize the code as it is running fast enough. But if I ever take on drawing more sophisticated or complicated fractals, then drawing speed becomes really important. Some things to consider would be how efficient in the recursive algorithm used, and whether I’m efficiently using floating point and integer arithmetic (or are there unnecessary conversions or perhaps too much precision being used).

Written by smist08

June 7, 2016 at 2:15 am