Stephen Smith's Blog

Musings on Machine Learning…

Archive for November 2019

Interrupting the ARM Processor

with one comment

Introduction

I recently published my book: “Raspberry Pi Assembly Language Programming”. Information on my book is available here. In my book, I cover the Assembly language instructions you can use when running under Raspbian, so we could always have working examples that we could play with. However there is more to the ARM processor, the details of which are transparently handled by Linux, so we don’t need to (and can’t) play with these. In this article we’ll start to look at how the ARM Processor deals with interrupts. This will be a simplified discussion, so we don’t get bogged down in all the differences between different Raspberry Pi models, with how the various interrupt controllers work, the various ARM operating modes or consider interactions with the virtual memory manager.

What Are Interrupts?

Interrupts started as a mechanism for devices to report they have data asynchronously. For instance, if you ask the disk drive to read a block of data, then the computer can keep doing other work (perhaps suspend the requesting process and continue running another process), until the disk drive sends the processor an interrupt telling it that it has retrieved the data. The ARM processor then needs to quickly process this data so that the disk drive can go on to other work. Some devices need to have the data handled quickly or it will be overwritten by new data being processed by the device. Most hardware devices have a limited buffer or queue of data they can hold before it is overwritten.

The interrupt mechanism has been used for additional purposes like reporting memory access errors and illegal instruction errors. There are also a number of system timers that send interrupts at regular intervals, these can be used to update the system clock, or preempt the current task, to give other tasks a turn under the operating system’s multitasking algorithm. Operating system calls are often implemented using interrupts, since a side effect of an interrupt being triggered is to change the operating state of the processor from user mode to system mode, allowing the operating system to run at a more privileged level. You can see this described in Chapter 7 of my book, on Linux Operating System Services.

How Are Interrupts Called?

If a device receives data, it notifies the interrupt controller which then maps the interrupt to one of the ARM processor interrupt codes. Transfer of control immediately switches to the code contained in a specific memory location. Below is a table  of the various interrupts supported by one particular ARM model. In Raspbian the memory offsets are added to 0xffff0000 to get the actual address.

Each ARM instruction is 32-bits in size, so each slot in the interrupt table can hold a single ARM instruction, hence this has to be a branch instruciton, or an instruction that does something to the program counter. The exception is the last one, the FIQ Interrupt which is the “Fast” interrupt, since fast interrupts need fast processing, it is deemed to slow to do a branch instruction first, so the interrupt handler can be entirely placed at this address, which is why it’s the last one in the table.

Some example instructions you might see in this table are:

B myhandler @ will be a PC relative address
MOV PC, #0x1234 @ has to be a valid operand2
LDR PC, [PC, #100] @ load PC with value from nearby memory

You can read about operand2 and the details of these instructions in my book.

Interrupt Calling Convention

When you call a regular function, there is a protocol, or calling convention that specifies who is responsible for saving which registers. Some are saved by the caller if they need them preserved and some have to be saved by the callee if it uses them. There are conventions on how to use the stack and how to return values from functions. With interrupt routines, the code that is interrupted can’t do anything. It’s been interrupted and has no knowledge of what is happening. Preserving the state of things is entirely handled by a combination of the ARM CPU and the interrupt handler code.

The ARM processor has a bank of extra (shadow) registers that it will automatically switch with the regular registers when an interrupt happens. This is especially necessary to preserve the Control Program Status Register (CPSR). The block diagram below shows the banks of registers for the various interrupt states.

Consider the code:

CMP R6, #66
BEQ _loop

If the interrupt occurs between these two instructions, then the CPSR (which holds the result of the comparison) must be preserved or the BEQ instruction could do the wrong thing. This is why the ARM processor switches the CPSR with one of the SPSR registers when the interrupt occurs and then switches them back when the interrupt service routine exits.

Similarly there are shadow registers for the Stack Pointer (SP) and Link Return (LR) register.

For fast (FIQ) interrupts, the ARM CPU also switches registers R8-R12, so the FIQ interrupt handler has five registers to use, without wasting time saving and restoring things.

If the interrupt handler uses any other registers then it needs to store them on the stack on entry and pop them from the stack before returning.

Interrupting an Interrupt?

When an interrupt occurs, the ARM processor disables interrupts until the interrupt handler routines. This is the simplest case since the operating system writer doesn’t have to worry about their interrupt routine being interrupted. This works ok as long as it can handle things quickly, but some interrupt handlers have to do quite a bit of work, for instance if a device returns 4k of data to be processed. Notice that the shadow registers have separate copies for each type of interrupt. This way if you are handling an IRQ interrupt, it is easy to enable FIQ interrupts and allow the IRQ handler to be interrupted by the higher priority FIQ. Newer interrupt handlers have support for more sophisticated nested interrupt handling, but that can be the topic for another article. Linux can disable or enable interrupts as it needs, for instance to finish initialization on a reboot before turning on interrupts.

Returning from an Interrupt

The instruction you use to return from an interrupt is interesting, it is:

SUBS R15, R14, #4

This instruction is taking the Link Return register, subtracting 4 and placing the result in the Program Counter (PC). The ARM Processor knows about this, so it can re-swap the shadow registers.

Normally we just need to move LR to PC to return, why the subtract 4? The reason is the instruction pipeline. Remember the pipeline model is three steps, first the instruction is loaded from memory, then its decoded and then its executed. When we were interrupted we lost the last instruction decode, so we need to go back and do it again.

This is a bit of a relic, since newer ARM processors have much more sophisticated pipeline, but once this was ingrained in enough code then the ARM processor has to respect it and stick with this scheme.

Summary

This was a quick introductory overview of how the ARM processor handles interrupts. You don’t need to know this unless you are working on the ARM support in the Linux kernel, or you are creating your own operating system. Still it is interesting to see what is going on under the hood as the ARM Processor and Linux operating system provide all their services to make things easy for your programs running in user mode.

Written by smist08

November 22, 2019 at 11:33 am

Out-of-Order Instructions

leave a comment »

Introduction

We think of computer processors executing a set of instructions one at a time in sequential order. As programmers this is exactly what we expect the computer to do and if the computer decided to execute our carefully written code in a different order then this terrifies us. We would expect our program to fail, producing wrong results or crashing. However we see manufacturers claiming their processors execute instructions out-of-order and that this is a feature that improves performance. In this article, we’ll look at what is really going on here and how it can benefit us, without causing too much fear.

Disclaimer

ARM defines the Instruction Set Architecture (ISA), which defines the Assembly Language instruction set. ARM provides some reference implementations, but individual manufacturers can take these, customize these or develop their own independent implementation of the ARM instruction set. As a result the internal workings of ARM processors differs from manufacturer to manufacturer. A main point of difference is in performance optimizations. Apple is very aggressive in this regard, which is why the ARM processors in iPads and iPhones beat the competition. This means the level of out-of-order execution differs from manufacturer to manufacturer, further this is much more prevalent in newer ARM chips. As a result, the examples in this article will apply to a selection of ARM chips but not all.

A Couple of Simple Cases

Consider the following small bit of code to multiply two numbers then load another number from memory and add it to the result of the multiplication:

MUL R3, R4, R5 @ R3 = R4 * R5
LDR R6, [R7]   @ Load R6 with the memory pointed to by R7
ADD R3, R6     @ R3 = R3 + R6

The ARM Processor is a RISC processor and its goal is to execute each instruction in 1 clock cycle. However multiplication is an exception and takes several clock cycles longer due to the loop of shifting and adding it has to perform internally. The load instruction doesn’t rely on the result of the multiplication and doesn’t involve the arithmetic unit. Thus it’s fairly simple for the ARM Processor to see this and execute the load while the multiply is still churning away. If the memory location is in cache, chances are the LDR will complete before the MUL and hence we say the instructions executed out-of-order. The ADD instruction then needs the results from both the MUL and LDR instruction, so it needs to wait for both of these to complete before executing it’s addition.

Consider another example of three LDR instructions:

LDR R1, [R4] @ memory in swap file
LDR R2, [R5] @ memory not in cache
LDR R3, [R6] @ memory in cache

Here the memory being loaded by the first instruction, has been swapped out of memory to secondary storage, so loading it is going to be slow. The second memory location is in regular memory. DDR4 memory, like that used in the new Raspberry Pi 4, is pretty fast, but not as fast as the CPU and it is also loading instructions to process, hence this second LDR might take a couple of cycles to execute. It makes a request to the memory controller and its request is queued with everything else going on. The third instruction, assumes the memory is in the CPU cache and hence processed immediately, so this instruction really does take only 1 clock cycle.

The upshot is that these three LDR instructions could well complete in reverse order.

Newer ARM processors can look ahead through the instructions looking for independent instructions to execute, the size of this pool will determine how out-of-order things can get. The important point is that instructions that have dependencies can’t start and that to the programmer, it looks like his code is executing in order and that all this magic is transparent to the correct execution of the program.

Since the CPU is executing all these instructions at once, you might wonder what the value of the program counter register (PC) is? This register has a very precisely defined value, since it is used for PC relative addressing. So the PC can’t be affected by out-of-order execution. 

Coprocessors

All newer ARM processors include floating-point coprocessors and NEON vector coprocessors. The instructions that execute on these usually take a few instructions cycles to execute. If the instructions that follow a coprocessor instruction are regular ARM instructions and don’t rely on the results of coprocessor operations, then they can continue to execute in parallel to the coprocessor. This is a handy way to get more code parallelism going, keeping all aspects of the CPU busy. Intermixing coprocessor and regular instructions is another great way to leverage out-of-order instructions to get better performance.

Compilers and Code Generation

This indicates that if a compiler code generator or an Assembly Language program rearranges some of their instructions, they can get more things happening at once in parallel giving the program better performance. ARM Holdings contributes to the GNU Compiler Collection (GCC) to fully utilize the optimization present in their reference implementations. In the ARM specific options for GCC, you can select the ARM processor version that matches your target and get more advanced optimizations. Since Apple creates their own development tools under XCode, they can add optimizations specific to their custom ARM implementations.

As Assembly Language programmers, if we want to get the absolute best performance we might consider re-arranging some of our instructions so that instructions that are independent of each other are in a row and hopefully can be executed in parallel. This can require quite a bit of testing to reverse engineer the exact out-of-order instruction capability of your particular target ARM processor model. As always with performance optimizations, you must test the performance to prove you are improving things, and not just making your code more cryptic.

Interrupts

This all sounds great, but what happens when an interrupt happens? This could be a timer interrupt to say your time-slice is up and another process gets to use the ARM Core, or it could be that more data needs to be read from the Wifi or a USB device.

Here the ARM CPU designer has a choice, they can forget about the work-in-progress and handle the interrupt quickly, or they can wait a couple of cycles to let work-in-progress complete and then handle the interrupt. Either way they have to allow the interrupt handler to save the current context and then restore the context to continue execution. Typically interrupt handlers do this by saving all the CPU and coprocessor registers to the system stack, doing their work and then restoring state.

When you see an ARM processor advertised as designed for real-time or industrial use, this typically means that it handles interrupts quickly with minimal delay. In this case, the work-in-progress is discarded and will be redone after the interrupt is finished. For ARM processors designed for general purpose computing, this usually means that user performance is more important than being super responsive to interrupts and hence they can let some of the work-in-progress complete before servicing the interrupt. For general purpose computing this is ok, since the attached devices like USB, ethernet and such have buffers that can hold enough contents to wait for the CPU to get around to them.

A Step Too Far and Spectre

Hardware designers went even further with branch prediction, where if a conditional branch instruction needs to wait for a condition code to be set, they don’t wait but keep going assuming one branch direction (perhaps based on the result from the last time this code executed) and keep going. The problem here is that at this point, the CPU has to save the current state, incase it needs to go back when it guesses wrong. This CPU state was saved in a CPU cache that was only used for this, but had no security protection, resulting in the Spectre attack that figured out a way to get at this data. This caused data leakage across processes or even across virtual machines. The whole spectre debacle showed that great care has to be taken with these sorts of optimizations.

Heat, the Ultimate Gotcha

Suppose your your ARM processor has four CPU cores and you write a brilliant Assembly language program that deploys to use all four cores and fully exploits out-of-order execution. Your program is now using every bit of the ARM CPU, each core is intermixing regular ARM, floating point and NEON instructions You have intermixed your ARM instructions to get the arithmetic unit operating in parallel to the memory unit. This will be the fastest implementation yet. Then you run your program, it gets off to a great start, but then suddenly slows down to a crawl. What happened?

The enemy of parallel processing on a single chip is heat. Everything the CPU does generates a little heat. The more things you get going at once the more heat will be generated by the CPU. Most ARM based computers like the Raspberry Pi assume you won’t be running the CPU so hard, and only provide heat dissipation for a more standard load. This is why Raspberry Pis usually do so badly playing high-res videos. They can do it, as long as they don’t overheat, which typically doesn’t take long.

This leaves you a real engineering problem. You need to either add more cooling to your target device, or you have to deliberately reduce the CPU usage of your program, where perhaps paradoxically you get more work done using two cores rather than four, because you won’t be throttled due to overheating.

Summary

This was a quick overview of out-of-order instructions. Hopefully you don’t find these scary and keep in mind the potential benefits as you write your code. As newer ARM processors come to market, we’ll be seeing larger and larger pools of instructions executed in parallel, where the ability for instructions to execute out-of-order will have even greater benefits.

If you are interested in machine code or Assembly Language programming, be sure to check out my book: “Raspberry Pi Assembly Language Programming” from Apress. It is available on all major booksellers or directly from Apress here.

Written by smist08

November 15, 2019 at 11:11 am

RISC Instruction Encoding

with one comment

Introduction

Modern microprocessors execute programs from memory that are formatted specifically for the processor and the instructions it is capable of executing. This machine code is generated by tools, either fairly directly from Assembly Language source code or via a compiler that translates a high level language to machine code. There are two popular philosophies on how machine code is structured.  One is Reduced Instruction Set Computers (RISC) exemplified by ARM, RISC-V, PowerPC and MIPs processors, and the other is Complex Instruction Set Computers (CISC) exemplified by Intel and AMD processors. In RISC computers, each instruction is quite small and does a tiny bit of work, in CISC computers the instructions tend to be larger and each one does more work. The advantage of RISC processors is that the circuitry is simpler which means they use less power, this is why nearly all mobile devices use RISC processors. In this article we will be looking at some of the tricks RISC computers use to keep their instructions small and quick.

32-Bit Instructions

Most RISC processors use 32-bit machine code instructions. It doesn’t matter if the processor is 32-bit or 64-bits, this only refers to the size of pointers for memory addressing and the size of the registers, in both cases the instructions stay at 32-bits in length. With all rules there are exceptions, for instance in RISC-V processors most instructions are 32-bit, but there is a facility to allow longer instructions where necessary and in ARM processors, in 32-bit mode, there is the ability to limit instructions to 16-bits in length. Modern processors are very powerful and have a lot of functionality, so how do they encode all the information needed for an instruction into 32-bits? This restriction imposes a lot of discipline on the instruction set designers, but the solutions they have come up with are quite interesting. In comparison, Intel x86 instructions are variable length and often 120 bits in length.

Having all the instructions 32-bits in length makes creating an efficient execution pipeline very efficient, since you can load and start working on a set of instructions in parallel. You don’t need to decode one instruction to learn where the next one starts. You know there is a new instruction every 4-bytes in memory. This uniformity saves a lot of complexity and greatly enhances instruction execution throughput.

Where Do the Bits Go?

What needs to be encoded in a machine language instruction? Here are some of the possible components:

  1. The opcode. This tells the processor what the instruction does, whether its add two numbers, load data from memory or jump to another program location. If the opcode takes 8-bits then there are 256 possible instructions. To really save space some opcodes can be less bits, like perhaps if it start 011 then the other bits can go to the immediate value.
  2. Registers. Microprocessors load data into registers and then process the data in the registers. Often two or three registers need to be specified in an instruction, like the two numbers to add and then where to put the result. If there are 32 registers, then each register field will take 5-bits.
  3. Immediate data. Most processors have a way to encode some data in an instruction. Like “LOAD R1, 5” might mean load the value 5 into register R1. Here 5 is data encoded in the instruction, and called an immediate value. The size of these varies based on the instruction and use cases.
  4. Memory Addresses. Data has to be loaded from memory, or program execution has to jump to a different memory location. Note that in a modern computer memory addresses are either 32-bit or 64-bits. These are both too big to fit in a 32-bit instruction (we need at least an opcode as well). In RISC, how do we specify memory addresses?
  5. Bits for additional parameters. Perhaps there are several addressing modes, or perhaps other options for an instruction that need to be encoded. Often there are a few bits in each instruction for this purpose.

 

That’s a lot of information to pack into a 32-bit instruction. How do they do it? My introduction to Raspberry Pi Assembly Language shows how this is done for ARM processors in 32-bit mode.

How to Load a Register

Let’s look at how to load a 32-bit register with data. We can’t fit a full 32-bit value inside a 32-bit instruction, so what do we do? You might suggest that we load the value from memory rather than encode the value in the instruction. This is a legitimate thing to do, but it just moves the problem since we now need to load the 32 or 64-bit memory address into memory first.

First we could do it in two steps, perhaps we can fit a 16-bit value in an instruction and then perform two load instructions to load the value. In an ARM processor, there is a MOV instruction that can load a 16-bit immediate value and then a MOVT instructions that loads a 16-immediate value into the top 16-bits of a register. Suppose we want to load 0x12345678 into register R1, then in ARM 32-Bit Assembly we would encode:

MOVT R1, #0x1234
MOV  R1, #0x5678

This works and we do expect that working in RISC is going to take lots of small instructions to perform the work we need to get done. However this is somehow not satisfying, since this is something we do a lot and it seems wasteful to take two instructions. The other thing is that if we are running 64-bit mode and want to load a 64-bit register then this will take 4 instructions.

Another trick is to make use of the Program Counter (PC) register. This register points to the instructions currently being executed. So if we can position the value near this then we could load it by dereferencing the PC (plus a small offset). As long as the offset fits in the amount of room we have for an immediate value then this could work. In the ARM world, the Assembler helps us generate this code. We write something like:

LDR R1, =mydata

...

mydata: .WORD 0x12345678

Then the Assembler will convert the LDR instruction to something like:

LDR R1, [PC, #20]

Which means load the data pointed to by PC + 20 into R1. Now it only takes one instruction to load the data.  This technique has the advantage that it will remain one instruction to execute when dealing with 64-bit data.

Summary

This was a quick discussion of how RISC processors encode each machine code instruction as a 32-bit value. This is one of the key things that keeps RISC processors simple, allowing them to be quick while at the same time simple, and hence more power efficient.

If you are interested in machine code or Assembly Language programming, be sure to check out my book: “Raspberry Pi Assembly Language Programming” from Apress. It is available on all major booksellers or directly from Apress here.

Written by smist08

November 8, 2019 at 11:55 am

Raspberry Pi Assembly Language Programming

leave a comment »

 

Introduction

My new book “Raspberry Pi Assembly Language Programming” has just been published by Apress. This is my first book to be published by a real publisher and I’m thrilled to see it appearing on websites of booksellers all over the Internet. In this blog post I’ll talk about how this book came to exist, the process of writing and publishing it and a bit about the book itself.

For anyone interested in this book, here are a few places where it is available:

Most of these sites let you see a preview and the table of contents.

This blog’s dedicated page to my book.

How this Book Came About

I purchased my Raspberry Pi 3+ in late 2017 and had a great deal of fun playing with it. I wrote quite a few blog posts on the Pi, a directory of these is available here. The Raspberry Pi package I purchased included a breadboard and a selection of electronic components. I put together a set of LEDs connected to the Pi’s GPIO ports. I then wrote a series of articles on making these LEDs flash using various programming languages including C, Python, Scratch, Fortran, and Erlang. In early 2018 I was interested in learning more about how the Pi’s ARM processor works and delved into Assembly language programming. This resulted in two blog posts, an introduction and then my flashing LED program ported to ARM Assembly Language.

Earlier this year I was contacted by an Apress Talent Acquisition agent who had seen my blog articles on ARM Assembly Language and wanted to know if I wanted to develop them into a book. I thought about it over the weekend and was intrigued. The material I found when writing the blog articles wasn’t great, and I felt I could do better. I replied to the agent and we had a call to discuss the book. He had me write up a proposal and possible table of contents. I did this, Apress accepted it and sent me a contract to sign.

The Process

Apress provided a Word style sheet and a written style guide. My writing process has been to write in Google Docs and then have my spouse, a professional editor, edit it. The collaboration of Google Docs is just too good to do away with. So I wrote the chapters in Google Docs, got them edited and then transferred them to MS Word and applied the Apress style sheet.

I worked with a coordinating editor at Apress who was very energetic in getting all the pieces done. She found a technical editor who would provide a technical review of each chapter as I wrote it. He was located in the UK, so often I would submit a chapter and see it edited overnight.

Once I had submitted all the chapters then a senior development editor gave the whole book a review. At that point I thought I was done, but then the book was given to Springer’s (Apress’s parent company) production department who did another editing pass. I was surprised that the production department still found quite a few things that needed fixing or improving.

After all that the book appeared fairly quickly. I like the cover, they used my photo of my breadboard with the flashing LEDs. As of today, the book is available at most booksellers, some with stock and some on preorder. I signed the contract in June and did the bulk of the writing in July and August. Overall, I’m pretty happy with the process and how things turned out.

The Book

My philosophy was to introduce complete working programs from Chapter 1 with the traditional “Hello World” program. I only covered topics where you could write the code with the tools included with the Raspberry Pi and run them. I lay the foundations for how to write larger Assembly programs, with how to code the various structured programming constructs, but also include a chapter on how to interoperate with C and Python code.

Raspbian is a 32-bit operating system as older Raspberry Pi’s and the Raspberry Pi Zero can only run 32-bit code. I didn’t want to leave out 64-bit code, as there are 64-bit versions of Linux from other distributions like Ubuntu that are available for the Pi. So I included a chapter on ARM 64-bit Assembly along with guidelines on how to port your 32-bit code to 64-bit. I then included 64-bit versions of several of the programs we had developed along the way.

There is a lot of interest in ARM Assembly Language, especially from hackers, as all phones, tablets and even a few laptops are running ARM processors now. I included a number of hacking related topics like how to reverse engineer code, as security professionals are very interested in this as they work to protect the mobile devices utilized by their organizations.

The ARM Processor is a good example of a RISC processor, so if you are interested in RISC, this book will give a good introduction to the concepts, like how to do everything with instructions that are only 32-bits in length. Once you understand ARM Assembly, picking up the Assembly language of another RISC processor like the Risc-V becomes much easier.

The book also covers how to program the floating point processor included with most ARMs along with the NEON vector processor that is available on newer Raspberry Pis.

Summary

If you are interested in learning Assembly Language, please check out my book. The Raspberry Pi provides a great platform to do this. Even if you only program in higher level languages, knowing Assembly Language will help you understand what is going on at a deeper level. How modern processors design their Assembly Language to maximize program performance and minimize memory usage is quite fascinating and I hope you find the topic as interesting as I do.

 

Written by smist08

November 1, 2019 at 11:22 am