Stephen Smith's Blog

Musings on Machine Learning…

Out-of-Order Instructions

leave a comment »

Introduction

We think of computer processors executing a set of instructions one at a time in sequential order. As programmers this is exactly what we expect the computer to do and if the computer decided to execute our carefully written code in a different order then this terrifies us. We would expect our program to fail, producing wrong results or crashing. However we see manufacturers claiming their processors execute instructions out-of-order and that this is a feature that improves performance. In this article, we’ll look at what is really going on here and how it can benefit us, without causing too much fear.

Disclaimer

ARM defines the Instruction Set Architecture (ISA), which defines the Assembly Language instruction set. ARM provides some reference implementations, but individual manufacturers can take these, customize these or develop their own independent implementation of the ARM instruction set. As a result the internal workings of ARM processors differs from manufacturer to manufacturer. A main point of difference is in performance optimizations. Apple is very aggressive in this regard, which is why the ARM processors in iPads and iPhones beat the competition. This means the level of out-of-order execution differs from manufacturer to manufacturer, further this is much more prevalent in newer ARM chips. As a result, the examples in this article will apply to a selection of ARM chips but not all.

A Couple of Simple Cases

Consider the following small bit of code to multiply two numbers then load another number from memory and add it to the result of the multiplication:

MUL R3, R4, R5 @ R3 = R4 * R5
LDR R6, [R7]   @ Load R6 with the memory pointed to by R7
ADD R3, R6     @ R3 = R3 + R6

The ARM Processor is a RISC processor and its goal is to execute each instruction in 1 clock cycle. However multiplication is an exception and takes several clock cycles longer due to the loop of shifting and adding it has to perform internally. The load instruction doesn’t rely on the result of the multiplication and doesn’t involve the arithmetic unit. Thus it’s fairly simple for the ARM Processor to see this and execute the load while the multiply is still churning away. If the memory location is in cache, chances are the LDR will complete before the MUL and hence we say the instructions executed out-of-order. The ADD instruction then needs the results from both the MUL and LDR instruction, so it needs to wait for both of these to complete before executing it’s addition.

Consider another example of three LDR instructions:

LDR R1, [R4] @ memory in swap file
LDR R2, [R5] @ memory not in cache
LDR R3, [R6] @ memory in cache

Here the memory being loaded by the first instruction, has been swapped out of memory to secondary storage, so loading it is going to be slow. The second memory location is in regular memory. DDR4 memory, like that used in the new Raspberry Pi 4, is pretty fast, but not as fast as the CPU and it is also loading instructions to process, hence this second LDR might take a couple of cycles to execute. It makes a request to the memory controller and its request is queued with everything else going on. The third instruction, assumes the memory is in the CPU cache and hence processed immediately, so this instruction really does take only 1 clock cycle.

The upshot is that these three LDR instructions could well complete in reverse order.

Newer ARM processors can look ahead through the instructions looking for independent instructions to execute, the size of this pool will determine how out-of-order things can get. The important point is that instructions that have dependencies can’t start and that to the programmer, it looks like his code is executing in order and that all this magic is transparent to the correct execution of the program.

Since the CPU is executing all these instructions at once, you might wonder what the value of the program counter register (PC) is? This register has a very precisely defined value, since it is used for PC relative addressing. So the PC can’t be affected by out-of-order execution. 

Coprocessors

All newer ARM processors include floating-point coprocessors and NEON vector coprocessors. The instructions that execute on these usually take a few instructions cycles to execute. If the instructions that follow a coprocessor instruction are regular ARM instructions and don’t rely on the results of coprocessor operations, then they can continue to execute in parallel to the coprocessor. This is a handy way to get more code parallelism going, keeping all aspects of the CPU busy. Intermixing coprocessor and regular instructions is another great way to leverage out-of-order instructions to get better performance.

Compilers and Code Generation

This indicates that if a compiler code generator or an Assembly Language program rearranges some of their instructions, they can get more things happening at once in parallel giving the program better performance. ARM Holdings contributes to the GNU Compiler Collection (GCC) to fully utilize the optimization present in their reference implementations. In the ARM specific options for GCC, you can select the ARM processor version that matches your target and get more advanced optimizations. Since Apple creates their own development tools under XCode, they can add optimizations specific to their custom ARM implementations.

As Assembly Language programmers, if we want to get the absolute best performance we might consider re-arranging some of our instructions so that instructions that are independent of each other are in a row and hopefully can be executed in parallel. This can require quite a bit of testing to reverse engineer the exact out-of-order instruction capability of your particular target ARM processor model. As always with performance optimizations, you must test the performance to prove you are improving things, and not just making your code more cryptic.

Interrupts

This all sounds great, but what happens when an interrupt happens? This could be a timer interrupt to say your time-slice is up and another process gets to use the ARM Core, or it could be that more data needs to be read from the Wifi or a USB device.

Here the ARM CPU designer has a choice, they can forget about the work-in-progress and handle the interrupt quickly, or they can wait a couple of cycles to let work-in-progress complete and then handle the interrupt. Either way they have to allow the interrupt handler to save the current context and then restore the context to continue execution. Typically interrupt handlers do this by saving all the CPU and coprocessor registers to the system stack, doing their work and then restoring state.

When you see an ARM processor advertised as designed for real-time or industrial use, this typically means that it handles interrupts quickly with minimal delay. In this case, the work-in-progress is discarded and will be redone after the interrupt is finished. For ARM processors designed for general purpose computing, this usually means that user performance is more important than being super responsive to interrupts and hence they can let some of the work-in-progress complete before servicing the interrupt. For general purpose computing this is ok, since the attached devices like USB, ethernet and such have buffers that can hold enough contents to wait for the CPU to get around to them.

A Step Too Far and Spectre

Hardware designers went even further with branch prediction, where if a conditional branch instruction needs to wait for a condition code to be set, they don’t wait but keep going assuming one branch direction (perhaps based on the result from the last time this code executed) and keep going. The problem here is that at this point, the CPU has to save the current state, incase it needs to go back when it guesses wrong. This CPU state was saved in a CPU cache that was only used for this, but had no security protection, resulting in the Spectre attack that figured out a way to get at this data. This caused data leakage across processes or even across virtual machines. The whole spectre debacle showed that great care has to be taken with these sorts of optimizations.

Heat, the Ultimate Gotcha

Suppose your your ARM processor has four CPU cores and you write a brilliant Assembly language program that deploys to use all four cores and fully exploits out-of-order execution. Your program is now using every bit of the ARM CPU, each core is intermixing regular ARM, floating point and NEON instructions You have intermixed your ARM instructions to get the arithmetic unit operating in parallel to the memory unit. This will be the fastest implementation yet. Then you run your program, it gets off to a great start, but then suddenly slows down to a crawl. What happened?

The enemy of parallel processing on a single chip is heat. Everything the CPU does generates a little heat. The more things you get going at once the more heat will be generated by the CPU. Most ARM based computers like the Raspberry Pi assume you won’t be running the CPU so hard, and only provide heat dissipation for a more standard load. This is why Raspberry Pis usually do so badly playing high-res videos. They can do it, as long as they don’t overheat, which typically doesn’t take long.

This leaves you a real engineering problem. You need to either add more cooling to your target device, or you have to deliberately reduce the CPU usage of your program, where perhaps paradoxically you get more work done using two cores rather than four, because you won’t be throttled due to overheating.

Summary

This was a quick overview of out-of-order instructions. Hopefully you don’t find these scary and keep in mind the potential benefits as you write your code. As newer ARM processors come to market, we’ll be seeing larger and larger pools of instructions executed in parallel, where the ability for instructions to execute out-of-order will have even greater benefits.

If you are interested in machine code or Assembly Language programming, be sure to check out my book: “Raspberry Pi Assembly Language Programming” from Apress. It is available on all major booksellers or directly from Apress here.

Written by smist08

November 15, 2019 at 11:11 am

RISC Instruction Encoding

with one comment

Introduction

Modern microprocessors execute programs from memory that are formatted specifically for the processor and the instructions it is capable of executing. This machine code is generated by tools, either fairly directly from Assembly Language source code or via a compiler that translates a high level language to machine code. There are two popular philosophies on how machine code is structured.  One is Reduced Instruction Set Computers (RISC) exemplified by ARM, RISC-V, PowerPC and MIPs processors, and the other is Complex Instruction Set Computers (CISC) exemplified by Intel and AMD processors. In RISC computers, each instruction is quite small and does a tiny bit of work, in CISC computers the instructions tend to be larger and each one does more work. The advantage of RISC processors is that the circuitry is simpler which means they use less power, this is why nearly all mobile devices use RISC processors. In this article we will be looking at some of the tricks RISC computers use to keep their instructions small and quick.

32-Bit Instructions

Most RISC processors use 32-bit machine code instructions. It doesn’t matter if the processor is 32-bit or 64-bits, this only refers to the size of pointers for memory addressing and the size of the registers, in both cases the instructions stay at 32-bits in length. With all rules there are exceptions, for instance in RISC-V processors most instructions are 32-bit, but there is a facility to allow longer instructions where necessary and in ARM processors, in 32-bit mode, there is the ability to limit instructions to 16-bits in length. Modern processors are very powerful and have a lot of functionality, so how do they encode all the information needed for an instruction into 32-bits? This restriction imposes a lot of discipline on the instruction set designers, but the solutions they have come up with are quite interesting. In comparison, Intel x86 instructions are variable length and often 120 bits in length.

Having all the instructions 32-bits in length makes creating an efficient execution pipeline very efficient, since you can load and start working on a set of instructions in parallel. You don’t need to decode one instruction to learn where the next one starts. You know there is a new instruction every 4-bytes in memory. This uniformity saves a lot of complexity and greatly enhances instruction execution throughput.

Where Do the Bits Go?

What needs to be encoded in a machine language instruction? Here are some of the possible components:

  1. The opcode. This tells the processor what the instruction does, whether its add two numbers, load data from memory or jump to another program location. If the opcode takes 8-bits then there are 256 possible instructions. To really save space some opcodes can be less bits, like perhaps if it start 011 then the other bits can go to the immediate value.
  2. Registers. Microprocessors load data into registers and then process the data in the registers. Often two or three registers need to be specified in an instruction, like the two numbers to add and then where to put the result. If there are 32 registers, then each register field will take 5-bits.
  3. Immediate data. Most processors have a way to encode some data in an instruction. Like “LOAD R1, 5” might mean load the value 5 into register R1. Here 5 is data encoded in the instruction, and called an immediate value. The size of these varies based on the instruction and use cases.
  4. Memory Addresses. Data has to be loaded from memory, or program execution has to jump to a different memory location. Note that in a modern computer memory addresses are either 32-bit or 64-bits. These are both too big to fit in a 32-bit instruction (we need at least an opcode as well). In RISC, how do we specify memory addresses?
  5. Bits for additional parameters. Perhaps there are several addressing modes, or perhaps other options for an instruction that need to be encoded. Often there are a few bits in each instruction for this purpose.

 

That’s a lot of information to pack into a 32-bit instruction. How do they do it? My introduction to Raspberry Pi Assembly Language shows how this is done for ARM processors in 32-bit mode.

How to Load a Register

Let’s look at how to load a 32-bit register with data. We can’t fit a full 32-bit value inside a 32-bit instruction, so what do we do? You might suggest that we load the value from memory rather than encode the value in the instruction. This is a legitimate thing to do, but it just moves the problem since we now need to load the 32 or 64-bit memory address into memory first.

First we could do it in two steps, perhaps we can fit a 16-bit value in an instruction and then perform two load instructions to load the value. In an ARM processor, there is a MOV instruction that can load a 16-bit immediate value and then a MOVT instructions that loads a 16-immediate value into the top 16-bits of a register. Suppose we want to load 0x12345678 into register R1, then in ARM 32-Bit Assembly we would encode:

MOVT R1, #0x1234
MOV  R1, #0x5678

This works and we do expect that working in RISC is going to take lots of small instructions to perform the work we need to get done. However this is somehow not satisfying, since this is something we do a lot and it seems wasteful to take two instructions. The other thing is that if we are running 64-bit mode and want to load a 64-bit register then this will take 4 instructions.

Another trick is to make use of the Program Counter (PC) register. This register points to the instructions currently being executed. So if we can position the value near this then we could load it by dereferencing the PC (plus a small offset). As long as the offset fits in the amount of room we have for an immediate value then this could work. In the ARM world, the Assembler helps us generate this code. We write something like:

LDR R1, =mydata

...

mydata: .WORD 0x12345678

Then the Assembler will convert the LDR instruction to something like:

LDR R1, [PC, #20]

Which means load the data pointed to by PC + 20 into R1. Now it only takes one instruction to load the data.  This technique has the advantage that it will remain one instruction to execute when dealing with 64-bit data.

Summary

This was a quick discussion of how RISC processors encode each machine code instruction as a 32-bit value. This is one of the key things that keeps RISC processors simple, allowing them to be quick while at the same time simple, and hence more power efficient.

If you are interested in machine code or Assembly Language programming, be sure to check out my book: “Raspberry Pi Assembly Language Programming” from Apress. It is available on all major booksellers or directly from Apress here.

Written by smist08

November 8, 2019 at 11:55 am

Raspberry Pi Assembly Language Programming

leave a comment »

 

Introduction

My new book “Raspberry Pi Assembly Language Programming” has just been published by Apress. This is my first book to be published by a real publisher and I’m thrilled to see it appearing on websites of booksellers all over the Internet. In this blog post I’ll talk about how this book came to exist, the process of writing and publishing it and a bit about the book itself.

For anyone interested in this book, here are a few places where it is available:

Most of these sites let you see a preview and the table of contents.

How this Book Came About

I purchased my Raspberry Pi 3+ in late 2017 and had a great deal of fun playing with it. I wrote quite a few blog posts on the Pi, a directory of these is available here. The Raspberry Pi package I purchased included a breadboard and a selection of electronic components. I put together a set of LEDs connected to the Pi’s GPIO ports. I then wrote a series of articles on making these LEDs flash using various programming languages including C, Python, Scratch, Fortran, and Erlang. In early 2018 I was interested in learning more about how the Pi’s ARM processor works and delved into Assembly language programming. This resulted in two blog posts, an introduction and then my flashing LED program ported to ARM Assembly Language.

Earlier this year I was contacted by an Apress Talent Acquisition agent who had seen my blog articles on ARM Assembly Language and wanted to know if I wanted to develop them into a book. I thought about it over the weekend and was intrigued. The material I found when writing the blog articles wasn’t great, and I felt I could do better. I replied to the agent and we had a call to discuss the book. He had me write up a proposal and possible table of contents. I did this, Apress accepted it and sent me a contract to sign.

The Process

Apress provided a Word style sheet and a written style guide. My writing process has been to write in Google Docs and then have my spouse, a professional editor, edit it. The collaboration of Google Docs is just too good to do away with. So I wrote the chapters in Google Docs, got them edited and then transferred them to MS Word and applied the Apress style sheet.

I worked with a coordinating editor at Apress who was very energetic in getting all the pieces done. She found a technical editor who would provide a technical review of each chapter as I wrote it. He was located in the UK, so often I would submit a chapter and see it edited overnight.

Once I had submitted all the chapters then a senior development editor gave the whole book a review. At that point I thought I was done, but then the book was given to Springer’s (Apress’s parent company) production department who did another editing pass. I was surprised that the production department still found quite a few things that needed fixing or improving.

After all that the book appeared fairly quickly. I like the cover, they used my photo of my breadboard with the flashing LEDs. As of today, the book is available at most booksellers, some with stock and some on preorder. I signed the contract in June and did the bulk of the writing in July and August. Overall, I’m pretty happy with the process and how things turned out.

The Book

My philosophy was to introduce complete working programs from Chapter 1 with the traditional “Hello World” program. I only covered topics where you could write the code with the tools included with the Raspberry Pi and run them. I lay the foundations for how to write larger Assembly programs, with how to code the various structured programming constructs, but also include a chapter on how to interoperate with C and Python code.

Raspbian is a 32-bit operating system as older Raspberry Pi’s and the Raspberry Pi Zero can only run 32-bit code. I didn’t want to leave out 64-bit code, as there are 64-bit versions of Linux from other distributions like Ubuntu that are available for the Pi. So I included a chapter on ARM 64-bit Assembly along with guidelines on how to port your 32-bit code to 64-bit. I then included 64-bit versions of several of the programs we had developed along the way.

There is a lot of interest in ARM Assembly Language, especially from hackers, as all phones, tablets and even a few laptops are running ARM processors now. I included a number of hacking related topics like how to reverse engineer code, as security professionals are very interested in this as they work to protect the mobile devices utilized by their organizations.

The ARM Processor is a good example of a RISC processor, so if you are interested in RISC, this book will give a good introduction to the concepts, like how to do everything with instructions that are only 32-bits in length. Once you understand ARM Assembly, picking up the Assembly language of another RISC processor like the Risc-V becomes much easier.

The book also covers how to program the floating point processor included with most ARMs along with the NEON vector processor that is available on newer Raspberry Pis.

Summary

If you are interested in learning Assembly Language, please check out my book. The Raspberry Pi provides a great platform to do this. Even if you only program in higher level languages, knowing Assembly Language will help you understand what is going on at a deeper level. How modern processors design their Assembly Language to maximize program performance and minimize memory usage is quite fascinating and I hope you find the topic as interesting as I do.

 

Written by smist08

November 1, 2019 at 11:22 am

The Race for 64-Bit Raspberry Pi 4 Linux

leave a comment »

Introduction

When the Raspberry Pi 4 was announced and shipped this past June, it caught everyone by surprise. No one was expecting a new Pi until next year sometime, if we were lucky. The Raspberry Pi 4 has updated faster components, including an updated ARM processor and USB 3.0. Raspbian, the official version version of Linux for the Pi was updated to be based on Debian Buster and shipped before the official Debian Buster actually shipped. However, Raspbian is still 32-bit, where the Raspberry foundation say this is so they only have to support one version of Linux for all Raspberry Pi devices.

Others in the Linux community, have then figured out how to run 64-bit Linux’s on the Raspberry Pi. For instance there are 64-bit versions of Ubuntu Mate, Ubuntu Server and Kali Linux. These work on the Raspberry Pi 3, but due to changes in the Raspberry architecture, didn’t work on the Raspberry Pi 4 when it shipped. We still don’t have official 64-bit releases, but we are reaching the point where the test builds are starting to work quite well.

Why 64-Bit?

To be honest, 64-bit Linux never ran very well on the Raspberry Pi 3. 64-bit Linux and 64-bit programs requires quite a bit more memory than their 32-bit equivalents. Each memory address is now 64-bits instead of 32-bits and there is a tendency to use 64-bit integers rather than 32-bit integers. The ARM processor instructions are 32-bits in both 32-bit and 64-bit mode, so programs tend to be about the same size, though 64-bit doesn’t have use of the 16-bit ARM thumb instructions. The Raspberry Pi 3 is limited to 1Gig of memory, that can just barely run a 64-bit Linux, and tends to run out of memory quickly as you run programs, like web browsers. The Raspberry Pi 4 now supports up to 4Gig of memory and that is sufficient to run 64-bit Linux along with a respectable number of programs. Plus the Raspberry Pi 4 has faster access to the SDCard and USB 3, so you can attach an even faster external drive, so if you do get swapping, it isn’t as painful.

In spite of these limitations, there are reasons to run 64-bit. The main one is that you can get better performance, especially if you actually need to work with 64-bit integers. Further the 64-bit instruction set has been optimised to work better with the execution pipeline, so you don’t get as many stalls when you perform jumps. For instance in 32-bit ARM, there is no function return instructions, so people use regular branches, pop the return address from the stack directly into the program counter or do a number of other tricks. As a result, function returns causes the execution pipeline to be flushed. In 64-bit, the pipeline knows about return instruction and knows where to get the next few instructions.

If 64-Bit Worked on the Pi 3, What’s the Problem?

If we had 64-bit working for the Pi 3, why doesn’t it just work on the Pi 4? There are a few reasons for this. The first obstacle was that Raspberry changed the whole Pi boot process. The Raspberry Pi 3 booted using the GPU. When it started the Pi 3’s GPU runs a program that knows how to read the boot folder on an SDCard and will load this into memory and then start the ARM CPU to run what it loaded into memory. The Raspberry Pi 4 now has a slightly larger EEPROM, this contains ARM code that executes on startup and then loads a further step from the SDCard. The volunteers with the other Linux distributions had to figure out this new process and adapt their code to fit into it. Sadly, the original EEPROM program didn’t provide a good way to do this, so the Linux volunteers have been working with Raspberry to get the support they need in newer versions of the EEPROM software. The most recent version seems to be working reliably finally.

The Raspberry Pi 4 then has all new hardware, so new drivers are required for bluetooth, wifi and everything else. To keep the price down, Raspberry uses older standard components, so there are drivers already written for all these devices. It’s just a matter of including the correct drivers and providing default configurations that work and settings dialogs if anything might need user input. This is all being worked on in parallel, and the consensus is that we are already in a better place than we were for the Pi 3.

It’s All Open Source so Why not Copy from Rasbian?

The Raspbian kernel is open source so anyone can look at that source code, but the EEPROM firmware is not open source. This can be reverse engineered, but that takes time. The Raspberry Pi foundation has been quite helpful in supporting people, but that is no substitute for reading the source code. This again shows the importance of open source BIOS.

Development got off to a slow start, because the Raspberry Pi foundation didn’t give anyone a heads up that this was coming. The developers of Ubuntu Mate had to order their Raspberry Pi 4’s just like everyone else when the announcement happened. This meant no one really got started until into July.

In spite of claiming up and down that they will never produce a 64-bit version of Raspbian, the Raspberry Pi foundation has produced a test Raspbian 64-bit Linux kernel. This then tests out that the Raspberry Pi firmware will support 64-bits and that all the device drivers are available. I couldn’t get this kernel to work, but it is proving very helpful for other developers. It also makes people excited that maybe Raspbian will go 64-bit sooner than later.

How Are We Doing?

The first distribution to get all this going is Gentoo Linux. They have a very smart developer Sakaki who provided the first image that actually worked. This then led to Arch and Majaro Linux releases based on Gentoo. This was a good first step, though these distributions are more for the DIY crowd due to their preference to always installing software from source code.

Next James Chambers put together a guide and images to allow you to install Ubuntu Server 64-bit on the Pi 4. Ubuntu Server is character based, but installing a desktop is no problem. The main limitation of this release is that you need a hardwired Internet connection to start. You can’t start with Wifi as the Wifi software isn’t installed with the base image. If you do have a wired Internet connection, getting it installed and installing the desktop is quite straightforward and works well. Once you have the desktop installed, then you can configure Wifi and ditch the ethernet cable.

The changes required for the Raspberry Pi 4 are being submitted to the standard Linux kernel for version 5.4. When this ships it will have available drivers for the Pi 4 hardware and official support for the Broadcom chips used in the Pi. Version 5.3 of the Linux kernel just shipped and added support for the NVidia Jetson Nano. Hopefully the wait for Linux 5.4 won’t be too long.

Summary

I’ve been running the 64-bit version of Ubuntu Linux Server, with the Xubuntu desktop for a few days now and it works really well on my Raspberry Pi 4 with 4Gig of RAM. Performance is great and everything is working. I’ve installed various software, including CubicSDR which works great. This is the first time I’ve been happy with Software Defined Radio running on a Pi.

I look forward to the official releases, and given the state of the current builds, think we are getting quite close.

Written by smist08

September 20, 2019 at 6:38 pm

Risc-V Assembly Language Hello World

leave a comment »

Introduction

Last time, we started talking about the Risc-V CPU. We looked at some background and now we are going to start to look at its Assembly Language. We’ll write a program to print “Hello World!” to the terminal window, cross-compile it with GCC and run it in a Risc-V emulator. This program lets us start discussing some features of the core Risc-V instruction set. Risc-V supports 32-bit, 64-bit or 128-bit implementations, here we’ll run using 64-bits.

We’ll start with the program, then discuss various aspects of the Assembly instructions it uses and finally discuss how to build and run the program.

Hello World

First let’s present the program and then we’ll discuss it. This program works by making Linux system calls and like all Linux programs starts execution at the globally exported _start label. The program uses the Assembly directives specified in the GCC documentation.

#
# Risc-V Assembler program to print "Hello World!"
# to stdout.
#
# a0-a2 - parameters to linux function services
# a7 - linux function number
#

.global _start      # Provide program starting address to linker

# Setup the parameters to print hello world
# and then call Linux to do it.

_start: addi  a0, x0, 1      # 1 = StdOut
        la    a1, helloworld # load address of helloworld
        addi  a2, x0, 13     # length of our string
        addi  a7, x0, 64     # linux write system call
        ecall                # Call linux to output the string

# Setup the parameters to exit the program
# and then call Linux to do it.

        addi    a0, x0, 0   # Use 0 return code
        addi    a7, x0, 93  # Service command code 93 terminates
        ecall               # Call linux to terminate the program

.data
helloworld:      .ascii "Hello World!\n"

The ‘#’ character is the comment character and anything after it on a line is a comment.

Registers

The Risc-V processor has 32 registers labeled x0 to x31 and a program counter (PC). x0 is a zero register, and x1-x31 can be used by programs as they wish. If you look at our listing for Hello World, you will notice that we are using registers a0, a1, a2 and a7. What are these? Since the Risc-V architecture provides no standards for register usage, and typical Assembly language programming requires a stack pointer, subroutine return register and some sort of function calling convention, these are defined in an Application Binary Interface (ABI). This is a software standard that the operating system defines so that programs and libraries can work together properly. Here GCC knows about the Risc-V Linux ABI where register usage is defined as:

 

Register ABI Use by convention Preserved?
x0 zero hardwired to 0, ignores writes n/a
x1 ra return address for jumps no
x2 sp stack pointer yes
x3 gp global pointer n/a
x4 tp thread pointer n/a
x5 t0 temporary register 0 no
x6 t1 temporary register 1 no
x7 t2 temporary register 2 no
x8 s0 or fp saved register 0 or frame pointer yes
x9 s1 saved register 1 yes
x10 a0 return value or function argument 0 no
x11 a1 return value or function argument 1 no
x12 a2 function argument 2 no
x13 a3 function argument 3 no
x14 a4 function argument 4 no
x15 a5 function argument 5 no
x16 a6 function argument 6 no
x17 a7 function argument 7 no
x18 s2 saved register 2 yes
x19 s3 saved register 3 yes
x20 s4 saved register 4 yes
x21 s5 saved register 5 yes
x22 s6 saved register 6 yes
x23 s7 saved register 7 yes
x24 s8 saved register 8 yes
x25 s9 saved register 9 yes
x26 s10 saved register 10 yes
x27 s11 saved register 11 yes
x28 t3 temporary register 3 no
x29 t4 temporary register 4 no
x30 t5 temporary register 5 no
x31 t6 temporary register 6 no
pc (none) program counter n/a

 

Which was taken from here. A0 to a7 are the registers used to pass function parameters (arguments), and a7 is used for Linux system calls where you specify the Linux function number from unistd.h.

Instructions

We only use three Assembly instructions in this program: LA, ADDI and ECALL. Risc-V works hard to define as few instructions as possible. As a result some instructions have multiple uses. For instance ADDI is add an intermediate to a register, which is of the form:

      ADDI RD, RS, imm

Where RD is the destination register, RS the source register and imm is a 12-bit immediate value. Instructions are 32-bits in length so the size of the immediate value tends to be whatever is leftover after setting the opcode and any required registers.

You can define a NOP instruction with:

      ADDI x0, x0, 0

Or load immediate with:

      ADDI RD, X0, imm

The Assembler will take opcodes like NOP or LI (Load Immediate) and translate them into the correct underlying instruction. Here we used ADDI, but when we decompile the compiled program we’ll see the decompiler uses these aliases. These do make your program more readable. All our ADDI instructions use the LI pattern.

Risc-V provides a separate opcode to call the operating system. This is the ECALL instruction. When calling Linux, A7 is the Linux service number and A0 to A6 contain any parameters. When calling write, we need the file descriptor (1 for stdout), the string to write and the length in bytes to write, which we put in registers A0, A1 and A2. The return code which we don’t check will be in A0. This differs from most other architectures that use the interrupt mechanism for this purpose. The Risc-V designers feel it is cleaner to separate operating system calls from interrupts, even though both cause kernel privileged instructions to execute.

The remaining instruction is LA, which isn’t a Risc-V instruction, but rather it tells the Assembler that we want to load an address into a register. Then we leave it up to the Assembler to figure out how to do this. If we are running with 64-bit addressing then this address will be 64-bits. We can’t load this with a single load immediate instruction since the biggest immediate value is 20-bits, with most smaller. This means to load the address we either need to do many instructions to load this address piece by piece using load immediates, shifts, logical operations and/or arithmetic operations. The Assembler has inside knowledge of the value of this address, so it can, say use PC relative addressing to load this address. There are a lot of tricks to deal with 64-bit values from 32-bit instructions, that we don’t have room to go into now, but perhaps in a future blog article.

Building

I don’t have a Risc-V processor, so I built the program using cross-compilation. The instructions on installing the GCC tools for this on a Debian based Linux are here. Then to build you run:

riscv64-linux-gnu-as -march=rv64imac -o HelloWorld.o HelloWorld.s
riscv64-linux-gnu-ld -o HelloWorld HelloWorld.o

We can run a Risc-V objdump to see what was produced with:

riscv64-linux-gnu-objdump -d HelloWorld

And get:

HelloWorld:     file format elf64-littleriscv

Disassembly of section .text:

00000000000100b0 <_start>:
   100b0: 00100513           li a0,1
   100b4: 00001597           auipc a1,0x1
   100b8: 02058593           addi a1,a1,32 # 110d4 <__DATA_BEGIN__>
   100bc: 00d00613           li a2,13
   100c0: 04000893           li a7,64
   100c4: 00000073           ecall
   100c8: 00000513           li a0,0
   100cc: 05d00893           li a7,93
   100d0: 00000073           ecall

We see it has interpreted the ADDI instructions that are just loading an immediate as LI. 

The “LA a1, helloworld” directive has been compiled to:

   100b4: 00001597           auipc a1,0x1
   100b8: 02058593           addi a1,a1,32 # 110d4 <__DATA_BEGIN__>

AUIPC is add immediate to PC, so it put PC+1 into A1 then the ADDI adds the offset to the beginning of the data section. Actually the Assembler set these as needing relocation and then the constants were filled in by the linker in the LD command. The good thing is that the Assembler and Linker took care of these details so we didn’t need to. Loading addresses and large integers is always a challenge in RISC processors.

Running

Now I have our HelloWorld executable on my Intel i3 laptop running Ubuntu Linux. To run it, I use the TinyEMU Risc-V emulator. There are instructions on running a mini version of Linux under the emulator, you can then mount your /tmp folder and copy the executable over. Then it runs.

The whole process is:

stephen@stephenubuntu:~/riscV/HelloWorld$ bash -x ./build
+ riscv64-linux-gnu-as -march=rv64imac -o HelloWorld.o HelloWorld.s
+ riscv64-linux-gnu-ld -o HelloWorld HelloWorld.o
stephen@stephenubuntu:~/riscV/HelloWorld$ cp HelloWorld /tmp
stephen@stephenubuntu:~/riscV/HelloWorld$ cd ../../Downloads/diskimage-linux-riscv-2018-09-23/
stephen@stephenubuntu:~/Downloads/diskimage-linux-riscv-2018-09-23$ temu root_9p-riscv64.cfg 
[    0.307640] NET: Registered protocol family 17
[    0.308079] 9pnet: Installing 9P2000 support
[    0.311914] EXT4-fs (vda): couldn't mount as ext3 due to feature incompatibilities
[    0.312757] EXT4-fs (vda): mounting ext2 file system using the ext4 subsystem
[    0.325269] EXT4-fs (vda): mounted filesystem without journal. Opts: (null)
[    0.325552] VFS: Mounted root (ext2 filesystem) on device 254:0.
[    0.326420] devtmpfs: mounted
[    0.326785] Freeing unused kernel memory: 80K
[    0.326949] This architecture does not have kernel memory protection.
~ # mount -t 9p /dev/root /mnt
~ # cp /mnt/HelloWorld .
~ # ./HelloWorld 
Hello World!
~ # 

Note: I had to add:

      kernel: "kernel-riscv64.bin",

To root_9p-riscv64.cfg in order for it to start properly.

Summary

This simple Hello World program showed us a basic Risc-V Assembly Language program that loads some registers and calls Linux to print a string and then exit. This was still a long blog posting since we needed to explain all the Assembly elements and then how to build and run the program without requiring any Risc-V hardware.

Written by smist08

September 7, 2019 at 10:38 pm

Posted in RiscV

Tagged with , , , ,

Introducing Risc-V

with 3 comments

Introduction

Risc-V (pronounced Risc Five) is an open source hardware Instruction Set Architecture (ISA) for Reduced Instruction Set Computers (RISC) developed by UC Berkeley. The Five is because this is Berkeley’s fifth RISC ISA design. This is a fully open standard, meaning that any chip manufacturer can create CPUs that use this instruction set without having to pay royalties. Currently the lion’s share of the CPU market is dominated by two camps, one is the CISC based x86 architecture from Intel with AMD as an alternate source, the other is the ARM camp where the designs come from ARM Holdings and then chip manufacturers can license the designs with royalty agreements.

The x86 architecture dominates server, workstation and laptop computers. These are quite powerful CPUs, but at the expense of using more power. The ARM architecture dominates cell phones, tables and Single Board Computers (SBCs) like the Raspberry Pi, these are usually a bit less powerful, but use far less power and are typically much cheaper.

Why do we need a third camp? What are the advantages and what are some of the features of Risc-V? This blog article will start to explore the Risc-V architecture and why people are excited about it.

Economies of Scale

The computer hardware business is competitive. For instance Western Digital harddrives each contain an ARM CPU to manage the controller functions and handle the caching. Saving a few dollars for each drive by saving the ARM royalty is a big deal. With Risc-V, Western Digital can make or buy a specialized Risc-V processor and then save the ARM royalty, either improving their profits or making their drives more price competitive.

The difficulty with introducing a new CPU architecture is to be price competitive you have to manufacture in huge quantities or your product will be very expensive. This means for there to be inexpensive Risc-V processors on the market, there has to be some large orders and that’s why adoption by large companies like Western Digital is so important.

Another giant boost to the Risc-V world is a direct result of Trump’s trade was with China. With the US restricting trade in ARM and x86 technology to China, Chinese computer manufacturers are madly investing in Risc-V, since it is open source and trade restrictions can’t be applied. If a major Chinese cell phone manufacturer can no longer get access to the latest ARM chips, then switching to Risc-V will be attractive. This is a big risk that Trump is taking, because if the rest of the world invests in Risc-V, then it might greatly reduce Intel, AMD and ARM’s influence and leadership, having the opposite effect to what Trump wants.

The Software Chicken & Egg Problem

If you create a wonderful new CPU, no matter how good it is, you still need software. At a start you need operating systems, compilers and debuggers. Developing these can be as expensive as developing the CPU chip itself. This is where open source comes to the rescue. UC Berkeley along with many other contributors added Risc-V support to the GNU Compiler Collection (GCC) and worked with Debian Linux to produce a Risc-V version of Linux.

Another big help is the availability of open source emulator technology. You are very limited in your choices of actual Risc-V hardware right now, but you can easily set up an emulator to play with. If you’ve ever played with RetroPie, you know the open source world can emulate pretty much any computer ever made. There are several emulator environments available for Risc-V so you can get going on learning the architecture and writing software as the hardware slowly starts to emerge.

Risc-V Basics

The Risc-V architecture is modular, where you start with a core simple arithmetic unit that can load/store registers, add, subtract, perform logical operations, compare and branch. There are 32 registers labeled x0 to x31. However x0 is a dedicated zero register. There is also a program counter (PC). The hardware doesn’t specify any other functionality to the registers, the rest is by software convention, such as which register is the stack pointer, which registers are used for passing function parameters, etc. Base instructions are 32-bits, but an extension module allows for 16-bit compressed instructions and extension modules can define longer instructions. The specification supports three different address sizes: 32-bit, 64-bit and 128-bit. This is quite forward thinking as we don’t expect the largest most powerful computer in the world to exceed 64-bits until 2030 or so.

Then you start adding modules like the multiply/divide module, atomic instruction module, various floating point modules, the compressed instruction module, and quite a few others. Some of these have their specifications frozen, others are still being worked on. The goal is to allow chip manufacturers to produce silicon that exactly meets their needs and keeps power utilization to a minimum.

Getting Started

Most of the current Risc-V hardware available for DIYers are small low power/low memory microcontrollers similar to Arduinos. I’m more interested in getting a Risc-V SBC similar to a Raspberry Pi or NVidia Jetson. As a result I don’t have a physical Risc-V computer to play with, but can still learn about Risc-V and play with Risc-V Assembly language programming in an emulator environment.

I’ll list the resources I found useful and the environment I’m using. Then in future blog articles, I’ll go into more detail.

  • The Risc-V Specifications. These are the documents on the ISA. I found them readable, and they give the rationale for the decisions they took along with the reasons for a number of roads they didn’t go down. The only thing missing are practical examples.
  • The Debian Risc-V Wiki Page. There is a lot of useful information here.  A very big help was how to install the Risc-V cross compilation tools on any Debian release. I used these instructions to install the Risc-V GCC tools on my Ubuntu laptop.
  • TinyEMU, a Risc-V Emulator. There are several Risc-V emulators, this is the first one I tried and its worked fine for me so far.
  • RV8 a Risc-V Emulator. This emulator looks good, but I haven’t had time to try it out yet. They have a good Risc-V instruction set summary.
  • SiFive Hardware. SiFive have produced a number of limited run Risc-V microcontrollers. Their website has lots of useful information and their employees are major contributors to various Risc-V open source projects. They have started a Risc-V Assembly Programmers Guide.

Summary

The Risc-V architecture is very interesting. It is always nice to start with a clean slate and learn from all that has gone before it. If this ISA gains enough steam to achieve volumes where it can compete with ARM, it is going to allow very powerful low cost computers. I’m very hopeful that perhaps next year we’ll see a $25 Risc-V based Raspberry Pi 4B competitor with 4Gig RAM and an M.2 SSD slot.

Written by smist08

September 6, 2019 at 6:07 pm

Posted in Business

Tagged with , , , ,

Raspberry Pi 4 as a Desktop Computer

leave a comment »

Introduction

The Raspberry Pi Foundation is promoting the Raspberry Pi 4 as a full desktop computer for only $35. I’ve had my Raspberry Pi 4 for about a month now and in this article we’ll discuss if it really is a full desktop computer replacement. This partly depends on what you use your desktop computer for. My answer is that the $35 price is misleading, you need to add quite a few other things to make it work well.

Making the Raspberry Pi 4 into a Decent Desktop

The Raspberry Pi has always been a barebones computer. You’ve always needed to add a case, a keyboard, a mouse, a monitor, a power supply, a video cable and a microSD card. Many people already have these kicking around, so they don’t need to buy them when they get their Pi. For instance, I already had a keyboard and monitor. The Raspberry Pi 4 even supports two monitors.

Beyond the bare bones, you need two more things for a decent desktop, namely:

  1. The 4GB version of the Raspberry Pi 4
  2. A good USB SSD drive

With these, it starts to feel like you are playing with a regular desktop computer. You now have enough RAM to run multiple programs and any good SSD will greatly enhance the performance of thee system, only using the microSD card to boot the Pi.

The Raspberry Pi 3 is a great little computer. Its main limitation is that if you run too many programs or open too many browser tabs, it bogs down and you have a painful process of closing windows (that aren’t responding well), until things pick up again. Now the Raspberry Pi 4 with 4GB of RAM really opens up the number of things you can do at once. Running multiple browser tabs, LibreOffice and a programming IDE are no problem.

The next thing you run into with the Raspberry Pi 4 is the performance of the SD card. Since I needed a video cable and a new case, I ordered a package deal that also included a microSD card containing Raspbian. Sadly, these bundled microSD cards are the cheapest, and hence slowest available. Having Raspbian bundled on a slow card is just a waste. Switching to a Sandisk Extreme 64GB made a huge difference. The speed was much better. When buying a microSD card watch the speed ratings, often the bigger cards (64GB or better) are twice as fast as the smaller cards (32GB or less). With a good microSD card the Raspberry Pi 4 can read and write microSD twice as fast as a Raspberry Pi 3.

I’ve never felt I could truly trust running off a microSD card. I’ve never had one fail, but people report problems all the time. Further, the performance of microSD cards is only a fraction of what you can get from good SSDs. The Raspberry Pi 4 comes with two USB 3 ports which have a theoretical performance ten times that of the microSD port. If you shop around you will find M.2 and SATA SSDs for prices less than those of microSD cards. I purchased a Kingston A1000 M.2 drive which was on sale cheap because the A2000 cards just started shipping. I had to get an M.2 USB caddy to contain it, but combined this was less than $100 and USB caddies are always useful.

Unfortunately, you can’t boot the Raspberry Pi 4 directly off a USB port yet. The Raspberry Pi foundation say this is coming, but not quite here yet. What you can do is have the entire root file system on the USB drive, but the boot partition must be on a microSD card. Setting up the SSD was easier than I thought it would be. I had to partition it, format it, copy everything over to the SSD and then edit /boot/config.txt to say where the root of the main file system is.

With this done, I feel like I’m using a real desktop computer. I’m confident my data is being stored reliably, the performance is great.

Overheating

The Raspberry Pi 4 uses more power than previous Pis. This means there is more heat to dissipate. The case I received with my Pi 4 didn’t have any ventilation holes and would get quite hot. I solved the problem by removing the top of the case. This let enough heat out that I could run fine for most things. People report that when using a USB SSD that the USB controller chip will overheat and the data throughput will be throttled. I haven’t run into this, but it is something to be aware of.

I installed Tensorflow, Google’s open source AI toolkit. Training a data model with Tensorflow does make my Pi 4 overheat. I suspect Tensorflow is keeping all four CPU cores busy and producing a maximum amount of heat. This might drive me to add a cooling fan. I like the way the Pi runs so quietly, with no fan, it makes no noise. I might try using a small fan blowing down on the Pi to see is that helps.

Summary

Is the Raspberry Pi 4 a complete desktop computer for $35? No. But if you get the 4GB model for $55 and then add a USB 3 SSD, then you do have a good workable desktop computer. The CPU power of the Pi has been compared to a typical 2012 desktop computer. But for the cost that is pretty good. I suspect the Wifi/Lan and SSD are quite a bit better than that 2012 computer.

Keep in mind the Raspberry Pi runs Linux, which isn’t for everyone. A typical low cost Windows desktop goes for around $500 these days. You can get a refurbished one for $200-$300. A refurbished desktop can be a good inexpensive option.

I like the Raspberry Pi, partly because you are cleanly out of the WinTel world. No Windows, no Intel. The processor is ARM and the operating system is Raspbian based on Debian Linux. A lot of things you do are DIY, but I enjoy that. With over 25 million Raspberry Pis sold worldwide, there is a lot of community support and you join quite an enthusiastic thriving group.

Written by smist08

August 26, 2019 at 8:17 pm