Stephen Smith's Blog

Musings on Machine Learning…

5 Best Word Processors for Writers

leave a comment »

The Write Cup

By Jeff Hortobagyi, Cathalynn Cindy Labonte-Smith, Elizabeth Rains & Stephen Smith


The market has remained fairly static for word-processors since MS Word was released as Multi-Tool Word Version 1.0 in 1983, it’s dominated the word-processing market. Its main competitor became WordPerfect, but that program soon fizzled out and became a minor player. Although, there are loyal WordPerfect users out there and there’s a WordPerfect Office Professional 2020 suite available, but at over $500 it’s out-priced itself out of the market. MS Word remains the heavy hitter in the word-processing world and it’s affordable for $6.40/month for the entire MS Office package, but an increasing number of free apps keep driving down its price.

In 2006, Google Docs came along and changed the way people worked. No longer did authors need to print out and make paper copies to make redlines. No longer did they need to attach large…

View original post 4,039 more words

Written by smist08

July 10, 2020 at 9:10 pm

Posted in Uncategorized

Fallout From ARM’s Success

with one comment


Last time, we talked about a number of ARM’s recent successes. This time we’ll discuss a few of the consequences for the rest of the industry. Many people are discussing the effect on Intel and AMD, but probably a bigger victim of the ARM steamroller is RISC-V, the open source processor.

Trouble for Intel’s Profits

This past year wasn’t a good one for Intel. They’ve been having trouble keeping up with chip manufacturing technology. Most other vendors outsource their chip manufacturing to TSMC, Samsung and a couple of others. What has happened is that TSMC is so large that it is out-spending Intel on R&D by orders of manufacturing and as a result is years ahead of Intel in chip technology. The big winners in this are AMD and ARM which now manufacture denser, faster, more power efficient chips than Intel. AMD gave up manufacturing their chips themselves some years ago and ARM never manufactured chips itself. 

Better chip manufacturing technology allows AMD and ARM to fit more processing cores on each chip or produce products in smaller form factors.

Intel’s main problem this past year has been AMD which has been chipping away at their market share. Now with Apple switching to ARM processors, this could be the start of a migration away from Intel. Microsoft already has an ARM version of their Surface notebook running a limited version of Windows, but they could easily produce something more powerful running a full version of Windows. Similarly, other manufacturers, such as Dell or HP could start producing ARM based laptops and workstations running Linux.

Although AMD doesn’t have Intel’s manufacturing problems, it does have a problem with requiring all its chips to support all the instructions introduced into the x86/x64 architecture over the many years of its existence. Modern x86 chips run RISC cores internally, but have to translate the old CISC instructions into RISC instructions as they run. This extra layer is required to keep all those old DOS and Windows programs running, many of which are no longer supported, but used by many users. Both Intel and AMD are at a competitive disadvantage to ARM and RISC-V, who don’t need to waste circuitry doing this, and extra circuitry means higher power consumption and heat production.

Today Intel’s most profitable chips are its data center focused Xeon processors. These are powerful multi-core chips, but with more and more cores being added to ARM processors, even here ARM is starting to chip away at Intel.

RISC-V is having Trouble Launching

I’ve blogged on RISC-V processors a couple of times, this is an open source hardware specification so you can develop a processor without paying royalties or fees to any other companies. Anyone can manufacture an ARM processor, but if they use the ARM instruction set, they need to pay royalties to ARM Holdings. The hope of the RISC-V folks was to stimulate competitive innovation and produce lower cost, more powerful processors.

The reality has been that companies designing RISC-V chips can’t get orders to manufacture in the volume they need to be price competitive.

RISC-V is still ticking along, but it is limited to the following applications:

  • Providing low cost processors for the Arduino market, usually 32-bit processors with a few meg of memory.
  • Producing specialty chips for things like AI processors. Again this is having trouble getting going due to low volumes.
  • Manufacturers like Western Digital using them as embedded processors in their products, like WD’s disk controllers.

What RISC-V really needs is a Single Board Computer (SBC) like the Raspberry Pi. This means with comparable performance and price. Plus they need to run Linux in a stable supported way. Without this there won’t be any software development and they won’t be able to gain any sort of foothold. Doing this will be extremely difficult given how powerful and cheap the current crop of ARM based SBCs are. The level of software support for ARM in the Linux world is phenomenal.


ARM certainly isn’t going to eradicate Intel and AMD anytime soon. But even a small dent in their sales can send their stock price into a tailspin. Investors are going to have to watch the trends very closely, in case they need to bail. RISC-V will continue to have difficulty gaining acceptance, and manufacturing a competitive chip. More companies will adopt ARM and this will increase its competitive advantage. Here ARM’s strategy of licensing designs rather than chips is really paying off in fielding more and more competition for its rivals. Next year will be a very good one for ARM and likely an even tougher year for Intel.

The main conclusion here is that if you are a programmer, you should have a look at ARM and a good way to learn about it is to study its Assembly Language, perhaps by reading my book: “Programming with 64-Bit ARM Assembly Language”.

Written by smist08

July 3, 2020 at 11:23 am

Posted in Business

Tagged with , , , ,

Exciting Days for ARM Processors

with 5 comments


ARM CPUs have long dominated the mobile world, nearly all Apple and Android phones and tablets utilize some model of ARM processor. However Intel and AMD still dominate the laptop, desktop, server and supercomputer markets. This week we saw a number of announcements where this will likely change:

  1. Apple announced they are going to transition all Mac computers to the ARM processor over two years.
  2. Ampere announced a 128-core server ARM processor.
  3. Japan now has the world’s most powerful supercomputer and it is based on 158,976 ARM Processors.

In this blog post, we’ll look at some of the consequences of these moves.

Apple Macs Move to ARM

The big announcement at this year’s Apple WorldWide Developers Conference is that Apple will be phasing out Intel processors in their Mac desktop and laptop computers. You wouldn’t know they are switching to ARM processors from all their marketing speak, which exclusively talks about the switch from Intel to Apple Silicon. But the heart of Apple Silicon are ARM CPU cores. The name Apple Silicon refers to the System on a Chip (SoC) that they are building around the ARM processors. These SoCs will include a number of ARM cores, a GPU, an AI processor, memory manager and other support functions.

Developers can pay $500 to get an iMac mini running the same ARM CPU as the latest iPad Pro, the downside is that you need to give this hardware back when the real systems ship at the end of this year. It is impressive that you can get a working ARM Mac running MacOS along with a lot of software already including the XCode development system. One cool feature is that you can run any iPad or iPhone app on your Mac, now that all Apple devices share the same CPU.

The new version of MacOS for ARM (or Apple Silicon) will run Intel compiled programs in an emulator, but the hope from Apple is that developers will recompile their programs for ARM fairly quickly, so this won’t be needed much. The emulation has some limitations, in that it doesn’t support Intel AVX SIMD instructions or instructions related to virtualization.

For developers converting their applications, if they have Assembly Language code, this will have to be converted from Intel Assembly to ARM Assembly and of course a great resource to do this is my book:

I’m excited to see what these new models of ARM based Apple computers look like. We should see them announced as we approach the Christmas shopping season. Incorporating all the circuitry onto a single chip will make these new computers even slimmer, lighter and more compact. Battery life should be far longer but still with great performance.

I think Apple should be thanking the Raspberry Pi world for showing what you can do with SoCs, and for driving so much software to already be ported to the ARM processor.

One possible downside of the new Macs, is that Apple keeps talking about the new secure boot feature only allowing Apple signed operating systems to boot as a security feature. Does this mean we won’t be able to run Linux on these new Macs, except using virtualization? This will be a big downside, especially down the road when Apple drops support for them. Apple makes great hardware that keeps on working long after Apple no longer supports it. You can get a lot of extra life out of your Apple hardware by installing Linux and keeping on trucking with new updates.

New Ampere Server ARM Chips

Intel and AMD have long dominated the server and data center markets, but that is beginning to change. Amazon has been designing their own ARM chips for AWS and Ampere has been providing extremely powerful ARM based server chips for everyone else. Last year they announced an 80-core ARM based server chip which is now in production. Just this week they announced the next generation which is a 128-core ARM server chip.

If you aren’t interested in a server, but would like a workstation containing one of these chips then you could consider a computer from Avantek such as this one.

These are just one of several powerful ARM based server chips coming to market. It will be interesting to see if there is a lot of uptake of ARM in this space.

Japan’s Fugaku ARM Based Supercomputer is Number One

Japan just took the number one spot in the list of the world’s most powerful supercomputers. The Fugaku supercomputer is located in Kobe and uses 158,976 Fujitsu 48-core ARM SoCs. Of course this computer runs Linux and currently is being used to solve protein folding problems around developing a cure for COVID-19, similar to folding@home. This is a truly impressive warehouse of technology and shows where you can go with the ARM CPU and the open source Linux operating system.


ARM conquered the mobile world some years ago, and now it looks like ARM is ready to take on the rest of the computer industry. Expect to see more ARM based desktop and laptop computers than just Macs. Only time will tell whether this is a true threat to Intel and AMD, but the advantage ARM has over previous attempts to unseat Intel as king is that they already have more volume production than Intel and AMD combined. The Intel world has stagnated in recent years, and I look forward to seeing the CPU market jump ahead again.

Written by smist08

June 24, 2020 at 11:28 am

Playing with CUDA on my Gaming Laptop

leave a comment »

Playing with CUDA on my Gaming Laptop


Last year, I blogged on playing with CUDA on my nVidia Jetson Nano. I recently bought a new laptop which contains an nVidia GTX1650 graphics card with 4Gig of RAM. This is more powerful than the coprocessor built into the Jetson Nano.  I took advantage of the release of newer Intel 10th generation processors along with the wider availability of newer nVidia RTX graphics cards to get a good deal on a gaming laptop with an Intel 9th generation processor and nVidia GTX graphics. This is still a very fast laptop with 16Gig of RAM and runs the couple of video games I’ve tried fine. It also compiles and handles my normal projects easily. In this blog post, I’ll repeat a lot of my previous article on the nVidia Jetson, but in the context of running on Windows 10 with an Intel CPU.

I wanted an nVidia graphics card because these have the best software support for graphics, gaming, AI, machine learning and parallel programming. If you use Tensorflow for AI, then it uses the nVidia graphics card automatically. All the versions of DirectX support nVidia and if you are doing general parallel programming then you can use a system like OpenCL. I find nVidia leads AMD in software support and Intel is going to have a lot of trouble with their new Xe graphics cards reaching this same level of software support.


On Windows, most developers use Visual Studio. I could do this all with GCC, but this is more difficult, since when you install the SDK for CUDA, you get all the samples and documentation for Visual Studio. The good news is that you can use Visual Studio Community Edition which is free and actually quite good. Installing Visual Studio is straightforward, just time consuming since it is large.

Next up, you need to install nVidia’s CUDA toolkit. Again, this is straightforward, just large. Although the install is large, you likely have all the drivers already installed, so you are mostly getting the developer tools and samples out of this.

Performing these installs and then dealing with the program upgrades, really makes me miss Linux’s package managers. On Linux, you can upgrade all the software on your computer with one command on a regular basis. On Windows, each program checks for upgrades when it starts and usually wants to upgrade itself before you do any work. I find that this is a real productivity killer on Windows. Microsoft is starting work on a package manager for Windows, but at this point it does little.

Compiling the deviceQuery sample produced the following output on my gaming laptop:

CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 1650 with Max-Q Design"
  CUDA Driver Version / Runtime Version          11.0 / 11.0
  CUDA Capability Major/Minor version number:    7.5
  Total amount of global memory:                 4096 MBytes (4294967296 bytes)
  (16) Multiprocessors, ( 64) CUDA Cores/MP:     1024 CUDA Cores
  GPU Max Clock rate:                            1245 MHz (1.25 GHz)
  Memory Clock rate:                             3501 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 1048576 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 6 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  CUDA Device Driver Mode (TCC or WDDM):         WDDM (Windows Display Driver Model)
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.0, CUDA Runtime Version = 11.0, NumDevs = 1
Result = PASS

If we compare this to the nVidia Jetson Nano, we see everything is better. The GTX 1650 is based on the newer Turing architecture and the memory is local to the graphics card and not shared with the CPU. The big difference is that we have 1024 CUDA cores, rather than the Jetson’s 128. This means we can perform 1024 operations in parallel for SIMD operations.

CUDA Samples

The CUDA toolkit includes a large selection of sample programs, in the Jetson Nano article we listed the vector addition sample. Compiling and running this on Windows is easy in Visual Studio. These samples are a great source of starting points for your own projects. 

Programming for Portability

If you are writing a specialized program and want the maximum performance on specialized hardware, it makes sense to write directly to nVidia’s CUDA API. However, most software developers want to have their programs to run on as many computers out in the world as possible. The solution is to write to a higher level API that then has drivers for different popular hardware.

For instance, if you are creating a video game, you could write to the DirectX interface and then your program can run on newer versions of Windows on a wide variety of GPUs from different vendors. If you don’t want to be limited to Windows, you could use a portable graphics API like OpenGL. You can also go higher level and create your game in a system like UnReal Engine or Unity. These then have different drivers to run on DirectX, MacOS, Linux, mobile devices or even in web browsers.

If you are creating an AI or Machine Learning application, you can use a library like Tensorflow or PyTorch which have drivers for all sorts of different hardware. You just need to ensure their support is as broad as the market you are trying to reach.

If you are doing something more general or completely new, you can consider a general parallel processing library like OpenCL which has support for all sorts of devices, including the limited SIMD coprocessors included with most modern CPUs. A good example of a program that uses OpenCL is Folding@Home which I blogged on here.


Modern GPUs are powerful and flexible computing devices. They have high speed memory and often thousands of processing cores to work on your task. Libraries to make use of this computing power are getting better and better allowing you to leverage this horsepower in your applications, whether they are graphics related or not. Today’s programmers need to have the tools to harness these powerful devices, so the applications they are working on can reach their true potential.

Written by smist08

June 20, 2020 at 1:43 pm

Blocking Speculative Execution Vulnerabilities on ARM Processors

with 4 comments


We blogged previously about the Spectre and Meltdown exploits that use the side effects of a CPU’s speculative execution to learn secrets that should be protected. The other day, ARM put out a warning and white paper for another similar exploit called “Straight-line Speculation”. Spectre and Meltdown took advantage of branch prediction, this is the opposite case where the processor’s speculation mechanism continues on past an unconditional branch instruction.

These side-channel attacks are hard to set up and exploit, but the place where they are most dangerous is in virtual environments, especially in the cloud. This is since it is a case where data leaks from one virtual machine to another and allows you to steal data from another cloud user. Running on a phone or PC isn’t so dangerous, since if you have the ability to run a program, you have way more power to do much more dangerous things, so why bother with this? This means that as ARM is used more often in cloud data centers, they are going to have to be very sensitive to these issues.

Why Does a Processor Do This?

Modern CPU’s process instructions ahead of the current executing instruction for performance reasons. There is capacity for a CPU core to process several instructions simultaneously and if one is delayed, say due to waiting for main memory, other instructions that don’t have a dependency on this can be executed. But why would the processor bother to process instructions after an unconditional branch? Afterall, ARM added the return (RET) instruction to their 64-bit instruction set so that the CPU would know where execution continued next and the pipeline wouldn’t need to be flushed.

First, ARM provides reference designs, and many manufacturers like Apple, take these and improve on them. This means that not all ARM processors work the same in this regard. Some might use a simpler brute-force approach that doesn’t interpret the instructions at all, others may add intelligence and their speculative pipeline interprets instructions and knows how to follow unconditional branches. The smarter the approach and the more silicon is required to implement it and possibly more heat is generated as it does its job.

The bottom line is that not all ARM processors will have this problem. Some will have limited pipelines and are safe, others interpret instructions and are safe. Then a few will implement the brute-force approach and have the vulnerability. The question then: what needs to be done about this?

ARM Instructions to Mitigate the Problem

Some ARM processors let you turn off speculative execution entirely. This is a very heavy handed approach and terrible for performance. This may make sense in some security sensitive situations, but usually is too heavy a hit on performance. The other approach is to use a number of ARM Assembly Language instructions that will halt speculative execution when encountered. Let’s look at them:

  • Data Synchronization Barrier (DSB): completes when all instructions before this instruction complete.
  • Data Memory Barrier (DMB): ensures that all explicit memory accesses before the DMB instruction complete before any explicit memory accesses after the DMB instruction start.
  • Instruction Synchronization Barrier (ISB): flushes the pipeline in the processor, so that all instructions following the ISB are fetched from cache or memory, after the ISB has been completed.
  • Speculation Barrier (SB): bars speculation of any instructions after this one, until after it is executed. Note: this instruction is optional and not available on all processors.
  • Clear the cache (MSR and SYS): using these instructions you can clear or invalidate all or part of the cache. However, these are privileged instructions and invalid from user space. Generally, these instructions are only used in the operating system kernel.

The various ARM security whitepapers have recommendations on how to use these instructions and the various developer tools like GCC and LLVM have compiler options to add these instructions to your code.

Adding a DSB/ISB instruction pair after an unconditional branch won’t affect the execution performance of your program, but it will increase the program size by 64-bits (two 32-bit instructions) after each unconditional branch. The recommendation is to only turn on the compiler options to generate this extra code in routines that do something sensitive like handling passwords or user’s sensitive data. Linux Kernel developers have to be cognizant of these issues to ensure kernel data doesn’t leak.

If you are playing with a Raspberry Pi, you can add the DSB, DMB and ISB instructions and run them in user mode. If you add an SB instruction then you need to add something like “-march=armv8.2-a+sb” to your as command line. The Broadcom ARM processor used in the Pi doesn’t support this instruction, so you will get an “Illegal Instruction” error when you execute it.

Ideally, this should all be transparent to the programmer. Future generations of the ARM processor should fix these problems. These instructions were added so systems programmers can address security problems as they appear. Without these tools, there would be no remedy. Addressing the current crop of speculative execution exploits, doesn’t guarantee that new ones won’t be discovered in the future. Its a cat and mouse, or whack-a-mole type game between hackers and security professionals. The next generation of chips will be more secure, but then the ball is in the hackers court.


It’s fascinating to see how hackers can exploit the smallest side-effect in programs or hardware to either take control of a system or to steal data from it. These recent exploits show how security has to be taken seriously in all design aspects, whether hardware, microcode, system software or user applications. As new attacks are developed, everyone has to scurry to develop workarounds, solutions and mitigations. In this cat and mouse world, there is no such thing as absolute security and everyone has to be aware that there are always risks if you are attached to a network.

If you are interested in these sorts of topics, be sure to check out my book: Programming with 64-Bit ARM Assembly Language.

Written by smist08

June 12, 2020 at 5:13 pm

Raspberry Pi Gets 8Gig and 64-Bits

leave a comment »


The Raspberry Pi Foundation recently announced the availability of the Raspberry Pi 4 with 8-Gig of RAM along with the start of a beta for a 64-bit version of the Raspberry Pi OS (renamed from Raspbian). This blog post will look into these announcements, where the Raspberry Pi is today and where it might go tomorrow.

I’ve written two books on ARM Assembly Language programming, one for 32-bits and one for 64-bits. All Raspberry Pis have the ARM CPU as their brains. If you are interested in learning Assembly language, the Raspberry Pi is the ideal place to do so. My books are:

32- Versus 64-Bits

Previously the Raspberry Pi Foundation had been singing the virtues of their 32-bit operating system. It uses less memory than a 64-bit operating system and would run on every Raspberry Pi ever made. Further if you really wanted 64-bits then you could run alternative versions of Linux from Ubuntu, Gentoo or Kali. The limitation of 32-bits is that you can only address 4 Gig of memory, so this seems like a problem for an 8-gig device, but 32-bit Raspbian handles this and in fact each process can have up to 4-gig of RAM and hence all the 8gig will get used if needed, just across multiple processes.

The downside to this is that the ARM 64-bit instruction set is faster, the memory addressing is simpler without this extra Linux memory management and modern ARM processors are optimised around 64-bits and only maintain 32-bits for compatibility. There are no new improvements to the 32-bit instruction set and typically it can’t take advantage of newer features and optimizations in the processor.

The Raspberry Pi foundation has released a beta version of the Raspberry Pi OS where the kernel is compiled for 64-bits. Many of the applications are still 32-bits but can run fine in compatibility mode, this is just a band-aid until everything is compiled 64-bit. I’ve been running the 64-bit version of Kali Linux on my Raspberry Pi 4 with 4-gig for a year now and it is excellent. I think the transition to 64-bits is a good one and there will be many benefits down the road.

New Hardware

The new Raspberry Pi 4 model with 8-gig of RAM is similar to the older model. The change was facilitated by the recent availability of a 8-gig RAM chip in a compatible form factor. They made some small adjustments to the power supply circuitry to handle the slightly higher power requirements of this model. Otherwise, everything else is the same. If a 16-gig part becomes available they would be able to offer such a model as well. The current Raspberry Pi memory controller can only handle up to 16-gig, so to go higher, this would need to be upgraded as well.

The new model costs $75 USD with 8-gig of RAM. The 2-gig model is still only $35 USD. This is incredibly inexpensive for a computer, especially given the power of the Pi. Remember this is the price for the core unit, you still need to provide a monitor, cables, power supply, keyboard and mouse.

Raspberry Pi Limitations

For most daily computer usage the Raspberry Pi is fine. But what is the difference between the Raspberry Pi and computers costing thousands of dollars. Here are the main ones:

  1. No fast SSD interface. You can connect an SSD or mechanical harddrive to a Raspberry Pi USB port, but this isn’t as fast as if there was an M.2 or SATA interface. M.2 would be ideal for a Raspberry Pi given its compact size. Adding an M.2 slot shouldn’t greatly increase the price of a Pi.
  2. Poor GPU. On most computers GPUs can be expensive. For $75 or less you get an older less powerful GPU. A better GPU, like ARM’s Mali GPU or some nVidia CUDA cores would be terrific, but will probably double or triple the price of the Pi. Even with the poor GPU, the RetroPi game system is terrific.
  3. Faster memory interface. The Raspberry Pi 4 has DDR4 memory, but it doesn’t compare will to other computers with DDR4. This probably indicates a bottleneck in either the PCI bus or Pi memory controller. I suspect this keeps the price low, but limits CPU performance due to bottlenecks limiting the data flow to and from memory.

If the Raspberry Pi addressed these issues, it would be competitive with most computers costing hundreds of dollars more.


The 8-gig version of the Raspberry Pi is a powerful computer for only $75. Having 8-gig of RAM allows you to run more programs at once, have more browser windows open and generally have more work in progress at one time. Each year the Raspberry Pi hardware gets more powerful. Combine this with the forthcoming 64-bit version of the Raspberry Pi OS and you have a powerful system that is ideal for the DIY hobbyist, for people learning about programming, and even people using it as a general purpose desktop computer.

Written by smist08

June 5, 2020 at 4:36 pm

Coffee in the Age of Social Distancing

leave a comment »

The Write Cup

By Stephen Smith


Here in British Columbia, Canada, COVID-19 restrictions are slowly being relaxed. As they are relaxed, coffee shops are scrambling to re-open while meeting the various government regulations for social distancing and cleaning. In this article I’ll discuss the various setups and trade-offs various shops are taking.

Inside vs Outside Seating

It is far easier for coffee shops to offer outside patio seating than providing inside seating. In both cases social distancing is required and the tables have to be measured to ensure they are sufficiently separated. Many coffee shops don’t have enough room for any inside seating and they have to keep the people in the counter lineup sufficiently separated. Often setting up the counter lineup takes all their inside floor space. Some have an indoor and snaky lineup to the counter, the pickup area and then an exit door.

BEWARE! None of the washrooms are…

View original post 528 more words

Written by smist08

May 29, 2020 at 1:40 pm

Posted in Uncategorized

Browsing MSDOS and GW-Basic Source Code

leave a comment »


These days I mostly play around with ARM Assembly Language and have written two books on it:

But long ago, my first job out of University involved some Intel 80186 Assembly Language programming, so I was interested when Microsoft recently posted the source code to GW-Basic which is entirely written in 8086 Assembly Language. Microsoft posted the source code to MS-DOS versions 1 and 2 a few years ago, which again is also entirely written in 8086 Assembly Language.

This takes us back to the days when C compilers weren’t as good at optimizing code as they are today, processors weren’t nearly as fast and memory was at a far greater premium. If you wanted your program to be useful, you had to write it entirely in Assembly Language. It’s interesting to scroll through this classic code and observe the level of documentation (low) and the programming styles used by the various programmers.

Nowadays, programs are almost entirely written in high-level programming languages and any Assembly Language is contained in a small set of routines that provide some sort of highly optimized functionality usually involving a coprocessor. But not too long ago, often the bulk of many programs consisted entirely of Assembly Language.

Why Release the Source Code?

Why did Microsoft release the source code for these? One reason is that they are a part of computer history now and there are historians that want to study this code. It provides insight into why the computer industry progressed in the manner it did. It is educational for programmers to learn from. It is a nice gesture and offering from Microsoft to the DIY and open source communities as well.

The other people who greatly benefit from this are those that are working on the emulators that are used in systems like RetroPie. Here they have emulators for dozens of old computer systems that allow vintage games and programs to be run on modern hardware. Having the source code for the original is a great way to ensure their emulations are accurate and a great help to fixing bugs correctly.


Here is an example routine from find.asm in MS-DOS 2.0 to convert a binary number into an ASCII string. The code in this routine is typical of the code throughout MS-DOS. Remember that back then MS-DOS was 16-bits so AX is 16-bits wide. Memory addresses are built using two 16-bit registers, one that provides a segment and the other that gives an offset into that 64K segment. Remember that MS-DOS can only address memory upto 640K (ten such segments).

;       Binary to Ascii conversion routine                
; Entry:                                                          
;       DI      Points to one past the last char in the             
;       AX      Binary number                                       
;             result buffer.                                        
; Exit:                                                             
;       Result in the buffer MSD first                            
;       CX      Digit count                                         
; Modifies:                                                         
;       AX,BX,CX,DX and DI                                          
        mov     bx,0ah
        xor     cx,cx
        inc     cx
        cmp     ax,bx
        jb      div_done
        xor     dx,dx
        div     bx
        add     dl,’0′          ;convert to ASCII
        push    dx
        jmp     short go_div
        add     al,’0′
        push    ax
        mov     bx,cx
        pop     ax
        loop    deposit
        mov     cx,bx

For an 8086 Assembly Language programmer of the day, this will be fairly self evident code and they would laugh at us if we complained there wasn’t enough documentation. But we’re 40 or so years on, so I’ll give the code again but with an explanation of what is going on added in comments.

        mov     bx,0ah ; we will divide by 0ah = 10 to get each digit
        xor     cx,cx ; cx will be the length of the string, initialize it to 0
        inc     cx ; increment the count for the current digit
        cmp     ax,bx ; Is the number < 10 (last digit)?
        jb      div_done   ; If so goto div_done to process the last digit
        xor     dx,dx ; DX = 0
        div     bx ; AX = AX/BX  DX=remainder
        add     dl,’0′          ;convert to ASCII. Know remainder is <10 so can use DL
        push    dx ; push the digit onto the stack
        jmp     short go_div ; Loop for the next digit
        add     al,’0′ ; Convert last digit to ASCII
        push    ax ; Push it on the stack
        mov     bx,cx ; Move string length to BX
        pop     ax ; get the next significant digit off the stack.
        stosb ; Store AX at ES:DI and increment DI
       ; Loop decrements CX and branches if CX not zero.
; Falls through when CX=0
        loop    deposit
        mov     cx,bx ; Put the count back in CX
        ret ; return from routine.

A bit different than a C routine. The routine assumes the DF flag is set, so the stosb increments the memory address, perhaps this is a standard across MS-DOS or perhaps it’s just local to this module. I think the comment is incorrect and that the start of the output buffer is passed in. The routine uses the stack to reverse the digits, since the dividing by 10 algorithm peels off the least significant digit first and we want the most significant digit first in the buffer. The resulting string isn’t NULL terminated so perhaps MS-DOS treats strings as a length and buffer everywhere.

Comparison to ARM

This code is representative of CISC type processors. The 8086 has few registers and their usage is predetermined. For instance the DIV instruction is only passed one parameter, the divisor. The dividend, quotient and remainder are set in hard-wired registers. RISC type processors like the ARM have a larger set of registers and tend to have three operands per instruction, namely two input registers and an output register.

This code could be assembled for a modern Intel 64-bit processor with little alteration, since Intel has worked hard to maintain a good level of compatibility as it has gone from 16-bits to 32-bits to 64-bits. Whereas ARM redesigned their instruction set when they went from 32-bits to 64-bits. This was a great improvement for ARM and only possible now that the amount of Assembly Language code in use is so much smaller.


Kudos to Microsoft for releasing this 8086 Assembly Language source code. It is interesting to read and gives insight into how programming was done in the early 80s. I hope more classic programs have their source code released for educational and historical purposes.

Written by smist08

May 25, 2020 at 6:56 pm

Virtual LinuxFest Northwest

leave a comment »


Last year we packed up the r-pod travel trailer and headed down to Bellingham, WA for some mountain biking and LinuxFest Northwest 2019. It was a really fun and informative show that I blogged about here. I greatly enjoyed the show and hoped to return the next year participating by giving a presentation. I applied last fall and was accepted. This looked like it was going to really work out since the show corresponded with the release of my second computer book: Programming with 64-Bit ARM Assembly Language from Apress.

Enter Covid-19

Things were looking good until in February things started to lock down and get cancelled due to the Covid-19 outbreak. Eventually LinuxFest NorthWest in Bellingham was added to the list of cancelled events. Even if the organizers hadn’t cancelled, the border between Canada and the USA was closed to all non-essential travel. I suspect I would have had a hard time convincing the border guards that my presentation at LinuxFest was essential.

The organizers of LinuxFest Northwest weren’t happy abandoning the show all together so they asked all the presenters if they would record their presentations and upload them to be posted on the LinuxFest YouTube channel. Further they set up a questions and answer section on the LinuxFest discussion forums.

It looks like quite a few presenters participated and you can find all the sessions here. The Q&A forums are all here. More specifically my presentation is here.


I’m disappointed that LinuxFest Northwest didn’t happen live this year. I was looking forward to it. But at least we had the virtual event. I invite you to browse and watch a few of the sessions. Hopefully, we will gather in person next year.

Written by smist08

May 22, 2020 at 1:09 pm

Programming with 64-Bit ARM Assembly Language

with 4 comments


My first book on Assembly Language is Raspberry Pi Assembly Language Programming which is all about ARM 32-Bit Assembly Language Programming. This is since the official variant of Linux for the Raspberry Pi, Raspbian is 32-bit. There are good reasons for this, the most important being that until the Raspberry Pi 4, the maximum memory was 1Gig which isn’t enough to properly run a 64-bit version of Linux. Yes you can do it, but it’s rather painful.

Now with the Raspberry Pi 4 supporting 4Gig of RAM and other SBC’s like the nVidia Jetson Nano also containing 4Gig of RAM, running 64-bit operating systems makes a lot more sense. Further in the ARM world, all phones and tablets have moved to 64-bits. All Apple products are 64-bit and all but the very cheapest Android phones are 64-bit.

Hence I felt it made sense to create a 64-bit version of my book and my publisher Apress agreed. This resulted in my newest book: Programming with 64-Bit ARM Assembly Language.

Beyond the Raspberry Pi

Along with teaching how to program 64-bit ARM Assembly Language, the book goes beyond the Raspberry Pi to cover how to add Assembly Language routines to your Apple iOS or Google Android Apps. Every App developer is struggling to get their App noticed out of the millions of Apps in the App stores. Having better performance is one great way for users to recommend your App to their friends.

The book also covers how to write Assembly Language for ARM 64-Bit Linux including Ubuntu as included with the nVidia Jetson Nano or Kali Linux running on a Raspberry Pi 4. Further the book covers how to cross compile your code, compile/assemble it on a powerful Intel/AMD computer and then run it on your target device.

There is a lot of interest in IoT and embedded devices these days. Often these are based on ARM processors and often you need to do some Assembly Language programming to write the device drivers for the various custom pieces of hardware you are developing.

About ARM 64-Bit Assembly Language

When ARM developed the 64-bit version of their processor, they took the time to fix many problems that have developed over the years in the 32-bit versions. The Assembly Language syntax is more streamlined and a lot of little used features like conditional instructions were removed entirely. As a consequence this new book is a complete rewrite. Although anyone familiar with 32-bit ARM Assembly should find 64-bit Assembly familiar, there are a lot of differences and improvements, such as doubling the number of registers.

With the new book you learn how to utilize all the new features that are now available to you. How the instruction syntax is much more uniform across all the coprocessors and how to use all the new registers you have at your disposal.

The newest generations of the ARM processor all have deep execution pipelines and multiple cores. The new 64-bit instruction set is the foundation that allows the ARM processor to fully exploit these features and get the best performance for the smallest amount of power usage.

Where to Buy

With Covid-19, things are moving a bit slower than normal. The ePub versions of my book are available now from Apress directly. This should flow to all the other retailers shortly, in the meantime they have the book available for presale. The print version is in process, but I’m not sure how long it will take this time around. Here are some sample places where it is listed:

Over the coming weeks, it’ll change from pre-release to shipping now.


If you are interested in learning 64-Bit ARM Assembly Language, either to optimize your programs or to learn about the architecture of a modern RISC processor then this book is for you. I hope this book motivates people to use more Assembly Language in their work to produce high performance applications. When people are surveyed for their favorite features in applications, better performance is always top of the list.

Written by smist08

May 2, 2020 at 10:46 am