Stephen Smith's Blog

Musings on Machine Learning…

Archive for April 2019

LinuxFest Northwest 2019

leave a comment »


2019 is the 50th anniversary of Unix and the 25th anniversary of Linux. Last weekend, I attended the 20th LinuxFest Northwest 2019 show in Bellingham at the Bellingham Technical Conference. A great celebration with over 1200 attendees and 84 speakers. Most of the main Linux distributions were represented along with many hardware, software and service companies associated with Linux.

I attended many great presentations and learned quite a lot. In this article, I’ll give a quick survey of what I got out of the conference. In each time slot there was typically ten talks to choose from and I chose the one that interested me the most. I tended to go to the security and overview presentations.

Computers are Broken

The first presentation I went to was “Computers are Broken (and we’re all going to die)” by Bryan Lunduke. This presentation laid out the problems with the continued increase in the complexity of all software. How this is slowing down current development, since programming teams need to be much larger and understanding what is already there is so difficult. He gave his presentation running Windows for Workgroups 3.11 and Powerpoint 4. His point was he can do everything he needs with this, but with way less RAM, disk space and processing power. Lots of arguments on how software gets into everything and how hard it is to test, it is getting quite dangerous. Just look at Boeing’s problems with the 737 Max.

50 Years of Unix

Next I went to Maddog’s presentation on “50 Years of Unix, the Internet and more”. Maddog has been around Unix the whole time and had a lot of great stories from the history of Unix, Linux and computers. He spent most of his career at DEC, but has done many other things along the way.

Freedom, Security and Privacy

Then I went to Kyle Rankin’s talk, which started with a slide on Oxford commas and why there is only one comma in the title of his presentation. The Linux community has some very paranoid people and maintaining security and privacy are major themes of the conference. One of the most hated items by the Linux community is the UEFI BIOS and how it gives corporations and governments backdoors into everyone’s computers. If you can, get a computer with a CoreBoot BIOS which is open source and lacks all these security problems. One claim is that security in Linux is better because there are so many eyes on it, but he makes the point that unless they are the right eyes, you don’t really gain anything. Getting the best security researchers to test and analyse Linux remains a challenge. Also people tend to be a bit complacent on where they get their software, even if it’s open source, they don’t build it themselves, leaving room for bad things to be inserted.

Early Technology and Ideas for the Future

Jeff Fitzmaurice gave a presentation that looked at some examples from the history of science and how various theoretical breakthroughs led to technological developments. Then there was speculation on what developments in Science happening now, will lead to future technological developments. We discussed AI, materials science, quantum computing among others.

Ubuntu 19.04+

I went to Simon Quigley’s presentation on Ubuntu. Mostly because I use Ubuntu, both on this laptop and on my NVidia Jetson Nano. This talk covered what is new in 19.04 (Disco Dingo) and how work is going towards 19.10 (note the version numbers are year.month of the release target). I’ve been running the LTS (long term support) version and I was surprised to find out they only do a LTS every two years, so when I got home, I changed my configuration to install any new released version. It was interesting on how they need to get open source contributors to commit to the five year support commitment of the LTS.

People were present that work on all the derivatives like Kubuntu and Lubuntu. Most of the work they do actually goes in the upstream Debian release, which benefits even more people.

The Fight for a Secure Linux Bios

David Spring gave this presentation on all the evils of UEFI and why we need CoreBoot so badly. He has a lot of stories on the evils done by the NSA, including causing the Deepwater Horizon disaster. When the NSA release the second version of Stuxnet to attack the Iranian nuclear program, it got away on them. The oil industry uses a lot of the same Siemens equipment and got infected. Before the disaster, Deepwater Horizons monitoring computers were all down, because of the NSA and Stuxnet. If not for the NSA, they would have detected the problem and resolved it without the disaster. For all the propaganda on Chinese and Russian hacking, the NSA employees 100 hackers for every single Chinese one. Their budget is huge.

Past, Present and Future of Blockchain

My friend Clive Boulton (from the GWT days) gave this presentation on the commercial uses of blockchain. This had nothing to do with cryptocurrencies and was on using the algorithms to secure and enable commercial transactions without third party intermediaries. The presentation covered a number of frameworks like Hyperledger and Openchain that enable blockchain for application developers.

Zero Knowledge Architecture

M4dz’s presentation showed how to limit access to application data, for instance to stop insurance companies seeing your medical records. Zero knowledge protocols find ways to tell if you have knowledge without getting that knowledge. For instance if you want to know if someone can access a room, you can watch them open the door, you don’t need to get a copy of the key. Similarly you can watch a service use a password, without giving you the password. These protocols are quite difficult, especially when you get into key recovery procedures, but ultimately if these gain traction we will all get better privacy.

Linux Gaming – the Dark Ages, Today and Beyond…

Ray Shimko’s presentation covered the state of Linux gaming from all the old console emulators to native ports of games where the source code has been released, to better packaging of all the layers required to run Windows games (right version of Wine, etc.). There are a lot of games on Linux now, but sadly the newest hot releases lag quite a while before showing up.

One interesting story is how the emulator contributors are trying to deal with games like “Duck Hunt”. Duck Hunt came with a gun, you pointed at the TV to shoot the ducks. The way this worked was that when you pressed the trigger, the game would flash the screen white. One a CRT this meant the refresh would scan down the screen in 1/60th of a second. A sensor in the gun would record when it saw white and by measuring the time difference, the software would know where the gun was pointing. The problem is that modern screens don’t work that way, so this whole aiming technique doesn’t work. Evidently a workaround is forthcoming.


The conference ended with a Q&A session hosted by Maddog, Kyle Rankin and Simon Quigley. The audience could ask whatever they wanted and perhaps got an answer or perhaps got a story. Lots of why doesn’t Linux do X and how can I contribute to Y.


Hard to believe Linux is 25 years old all ready. This is a great show and in the spirit of free software the show is also free to attend. Lots of interesting discussion and its refreshing to see software developing where users really want, rather than what you see under various corporate agendas.

When you buy a new computer, make sure it uses Coreboot BIOS and not UEFI.



Written by smist08

April 30, 2019 at 7:01 pm

Posted in Life

Tagged with , , , ,

Playing with Software Defined Radio

leave a comment »


Most Ham Radios these days, receive signals through an antenna, convert the signal to digital, process the signal with a built-in computer, and then output the result converting back to analog for the speaker. This trend to doing all the radio signal processing in software instead of using electronic components is called Software Defined Radio (SDR). The ICOM 7300 is built around SDR as are all the expensive Flex Radios.

Inexpensive SDR

Some clever hackers figured out that an inexpensive chip used in boards to receive TV into a computer, could actually tune to any frequency. From this discovery, many inexpensive USB dongles have been produced that utilize this “TV Tuner” chip, but to tune radio instead of TV. This is possible because all this chip does is receive a signal from an antenna and then convert it to digital for the computer to process. I purchased the RTL-SDR dongle for around $30 which included a small VHF/UHF antenna.

I run Linux, both on my laptop and on a Raspberry Pi. I looked around for software to use with this device and found several candidates. I chose CubicSDR because it easily installed from the Ubuntu App store on both my laptop and on my Raspberry Pi.

I tried it first on the Pi, but it just didn’t work well. It would keep hanging and the sound was never good. I then tried it on my laptop and it worked great. This led me to believe that the Raspberry Pi just doesn’t have the horsepower to run this sort of system. Either due to lack of memory (only having 1Gig) or that the ARM processor isn’t quite powerful enough. Doing some reading online, the consensus seemed to be that you couldn’t run both the radio software and a GUI on the same Pi. You needed to either have two Pi’s or use a command line version of the software. I was disappointed the Pi wasn’t up to the challenge, but got along just fine using my laptop.

Enter the NVidia Jetson Nano

I recently acquired an NVidia Jetson Nano Developers Kit. This is similar to a Raspberry Pi, but with a more powerful quad-core ARM processor, 4Gig or RAM and 120 Tegra NVidia GPU processors (it also costs $99 rather than $35).

I installed CubicSDR on this, and it worked right away like a charm. I was impressed, getting software for the Nano can sometime be difficult since it runs true 64-Bit Ubuntu Linux on ARM, so you need to have that built. But CubicSDR was in the App Store and installed with no problem. I fired it up and it recognized the RTL-SDR and it recognized the NVidia Tegra GPUs. It took over 10 of them for doing its signal processing and worked really well.

Below is a screenshot of CubicSDR playing an FM radio station.


CubicSDR is open source and free, it uses GNURadio under the covers (low level open source radio processing). CubicSDR has quite an impressive display. Like fancy high end radios you can see what is happening on the frequencies around where you are tuned in. The interface can be a bit cryptic and you need to refer to the documentation to do some things. For instance the volume, doesn’t honor the system setting and you have to use the green slider in the upper right. Knowing what the various sliders do is quite helpful. Tuning frequencies is a bit tricky at first, but once you check the manual and play with it, it becomes easy. Using CubicSDR really is like using a high end radio, just for a fraction of the cost.

It is certainly helpful to know ham terminology and to know what radio protocol is used where. For instance most VHF communications use narrow band FM. Most longer wavelength ham communications are either upper or lower sideband. Aeronautical uses AM. Commercial FM stations use wide band FM.


Although the RTL-SDR supports pretty much any frequency, you need the correct antenna for what you are doing. The ham bands that bounce off the stratosphere to allow you to talk to people halfway around the world use quite long wavelengths. The longer the wavelength, the larger the antenna you need to receive them. Don’t expect to receive anything from the 20 meter band without a good sized antenna. That doesn’t mean it has to be expensive, you can get good results using a dipole or end-fed antenna, both of these are just made out of wires, but you do have to string them high up and facing the right direction.

What About Transmitting?

This RTL-SDR only receives signals. If you want to transmit as well, then you need a more expensive model. These sort of SDR transmitters are very low power, so if you want to be heard, you will need a good linear amplifier, rated for the frequencies you want to use. You will also need a better antenna.

If you transmit you also require a ham radio license and call sign. You are responsible for not causing interference and that you signal doesn’t bleed through to adjacent channels. Since you are assembling this all yourself, an advanced license is required.


SDR is great fun to play with and there are lots of great projects you can create with this and an inexpensive single board computer. It’s too bad the Raspberry Pi isn’t quite up to the task. However, more powerful Pi competitors like the Jetson Nano run SDR just fine.

Written by smist08

April 16, 2019 at 2:08 am

Playing with CUDA on My NVIDIA Jetson Nano

leave a comment »


I reported last time about my new toy, an NVIDIA Jetson Nano Development Kit. I’m pretty familiar with Linux and ARM processors. I even wrote a couple of articles on Assembler programming, here and here. The thing that intrigued be about the Jetson Nano is its 128 Maxwell GPU cores. What can I do with these? Sure I can speed up TensorFlow since it uses these automatically. I could probably do the same with OpenGL programs. But what can I do directly?

So I downloaded the CUDA C Programming Guide from NVIDIA’s website to have a look at what is involved.


The claim is that the microSD image of 64Bit Ubuntu Linux that NVIDIA provides for this computer has all the NVIDIA libraries and utilities you need all pre-installed. The programming guide made it clear that if you need to use the NVIDIA C compiler nvcc to compile your work. But if I typed nvcc at a command prompt, I just got an error that this command wasn’t found. A bit of Googling revealed that everything is installed, but it did it before installation created your user, so you need to add the locations to some PATHS. Adding:

export PATH=${PATH}:/usr/local/cuda/bin
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/cuda/lib64

To my .bashrc file got everything working. It also shows where cuda is installed. This is handy since it includes a large collection of samples.

Compiling the deviceQuery sample produced the following output on my Nano:

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA Tegra X1"
  CUDA Driver Version / Runtime Version          10.0 / 10.0
  CUDA Capability Major/Minor version number:    5.3
  Total amount of global memory:                 3957 MBytes (4148756480 bytes)
  ( 1) Multiprocessors, (128) CUDA Cores/MP:     128 CUDA Cores
  GPU Max Clock rate:                            922 MHz (0.92 GHz)
  Memory Clock rate:                             13 Mhz
  Memory Bus Width:                              64-bit
  L2 Cache Size:                                 262144 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            No
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.0, CUDA Runtime Version = 10.0, NumDevs = 1

Result = PASS

This is all good information and what all this data means is explained in NVIDIA’s developer documentation (which is actually pretty good). The deviceQuery sample exercises various information APIs in the CUDA library to tell you all it can about what you are running. If you can compile and run deviceQuery in the samples/1_Utilities folder then you should be good to go.

CUDA Hello World

The 128 NVidia Maxwell cores basically consist of a SIMD computer (Single Instruction Multiple Data). This means you have one instruction that they all execute, but on different data. For instance if you want to add two arrays of 128 floating point numbers you have one instruction, add, and then each processor core adds a different element of the array. NVidia actually calls their processors SIMT meaning single instruction multiple threads, since you can partition the processors to different threads and have the two threads each with a collection of processors doing their SIMD thing at once.

When you write a CUDA program, you have two parts, one is the part that runs on the host CPU and the other is the part that runs on the NVidia GPUs. The NVidia C compiler, NVCC adds a number of extensions to the C language to specify what runs where and provide some more convenient syntaxes for the common things you need to do. For the host parts, NVCC translates its custom syntax into CUDA library calls and then passes the result onto GCC to compile regularly. For the GPU parts, NVCC compiles to an intermediate format called PTX. The reason it does this is to support all the various NVidia GPU models. When the NVidia device driver goes to load this code, it does a just in time compile (which it then caches), where the PTX code is compiled to the correct binary code for your particular set of GPUs.

Here is the skeleton of a simple CUDA program:

// Kernel definition
__global__ void VecAdd(float* A, float* B, float* C)
    int i = threadIdx.x;
    C[i] = A[i] + B[i];

int main()
    // Kernel invocation with N threads
    VecAdd<<<1, N>>>(A, B, C);


The __global__ identifier specifies the VecAdd routine as to run on the GPU. One instance of this routine will be downloaded to run on N processors. Notice there is no loop to add these vectors, Each processor will be a different thread and the thread’s x member will be used to choose which array element to add.

Then in the main program we call VecAdd using the VecAdd<<>> syntax which indicates we are calling a GPU function with these three arrays (along with the size).

This little example skips the extra steps of copying the arrays to GPU memory or copying the result out of GPU memory. There are quite a few different memory types, and various trade offs for using them.

The complete program for adding two vectors from the samples is at the end of this article.

This example also doesn’t explain how to handles larger arrays or how to do error processing. For these extra levels of complexity, refer to the CUDA C Programming Guide.

The CUDA program here is very short, just doing an addition. If you wanted to say multiply two 10×10 matrices, you would have your CUDA code do the dot product of a row in the first matrix by a column in the second matrix. Then you would have 100 cores execute this code, so the result of the multiplication would be done 100 times faster than just using the host processor. There are a lot of samples on how to do matrix multiplication in the samples and documentation.

Newer CUDA Technologies

The Maxwell GPUs in the Jetson Nano are a bit old and reading and playing with the CUDA libraries revealed a few interesting tidbits on things they are missing. We all know how NVidia has been enhancing their products for gaming and graphics with the introduction of things like real time ray tracing, but the thing of more interest to me is how they’ve been adding features specific to Machine Learning and AI. Even though Google produces their own hardware for accelerating their TensorFlow product in their data centers, NVidia has added specific features that greatly help TensorFlow and other Neural Network programs.

One thing the Maxwell GPU lacks is direct matrix multiplication support, newer GPUs can just do A * B + C as a single instruction, where these are all matrices.

Another thing that NVidia just added is direct support for executing computation graphs. If you worked with the early version of TensorFlow then you know that you construct your model by building a computational graph and then training and executing it. The newest NVidia GPUs can now execute these graphs directly. NVidia has a TensorRT library to move parts of TensorFlow to the GPU, this library does work for the Maxwell GPUs in the Jetson Nano, but is probably way more efficient in the newest, bright and shiny GPUs. Even just using TensorFlow without TensorRT is a great improvement and handles moving the matrix calculations to the GPUs even for the Nano, it just means the libraries have more work to do.


The GPU cores in a product like the Jetson Nano can be easily utilized using products that support them like TensorFlow or OpenGL, but it’s fun to explore the lower level programming models to see how things are working under the covers. If you are interested in parallel programming on a SIMD type machine, then this is a good way to go.


 * Copyright 1993-2015 NVIDIA Corporation.  All rights reserved.
 * Please refer to the NVIDIA end user license agreement (EULA) associated
 * with this source code for terms and conditions that govern your use of
 * this software. Any use, reproduction, disclosure, or distribution of
 * this software and related documentation outside the terms of the EULA
 * is strictly prohibited.

 * Vector addition: C = A + B.
 * This sample is a very basic sample that implements element by element
 * vector addition. It is the same as the sample illustrating Chapter 2
 * of the programming guide with some additions like error checking.

#include <stdio.h>

// For the CUDA runtime routines (prefixed with "cuda_")
#include <cuda_runtime.h>

#include <helper_cuda.h>

 * CUDA Kernel Device code
 * Computes the vector addition of A and B into C. The 3 vectors have the same
 * number of elements numElements.

__global__ void
vectorAdd(const float *A, const float *B, float *C, int numElements)
    int i = blockDim.x * blockIdx.x + threadIdx.x;

    if (i < numElements)
        C[i] = A[i] + B[i];

 * Host main routine

    // Error code to check return values for CUDA calls
    cudaError_t err = cudaSuccess;

    // Print the vector length to be used, and compute its size
    int numElements = 50000;
    size_t size = numElements * sizeof(float);
    printf("[Vector addition of %d elements]\n", numElements);

    // Allocate the host input vector A
    float *h_A = (float *)malloc(size);

    // Allocate the host input vector B
    float *h_B = (float *)malloc(size);

    // Allocate the host output vector C
    float *h_C = (float *)malloc(size);

    // Verify that allocations succeeded
    if (h_A == NULL || h_B == NULL || h_C == NULL)
        fprintf(stderr, "Failed to allocate host vectors!\n");

    // Initialize the host input vectors
    for (int i = 0; i < numElements; ++i)
        h_A[i] = rand()/(float)RAND_MAX;
        h_B[i] = rand()/(float)RAND_MAX;

    // Allocate the device input vector A
    float *d_A = NULL;
    err = cudaMalloc((void **)&d_A, size);

    if (err != cudaSuccess)
        fprintf(stderr, "Failed to allocate device vector A (error code %s)!\n", cudaGetErrorString(err));

    // Allocate the device input vector B
    float *d_B = NULL;
    err = cudaMalloc((void **)&d_B, size);

    if (err != cudaSuccess)
        fprintf(stderr, "Failed to allocate device vector B (error code %s)!\n", cudaGetErrorString(err));

    // Allocate the device output vector C
    float *d_C = NULL;
    err = cudaMalloc((void **)&d_C, size);

    if (err != cudaSuccess)
        fprintf(stderr, "Failed to allocate device vector C (error code %s)!\n", cudaGetErrorString(err));

    // Copy the host input vectors A and B in host memory to the device input vectors in
    // device memory
    printf("Copy input data from the host memory to the CUDA device\n");
    err = cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

    if (err != cudaSuccess)
        fprintf(stderr, "Failed to copy vector A from host to device (error code %s)!\n", cudaGetErrorString(err));

    err = cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

    if (err != cudaSuccess)
        fprintf(stderr, "Failed to copy vector B from host to device (error code %s)!\n", cudaGetErrorString(err));

    // Launch the Vector Add CUDA Kernel
    int threadsPerBlock = 256;
    int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;
    printf("CUDA kernel launch with %d blocks of %d threads\n", blocksPerGrid, threadsPerBlock);
    vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements);
    err = cudaGetLastError();

    if (err != cudaSuccess)
        fprintf(stderr, "Failed to launch vectorAdd kernel (error code %s)!\n", cudaGetErrorString(err));

    // Copy the device result vector in device memory to the host result vector
    // in host memory.
    printf("Copy output data from the CUDA device to the host memory\n");
    err = cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

    if (err != cudaSuccess)
        fprintf(stderr, "Failed to copy vector C from device to host (error code %s)!\n", cudaGetErrorString(err));

    // Verify that the result vector is correct
    for (int i = 0; i < numElements; ++i)
        if (fabs(h_A[i] + h_B[i] - h_C[i]) > 1e-5)
            fprintf(stderr, "Result verification failed at element %d!\n", i);

    printf("Test PASSED\n");

    // Free device global memory
    err = cudaFree(d_A);

    if (err != cudaSuccess)
        fprintf(stderr, "Failed to free device vector A (error code %s)!\n", cudaGetErrorString(err));

    err = cudaFree(d_B);

    if (err != cudaSuccess)
        fprintf(stderr, "Failed to free device vector B (error code %s)!\n", cudaGetErrorString(err));

    err = cudaFree(d_C);

    if (err != cudaSuccess)
        fprintf(stderr, "Failed to free device vector C (error code %s)!\n", cudaGetErrorString(err));

    // Free host memory

    return 0;

Written by smist08

April 3, 2019 at 6:01 pm