Stephen Smith's Blog

Musings on Machine Learning…

Posts Tagged ‘nvidia jetson nano develoment kit

Playing with Software Defined Radio

with 2 comments

Introduction

Most Ham Radios these days, receive signals through an antenna, convert the signal to digital, process the signal with a built-in computer, and then output the result converting back to analog for the speaker. This trend to doing all the radio signal processing in software instead of using electronic components is called Software Defined Radio (SDR). The ICOM 7300 is built around SDR as are all the expensive Flex Radios.

Inexpensive SDR

Some clever hackers figured out that an inexpensive chip used in boards to receive TV into a computer, could actually tune to any frequency. From this discovery, many inexpensive USB dongles have been produced that utilize this “TV Tuner” chip, but to tune radio instead of TV. This is possible because all this chip does is receive a signal from an antenna and then convert it to digital for the computer to process. I purchased the RTL-SDR dongle for around $30 which included a small VHF/UHF antenna.

I run Linux, both on my laptop and on a Raspberry Pi. I looked around for software to use with this device and found several candidates. I chose CubicSDR because it easily installed from the Ubuntu App store on both my laptop and on my Raspberry Pi.

I tried it first on the Pi, but it just didn’t work well. It would keep hanging and the sound was never good. I then tried it on my laptop and it worked great. This led me to believe that the Raspberry Pi just doesn’t have the horsepower to run this sort of system. Either due to lack of memory (only having 1Gig) or that the ARM processor isn’t quite powerful enough. Doing some reading online, the consensus seemed to be that you couldn’t run both the radio software and a GUI on the same Pi. You needed to either have two Pi’s or use a command line version of the software. I was disappointed the Pi wasn’t up to the challenge, but got along just fine using my laptop.

Enter the NVidia Jetson Nano

I recently acquired an NVidia Jetson Nano Developers Kit. This is similar to a Raspberry Pi, but with a more powerful quad-core ARM processor, 4Gig or RAM and 120 Tegra NVidia GPU processors (it also costs $99 rather than $35).

I installed CubicSDR on this, and it worked right away like a charm. I was impressed, getting software for the Nano can sometime be difficult since it runs true 64-Bit Ubuntu Linux on ARM, so you need to have that built. But CubicSDR was in the App Store and installed with no problem. I fired it up and it recognized the RTL-SDR and it recognized the NVidia Tegra GPUs. It took over 10 of them for doing its signal processing and worked really well.

Below is a screenshot of CubicSDR playing an FM radio station.

CubicSDR

CubicSDR is open source and free, it uses GNURadio under the covers (low level open source radio processing). CubicSDR has quite an impressive display. Like fancy high end radios you can see what is happening on the frequencies around where you are tuned in. The interface can be a bit cryptic and you need to refer to the documentation to do some things. For instance the volume, doesn’t honor the system setting and you have to use the green slider in the upper right. Knowing what the various sliders do is quite helpful. Tuning frequencies is a bit tricky at first, but once you check the manual and play with it, it becomes easy. Using CubicSDR really is like using a high end radio, just for a fraction of the cost.

It is certainly helpful to know ham terminology and to know what radio protocol is used where. For instance most VHF communications use narrow band FM. Most longer wavelength ham communications are either upper or lower sideband. Aeronautical uses AM. Commercial FM stations use wide band FM.

Antennas

Although the RTL-SDR supports pretty much any frequency, you need the correct antenna for what you are doing. The ham bands that bounce off the stratosphere to allow you to talk to people halfway around the world use quite long wavelengths. The longer the wavelength, the larger the antenna you need to receive them. Don’t expect to receive anything from the 20 meter band without a good sized antenna. That doesn’t mean it has to be expensive, you can get good results using a dipole or end-fed antenna, both of these are just made out of wires, but you do have to string them high up and facing the right direction.

What About Transmitting?

This RTL-SDR only receives signals. If you want to transmit as well, then you need a more expensive model. These sort of SDR transmitters are very low power, so if you want to be heard, you will need a good linear amplifier, rated for the frequencies you want to use. You will also need a better antenna.

If you transmit you also require a ham radio license and call sign. You are responsible for not causing interference and that you signal doesn’t bleed through to adjacent channels. Since you are assembling this all yourself, an advanced license is required.

Summary

SDR is great fun to play with and there are lots of great projects you can create with this and an inexpensive single board computer. It’s too bad the Raspberry Pi isn’t quite up to the task. However, more powerful Pi competitors like the Jetson Nano run SDR just fine.

Advertisements

Written by smist08

April 16, 2019 at 2:08 am

Playing with CUDA on My NVIDIA Jetson Nano

leave a comment »

Introduction

I reported last time about my new toy, an NVIDIA Jetson Nano Development Kit. I’m pretty familiar with Linux and ARM processors. I even wrote a couple of articles on Assembler programming, here and here. The thing that intrigued be about the Jetson Nano is its 128 Maxwell GPU cores. What can I do with these? Sure I can speed up TensorFlow since it uses these automatically. I could probably do the same with OpenGL programs. But what can I do directly?

So I downloaded the CUDA C Programming Guide from NVIDIA’s website to have a look at what is involved.

Setup

The claim is that the microSD image of 64Bit Ubuntu Linux that NVIDIA provides for this computer has all the NVIDIA libraries and utilities you need all pre-installed. The programming guide made it clear that if you need to use the NVIDIA C compiler nvcc to compile your work. But if I typed nvcc at a command prompt, I just got an error that this command wasn’t found. A bit of Googling revealed that everything is installed, but it did it before installation created your user, so you need to add the locations to some PATHS. Adding:

export PATH=${PATH}:/usr/local/cuda/bin
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/cuda/lib64

To my .bashrc file got everything working. It also shows where cuda is installed. This is handy since it includes a large collection of samples.

Compiling the deviceQuery sample produced the following output on my Nano:

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA Tegra X1"
  CUDA Driver Version / Runtime Version          10.0 / 10.0
  CUDA Capability Major/Minor version number:    5.3
  Total amount of global memory:                 3957 MBytes (4148756480 bytes)
  ( 1) Multiprocessors, (128) CUDA Cores/MP:     128 CUDA Cores
  GPU Max Clock rate:                            922 MHz (0.92 GHz)
  Memory Clock rate:                             13 Mhz
  Memory Bus Width:                              64-bit
  L2 Cache Size:                                 262144 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            No
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.0, CUDA Runtime Version = 10.0, NumDevs = 1

Result = PASS

This is all good information and what all this data means is explained in NVIDIA’s developer documentation (which is actually pretty good). The deviceQuery sample exercises various information APIs in the CUDA library to tell you all it can about what you are running. If you can compile and run deviceQuery in the samples/1_Utilities folder then you should be good to go.

CUDA Hello World

The 128 NVidia Maxwell cores basically consist of a SIMD computer (Single Instruction Multiple Data). This means you have one instruction that they all execute, but on different data. For instance if you want to add two arrays of 128 floating point numbers you have one instruction, add, and then each processor core adds a different element of the array. NVidia actually calls their processors SIMT meaning single instruction multiple threads, since you can partition the processors to different threads and have the two threads each with a collection of processors doing their SIMD thing at once.

When you write a CUDA program, you have two parts, one is the part that runs on the host CPU and the other is the part that runs on the NVidia GPUs. The NVidia C compiler, NVCC adds a number of extensions to the C language to specify what runs where and provide some more convenient syntaxes for the common things you need to do. For the host parts, NVCC translates its custom syntax into CUDA library calls and then passes the result onto GCC to compile regularly. For the GPU parts, NVCC compiles to an intermediate format called PTX. The reason it does this is to support all the various NVidia GPU models. When the NVidia device driver goes to load this code, it does a just in time compile (which it then caches), where the PTX code is compiled to the correct binary code for your particular set of GPUs.

Here is the skeleton of a simple CUDA program:

// Kernel definition
__global__ void VecAdd(float* A, float* B, float* C)
{
    int i = threadIdx.x;
    C[i] = A[i] + B[i];
}

int main()
{
    ...
    // Kernel invocation with N threads
    VecAdd<<<1, N>>>(A, B, C);
    ...
}

 

The __global__ identifier specifies the VecAdd routine as to run on the GPU. One instance of this routine will be downloaded to run on N processors. Notice there is no loop to add these vectors, Each processor will be a different thread and the thread’s x member will be used to choose which array element to add.

Then in the main program we call VecAdd using the VecAdd<<>> syntax which indicates we are calling a GPU function with these three arrays (along with the size).

This little example skips the extra steps of copying the arrays to GPU memory or copying the result out of GPU memory. There are quite a few different memory types, and various trade offs for using them.

The complete program for adding two vectors from the samples is at the end of this article.

This example also doesn’t explain how to handles larger arrays or how to do error processing. For these extra levels of complexity, refer to the CUDA C Programming Guide.

The CUDA program here is very short, just doing an addition. If you wanted to say multiply two 10×10 matrices, you would have your CUDA code do the dot product of a row in the first matrix by a column in the second matrix. Then you would have 100 cores execute this code, so the result of the multiplication would be done 100 times faster than just using the host processor. There are a lot of samples on how to do matrix multiplication in the samples and documentation.

Newer CUDA Technologies

The Maxwell GPUs in the Jetson Nano are a bit old and reading and playing with the CUDA libraries revealed a few interesting tidbits on things they are missing. We all know how NVidia has been enhancing their products for gaming and graphics with the introduction of things like real time ray tracing, but the thing of more interest to me is how they’ve been adding features specific to Machine Learning and AI. Even though Google produces their own hardware for accelerating their TensorFlow product in their data centers, NVidia has added specific features that greatly help TensorFlow and other Neural Network programs.

One thing the Maxwell GPU lacks is direct matrix multiplication support, newer GPUs can just do A * B + C as a single instruction, where these are all matrices.

Another thing that NVidia just added is direct support for executing computation graphs. If you worked with the early version of TensorFlow then you know that you construct your model by building a computational graph and then training and executing it. The newest NVidia GPUs can now execute these graphs directly. NVidia has a TensorRT library to move parts of TensorFlow to the GPU, this library does work for the Maxwell GPUs in the Jetson Nano, but is probably way more efficient in the newest, bright and shiny GPUs. Even just using TensorFlow without TensorRT is a great improvement and handles moving the matrix calculations to the GPUs even for the Nano, it just means the libraries have more work to do.

Summary

The GPU cores in a product like the Jetson Nano can be easily utilized using products that support them like TensorFlow or OpenGL, but it’s fun to explore the lower level programming models to see how things are working under the covers. If you are interested in parallel programming on a SIMD type machine, then this is a good way to go.

 

/**
 * Copyright 1993-2015 NVIDIA Corporation.  All rights reserved.
 *
 * Please refer to the NVIDIA end user license agreement (EULA) associated
 * with this source code for terms and conditions that govern your use of
 * this software. Any use, reproduction, disclosure, or distribution of
 * this software and related documentation outside the terms of the EULA
 * is strictly prohibited.
 *
 */

/**
 * Vector addition: C = A + B.
 *
 * This sample is a very basic sample that implements element by element
 * vector addition. It is the same as the sample illustrating Chapter 2
 * of the programming guide with some additions like error checking.
 */

#include <stdio.h>

// For the CUDA runtime routines (prefixed with "cuda_")
#include <cuda_runtime.h>

#include <helper_cuda.h>

/**
 * CUDA Kernel Device code
 *
 * Computes the vector addition of A and B into C. The 3 vectors have the same
 * number of elements numElements.
 */

__global__ void
vectorAdd(const float *A, const float *B, float *C, int numElements)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;

    if (i < numElements)
    {
        C[i] = A[i] + B[i];
    }
}

/**
 * Host main routine
 */

int
main(void)
{
    // Error code to check return values for CUDA calls
    cudaError_t err = cudaSuccess;

    // Print the vector length to be used, and compute its size
    int numElements = 50000;
    size_t size = numElements * sizeof(float);
    printf("[Vector addition of %d elements]\n", numElements);

    // Allocate the host input vector A
    float *h_A = (float *)malloc(size);

    // Allocate the host input vector B
    float *h_B = (float *)malloc(size);

    // Allocate the host output vector C
    float *h_C = (float *)malloc(size);

    // Verify that allocations succeeded
    if (h_A == NULL || h_B == NULL || h_C == NULL)
    {
        fprintf(stderr, "Failed to allocate host vectors!\n");
        exit(EXIT_FAILURE);
    }

    // Initialize the host input vectors
    for (int i = 0; i < numElements; ++i)
    {
        h_A[i] = rand()/(float)RAND_MAX;
        h_B[i] = rand()/(float)RAND_MAX;
    }

    // Allocate the device input vector A
    float *d_A = NULL;
    err = cudaMalloc((void **)&d_A, size);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to allocate device vector A (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Allocate the device input vector B
    float *d_B = NULL;
    err = cudaMalloc((void **)&d_B, size);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to allocate device vector B (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Allocate the device output vector C
    float *d_C = NULL;
    err = cudaMalloc((void **)&d_C, size);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to allocate device vector C (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Copy the host input vectors A and B in host memory to the device input vectors in
    // device memory
    printf("Copy input data from the host memory to the CUDA device\n");
    err = cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to copy vector A from host to device (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    err = cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to copy vector B from host to device (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Launch the Vector Add CUDA Kernel
    int threadsPerBlock = 256;
    int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;
    printf("CUDA kernel launch with %d blocks of %d threads\n", blocksPerGrid, threadsPerBlock);
    vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements);
    err = cudaGetLastError();

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to launch vectorAdd kernel (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Copy the device result vector in device memory to the host result vector
    // in host memory.
    printf("Copy output data from the CUDA device to the host memory\n");
    err = cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to copy vector C from device to host (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Verify that the result vector is correct
    for (int i = 0; i < numElements; ++i)
    {
        if (fabs(h_A[i] + h_B[i] - h_C[i]) > 1e-5)
        {
            fprintf(stderr, "Result verification failed at element %d!\n", i);
            exit(EXIT_FAILURE);
        }
    }

    printf("Test PASSED\n");

    // Free device global memory
    err = cudaFree(d_A);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to free device vector A (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    err = cudaFree(d_B);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to free device vector B (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    err = cudaFree(d_C);

    if (err != cudaSuccess)
    {
        fprintf(stderr, "Failed to free device vector C (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    // Free host memory
    free(h_A);
    free(h_B);
    free(h_C);

    printf("Done\n");
    return 0;
}






Written by smist08

April 3, 2019 at 6:01 pm

Can NVidia Bake a Better Pi Than Raspberry?

with 3 comments

Introduction

I love my Raspberry Pi, but I find it’s limited 1Gig of RAM can be quite restricting. It is still pretty amazing what you can do with these $35 computers. I was disappointed when the Raspberry Foundation announced that the Raspberry Pi 4 is still over a year away, so I started to look at Raspberry Pi alternatives. I wanted something with 4Gig of RAM and a faster ARM processor. I was considering purchasing an Odroid N2, when I saw the press release from NVidia’s Developer Conference that they just released their NVidia Jetson Nano Developer Kit. This board has a faster ARM A57 quad core processor, 4 Gig of RAM plus the bonus of a 128 core Maxwell GPU. The claim being that this is an ideal DIY computer for those interested in AI and machine learning (i.e. me). It showed up for sale on arrow.com, so I bought one and received it via FedEx in 2 days.

Setup

If you already have a Raspberry Pi, setup is easy, since you can unplug things from the Pi and plug them into the Nano, namely the power supply, keyboard, monitor and mouse. Like the Pi, the Nano runs from a microSD card, so I reformatted one of my Pi cards to a download of the variant of Ubuntu Linux that NVidia provides for these. Once the operating system was burned to the microSD card, I plugged it into the Nano and away I went.

One difference from the Pi is that the Nano does not have built in Wifi or Bluetooth. Fortunately the room I’m setting this up in has a wired Internet port, so I went into the garage and found a long Internet cable in my box of random cables, plugged it in and was all connected to the Internet. You can plug a USB Wifi dongle in if you need Wifi, or there is an M.2 E slot (which is hard to access) for an M.2 Wifi card. Just be careful of compatibility, since the drivers need to be compiled for ARM64 Linux.

The board doesn’t come with a case, but the box folds into a stand to hold the board. For now that is how I’m running. If they sell enough of these, I’m sure cases will appear, but you will need to ensure there is enough ventilation for the huge heat sink.

Initial Impressions

The Jetson Nano certainly feels faster than the Raspberry Pi. This is all helped by the faster ARM processor, the quadrupled memory, using the GPU cores for graphics acceleration and that the version of Linux is 64 Bit (unlike Raspbian which is 32 Bit). It ran the pre installed Chromium Browser quite well.

As I installed more software, I found that writing large amounts of data to the microSD card can be a real bottleneck, and I would often have to wait for it to catch up. This is more pronounced than on the Pi, probably because other things are quite slow as well. It would be nice if there was an M.2 M interface for an NVMe SSD drive, but there isn’t. I ordered a faster microSD card (over three times faster than what I have) and hope that helps. I can also try putting some things on a USB SSD, but again this isn’t the fastest.

I tried running the TensorFlow MNIST tutorial program. The version of TensorFlow for this is 1.11. If I want to try TensorFlow 2.0, I’ll have to compile it myself for ARM64, which I haven’t attempted yet. Anyway, TensorFlow automatically used the GPU and executed the tutorial orders of magnitude faster than the Pi (a few minutes versus several hours). So I was impressed with that.

This showed up another gotcha. The GPU cores and CPU share the same memory. So when TensorFlow used the GPU, that took a lot of memory away from the CPU. I was running the tutorial in a Jupyter notebook running locally, so that meant I was running a web server, Chromium, Python, and then TensorFlow with bits on the CPU and GPU. This tended to use up all memory and then things would grind to a halt until garbage collection sorted things out. Running from scratch was fine, but running iteratively felt like it kept hitting a wall. I think the lesson here is that to do machine learning training on this board, I really have to use a lighter Python environment than Jupyter.

The documentation mentions a utility to control the processor speeds of the ARM cores and GPU cores, so you can tune the heat produced. I think this is more for if you embed the board inside something, but beware this sucker can run hot if you keep all the various processors busy.

How is it so Cheap?

The NVidia Jetson Nano costs $99 USD. The Odroid is $79 so it is fairly competitive with other boards trying to be super-Pis. However, it is cheaper than pretty much any NVidia graphics card and even their Nano compute board (which has no ports and costs $129 in quantities of 1000).

The obvious cost saving is no Wifi and no bluetooth. Another is the lack of a SATA or M.2 M interface. It does have a camera interface, a serial interface and a Pi like GPIO block.

The Nano has 128 Maxwell GPU cores. Sounds impressive, but remember most graphics cards have 700 to 4000 cores. Further Maxwell is the oldest supported platform (version 5) where as the newest is the version 7 Volta core.

I think NVidia is keeping the cost low, to get the DIY crowd using their technologies, they’ve seen the success of the Raspberry Pi community and want to duplicate it for their various processor boards. I also think they want to be in the ARM board game, so as better ARM processors come out, they might hope to supplant Intel in producing motherboards for desktop and laptop computers.

Summary

If the Raspberry Pi 4 team can produce something like this for $35 they will have a real winner. I’m enjoying playing with the board and learning what it can do. So far I’ve been pretty impressed. There are some limitations, but given the $100 price tag, I don’t think you can lose. You can play with parallel processing with the GPU cores, you can interface to robots with the GPIO pins, or play with object recognition via the camera interface.

For an DIY board, there are a lot of projects you can take on.