Stephen Smith's Blog

Musings on Machine Learning…

Posts Tagged ‘nvidia

The Ups and Downs of Chip Manufacturing

with 3 comments


Previously, we blogged about a number of successes in the ARM world: being adopted for the next generation Macs, moving into the server and supercomputer markets and of course continuing to dominate the mobile world. ARM is owned by the Japanese conglomerate holding company Softbank. ARM is Softbank’s big success, balanced with a number of major failures such as WeWork. With ARM’s current success, Softbank is considering whether this is a good time to sell off ARM at a big profit.

Meanwhile, Intel just reported their quarterly earnings and along with those, announced their 7nm next generation manufacturing process has been delayed until 2022. As a result Intel’s stock price took a major haircut and it seems Intel is even considering outsourcing their manufacturing.

In this article, we’ll look at the ramifications that we might see over the next year or so.

Who Might Buy ARM?

Softbank paid $32 billion to acquire ARM in 2016 and it is suspected that Softbank is trying to get in excess of $40 billion from the sale. Softbank is considering spawning ARM off as a separate company via an IPO, or selling it to either nVidia or Apple.

It is reported that Apple has already said that it isn’t interested in buying ARM. The main reason is that Apple only licenses the ARM ISA (Instruction Set Architecture) and not the CPU circuitry designs, so it doesn’t gain that much and will bring on anti-trust attention if it buys the company that designs the processors for all its competitors. Even if Apple has good intentions to its competitors, it doesn’t want to deal with them, as buying ARM would force it to.

nVidia is the main interested party. nVidia develops graphics cards that are used for AI applications and playing video games. The explosion in AI has greatly benefited nVidia, but nVidia faces a struggle as its main competitors in the graphics/AI world, namely Intel and AMD both make CPUs and have been including more and more graphics processing on the core CPU chips. nVidia would dearly like to enter the CPU market and be able to compete with newer AMD CPU/GPU chips. Intel has struggled of late, but even their built in Intel graphics are good enough for most people, meaning they don’t feel like they need an nVidia product. nVidia already has a lot of experience with ARM, we’ve blogged about their nVidia Jetson Nano, which is just one in a line of processor boards with integrated nVidia SIMD cores.

nVidia is large enough that they can afford to buy ARM, but the question is whether nVidia will be a good steward of the ARM technology portfolio? At least Softbank, being a holding company has largely left ARM alone to do its thing. One question will be how hands-on nVidia is if they do acquire ARM, or will they allow it independence. Will they try to milk more revenue from ARM, raising prices for everyone? Will they force the inclusion of nVidia technology. For instance ARM designs their own GPU called Mali, will nVidia mandate this be replaced by nVidia GPU technology? Personally, I feel that GPU is a weak spot for ARM and that a migration to the excellent nVidia graphics cores will be a big benefit. But this will take time, since most software expects Mali or one of its competitors.

If this deal does pass all the complicated regulatory and financial hurdles, only time will tell if this turns out to be a good thing for ARM.

Intel’s Manufacturing Problems

Intel has been struggling to release its 10nm based chips and they are just starting to come out. Meanwhile, AMD, Apple and others have been having their chips manufactured by TSMC using 7nm technology for almost a year now. Intel claims their 10nm technology is as good as TSMC’s 7nm process, but independent analysis shows that TSMC is beating Intel in the number of transistors per square millimeter (hence denser chips with more transistors), while using less power and generating less heat. Add to that, Intel announcing their 7nm technology has been deleted to 2022, and that TSMC is starting to produce 5nm based chips now.

These problems caused Apple to hurry up their switch from Intel to ARM for their Mac computers. It has also resulted in tremendous growth for AMD which produces Intel compatible chips using TSMC technology.

With my first job after University, they sent me on a 2 week course at Intel in Santa Clara on using their embedded processors (at that time the 80186). The course started with an Intel marketing video on how Intel was centered on their chip manufacturing process technology, how this was their crown jewel and the core of everything they did. Everything else was based on their excellence in manufacturing chips. This certainly remained true for many years, but recently a chink has appeared in Intel’s armour as TSMC has passed Intel in processing technology.

At Intel’s earnings call, Bob Swan, CEO said the unthinkable that Intel was considering outsourcing their manufacturing to TSMC as well. Would this work for Intel? Will their chip designs which are optimized for their process technology work on TSMC’s? Is it in TSMC’s interest to ramp up for Intel, when Intel is likely to take the manufacturing back in-house down the road? None of these questions were answered by Bob Swan on the call. Another question is whether Intel would exit the chip manufacturing business forever? How would this affect Intel long term? Is Intel’s chip design capability competitive, without a process technology advantage to help them? These are all hard questions and Intel is going to have to find some answers to all these or face an accelerating decline over the next few years.


Just when ARM is on a roll, Softbank has thrown it a curved ball by offering it up for sale. Whether anything comes of this is yet to be seen, if it does happen, I think nVidia would be the best choice to acquire ARM and that long term it would be a good thing. I think nVidia has shown how to do excellent SoCs based on ARM processors and nVidia GPU cores with products like the Jetson Nano and that hopefully nVidia can be a good steward for ARM.

Intel certainly faces some challenges in the coming months. AMD is eating away at their market share and getting their new chips to market seems to be getting more and more challenging. Hopefully Intel can find a solution to their problems, but these things can take several years and billions in investment to turn around.

To learn more about the internal architecture of the ARM Processor, consider my book: Programming with 64-Bit ARM Assembly Language.

Written by smist08

July 24, 2020 at 11:35 am

Playing with CUDA on my Gaming Laptop

leave a comment »

Playing with CUDA on my Gaming Laptop


Last year, I blogged on playing with CUDA on my nVidia Jetson Nano. I recently bought a new laptop which contains an nVidia GTX1650 graphics card with 4Gig of RAM. This is more powerful than the coprocessor built into the Jetson Nano.  I took advantage of the release of newer Intel 10th generation processors along with the wider availability of newer nVidia RTX graphics cards to get a good deal on a gaming laptop with an Intel 9th generation processor and nVidia GTX graphics. This is still a very fast laptop with 16Gig of RAM and runs the couple of video games I’ve tried fine. It also compiles and handles my normal projects easily. In this blog post, I’ll repeat a lot of my previous article on the nVidia Jetson, but in the context of running on Windows 10 with an Intel CPU.

I wanted an nVidia graphics card because these have the best software support for graphics, gaming, AI, machine learning and parallel programming. If you use Tensorflow for AI, then it uses the nVidia graphics card automatically. All the versions of DirectX support nVidia and if you are doing general parallel programming then you can use a system like OpenCL. I find nVidia leads AMD in software support and Intel is going to have a lot of trouble with their new Xe graphics cards reaching this same level of software support.


On Windows, most developers use Visual Studio. I could do this all with GCC, but this is more difficult, since when you install the SDK for CUDA, you get all the samples and documentation for Visual Studio. The good news is that you can use Visual Studio Community Edition which is free and actually quite good. Installing Visual Studio is straightforward, just time consuming since it is large.

Next up, you need to install nVidia’s CUDA toolkit. Again, this is straightforward, just large. Although the install is large, you likely have all the drivers already installed, so you are mostly getting the developer tools and samples out of this.

Performing these installs and then dealing with the program upgrades, really makes me miss Linux’s package managers. On Linux, you can upgrade all the software on your computer with one command on a regular basis. On Windows, each program checks for upgrades when it starts and usually wants to upgrade itself before you do any work. I find that this is a real productivity killer on Windows. Microsoft is starting work on a package manager for Windows, but at this point it does little.

Compiling the deviceQuery sample produced the following output on my gaming laptop:

CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 1650 with Max-Q Design"
  CUDA Driver Version / Runtime Version          11.0 / 11.0
  CUDA Capability Major/Minor version number:    7.5
  Total amount of global memory:                 4096 MBytes (4294967296 bytes)
  (16) Multiprocessors, ( 64) CUDA Cores/MP:     1024 CUDA Cores
  GPU Max Clock rate:                            1245 MHz (1.25 GHz)
  Memory Clock rate:                             3501 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 1048576 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 6 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  CUDA Device Driver Mode (TCC or WDDM):         WDDM (Windows Display Driver Model)
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.0, CUDA Runtime Version = 11.0, NumDevs = 1
Result = PASS

If we compare this to the nVidia Jetson Nano, we see everything is better. The GTX 1650 is based on the newer Turing architecture and the memory is local to the graphics card and not shared with the CPU. The big difference is that we have 1024 CUDA cores, rather than the Jetson’s 128. This means we can perform 1024 operations in parallel for SIMD operations.

CUDA Samples

The CUDA toolkit includes a large selection of sample programs, in the Jetson Nano article we listed the vector addition sample. Compiling and running this on Windows is easy in Visual Studio. These samples are a great source of starting points for your own projects. 

Programming for Portability

If you are writing a specialized program and want the maximum performance on specialized hardware, it makes sense to write directly to nVidia’s CUDA API. However, most software developers want to have their programs to run on as many computers out in the world as possible. The solution is to write to a higher level API that then has drivers for different popular hardware.

For instance, if you are creating a video game, you could write to the DirectX interface and then your program can run on newer versions of Windows on a wide variety of GPUs from different vendors. If you don’t want to be limited to Windows, you could use a portable graphics API like OpenGL. You can also go higher level and create your game in a system like UnReal Engine or Unity. These then have different drivers to run on DirectX, MacOS, Linux, mobile devices or even in web browsers.

If you are creating an AI or Machine Learning application, you can use a library like Tensorflow or PyTorch which have drivers for all sorts of different hardware. You just need to ensure their support is as broad as the market you are trying to reach.

If you are doing something more general or completely new, you can consider a general parallel processing library like OpenCL which has support for all sorts of devices, including the limited SIMD coprocessors included with most modern CPUs. A good example of a program that uses OpenCL is Folding@Home which I blogged on here.


Modern GPUs are powerful and flexible computing devices. They have high speed memory and often thousands of processing cores to work on your task. Libraries to make use of this computing power are getting better and better allowing you to leverage this horsepower in your applications, whether they are graphics related or not. Today’s programmers need to have the tools to harness these powerful devices, so the applications they are working on can reach their true potential.

Written by smist08

June 20, 2020 at 1:43 pm

Playing with Software Defined Radio

with 5 comments


Most Ham Radios these days, receive signals through an antenna, convert the signal to digital, process the signal with a built-in computer, and then output the result converting back to analog for the speaker. This trend to doing all the radio signal processing in software instead of using electronic components is called Software Defined Radio (SDR). The ICOM 7300 is built around SDR as are all the expensive Flex Radios.

Inexpensive SDR

Some clever hackers figured out that an inexpensive chip used in boards to receive TV into a computer, could actually tune to any frequency. From this discovery, many inexpensive USB dongles have been produced that utilize this “TV Tuner” chip, but to tune radio instead of TV. This is possible because all this chip does is receive a signal from an antenna and then convert it to digital for the computer to process. I purchased the RTL-SDR dongle for around $30 which included a small VHF/UHF antenna.

I run Linux, both on my laptop and on a Raspberry Pi. I looked around for software to use with this device and found several candidates. I chose CubicSDR because it easily installed from the Ubuntu App store on both my laptop and on my Raspberry Pi.

I tried it first on the Pi, but it just didn’t work well. It would keep hanging and the sound was never good. I then tried it on my laptop and it worked great. This led me to believe that the Raspberry Pi just doesn’t have the horsepower to run this sort of system. Either due to lack of memory (only having 1Gig) or that the ARM processor isn’t quite powerful enough. Doing some reading online, the consensus seemed to be that you couldn’t run both the radio software and a GUI on the same Pi. You needed to either have two Pi’s or use a command line version of the software. I was disappointed the Pi wasn’t up to the challenge, but got along just fine using my laptop.

Enter the NVidia Jetson Nano

I recently acquired an NVidia Jetson Nano Developers Kit. This is similar to a Raspberry Pi, but with a more powerful quad-core ARM processor, 4Gig or RAM and 120 Tegra NVidia GPU processors (it also costs $99 rather than $35).

I installed CubicSDR on this, and it worked right away like a charm. I was impressed, getting software for the Nano can sometime be difficult since it runs true 64-Bit Ubuntu Linux on ARM, so you need to have that built. But CubicSDR was in the App Store and installed with no problem. I fired it up and it recognized the RTL-SDR and it recognized the NVidia Tegra GPUs. It took over 10 of them for doing its signal processing and worked really well.

Below is a screenshot of CubicSDR playing an FM radio station.


CubicSDR is open source and free, it uses GNURadio under the covers (low level open source radio processing). CubicSDR has quite an impressive display. Like fancy high end radios you can see what is happening on the frequencies around where you are tuned in. The interface can be a bit cryptic and you need to refer to the documentation to do some things. For instance the volume, doesn’t honor the system setting and you have to use the green slider in the upper right. Knowing what the various sliders do is quite helpful. Tuning frequencies is a bit tricky at first, but once you check the manual and play with it, it becomes easy. Using CubicSDR really is like using a high end radio, just for a fraction of the cost.

It is certainly helpful to know ham terminology and to know what radio protocol is used where. For instance most VHF communications use narrow band FM. Most longer wavelength ham communications are either upper or lower sideband. Aeronautical uses AM. Commercial FM stations use wide band FM.


Although the RTL-SDR supports pretty much any frequency, you need the correct antenna for what you are doing. The ham bands that bounce off the stratosphere to allow you to talk to people halfway around the world use quite long wavelengths. The longer the wavelength, the larger the antenna you need to receive them. Don’t expect to receive anything from the 20 meter band without a good sized antenna. That doesn’t mean it has to be expensive, you can get good results using a dipole or end-fed antenna, both of these are just made out of wires, but you do have to string them high up and facing the right direction.

What About Transmitting?

This RTL-SDR only receives signals. If you want to transmit as well, then you need a more expensive model. These sort of SDR transmitters are very low power, so if you want to be heard, you will need a good linear amplifier, rated for the frequencies you want to use. You will also need a better antenna.

If you transmit you also require a ham radio license and call sign. You are responsible for not causing interference and that you signal doesn’t bleed through to adjacent channels. Since you are assembling this all yourself, an advanced license is required.


SDR is great fun to play with and there are lots of great projects you can create with this and an inexpensive single board computer. It’s too bad the Raspberry Pi isn’t quite up to the task. However, more powerful Pi competitors like the Jetson Nano run SDR just fine.

Written by smist08

April 16, 2019 at 2:08 am

Playing with CUDA on My NVIDIA Jetson Nano

with 2 comments


I reported last time about my new toy, an NVIDIA Jetson Nano Development Kit. I’m pretty familiar with Linux and ARM processors. I even wrote a couple of articles on Assembler programming, here and here. The thing that intrigued be about the Jetson Nano is its 128 Maxwell GPU cores. What can I do with these? Sure I can speed up TensorFlow since it uses these automatically. I could probably do the same with OpenGL programs. But what can I do directly?

So I downloaded the CUDA C Programming Guide from NVIDIA’s website to have a look at what is involved.


The claim is that the microSD image of 64Bit Ubuntu Linux that NVIDIA provides for this computer has all the NVIDIA libraries and utilities you need all pre-installed. The programming guide made it clear that if you need to use the NVIDIA C compiler nvcc to compile your work. But if I typed nvcc at a command prompt, I just got an error that this command wasn’t found. A bit of Googling revealed that everything is installed, but it did it before installation created your user, so you need to add the locations to some PATHS. Adding:

export PATH=${PATH}:/usr/local/cuda/bin
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/cuda/lib64

To my .bashrc file got everything working. It also shows where cuda is installed. This is handy since it includes a large collection of samples.

Compiling the deviceQuery sample produced the following output on my Nano:

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA Tegra X1"
  CUDA Driver Version / Runtime Version          10.0 / 10.0
  CUDA Capability Major/Minor version number:    5.3
  Total amount of global memory:                 3957 MBytes (4148756480 bytes)
  ( 1) Multiprocessors, (128) CUDA Cores/MP:     128 CUDA Cores
  GPU Max Clock rate:                            922 MHz (0.92 GHz)
  Memory Clock rate:                             13 Mhz
  Memory Bus Width:                              64-bit
  L2 Cache Size:                                 262144 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            No
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.0, CUDA Runtime Version = 10.0, NumDevs = 1

Result = PASS

This is all good information and what all this data means is explained in NVIDIA’s developer documentation (which is actually pretty good). The deviceQuery sample exercises various information APIs in the CUDA library to tell you all it can about what you are running. If you can compile and run deviceQuery in the samples/1_Utilities folder then you should be good to go.

CUDA Hello World

The 128 NVidia Maxwell cores basically consist of a SIMD computer (Single Instruction Multiple Data). This means you have one instruction that they all execute, but on different data. For instance if you want to add two arrays of 128 floating point numbers you have one instruction, add, and then each processor core adds a different element of the array. NVidia actually calls their processors SIMT meaning single instruction multiple threads, since you can partition the processors to different threads and have the two threads each with a collection of processors doing their SIMD thing at once.

When you write a CUDA program, you have two parts, one is the part that runs on the host CPU and the other is the part that runs on the NVidia GPUs. The NVidia C compiler, NVCC adds a number of extensions to the C language to specify what runs where and provide some more convenient syntaxes for the common things you need to do. For the host parts, NVCC translates its custom syntax into CUDA library calls and then passes the result onto GCC to compile regularly. For the GPU parts, NVCC compiles to an intermediate format called PTX. The reason it does this is to support all the various NVidia GPU models. When the NVidia device driver goes to load this code, it does a just in time compile (which it then caches), where the PTX code is compiled to the correct binary code for your particular set of GPUs.

Here is the skeleton of a simple CUDA program:

// Kernel definition
__global__ void VecAdd(float* A, float* B, float* C)
    int i = threadIdx.x;
    C[i] = A[i] + B[i];

int main()
    // Kernel invocation with N threads
    VecAdd<<<1, N>>>(A, B, C);


The __global__ identifier specifies the VecAdd routine as to run on the GPU. One instance of this routine will be downloaded to run on N processors. Notice there is no loop to add these vectors, Each processor will be a different thread and the thread’s x member will be used to choose which array element to add.

Then in the main program we call VecAdd using the VecAdd<<>> syntax which indicates we are calling a GPU function with these three arrays (along with the size).

This little example skips the extra steps of copying the arrays to GPU memory or copying the result out of GPU memory. There are quite a few different memory types, and various trade offs for using them.

The complete program for adding two vectors from the samples is at the end of this article.

This example also doesn’t explain how to handles larger arrays or how to do error processing. For these extra levels of complexity, refer to the CUDA C Programming Guide.

The CUDA program here is very short, just doing an addition. If you wanted to say multiply two 10×10 matrices, you would have your CUDA code do the dot product of a row in the first matrix by a column in the second matrix. Then you would have 100 cores execute this code, so the result of the multiplication would be done 100 times faster than just using the host processor. There are a lot of samples on how to do matrix multiplication in the samples and documentation.

Newer CUDA Technologies

The Maxwell GPUs in the Jetson Nano are a bit old and reading and playing with the CUDA libraries revealed a few interesting tidbits on things they are missing. We all know how NVidia has been enhancing their products for gaming and graphics with the introduction of things like real time ray tracing, but the thing of more interest to me is how they’ve been adding features specific to Machine Learning and AI. Even though Google produces their own hardware for accelerating their TensorFlow product in their data centers, NVidia has added specific features that greatly help TensorFlow and other Neural Network programs.

One thing the Maxwell GPU lacks is direct matrix multiplication support, newer GPUs can just do A * B + C as a single instruction, where these are all matrices.

Another thing that NVidia just added is direct support for executing computation graphs. If you worked with the early version of TensorFlow then you know that you construct your model by building a computational graph and then training and executing it. The newest NVidia GPUs can now execute these graphs directly. NVidia has a TensorRT library to move parts of TensorFlow to the GPU, this library does work for the Maxwell GPUs in the Jetson Nano, but is probably way more efficient in the newest, bright and shiny GPUs. Even just using TensorFlow without TensorRT is a great improvement and handles moving the matrix calculations to the GPUs even for the Nano, it just means the libraries have more work to do.


The GPU cores in a product like the Jetson Nano can be easily utilized using products that support them like TensorFlow or OpenGL, but it’s fun to explore the lower level programming models to see how things are working under the covers. If you are interested in parallel programming on a SIMD type machine, then this is a good way to go.


 * Copyright 1993-2015 NVIDIA Corporation.  All rights reserved.
 * Please refer to the NVIDIA end user license agreement (EULA) associated
 * with this source code for terms and conditions that govern your use of
 * this software. Any use, reproduction, disclosure, or distribution of
 * this software and related documentation outside the terms of the EULA
 * is strictly prohibited.

 * Vector addition: C = A + B.
 * This sample is a very basic sample that implements element by element
 * vector addition. It is the same as the sample illustrating Chapter 2
 * of the programming guide with some additions like error checking.

#include <stdio.h>

// For the CUDA runtime routines (prefixed with "cuda_")
#include <cuda_runtime.h>

#include <helper_cuda.h>

 * CUDA Kernel Device code
 * Computes the vector addition of A and B into C. The 3 vectors have the same
 * number of elements numElements.

__global__ void
vectorAdd(const float *A, const float *B, float *C, int numElements)
    int i = blockDim.x * blockIdx.x + threadIdx.x;

    if (i < numElements)
        C[i] = A[i] + B[i];

 * Host main routine

    // Error code to check return values for CUDA calls
    cudaError_t err = cudaSuccess;

    // Print the vector length to be used, and compute its size
    int numElements = 50000;
    size_t size = numElements * sizeof(float);
    printf("[Vector addition of %d elements]\n", numElements);

    // Allocate the host input vector A
    float *h_A = (float *)malloc(size);

    // Allocate the host input vector B
    float *h_B = (float *)malloc(size);

    // Allocate the host output vector C
    float *h_C = (float *)malloc(size);

    // Verify that allocations succeeded
    if (h_A == NULL || h_B == NULL || h_C == NULL)
        fprintf(stderr, "Failed to allocate host vectors!\n");

    // Initialize the host input vectors
    for (int i = 0; i < numElements; ++i)
        h_A[i] = rand()/(float)RAND_MAX;
        h_B[i] = rand()/(float)RAND_MAX;

    // Allocate the device input vector A
    float *d_A = NULL;
    err = cudaMalloc((void **)&d_A, size);

    if (err != cudaSuccess)
        fprintf(stderr, "Failed to allocate device vector A (error code %s)!\n", cudaGetErrorString(err));

    // Allocate the device input vector B
    float *d_B = NULL;
    err = cudaMalloc((void **)&d_B, size);

    if (err != cudaSuccess)
        fprintf(stderr, "Failed to allocate device vector B (error code %s)!\n", cudaGetErrorString(err));

    // Allocate the device output vector C
    float *d_C = NULL;
    err = cudaMalloc((void **)&d_C, size);

    if (err != cudaSuccess)
        fprintf(stderr, "Failed to allocate device vector C (error code %s)!\n", cudaGetErrorString(err));

    // Copy the host input vectors A and B in host memory to the device input vectors in
    // device memory
    printf("Copy input data from the host memory to the CUDA device\n");
    err = cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

    if (err != cudaSuccess)
        fprintf(stderr, "Failed to copy vector A from host to device (error code %s)!\n", cudaGetErrorString(err));

    err = cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

    if (err != cudaSuccess)
        fprintf(stderr, "Failed to copy vector B from host to device (error code %s)!\n", cudaGetErrorString(err));

    // Launch the Vector Add CUDA Kernel
    int threadsPerBlock = 256;
    int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;
    printf("CUDA kernel launch with %d blocks of %d threads\n", blocksPerGrid, threadsPerBlock);
    vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements);
    err = cudaGetLastError();

    if (err != cudaSuccess)
        fprintf(stderr, "Failed to launch vectorAdd kernel (error code %s)!\n", cudaGetErrorString(err));

    // Copy the device result vector in device memory to the host result vector
    // in host memory.
    printf("Copy output data from the CUDA device to the host memory\n");
    err = cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

    if (err != cudaSuccess)
        fprintf(stderr, "Failed to copy vector C from device to host (error code %s)!\n", cudaGetErrorString(err));

    // Verify that the result vector is correct
    for (int i = 0; i < numElements; ++i)
        if (fabs(h_A[i] + h_B[i] - h_C[i]) > 1e-5)
            fprintf(stderr, "Result verification failed at element %d!\n", i);

    printf("Test PASSED\n");

    // Free device global memory
    err = cudaFree(d_A);

    if (err != cudaSuccess)
        fprintf(stderr, "Failed to free device vector A (error code %s)!\n", cudaGetErrorString(err));

    err = cudaFree(d_B);

    if (err != cudaSuccess)
        fprintf(stderr, "Failed to free device vector B (error code %s)!\n", cudaGetErrorString(err));

    err = cudaFree(d_C);

    if (err != cudaSuccess)
        fprintf(stderr, "Failed to free device vector C (error code %s)!\n", cudaGetErrorString(err));

    // Free host memory

    return 0;

Written by smist08

April 3, 2019 at 6:01 pm

Can NVidia Bake a Better Pi Than Raspberry?

with 5 comments


I love my Raspberry Pi, but I find it’s limited 1Gig of RAM can be quite restricting. It is still pretty amazing what you can do with these $35 computers. I was disappointed when the Raspberry Foundation announced that the Raspberry Pi 4 is still over a year away, so I started to look at Raspberry Pi alternatives. I wanted something with 4Gig of RAM and a faster ARM processor. I was considering purchasing an Odroid N2, when I saw the press release from NVidia’s Developer Conference that they just released their NVidia Jetson Nano Developer Kit. This board has a faster ARM A57 quad core processor, 4 Gig of RAM plus the bonus of a 128 core Maxwell GPU. The claim being that this is an ideal DIY computer for those interested in AI and machine learning (i.e. me). It showed up for sale on, so I bought one and received it via FedEx in 2 days.


If you already have a Raspberry Pi, setup is easy, since you can unplug things from the Pi and plug them into the Nano, namely the power supply, keyboard, monitor and mouse. Like the Pi, the Nano runs from a microSD card, so I reformatted one of my Pi cards to a download of the variant of Ubuntu Linux that NVidia provides for these. Once the operating system was burned to the microSD card, I plugged it into the Nano and away I went.

One difference from the Pi is that the Nano does not have built in Wifi or Bluetooth. Fortunately the room I’m setting this up in has a wired Internet port, so I went into the garage and found a long Internet cable in my box of random cables, plugged it in and was all connected to the Internet. You can plug a USB Wifi dongle in if you need Wifi, or there is an M.2 E slot (which is hard to access) for an M.2 Wifi card. Just be careful of compatibility, since the drivers need to be compiled for ARM64 Linux.

The board doesn’t come with a case, but the box folds into a stand to hold the board. For now that is how I’m running. If they sell enough of these, I’m sure cases will appear, but you will need to ensure there is enough ventilation for the huge heat sink.

Initial Impressions

The Jetson Nano certainly feels faster than the Raspberry Pi. This is all helped by the faster ARM processor, the quadrupled memory, using the GPU cores for graphics acceleration and that the version of Linux is 64 Bit (unlike Raspbian which is 32 Bit). It ran the pre installed Chromium Browser quite well.

As I installed more software, I found that writing large amounts of data to the microSD card can be a real bottleneck, and I would often have to wait for it to catch up. This is more pronounced than on the Pi, probably because other things are quite slow as well. It would be nice if there was an M.2 M interface for an NVMe SSD drive, but there isn’t. I ordered a faster microSD card (over three times faster than what I have) and hope that helps. I can also try putting some things on a USB SSD, but again this isn’t the fastest.

I tried running the TensorFlow MNIST tutorial program. The version of TensorFlow for this is 1.11. If I want to try TensorFlow 2.0, I’ll have to compile it myself for ARM64, which I haven’t attempted yet. Anyway, TensorFlow automatically used the GPU and executed the tutorial orders of magnitude faster than the Pi (a few minutes versus several hours). So I was impressed with that.

This showed up another gotcha. The GPU cores and CPU share the same memory. So when TensorFlow used the GPU, that took a lot of memory away from the CPU. I was running the tutorial in a Jupyter notebook running locally, so that meant I was running a web server, Chromium, Python, and then TensorFlow with bits on the CPU and GPU. This tended to use up all memory and then things would grind to a halt until garbage collection sorted things out. Running from scratch was fine, but running iteratively felt like it kept hitting a wall. I think the lesson here is that to do machine learning training on this board, I really have to use a lighter Python environment than Jupyter.

The documentation mentions a utility to control the processor speeds of the ARM cores and GPU cores, so you can tune the heat produced. I think this is more for if you embed the board inside something, but beware this sucker can run hot if you keep all the various processors busy.

How is it so Cheap?

The NVidia Jetson Nano costs $99 USD. The Odroid is $79 so it is fairly competitive with other boards trying to be super-Pis. However, it is cheaper than pretty much any NVidia graphics card and even their Nano compute board (which has no ports and costs $129 in quantities of 1000).

The obvious cost saving is no Wifi and no bluetooth. Another is the lack of a SATA or M.2 M interface. It does have a camera interface, a serial interface and a Pi like GPIO block.

The Nano has 128 Maxwell GPU cores. Sounds impressive, but remember most graphics cards have 700 to 4000 cores. Further Maxwell is the oldest supported platform (version 5) where as the newest is the version 7 Volta core.

I think NVidia is keeping the cost low, to get the DIY crowd using their technologies, they’ve seen the success of the Raspberry Pi community and want to duplicate it for their various processor boards. I also think they want to be in the ARM board game, so as better ARM processors come out, they might hope to supplant Intel in producing motherboards for desktop and laptop computers.


If the Raspberry Pi 4 team can produce something like this for $35 they will have a real winner. I’m enjoying playing with the board and learning what it can do. So far I’ve been pretty impressed. There are some limitations, but given the $100 price tag, I don’t think you can lose. You can play with parallel processing with the GPU cores, you can interface to robots with the GPIO pins, or play with object recognition via the camera interface.

For an DIY board, there are a lot of projects you can take on.