Stephen Smith's Blog

Musings on Machine Learning…

Playing with CUDA on My NVIDIA Jetson Nano

with 2 comments


I reported last time about my new toy, an NVIDIA Jetson Nano Development Kit. I’m pretty familiar with Linux and ARM processors. I even wrote a couple of articles on Assembler programming, here and here. The thing that intrigued be about the Jetson Nano is its 128 Maxwell GPU cores. What can I do with these? Sure I can speed up TensorFlow since it uses these automatically. I could probably do the same with OpenGL programs. But what can I do directly?

So I downloaded the CUDA C Programming Guide from NVIDIA’s website to have a look at what is involved.


The claim is that the microSD image of 64Bit Ubuntu Linux that NVIDIA provides for this computer has all the NVIDIA libraries and utilities you need all pre-installed. The programming guide made it clear that if you need to use the NVIDIA C compiler nvcc to compile your work. But if I typed nvcc at a command prompt, I just got an error that this command wasn’t found. A bit of Googling revealed that everything is installed, but it did it before installation created your user, so you need to add the locations to some PATHS. Adding:

export PATH=${PATH}:/usr/local/cuda/bin
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/cuda/lib64

To my .bashrc file got everything working. It also shows where cuda is installed. This is handy since it includes a large collection of samples.

Compiling the deviceQuery sample produced the following output on my Nano:

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA Tegra X1"
  CUDA Driver Version / Runtime Version          10.0 / 10.0
  CUDA Capability Major/Minor version number:    5.3
  Total amount of global memory:                 3957 MBytes (4148756480 bytes)
  ( 1) Multiprocessors, (128) CUDA Cores/MP:     128 CUDA Cores
  GPU Max Clock rate:                            922 MHz (0.92 GHz)
  Memory Clock rate:                             13 Mhz
  Memory Bus Width:                              64-bit
  L2 Cache Size:                                 262144 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            No
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.0, CUDA Runtime Version = 10.0, NumDevs = 1

Result = PASS

This is all good information and what all this data means is explained in NVIDIA’s developer documentation (which is actually pretty good). The deviceQuery sample exercises various information APIs in the CUDA library to tell you all it can about what you are running. If you can compile and run deviceQuery in the samples/1_Utilities folder then you should be good to go.

CUDA Hello World

The 128 NVidia Maxwell cores basically consist of a SIMD computer (Single Instruction Multiple Data). This means you have one instruction that they all execute, but on different data. For instance if you want to add two arrays of 128 floating point numbers you have one instruction, add, and then each processor core adds a different element of the array. NVidia actually calls their processors SIMT meaning single instruction multiple threads, since you can partition the processors to different threads and have the two threads each with a collection of processors doing their SIMD thing at once.

When you write a CUDA program, you have two parts, one is the part that runs on the host CPU and the other is the part that runs on the NVidia GPUs. The NVidia C compiler, NVCC adds a number of extensions to the C language to specify what runs where and provide some more convenient syntaxes for the common things you need to do. For the host parts, NVCC translates its custom syntax into CUDA library calls and then passes the result onto GCC to compile regularly. For the GPU parts, NVCC compiles to an intermediate format called PTX. The reason it does this is to support all the various NVidia GPU models. When the NVidia device driver goes to load this code, it does a just in time compile (which it then caches), where the PTX code is compiled to the correct binary code for your particular set of GPUs.

Here is the skeleton of a simple CUDA program:

// Kernel definition
__global__ void VecAdd(float* A, float* B, float* C)
    int i = threadIdx.x;
    C[i] = A[i] + B[i];

int main()
    // Kernel invocation with N threads
    VecAdd<<<1, N>>>(A, B, C);


The __global__ identifier specifies the VecAdd routine as to run on the GPU. One instance of this routine will be downloaded to run on N processors. Notice there is no loop to add these vectors, Each processor will be a different thread and the thread’s x member will be used to choose which array element to add.

Then in the main program we call VecAdd using the VecAdd<<>> syntax which indicates we are calling a GPU function with these three arrays (along with the size).

This little example skips the extra steps of copying the arrays to GPU memory or copying the result out of GPU memory. There are quite a few different memory types, and various trade offs for using them.

The complete program for adding two vectors from the samples is at the end of this article.

This example also doesn’t explain how to handles larger arrays or how to do error processing. For these extra levels of complexity, refer to the CUDA C Programming Guide.

The CUDA program here is very short, just doing an addition. If you wanted to say multiply two 10×10 matrices, you would have your CUDA code do the dot product of a row in the first matrix by a column in the second matrix. Then you would have 100 cores execute this code, so the result of the multiplication would be done 100 times faster than just using the host processor. There are a lot of samples on how to do matrix multiplication in the samples and documentation.

Newer CUDA Technologies

The Maxwell GPUs in the Jetson Nano are a bit old and reading and playing with the CUDA libraries revealed a few interesting tidbits on things they are missing. We all know how NVidia has been enhancing their products for gaming and graphics with the introduction of things like real time ray tracing, but the thing of more interest to me is how they’ve been adding features specific to Machine Learning and AI. Even though Google produces their own hardware for accelerating their TensorFlow product in their data centers, NVidia has added specific features that greatly help TensorFlow and other Neural Network programs.

One thing the Maxwell GPU lacks is direct matrix multiplication support, newer GPUs can just do A * B + C as a single instruction, where these are all matrices.

Another thing that NVidia just added is direct support for executing computation graphs. If you worked with the early version of TensorFlow then you know that you construct your model by building a computational graph and then training and executing it. The newest NVidia GPUs can now execute these graphs directly. NVidia has a TensorRT library to move parts of TensorFlow to the GPU, this library does work for the Maxwell GPUs in the Jetson Nano, but is probably way more efficient in the newest, bright and shiny GPUs. Even just using TensorFlow without TensorRT is a great improvement and handles moving the matrix calculations to the GPUs even for the Nano, it just means the libraries have more work to do.


The GPU cores in a product like the Jetson Nano can be easily utilized using products that support them like TensorFlow or OpenGL, but it’s fun to explore the lower level programming models to see how things are working under the covers. If you are interested in parallel programming on a SIMD type machine, then this is a good way to go.


 * Copyright 1993-2015 NVIDIA Corporation.  All rights reserved.
 * Please refer to the NVIDIA end user license agreement (EULA) associated
 * with this source code for terms and conditions that govern your use of
 * this software. Any use, reproduction, disclosure, or distribution of
 * this software and related documentation outside the terms of the EULA
 * is strictly prohibited.

 * Vector addition: C = A + B.
 * This sample is a very basic sample that implements element by element
 * vector addition. It is the same as the sample illustrating Chapter 2
 * of the programming guide with some additions like error checking.

#include <stdio.h>

// For the CUDA runtime routines (prefixed with "cuda_")
#include <cuda_runtime.h>

#include <helper_cuda.h>

 * CUDA Kernel Device code
 * Computes the vector addition of A and B into C. The 3 vectors have the same
 * number of elements numElements.

__global__ void
vectorAdd(const float *A, const float *B, float *C, int numElements)
    int i = blockDim.x * blockIdx.x + threadIdx.x;

    if (i < numElements)
        C[i] = A[i] + B[i];

 * Host main routine

    // Error code to check return values for CUDA calls
    cudaError_t err = cudaSuccess;

    // Print the vector length to be used, and compute its size
    int numElements = 50000;
    size_t size = numElements * sizeof(float);
    printf("[Vector addition of %d elements]\n", numElements);

    // Allocate the host input vector A
    float *h_A = (float *)malloc(size);

    // Allocate the host input vector B
    float *h_B = (float *)malloc(size);

    // Allocate the host output vector C
    float *h_C = (float *)malloc(size);

    // Verify that allocations succeeded
    if (h_A == NULL || h_B == NULL || h_C == NULL)
        fprintf(stderr, "Failed to allocate host vectors!\n");

    // Initialize the host input vectors
    for (int i = 0; i < numElements; ++i)
        h_A[i] = rand()/(float)RAND_MAX;
        h_B[i] = rand()/(float)RAND_MAX;

    // Allocate the device input vector A
    float *d_A = NULL;
    err = cudaMalloc((void **)&d_A, size);

    if (err != cudaSuccess)
        fprintf(stderr, "Failed to allocate device vector A (error code %s)!\n", cudaGetErrorString(err));

    // Allocate the device input vector B
    float *d_B = NULL;
    err = cudaMalloc((void **)&d_B, size);

    if (err != cudaSuccess)
        fprintf(stderr, "Failed to allocate device vector B (error code %s)!\n", cudaGetErrorString(err));

    // Allocate the device output vector C
    float *d_C = NULL;
    err = cudaMalloc((void **)&d_C, size);

    if (err != cudaSuccess)
        fprintf(stderr, "Failed to allocate device vector C (error code %s)!\n", cudaGetErrorString(err));

    // Copy the host input vectors A and B in host memory to the device input vectors in
    // device memory
    printf("Copy input data from the host memory to the CUDA device\n");
    err = cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

    if (err != cudaSuccess)
        fprintf(stderr, "Failed to copy vector A from host to device (error code %s)!\n", cudaGetErrorString(err));

    err = cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

    if (err != cudaSuccess)
        fprintf(stderr, "Failed to copy vector B from host to device (error code %s)!\n", cudaGetErrorString(err));

    // Launch the Vector Add CUDA Kernel
    int threadsPerBlock = 256;
    int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;
    printf("CUDA kernel launch with %d blocks of %d threads\n", blocksPerGrid, threadsPerBlock);
    vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements);
    err = cudaGetLastError();

    if (err != cudaSuccess)
        fprintf(stderr, "Failed to launch vectorAdd kernel (error code %s)!\n", cudaGetErrorString(err));

    // Copy the device result vector in device memory to the host result vector
    // in host memory.
    printf("Copy output data from the CUDA device to the host memory\n");
    err = cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

    if (err != cudaSuccess)
        fprintf(stderr, "Failed to copy vector C from device to host (error code %s)!\n", cudaGetErrorString(err));

    // Verify that the result vector is correct
    for (int i = 0; i < numElements; ++i)
        if (fabs(h_A[i] + h_B[i] - h_C[i]) > 1e-5)
            fprintf(stderr, "Result verification failed at element %d!\n", i);

    printf("Test PASSED\n");

    // Free device global memory
    err = cudaFree(d_A);

    if (err != cudaSuccess)
        fprintf(stderr, "Failed to free device vector A (error code %s)!\n", cudaGetErrorString(err));

    err = cudaFree(d_B);

    if (err != cudaSuccess)
        fprintf(stderr, "Failed to free device vector B (error code %s)!\n", cudaGetErrorString(err));

    err = cudaFree(d_C);

    if (err != cudaSuccess)
        fprintf(stderr, "Failed to free device vector C (error code %s)!\n", cudaGetErrorString(err));

    // Free host memory

    return 0;

Written by smist08

April 3, 2019 at 6:01 pm

2 Responses

Subscribe to comments with RSS.

  1. Thanks man, this saved me some time finding out why nvcc wasn’t working.

    Christian Ruocco

    June 10, 2020 at 10:18 am

  2. […] year, I blogged on playing with CUDA on my nVidia Jetson Nano. I recently bought a new laptop which contains an […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: