Home A bit of Cuda and Thrust
Post
Cancel

A bit of Cuda and Thrust

I have a limited background in Parallel programming. I took a course during my undergrad as well as the Udacity offering a few years ago. While I’ve been intrigued, for a while the hardware was difficult to get access to or the libraries were difficult to use.

While things aren’t perfect — the driver packages still don’t play very nicely with each other — there has been a marked improvement. The Cuda drivers and libraries can be installed by apt-get, I didn’t have to build a kernel image, and the Thrust library is now included with the Cuda toolkit. As we’ll see this makes it a little bit easier to work with.

In this post I take a look at creating a simple program using Cuda’s core library, Thrust, a serialized version.

The source-code can be found here.

Set Up and Build Config

My computer has a Nvidia GTX 1060 (6GB) graphics card. I’m running Ubuntu 18.04, using the nvidia-driver-390 metapackage.

There were a couple of gotchas I ran into. However, because Thrust is included as part of the Cuda toolkit, I don’t need to link any external libraries. This allowed my final Makefile to be fairly simple.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
CC:=/usr/local/cuda-9.1/bin/nvcc
PROF:=/usr/local/cuda-9.1/bin/nvprof

run: main-cuda.cuda
	./main-cuda.cuda

prof: main-cuda main-cpu main-thrust
	${PROF} ./main-cuda
	@echo
	${PROF} ./main-thrust
	@echo
	time ./main-cuda
	@echo
	time ./main-thrust
	@echo
	time ./main-cpu

main-cuda: main-cuda.cu
	${CC} -ccbin g++-6 main-cuda.cu -o main-cuda

main-thrust: main-thrust.cu
	${CC} -ccbin g++-6 main-thrust.cu -o main-thrust

main-cpu: main-cpu.cpp
${CC} -ccbin g++-6 main-cpu.cpp -o main-cpu

The compiler, referenced as CC, is packaged with Cuda version 9.1 — the maximum supported by my driver version. You’ll also notice the ccbin compiler option for each of the builds. As of this writing, Ubuntu’s default g++ version is 7.3.0 and Cuda 9.1 only supports up to 6.

Serial Version

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#include <iostream>
#include <math.h>
#include <cstdlib>
#include <vector>

int main () {
  int N = 8000 * 8000; // 800px x 800px image
  int iterations = 10;
  int size = N*sizeof(float);
  float *x, *y, *output;

  x = (float *) malloc(size);
  y = (float *) malloc(size);
  output = (float *) malloc(size);

  for (int i = 0; i < N; i++) {
    x[i] = ((float) std::rand()) / (float) RAND_MAX;
    y[i] = ((float) std::rand()) / (float) RAND_MAX;
  }

  // initialize random arrays
  for (int blerp = 0; blerp < iterations; blerp++) {
    for (int i = 0; i < N; i++) {
      output[i] = x[i] + y[i];
    }
  }

  free(x);
  free(y);
  free(output);

  return 0;
}

The serial version isn’t meant to be well formed C++ code, but rather it mimics the structure of the Cuda program closely. First the memory size of the vectors is calculated. This is 8000x8000 (64M) floating point numbers. Next, the vectors are allocated on the heap using malloc and initialized with random floating point values between 0 and 1. Then the vectors are summed 10 times and the result is are stored in output. Lastly the memory is freed up.

It’s a simple if contrived and inelegant program. On my CPU the run-time is around 4.5 seconds.

Cuda Version

As mentioned above, the cuda version is intentionally similar to the serial version.

The allocation is done with the cudaMallocManaged function instead of malloc, and the free function is replaced by cudaFree. This means that the memory is available on the host (code run on the cpu) and on the device (code run on the gpu).

The way the data is interacted with is quite different.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#include <iostream>
#include <math.h>
#include <cstdlib>
#include <vector>

int main () {
  int N = 8000 * 8000; // 800px x 800px image
  int iterations = 10;
  int size = N*sizeof(float);
  float *x, *y, *output;

  x = (float *) malloc(size);
  y = (float *) malloc(size);
  output = (float *) malloc(size);

  for (int i = 0; i < N; i++) {
    x[i] = ((float) std::rand()) / (float) RAND_MAX;
    y[i] = ((float) std::rand()) / (float) RAND_MAX;
  }

  // initialize random arrays
  for (int blerp = 0; blerp < iterations; blerp++) {
    for (int i = 0; i < N; i++) {
      output[i] = x[i] + y[i];
    }
  }

  free(x);
  free(y);
  free(output);

  return 0;
}

Notice first that this is a .cu file, so there are some values available that aren’t in .cpp files. The __global__ keyword indicates that the function is available on the host and the device. Both return void, they mutate the shared memory rather than return values.

The first function add, as you might guess, produces the output of the sum operation. This is done element-wise, each kernel is responsible for a subset of the input. This is done by calculating idx which is based on the block location and thread index of this particular kernel. Thus each kernel only performs a single addition operation.

The second function is the random initialization. This function actually uses the thrust API to sample from a normal distribution. There are some other options for doing this on the GPU, but I wasn’t able to find anything that was this easy to use.

The calls to these kernel functions are a little strange looking. On lines 20–22 and 27/28 you’ll see how the functions is called. The function parameters are provided normally, but they are parameterized with <<<numBlocks,blockSize>>> . This expression tells Cuda how many kernels will be used. In our case, we are making blocks of size 256, and then there are 64M / 256 blocks. Cuda then distributes work to all of these kernels. The cudaDeviceSynchronize is kind of like join on threads or await in JavaScript async functions. This causes the program to stop and wait for all of the kernels to finish before the program continues.

This is a bit more work than the serialized version, and way less readable than a well written C++ program. Is the juice worth the squeeze? On my hardware, this is a dramatic yes. The Cuda version finishes in around 0.14s, over 30x faster than the serialized version. If we look at the output of nvprof we can get a bit more detail.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
==21393== Profiling application: ./main-cuda.cuda
==21393== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   73.42%  227.76ms         2  113.88ms  113.37ms  114.40ms  void initRandom<float>(float*, float, float)
                   26.58%  82.465ms        10  8.2465ms  4.8140ms  39.102ms  void add<float>(float*, float*, float*)
      API calls:   61.88%  310.24ms        11  28.204ms  4.8177ms  227.74ms  cudaDeviceSynchronize
                   32.78%  164.31ms         3  54.771ms  49.680us  164.11ms  cudaMallocManaged
                    4.87%  24.428ms         3  8.1428ms  8.1220ms  8.1685ms  cudaFree
                    0.32%  1.6288ms        12  135.74us  19.770us  559.03us  cudaLaunch
                    0.11%  528.26us        94  5.6190us     240ns  229.88us  cuDeviceGetAttribute
                    0.02%  115.80us         1  115.80us  115.80us  115.80us  cuDeviceTotalMem
                    0.01%  53.541us         1  53.541us  53.541us  53.541us  cuDeviceGetName
                    0.00%  8.6900us        36     241ns     130ns  1.3400us  cudaSetupArgument
                    0.00%  6.8300us        12     569ns     250ns  2.1200us  cudaConfigureCall
                    0.00%  1.9900us         3     663ns     240ns  1.4500us  cuDeviceGetCount
0.00% 1.1000us 2 550ns 250ns 850ns cuDeviceGet

The most expensive operations were memory allocation, and device synchronization. And the most expensive kernel operation was the random initialization despite the summation being run 10x more. This is a good demonstration of an important parallel programming principle: communication is expensive.

Thrust Version

The Thrust version differs quite a bit from the Cuda verison. The patterns are more STL C++ idiomatic, rather than C-like.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
struct initRandomPrg {
  float minValue, maxValue;

  __host__ __device__
  initRandomPrg(float _mnV=0.f, float _mxV=1.f) : minValue(_mnV), maxValue(_mxV) {};

  __host__ __device__
  float operator()(const unsigned int n) const {
    thrust::default_random_engine rng;
    thrust::uniform_real_distribution<float> dist(minValue, maxValue);
    rng.discard(n);

    return dist(rng);
  }
};

int main() {
  // some stuff
  auto x = thrust::device_vector<float>(N);
  auto y = thrust::device_vector<float>(N);
  auto output = thrust::device_vector<float>(N);

   // initilize array
  auto index_sequence_begin = thrust::counting_iterator<unsigned int>(0);

  // initialize X
  thrust::transform(
      index_sequence_begin,
      index_sequence_begin + N,
      x.begin(),
      initRandomPrg()
  );

  // initialize Y

  // add them up
  for (int i = 0; i < iterations; i++) {
    thrust::transform(
        x.begin(), x.end(),
        y.begin(),
        output.begin(),
        thrust::plus<float>()
    );
  }
}

The custom kernel operation initRandomPrg is organized as a struct that implements the () operator, meaning it can be executed after initialization. In a JavaScript context I would think of this as a second order function. The first call sets the bounds for the random floats. Executions of the returned function will return a random number sample from a uniform distribution.

The data is stored in device_vectors which are initialized, and can be interacted with much like std::vector objects. Despite their name these vectors can be accessed and mutated from the host as well as the device. It isn’t necessary to calculate the sizeof the vectors, or manually allocate them.

The kernel executions are organized similar to the STL algorithm package functions. There is not declaration of block sizes or counts, and the devices don’t have to be synchronized.

The convenience of not having to manually allocate and de-allocate memory seems like it should come at a cost. However, that appears not to be the case. In multiple trials, the Thrust version appears to run slightly faster than the Cuda version. This could be because of a number of factors — not the least that I’m not familiar extremely with Cuda API performance characteristics.

The output of nvprof is a little more difficult to read.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
==24803== NVPROF is profiling process 24803, command: ./main-thrust
==24803== Profiling application: ./main-thrust
==24803== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   74.98%  157.91ms         2  78.954ms  77.478ms  80.430ms  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::__transform::unary_transform_f<thrust::counting_iterator<unsigned int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::cuda_cub::__transform::no_stencil_tag, initRandomPrg, thrust::cuda_cub::__transform::always_true_predicate>, long>, thrust::cuda_cub::__transform::unary_transform_f<thrust::counting_iterator<unsigned int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::cuda_cub::__transform::no_stencil_tag, initRandomPrg, thrust::cuda_cub::__transform::always_true_predicate>, long>(thrust::use_default, thrust::use_default)
                   22.78%  47.984ms        10  4.7984ms  4.7934ms  4.8072ms  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::__transform::binary_transform_f<thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::cuda_cub::__transform::no_stencil_tag, addPrg, thrust::cuda_cub::__transform::always_true_predicate>, long>, thrust::cuda_cub::__transform::binary_transform_f<thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::cuda_cub::__transform::no_stencil_tag, addPrg, thrust::cuda_cub::__transform::always_true_predicate>, long>(thrust::device_ptr<float>, thrust::detail::normal_iterator<thrust::device_ptr<float>>)
                    2.24%  4.7185ms         3  1.5728ms  1.5658ms  1.5834ms  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr<float>, float>, unsigned long>, thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr<float>, float>, unsigned long>(thrust::device_ptr<float>, float)
      API calls:   58.03%  216.79ms         3  72.265ms  2.3659ms  210.11ms  cudaFree
                   41.66%  155.63ms         3  51.877ms  310.45us  155.00ms  cudaMalloc
                    0.14%  525.39us        94  5.5890us     600ns  211.77us  cuDeviceGetAttribute
                    0.08%  287.94us         1  287.94us  287.94us  287.94us  cuDeviceTotalMem
                    0.04%  131.16us        15  8.7440us  6.1100us  25.190us  cudaLaunch
                    0.02%  75.650us         1  75.650us  75.650us  75.650us  cuDeviceGetName
                    0.02%  72.710us        15  4.8470us  4.4100us  7.9100us  cudaFuncGetAttributes
                    0.00%  10.040us        15     669ns     560ns  1.4200us  cudaGetDevice
                    0.00%  9.3600us        15     624ns     530ns  1.2200us  cudaDeviceGetAttribute
                    0.00%  5.8600us        30     195ns     140ns     480ns  cudaSetupArgument
                    0.00%  4.9900us        30     166ns     130ns     330ns  cudaPeekAtLastError
                    0.00%  4.1000us         3  1.3660us     650ns  2.5200us  cuDeviceGetCount
                    0.00%  3.6800us        15     245ns     180ns     780ns  cudaConfigureCall
0.00% 2.1800us 2 1.0900us 640ns 1.5400us cuDeviceGet

The function names are generated monstrosities rather than the easy to read ones from before. I’m assuming the most expensive GPU operation is the random initialization, but I’m not sure. And I don’t know what the third GPU operation is. At least the API calls are still broken out nicely.

There are no calls to cudaDeviceSynchroniz. A more intelligent synchronization could explain why the Thrust version was a little faster.

Conclusions

Thrust appears to be supported internally by Nvidia. Despite the appearance on the project’s former homepage, it is going to be under active development going forward.

From my perspective Thrust has a lot of value. With simple allocation, deallocation, block sizing, and synchronization, there are fewer opportunities for me to shoot myself in the foot. Also there are a few nice algorithms available.

Assuming I can come up with a suitable application(some kind of simulation perhaps), I’m anxious to dig more into Cuda, Thrust, and the other libraries Nvidia has provided.

-->