I have a limited background in Parallel programming. I took a course during my undergrad as well as the Udacity offering a few years ago. While I’ve been intrigued, for a while the hardware was difficult to get access to or the libraries were difficult to use.
While things aren’t perfect — the driver packages still don’t play very nicely with each other — there has been a marked improvement. The Cuda drivers and libraries can be installed by apt-get
, I didn’t have to build a kernel image, and the Thrust library is now included with the Cuda toolkit. As we’ll see this makes it a little bit easier to work with.
In this post I take a look at creating a simple program using Cuda’s core library, Thrust, a serialized version.
The source-code can be found here.
Set Up and Build Config
My computer has a Nvidia GTX 1060 (6GB) graphics card. I’m running Ubuntu 18.04, using the nvidia-driver-390
metapackage.
There were a couple of gotchas I ran into. However, because Thrust is included as part of the Cuda toolkit, I don’t need to link any external libraries. This allowed my final Makefile
to be fairly simple.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
CC:=/usr/local/cuda-9.1/bin/nvcc
PROF:=/usr/local/cuda-9.1/bin/nvprof
run: main-cuda.cuda
./main-cuda.cuda
prof: main-cuda main-cpu main-thrust
${PROF} ./main-cuda
@echo
${PROF} ./main-thrust
@echo
time ./main-cuda
@echo
time ./main-thrust
@echo
time ./main-cpu
main-cuda: main-cuda.cu
${CC} -ccbin g++-6 main-cuda.cu -o main-cuda
main-thrust: main-thrust.cu
${CC} -ccbin g++-6 main-thrust.cu -o main-thrust
main-cpu: main-cpu.cpp
${CC} -ccbin g++-6 main-cpu.cpp -o main-cpu
The compiler, referenced as CC
, is packaged with Cuda version 9.1 — the maximum supported by my driver version. You’ll also notice the ccbin
compiler option for each of the builds. As of this writing, Ubuntu’s default g++
version is 7.3.0
and Cuda 9.1
only supports up to 6.
Serial Version
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#include <iostream>
#include <math.h>
#include <cstdlib>
#include <vector>
int main () {
int N = 8000 * 8000; // 800px x 800px image
int iterations = 10;
int size = N*sizeof(float);
float *x, *y, *output;
x = (float *) malloc(size);
y = (float *) malloc(size);
output = (float *) malloc(size);
for (int i = 0; i < N; i++) {
x[i] = ((float) std::rand()) / (float) RAND_MAX;
y[i] = ((float) std::rand()) / (float) RAND_MAX;
}
// initialize random arrays
for (int blerp = 0; blerp < iterations; blerp++) {
for (int i = 0; i < N; i++) {
output[i] = x[i] + y[i];
}
}
free(x);
free(y);
free(output);
return 0;
}
The serial version isn’t meant to be well formed C++ code, but rather it mimics the structure of the Cuda program closely. First the memory size of the vectors is calculated. This is 8000x8000 (64M) floating point numbers. Next, the vectors are allocated on the heap using malloc
and initialized with random floating point values between 0 and 1. Then the vectors are summed 10 times and the result is are stored in output
. Lastly the memory is freed up.
It’s a simple if contrived and inelegant program. On my CPU the run-time is around 4.5 seconds.
Cuda Version
As mentioned above, the cuda version is intentionally similar to the serial version.
The allocation is done with the cudaMallocManaged
function instead of malloc
, and the free
function is replaced by cudaFree
. This means that the memory is available on the host (code run on the cpu) and on the device (code run on the gpu).
The way the data is interacted with is quite different.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#include <iostream>
#include <math.h>
#include <cstdlib>
#include <vector>
int main () {
int N = 8000 * 8000; // 800px x 800px image
int iterations = 10;
int size = N*sizeof(float);
float *x, *y, *output;
x = (float *) malloc(size);
y = (float *) malloc(size);
output = (float *) malloc(size);
for (int i = 0; i < N; i++) {
x[i] = ((float) std::rand()) / (float) RAND_MAX;
y[i] = ((float) std::rand()) / (float) RAND_MAX;
}
// initialize random arrays
for (int blerp = 0; blerp < iterations; blerp++) {
for (int i = 0; i < N; i++) {
output[i] = x[i] + y[i];
}
}
free(x);
free(y);
free(output);
return 0;
}
Notice first that this is a .cu
file, so there are some values available that aren’t in .cpp
files. The __global__
keyword indicates that the function is available on the host and the device. Both return void
, they mutate the shared memory rather than return values.
The first function add
, as you might guess, produces the output of the sum operation. This is done element-wise, each kernel is responsible for a subset of the input. This is done by calculating idx
which is based on the block
location and thread
index of this particular kernel. Thus each kernel only performs a single addition operation.
The second function is the random initialization. This function actually uses the thrust
API to sample from a normal distribution. There are some other options for doing this on the GPU, but I wasn’t able to find anything that was this easy to use.
The calls to these kernel functions are a little strange looking. On lines 20–22 and 27/28 you’ll see how the functions is called. The function parameters are provided normally, but they are parameterized with <<<numBlocks,blockSize>>>
. This expression tells Cuda how many kernels will be used. In our case, we are making blocks of size 256, and then there are 64M / 256 blocks. Cuda then distributes work to all of these kernels. The cudaDeviceSynchronize
is kind of like join
on threads or await
in JavaScript async
functions. This causes the program to stop and wait for all of the kernels to finish before the program continues.
This is a bit more work than the serialized version, and way less readable than a well written C++ program. Is the juice worth the squeeze? On my hardware, this is a dramatic yes. The Cuda version finishes in around 0.14s, over 30x faster than the serialized version. If we look at the output of nvprof
we can get a bit more detail.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
==21393== Profiling application: ./main-cuda.cuda
==21393== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 73.42% 227.76ms 2 113.88ms 113.37ms 114.40ms void initRandom<float>(float*, float, float)
26.58% 82.465ms 10 8.2465ms 4.8140ms 39.102ms void add<float>(float*, float*, float*)
API calls: 61.88% 310.24ms 11 28.204ms 4.8177ms 227.74ms cudaDeviceSynchronize
32.78% 164.31ms 3 54.771ms 49.680us 164.11ms cudaMallocManaged
4.87% 24.428ms 3 8.1428ms 8.1220ms 8.1685ms cudaFree
0.32% 1.6288ms 12 135.74us 19.770us 559.03us cudaLaunch
0.11% 528.26us 94 5.6190us 240ns 229.88us cuDeviceGetAttribute
0.02% 115.80us 1 115.80us 115.80us 115.80us cuDeviceTotalMem
0.01% 53.541us 1 53.541us 53.541us 53.541us cuDeviceGetName
0.00% 8.6900us 36 241ns 130ns 1.3400us cudaSetupArgument
0.00% 6.8300us 12 569ns 250ns 2.1200us cudaConfigureCall
0.00% 1.9900us 3 663ns 240ns 1.4500us cuDeviceGetCount
0.00% 1.1000us 2 550ns 250ns 850ns cuDeviceGet
The most expensive operations were memory allocation, and device synchronization. And the most expensive kernel operation was the random initialization despite the summation being run 10x more. This is a good demonstration of an important parallel programming principle: communication is expensive.
Thrust Version
The Thrust version differs quite a bit from the Cuda verison. The patterns are more STL C++ idiomatic, rather than C-like.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
struct initRandomPrg {
float minValue, maxValue;
__host__ __device__
initRandomPrg(float _mnV=0.f, float _mxV=1.f) : minValue(_mnV), maxValue(_mxV) {};
__host__ __device__
float operator()(const unsigned int n) const {
thrust::default_random_engine rng;
thrust::uniform_real_distribution<float> dist(minValue, maxValue);
rng.discard(n);
return dist(rng);
}
};
int main() {
// some stuff
auto x = thrust::device_vector<float>(N);
auto y = thrust::device_vector<float>(N);
auto output = thrust::device_vector<float>(N);
// initilize array
auto index_sequence_begin = thrust::counting_iterator<unsigned int>(0);
// initialize X
thrust::transform(
index_sequence_begin,
index_sequence_begin + N,
x.begin(),
initRandomPrg()
);
// initialize Y
// add them up
for (int i = 0; i < iterations; i++) {
thrust::transform(
x.begin(), x.end(),
y.begin(),
output.begin(),
thrust::plus<float>()
);
}
}
The custom kernel operation initRandomPrg
is organized as a struct
that implements the ()
operator, meaning it can be executed after initialization. In a JavaScript context I would think of this as a second order function. The first call sets the bounds for the random floats. Executions of the returned function will return a random number sample from a uniform distribution.
The data is stored in device_vectors
which are initialized, and can be interacted with much like std::vector
objects. Despite their name these vectors can be accessed and mutated from the host as well as the device. It isn’t necessary to calculate the sizeof
the vectors, or manually allocate them.
The kernel executions are organized similar to the STL algorithm
package functions. There is not declaration of block sizes or counts, and the devices don’t have to be synchronized.
The convenience of not having to manually allocate and de-allocate memory seems like it should come at a cost. However, that appears not to be the case. In multiple trials, the Thrust version appears to run slightly faster than the Cuda version. This could be because of a number of factors — not the least that I’m not familiar extremely with Cuda API performance characteristics.
The output of nvprof
is a little more difficult to read.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
==24803== NVPROF is profiling process 24803, command: ./main-thrust
==24803== Profiling application: ./main-thrust
==24803== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 74.98% 157.91ms 2 78.954ms 77.478ms 80.430ms void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::__transform::unary_transform_f<thrust::counting_iterator<unsigned int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::cuda_cub::__transform::no_stencil_tag, initRandomPrg, thrust::cuda_cub::__transform::always_true_predicate>, long>, thrust::cuda_cub::__transform::unary_transform_f<thrust::counting_iterator<unsigned int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::cuda_cub::__transform::no_stencil_tag, initRandomPrg, thrust::cuda_cub::__transform::always_true_predicate>, long>(thrust::use_default, thrust::use_default)
22.78% 47.984ms 10 4.7984ms 4.7934ms 4.8072ms void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::__transform::binary_transform_f<thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::cuda_cub::__transform::no_stencil_tag, addPrg, thrust::cuda_cub::__transform::always_true_predicate>, long>, thrust::cuda_cub::__transform::binary_transform_f<thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::cuda_cub::__transform::no_stencil_tag, addPrg, thrust::cuda_cub::__transform::always_true_predicate>, long>(thrust::device_ptr<float>, thrust::detail::normal_iterator<thrust::device_ptr<float>>)
2.24% 4.7185ms 3 1.5728ms 1.5658ms 1.5834ms void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr<float>, float>, unsigned long>, thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr<float>, float>, unsigned long>(thrust::device_ptr<float>, float)
API calls: 58.03% 216.79ms 3 72.265ms 2.3659ms 210.11ms cudaFree
41.66% 155.63ms 3 51.877ms 310.45us 155.00ms cudaMalloc
0.14% 525.39us 94 5.5890us 600ns 211.77us cuDeviceGetAttribute
0.08% 287.94us 1 287.94us 287.94us 287.94us cuDeviceTotalMem
0.04% 131.16us 15 8.7440us 6.1100us 25.190us cudaLaunch
0.02% 75.650us 1 75.650us 75.650us 75.650us cuDeviceGetName
0.02% 72.710us 15 4.8470us 4.4100us 7.9100us cudaFuncGetAttributes
0.00% 10.040us 15 669ns 560ns 1.4200us cudaGetDevice
0.00% 9.3600us 15 624ns 530ns 1.2200us cudaDeviceGetAttribute
0.00% 5.8600us 30 195ns 140ns 480ns cudaSetupArgument
0.00% 4.9900us 30 166ns 130ns 330ns cudaPeekAtLastError
0.00% 4.1000us 3 1.3660us 650ns 2.5200us cuDeviceGetCount
0.00% 3.6800us 15 245ns 180ns 780ns cudaConfigureCall
0.00% 2.1800us 2 1.0900us 640ns 1.5400us cuDeviceGet
The function names are generated monstrosities rather than the easy to read ones from before. I’m assuming the most expensive GPU operation is the random initialization, but I’m not sure. And I don’t know what the third GPU operation is. At least the API calls are still broken out nicely.
There are no calls to cudaDeviceSynchroniz
. A more intelligent synchronization could explain why the Thrust version was a little faster.
Conclusions
Thrust appears to be supported internally by Nvidia. Despite the appearance on the project’s former homepage, it is going to be under active development going forward.
From my perspective Thrust has a lot of value. With simple allocation, deallocation, block sizing, and synchronization, there are fewer opportunities for me to shoot myself in the foot. Also there are a few nice algorithms available.
Assuming I can come up with a suitable application(some kind of simulation perhaps), I’m anxious to dig more into Cuda, Thrust, and the other libraries Nvidia has provided.