Field Report: N-Body simulator 1: Uniform Random Images

I recently came across the Thrust library when I was toying around with Cuda. In that post I mentioned, “Assuming I can come up with a suitable application, I’m anxious to dig more into Cuda, Thrust”. The project I’ve set out for myself is a Cuda implementation of an N-Body Simulation.

These simulations yield some gorgeous results, and can be solved efficiently using a Barnes-Hut tree. There are already a few different implementations to reference. This obviously isn’t novel, but I have some ideas I’d like to pursue outside of a simple simulation. I’m not going over those here because I’m not sure how feasible they are.

This post is the first of hopefully of several that outline problems and solutions I come across while create a visually interesting simulations.

You can see the source code here.

##Thrust Makes Management, and data Initialization Easiernn

The goal for this first iteration was to create a large number of bodies from a uniform distribution and save a PNG image of them. I wanted pixels with more bodies mapped to be rendered brighter than other pixels. So I needed to count them up.

In order to do this I needed to create a vector with as many cells as there would be pixels in the final image initialized all zeros. A cuda version of this might look something like.

const int PIXEL_COUNT = 500 * 500;
usigned int *pixelBodyCounts;
cudaMallocManaged(&pixelBodyCounts, PIXEL_COUNT * sizeof(unsigned int));
initZeros<<<numBlocks, blockSize>>>(pixelBodyCounts);

// do stuff with the pixels

cudaFree(pixelBodyCounts);

That’s a lot of work before we even start creating the sums. This doesn’t even include the kernel code for initZeroes. There are multiple opportunities to mismanage memory: if the type of pixelBodyCounts is changed, there is a possibility of not allocating the correct amount of memory. We might accidentally delete our cudaFree statement, or attempt to use the data after it’s freed.

Now check out the Thrust version.

const int PIXEL_COUNT = 500 * 500;
auto pixelCounts = thrust::device_vector<unsigned int>(PIXEL_COUNT);
thrust::fill(pixelCounts.begin(), pixelCounts.end(), 0);

Much better! The data type can now be specified changed without worrying about misallocation. Deallocation will happen automatically when pixelBodyCounts goes off the stack. Finally the thrust::fill initializes the data without needing to write a custom kernel function.

##But what about structs?

We’ve already seen how easy it was to create vectors of scalar values with Thrust. Creating vectors of structs ended up being pretty simple as well.

struct Body {
  float x, y, vx, vy;

  __host__ __device__
    Body (float _x, float _y, float _vx, float _vy):
        x(_x), y(_y), vx(_vx), vy(_vy) {};
  __host__ __device__
    Body () {};
};

int main(int argc, char **argv) {
  // ...
  auto bodies = thrust::device_vector<Body>(BODY_COUNT);
}

There are two different constructors. One without arguments Body(){}; is used when allocating the vector. Another that allows the component values to be set within a custom kernel. The Body constructors have been annotated as __host__ __device__ so they can be allocated on either processor.

##Mapping Bodies to Pixels: Sometimes you need the raw pointer

It’s not always convenient to work with Thrust vectors and algorithms. When calculating the number of bodies for a given pixel, I needed to perform an atomic add operation. Cuda provides an implementation that works with memory addresses. I wasn’t able to figure out how to use Thrust vector references with this function, but it ended up being pretty simple to get a raw pointer to the data.

struct mapBodyToPixelCounts
{
  int *pixelCounts;
  const int imageDim;

  __host__ __device__
    mapBodyToPixelCounts(
        int _imageDim,
        int *_pixelCounts
        ): imageDim(_imageDim), pixelCounts(_pixelCounts) {};

  __device__
    void operator()(const Body &body) const
    {
      auto offset = ImageCoord(body, imageDim).toOffset();
      atomicAdd(&pixelCounts[offset], 1);
    };
}

int main () {
  // ...
  auto pixelCount_ptr = thrust::raw_pointer_cast(pixelCounts.data());
  thrust::for_each(
    bodies.begin(), bodies.end(),
    mapBodyToPixelCounts(IMAGE_DIM, pixelCount_ptr)
  );
}

The pixelCount.data method returns a thrust::pointer which is then converted to a raw pointer.

The mapBodyToPixelCounts operation takes the raw pointer as a first-order argument . When called via thrust::for_each the function has access to the pixelCount pointer and a reference to the current body. Now we can pass the address corresponding to the pixel occupied by the body to Cuda’s atomicAdd function.

Casting to raw pointers is a little messier than working with the vectors, but sometimes it is the easiest of best way to work with an existing library.

##From Pixel Counts to Pixels

After counting up the number of bodies in each pixel, it’s time to map those counts to an RGB pixel. In this case, we’ll be creating a gray-scale image. To do this we’ll need to map each pixel count to three channel values each.

const int BODY_COUNT_GRAYSCALE_RATIO = 3;

struct mapPixelCountToRGBA
{
  const int BODY_COUNT_GRAYSCALE_RATIO = 1; // like 12 points in white
  const int *pixelCount_ptr;
  unsigned char *image_ptr;

  __host__ __device__
  mapPixelCountToRGBA(int *_pixelCounts, unsigned char *_image):
    pixelCount_ptr(_pixelCounts), image_ptr(_image) {};

  __host__ __device__
  void operator()(const unsigned int idx) {
    auto count = pixelCount_ptr[idx];
    auto grayValue = min(BODY_COUNT_GRAYSCALE_RATIO * count, 127);
    // assign rgba values
    auto baseIdx = idx * PIXEL_RGBA_RATIO;
    image_ptr[baseIdx] = grayValue;
    image_ptr[baseIdx + 1] = grayValue;
    image_ptr[baseIdx + 2] = grayValue;
  }
};

int main() {
  // ...
  auto deviceImage = thrust::device_vector<unsigned char>(RGBA_IMAGE_SIZE);
  auto deviceImage_ptr = thrust::raw_pointer_cast(deviceImage.data());

  auto index_sequence_begin = thrust::counting_iterator<unsigned int>(0);
  thrust::for_each(
    index_sequence_begin,
    index_sequence_begin + PIXEL_COUNT,
    mapPixelCountToRGBA(pixelCount_ptr, deviceImage_ptr)
);

The image vector is type unsigned char, this type is compatible with Open CV Matrix container, which we’ll use later on.

The index_sequence_begin iterator is used by the thrust::for_each function to deliver a sequence of numbers from 0 to PIXEL_COUNT to the different kernel tasks. I couldn’t iterate over the pixels directly because I needed to map the indices to a larger vector.

The need to change indexes meant that I couldn’t easily use the thrust::device_vector. So, mapPixelCountToRGBA takes raw pointers as first-order arguments. These pointers are accessed by indices calculated using the BODY_COUNT_GRAYSCALE_RATIO constant.

From **device_image** to Image File

Now that we have a device_vector of channel values, we need to save them to a file. The solution I came up with was a three-step process: copy from device to host, copy host vector to OpenCV matrix, and then save the matrix to a file.

First I needed to copy the data from the device to host. This is made easier with Thrust.

int main () {
  // copy image to host
  auto hostImage = thrust::host_vector<unsigned char>(RGBA_IMAGE_SIZE);
  thrust::copy(deviceImage.begin(), deviceImage.end(), hostImage.begin());
}

Next, the image data needs to be copied to an OpenCV Mat. Then saving to an image file is easy.

int main () {
  int const IMAGE_DIM = 400;
  int const PIXEL_COUNT = IMAGE_DIM * IMAGE_DIM;
  int const RGBA_IMAGE_SIZE = PIXEL_COUNT * PIXEL_RGBA_RATIO;

  auto hostImage_ptr = thrust::raw_pointer_cast(hostImage.data());
  cv::Mat imageMat(IMAGE_DIM, IMAGE_DIM, CV_8UC3);
  memcpy(imageMat.data, hostImage_ptr, sizeof(unsigned char) * RGBA_IMAGE_SIZE);

  cv::imwrite(argv[1], imageMat);
}

This is a little nastier. The matrix has to be initialized with the image dimensions and the channel format. I also needed to dip into raw pointers again and provide the memory size. I hope to find a way to do this with iterators later on.

The cv::imwrite function is really easy to use. It can save to many different formats, and switching between .jpg and .png is as easy as specifying a different output file through the command line.

Below, you can see a gif of several images produced by this iteration of the program.

Animation of multiple uniform random distributions. (Very exciting)

##Conclusion

That covers all the things I found noteworthy while working through this first iteration. I hope to report back with bodies that actually interact soon!