Notes on CS344: Introduction to Parallel Programming
Jan 18, 2019
Overview of CUDA processing pipeline:
1.Allocate host/device buffers, fill device buffers with data to process.
2.Set parameters for kernel execution (grid size, thread block size, number of shared memory to allocate).
3.Launch a kernel with prepared data.
If kernel uses Shared Memory, we can specify the amount of memory to allocate dynamically:
4.Synchronize CUDA device before retreiving data from device buffers.
Problem set #1: Color to Greyscale Conversion
RGB -> Grayscale conversion:
gray = .299f * R + .587f * G + .114f * B
Kernel implementation:
RGB -> Gray Scale
It’s interesting to notice that (X * Y) + Z calculated on CPU doesn’t equal (X * Y) + Z calculated on GPU, because GPU uses Fused Multiply-Add operation (FMA). So CPU version looks like Round(Round(X * Y) + Z) and GPU with FMA Round(X * Y + Z). This moment may result in small difference between CPU and GPU output. In the kernel above I use explicit rounding to match the reference (CPU) result.
CPU: ~2.25 msec (Intel i3-6006U 2GHz)
GPU: ~0.67 msec (Nvidia 920MX)
Input image is in RGB format in which color components are interleaved (RGBRGBRGBRGB…). Working with such structure will result in ineficient memory access pattern.
Steps to follow:
1.Separate RGB channels that different color components are stored contiguously (RRRRR…, GGGGG…, BBBBB…).
2.Perform bluring on each color channel (convolve each channel with a filter).
Use shared memory for calculating a histogram for a thread block then copy the result to the global memory.
Problem set #4: Red Eye Removal
Main task for this assignment is to implement Radix sorting algorithm.
For each bit repeat next steps:
Calculate the histogram of the number of occurrences of 0s and 1s.
Calculate predicates (whether a particular bit is 0 or 1), perform Exclusive-scan on it (block based), save the last value of the prefix sum in the block to use it later to calculate the correct offset moving array’s elements.
For each thread block (as in the step 2) apply corresponding offsets to the calculated prefix sum.
Move elements in the Value and Position arrays. Output indexes for elements which predicates are 0s match with the calculated prefix sum, if predicate is 1 - add the histogram[0] value to the corresponding prefix sum value.
get rid of the step #3 by changing the Exclusive-sum algorithm.
For each bit repeat next steps:
Calculate the histogram of the number of occurrences of 0s and 1s.
Calculate predicates (whether a particular bit is 0 or 1) for each value.
Apply Exclusive scan from thrust library.
Move elements in the Value and Position arrays. Output indexes for elements which predicates are 0s match with the calculated prefix sum, if predicate is 1 - add the histogram[0] value to the corresponding prefix sum value.