Hands-On GPU：Accelerated Computer Vision with OpenCV and CUDA

上QQ阅读APP看书，第一时间看更新

Atomic operations

Consider a situation in which a large number of threads try to modify a small portion of memory. This is a frequently occurring phenomenon. It creates more problems when we try to perform a read-modify-write operation. The example of this operation is d_out[i] ++, where the first d_out[i] is read from memory, then incremented and then written back to the memory. However, when multiple threads are doing this operation on the same memory location, it can give a wrong output.

Suppose one memory location has an initial value of six, and threads p and q are trying to increment this memory location, then the final answer should be eight. But at the time of execution, it may happen that both the p and q threads read this value simultaneously, then both will get the value six. They increment it to seven and both will store this seven in the memory. So instead of eight, our final answer is seven, which is wrong. How this can be dangerous is understood by taking an example of ATM cash withdrawal. Suppose you have a balance of Rs 5,000 in your account. You have two ATM cards of the same account. You and your friend go to two different ATMs simultaneously to withdraw Rs 4,000. Both of you swipe your card simultaneously; so, when the ATM checks for the balance, both will show a balance of Rs 5,000. When both of you withdraw Rs 4,000, then both machines will look at the initial balance, which was Rs 5,000. The amount to withdraw is less than the balance, and hence both machines will give Rs 4,000. Even though your balance was Rs 5,000, you got Rs 8,000, which is dangerous. To demonstrate this phenomenon, one example of large threads trying to access a small array is taken. The kernel function for this example is shown as follows:

include <stdio.h>

#define NUM_THREADS 10000
#define SIZE 10

#define BLOCK_WIDTH 100

__global__ void gpu_increment_without_atomic(int *d_a)
{
  int tid = blockIdx.x * blockDim.x + threadIdx.x;

  // Each thread increment elements which wraps at SIZE
  tid = tid % SIZE;
  d_a[tid] += 1;
}

The kernel function is just incrementing memory location in the d_a[tid] +=1 line. The issue is how many times this memory location is incremented. The total number of threads is 10,000 and the array is only of size 10. We are indexing an array by taking the modulo operation between the thread ID and the size of the array. So, 1,000 threads will try to increment the same memory location. Ideally, every location in the array should be incremented 1,000 times. But as we will see in the output, this is not the case. Before seeing the output, we will try to write the main function:

int main(int argc, char **argv)
{
  printf("%d total threads in %d blocks writing into %d array elements\n",
  NUM_THREADS, NUM_THREADS / BLOCK_WIDTH, SIZE);
  
  // declare and allocate host memory
  int h_a[SIZE];
  const int ARRAY_BYTES = SIZE * sizeof(int);
  // declare and allocate GPU memory
  int * d_a;
  cudaMalloc((void **)&d_a, ARRAY_BYTES);
  
  // Initialize GPU memory with zero value.
  cudaMemset((void *)d_a, 0, ARRAY_BYTES);
  gpu_increment_without_atomic << <NUM_THREADS / BLOCK_WIDTH, BLOCK_WIDTH >> >(d_a);
  
  // copy back the array of sums from GPU and print
  cudaMemcpy(h_a, d_a, ARRAY_BYTES, cudaMemcpyDeviceToHost);
  
  printf("Number of times a particular Array index has been incremented without atomic add is: \n");
  for (int i = 0; i < SIZE; i++)
  {
    printf("index: %d --> %d times\n ", i, h_a[i]);
  }
  cudaFree(d_a);
  return 0;
}

In the main function, the device array is declared and initialized to zero. Here, a special cudaMemSet function is used to initialize memory on the device. This is passed as a parameter to the kernel, which increments these 10 memory locations. Here, a total of 10,000 threads are launched as 1,000 blocks and 100 threads per block. The answer stored on the device after the kernel's execution is copied back to the host, and the value of each memory location is displayed on the console.

The output is as follows:

As discussed previously, ideally, each memory location should have been incremented 1,000 times, but most of the memory locations have values of 16 and 17. This is because many threads read the same locations simultaneously and hence increment the same value and store it in memory. As the timing of the thread's execution is beyond the control of the programmer, how many times simultaneous memory access will happen is not known. If you run your program a second time, then will your output be same as the first time? Your output might look like the following:

As you might have guessed, every time you run your program, the memory locations may have different values. This happens because of the random execution of all threads on the device.

To solve this problem, CUDA provides an API called atomicAdd operations. It is a blocking operation, which means that when multiple threads are trying to access the same memory location, only one thread can access the memory location at a time. Other threads have to wait for this thread to finish and write its answer on memory. The kernel function to use the atomicAdd operation is shown as follows:

#include <stdio.h>
#define NUM_THREADS 10000
#define SIZE 10
#define BLOCK_WIDTH 100

__global__ void gpu_increment_atomic(int *d_a)
{
  // Calculate thread index 
  int tid = blockIdx.x * blockDim.x + threadIdx.x;

  // Each thread increments elements which wraps at SIZE
  tid = tid % SIZE;
  atomicAdd(&d_a[tid], 1);
}

The kernel function is quite similar to what we saw earlier. Instead of incrementing memory location using the += operator, the atomicAdd function is used. It takes two arguments. The first is the memory location we want to increment, and the second is the value by which this location has to be incremented. In this code, 1,000 threads will again try to access the same location; so when one thread is using this location, the other 999 threads have to wait. This will increase the cost in terms of execution time. The main function of increment using atomic operations is shown as follows:

int main(int argc, char **argv)
{
  printf("%d total threads in %d blocks writing into %d array elements\n",NUM_THREADS, NUM_THREADS / BLOCK_WIDTH, SIZE);

  // declare and allocate host memory
  int h_a[SIZE];
  const int ARRAY_BYTES = SIZE * sizeof(int);

  // declare and allocate GPU memory
  int * d_a;
  cudaMalloc((void **)&d_a, ARRAY_BYTES);
  
   // Initialize GPU memory withzero value
  cudaMemset((void *)d_a, 0, ARRAY_BYTES);
  
  gpu_increment_atomic << <NUM_THREADS / BLOCK_WIDTH, BLOCK_WIDTH >> >(d_a);
    // copy back the array from GPU and print
  cudaMemcpy(h_a, d_a, ARRAY_BYTES, cudaMemcpyDeviceToHost);
    
  printf("Number of times a particular Array index has been incremented is: \n");
  for (int i = 0; i < SIZE; i++) 
  { 
     printf("index: %d --> %d times\n ", i, h_a[i]); 
  }
  
  cudaFree(d_a);
  return 0;
}

In the main function, the array with 10 elements is initialized with a zero value and passed to the kernel. But now, the kernel will do the atomic add operation. So, the output of this program should be accurate. Each element in the array should be incremented 1,000 times. The following will be the output:

If you measure the execution time of the program with atomic operations, it may take a longer time than that taken by simple programs using global memory. This is because many threads are waiting for memory access in the atomic operation. Use of shared memory can help to speed up operations. Also, if the same number of threads are accessing more memory locations, then the atomic operation will incur less time overhead as a smaller number of threads having to wait for memory access.

In this section, we have seen that atomic operations help in avoiding race conditions in memory operations and make the code simpler to write and understand. In the next section, we will explain two special types of memories, constant and texture, which help in accelerating certain types of code.