2015年12月7日星期一

Seven Ofiicial Ways to Optimize Reduction Parallel Reduction in CUDA

Refer to the official document -> reduction.pdf, I explored the optimization of reduction in CUDA and finished this paper as a record.


Reduction

Ecrivez un programme pour calcul la reduction d’un vecteur.


  • La reduction

First, we should calculate the reduction of global index-th thread on GPU. Then we calculate the sum of each thread’s reduction on main function. There are the core codes,

//result of index-th thread
for(i = index; i < N; i += THREADS_NUM*BLOCK_NUM)
{
sum += num[i]*num[i];
}
result[index] = sum;  

//sum of each thread' result
int final_sum = 0;
for(int i=0; i < THREADS_NUM*BLOCK_NUM; i++ )
{
final_sum += sum[i];
}

As we did in previous TPs, we need define and assignment our vecteur, then we need allocate space for device and host memory. The function on GPU is the core thing.


//Complete Code
#include<stdio.h>
#define N (2048*2048)
#define THREADS_NUM 512
#define BLOCK_NUM 32

__global__ void reduction(int *num, int *result)
{
   /* insert code to calculate the index properly using blockIdx.x, blockDim.x, threadIdx.x */
int index = threadIdx.x + blockIdx.x * blockDim.x;   //global index
int sum = 0;
int i;
for(i = index; i < N; i += THREADS_NUM*BLOCK_NUM)
{
sum += num[i]*num[i];
}
result[index] = sum;  //result of indexer thread
}

/* experiment with N */
/* how large can it be? */


int main()
{
  int *data;
int *gpu_data, *result;  //gpu

int size = N * sizeof( int );  //data size
int sum[THREADS_NUM*BLOCK_NUM];

/* allocate space for device copies of a, b, c */
cudaMalloc( (void **) &gpu_data, size );
cudaMalloc( (void **) &result, size);

/* allocate space for host copies of a, b, c and setup input values */

data = (int *)malloc( size );

for( int i = 0; i < N; i++ )
{
data[i] = i;
}

/* copy inputs to device */
/* fix the parameters needed to copy data to the device */
cudaMemcpy( gpu_data, data, size, cudaMemcpyHostToDevice );

/* launch the kernel on the GPU */
/* insert the launch parameters to launch the kernel properly using blocks and threads */
reduction<<<N/THREADS_NUM, THREADS_NUM>>>(gpu_data, result);

/* copy result back to host */
/* fix the parameters needed to copy data back to the host */
cudaMemcpy(&sum, result, sizeof(int)*THREADS_NUM*BLOCK_NUM, cudaMemcpyDeviceToHost );

//sum of every thread' result
int final_sum = 0;
for(int i=0; i < THREADS_NUM*BLOCK_NUM; i++ )
{
final_sum += sum[i];
}

printf("sum:%d ", final_sum);

/* clean up */

free(data);
cudaFree( gpu_data );
cudaFree(result);

return 0;
} /* end main */


  • Optimization

With the help of reduction.pdf supplied by teacher, I continue to optimize the code. Meanwhile, I am not sure whether I have understood REDUCTION well...

There are seven different versions to optimize parallel reduction in this pdf. My code is the most original iterative..

We need to keep all multiprocessors on the GPU busy, that is to say, we should use multiple thread blocks maximize, and there is also problem of synchronization. Besides, how to allocate bolcks and threads is important. We need to focus on the process of sum of thread.

The optimization goal is to reach GPU peak performance, the quantization is GFLOP/s and bandwidth. GFLOP/s is for compute-bound kernels, bandwidth is for memory-bound kernels.

  1. Interleaved Addressing
//do reduction in shared memory
extern __shared__ int sdata[] = num[index];
for(unsigned int s=1; s<blockDim.x; s*=2){
if(threadIdx.x%(2*s) == 0) {
sdata[threadIdx.x]+=sdata[threadldx.x+s];
}
__syncthreads();
}
result[blockldx.x] = sdata[0];


  1. Interleaved Addressing -  divergent branch in inner loop

//do reduction in shared memory
extern __shared__ int sdata[] = num[index];
for(unsigned int s=1; s<blockDim.x; s*=2){
int index2 = 2*s*threadldx.x;
if(index2 < blockDim.x) {
sdata[index]+=sdata[index+s];
}
__syncthreads();
}
result[blockldx.x] = sdata[0];


  1. Sequential Addressing

Change the strided indexing in inner loop, the former strided indexing is s*2.

//do reduction in shared memory
extern __shared__ int sdata[] = num[index];
for(unsigned int  s = blockDim.x/2; s>0; s>>1){
if(tid < s) {
sdata[index]+=sdata[index+s];
}
__syncthreads();
}
result[blockldx.x] = sdata[0];

But it’s big waste, half of the threads are idle on first loop iteration, so we have the next method.

  1. First Add During Load


Because of the wasteful of threads, we halve the number of blocks with twe loads, and first of the reduction.

int index = threadIdx.x + blockIdx.x * blockDim.x;   //global index
sdata[threadldx.x] = num[i] + num[i+blockDim.x];
__syncthreads;

This method is an optimization on block load, we reducce waste by reduce the number of blocks.

  1. Unrol the Last Warp

The last warp is difficut to understand. I think this method further determines the position of reduce thread.

__device__ void warpReduce(volatile int* sdata, int tid) {
sdata[tid] += sdata[tid + 32];
sdata[tid] += sdata[tid + 16];
sdata[tid] += sdata[tid + 8];
sdata[tid] += sdata[tid + 4];
sdata[tid] += sdata[tid + 2];
sdata[tid] += sdata[tid + 1];
}

for(unsigned int s  = blockDim.x/2; s>32; s>>=1) {
if(id < s)
sdata[tid] += sdata[tid+s];
__syncthreads();
}

if(tid < 32) warpReduce(sdata, tid);

  1. Completely Unrolled

More precise on the basis of the last method.

Template __device__ void warpReduce(volatile int* sdata, int tid) {
if (blockSize >= 64) sdata[tid] += sdata[tid + 32];
if (blockSize >= 32) sdata[tid] += sdata[tid + 16];
if (blockSize >= 16) sdata[tid] += sdata[tid + 8];
if (blockSize >= 8) sdata[tid] += sdata[tid + 4];
if (blockSize >= 4) sdata[tid] += sdata[tid + 2];
if (blockSize >= 2) sdata[tid] += sdata[tid + 1];
}

if (blockSize >= 512) {
if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }
if (blockSize >= 256) {
if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); }
if (blockSize >= 128) {
if (tid < 64) { sdata[tid] += sdata[tid + 64]; } __syncthreads(); }
if (tid < 32) warpReduce(sdata, tid);

Each blockSize comparing module wil be evaluated at compile time, it’s also a waste! So we can optimize it with swicth-case -.-
  1. Multiple Adds / Threads

Define the size of grid with blockSize and gridDim, because the pdf said that the best performance with 64-256 blocks of 128 threads on G80 GPU.

unsigned int index = blockIdx.x*(blockSize*2) + threadIdx.x;
unsigned int gridSize = blockSize*2*gridDim.x;
sdata[threadldx.x] = 0;

while (index< n) {
sdata[threadldx.x] += num[index] + num[index+blockSize];
index += gridSize; }
__syncthreads();


  • Conclsion

There are different types of bottlenecks, memory, core computation or instruction overhead. We can optimize code by using template parameters. The official references are very useful, by the way, I read a lot on stackoverflow to better understand the reduction.pdf.

At last, I am still not well understood Bank conflicts even I have searched a lot. I also think that there must be some misunderstands in my report. Anyway, I have a overall knowledge of the reduction optimization and know how to optimize reduction code at least. I am interested in CUDA, so I am going to apply for individual TER in next semester.

没有评论:

发表评论