Peikun's blog: 2015

在网上看到这个分布式计算概述图。而这学期的分布式课程学习了分布式编程语言和分布式算法两个部分，九牛一毛呀。

按照最完备的分布式体系来看，可以由以下模块组成：

分布式任务处理服务：负责具体的业务逻辑处理

分布式节点注册和查询：负责管理所有分布式节点的命名和物理信息的注册与查询，是节点之间联系的桥梁

分布式DB：分布式结构化数据存取

分布式Cache：分布式缓存数据（非持久化）存取

分布式文件：分布式文件存取

网络通信：节点之间的网络数据通信

监控管理：搜集、监控和诊断所有节点运行状态

分布式编程语言：用于分布式环境下的专有编程语言，比如Elang、Scala

分布式算法：为解决分布式环境下一些特有问题的算法，比如解决一致性问题的Paxos算法

因此，若要深入研究云计算和分布式，就得深入研究以上领域，而这些领域每一块的水都很深，都需要很底层的知识和技术来支撑，所以说，对于想提升技术的开发者来说，以分布式来作为切入点是非常好的，可以以此为线索，探索计算机世界的各个角落。

Refer to the official document -> reduction.pdf, I explored the optimization of reduction in CUDA and finished this paper as a record.

Reduction

Ecrivez un programme pour calcul la reduction d’un vecteur.

La reduction

First, we should calculate the reduction of global index-th thread on GPU. Then we calculate the sum of each thread’s reduction on main function. There are the core codes,

//result of index-th thread

for(i = index; i < N; i += THREADS_NUM*BLOCK_NUM)

{

sum += num[i]*num[i];

}

result[index] = sum;

//sum of each thread' result

int final_sum = 0;

for(int i=0; i < THREADS_NUM*BLOCK_NUM; i++ )

{

final_sum += sum[i];

}

As we did in previous TPs, we need define and assignment our vecteur, then we need allocate space for device and host memory. The function on GPU is the core thing.

//Complete Code

#include<stdio.h>

#define N (2048*2048)

#define THREADS_NUM 512

#define BLOCK_NUM 32

__global__ void reduction(int *num, int *result)

{

/* insert code to calculate the index properly using blockIdx.x, blockDim.x, threadIdx.x */

int index = threadIdx.x + blockIdx.x * blockDim.x; //global index

int sum = 0;

int i;

for(i = index; i < N; i += THREADS_NUM*BLOCK_NUM)

{

sum += num[i]*num[i];

}

result[index] = sum; //result of indexer thread

}

/* experiment with N */

/* how large can it be? */

int main()

{

int *data;

int *gpu_data, *result; //gpu

int size = N * sizeof( int ); //data size

int sum[THREADS_NUM*BLOCK_NUM];

/* allocate space for device copies of a, b, c */

cudaMalloc( (void **) &gpu_data, size );

cudaMalloc( (void **) &result, size);

/* allocate space for host copies of a, b, c and setup input values */

data = (int *)malloc( size );

for( int i = 0; i < N; i++ )

{

data[i] = i;

}

/* copy inputs to device */

/* fix the parameters needed to copy data to the device */

cudaMemcpy( gpu_data, data, size, cudaMemcpyHostToDevice );

/* launch the kernel on the GPU */

/* insert the launch parameters to launch the kernel properly using blocks and threads */

reduction<<<N/THREADS_NUM, THREADS_NUM>>>(gpu_data, result);

/* copy result back to host */

/* fix the parameters needed to copy data back to the host */

cudaMemcpy(&sum, result, sizeof(int)*THREADS_NUM*BLOCK_NUM, cudaMemcpyDeviceToHost );

//sum of every thread' result

int final_sum = 0;

for(int i=0; i < THREADS_NUM*BLOCK_NUM; i++ )

{

final_sum += sum[i];

}

printf("sum:%d ", final_sum);

/* clean up */

free(data);

cudaFree( gpu_data );

cudaFree(result);

return 0;

} /* end main */

Optimization

With the help of reduction.pdf supplied by teacher, I continue to optimize the code. Meanwhile, I am not sure whether I have understood REDUCTION well...

There are seven different versions to optimize parallel reduction in this pdf. My code is the most original iterative..

We need to keep all multiprocessors on the GPU busy, that is to say, we should use multiple thread blocks maximize, and there is also problem of synchronization. Besides, how to allocate bolcks and threads is important. We need to focus on the process of sum of thread.

The optimization goal is to reach GPU peak performance, the quantization is GFLOP/s and bandwidth. GFLOP/s is for compute-bound kernels, bandwidth is for memory-bound kernels.

Interleaved Addressing

//do reduction in shared memory

extern __shared__ int sdata[] = num[index];

for(unsigned int s=1; s<blockDim.x; s*=2){

if(threadIdx.x%(2*s) == 0) {

sdata[threadIdx.x]+=sdata[threadldx.x+s];

}

__syncthreads();

}

result[blockldx.x] = sdata[0];

Interleaved Addressing - divergent branch in inner loop

//do reduction in shared memory

extern __shared__ int sdata[] = num[index];

for(unsigned int s=1; s<blockDim.x; s*=2){

int index2 = 2*s*threadldx.x;

if(index2 < blockDim.x) {

sdata[index]+=sdata[index+s];

}

__syncthreads();

}

result[blockldx.x] = sdata[0];

Sequential Addressing

Change the strided indexing in inner loop, the former strided indexing is s*2.

//do reduction in shared memory

extern __shared__ int sdata[] = num[index];

for(unsigned int s = blockDim.x/2; s>0; s>>1){

if(tid < s) {

sdata[index]+=sdata[index+s];

}

__syncthreads();

}

result[blockldx.x] = sdata[0];

But it’s big waste, half of the threads are idle on first loop iteration, so we have the next method.

First Add During Load

Because of the wasteful of threads, we halve the number of blocks with twe loads, and first of the reduction.

int index = threadIdx.x + blockIdx.x * blockDim.x; //global index

sdata[threadldx.x] = num[i] + num[i+blockDim.x];

__syncthreads;

This method is an optimization on block load, we reducce waste by reduce the number of blocks.

Unrol the Last Warp

The last warp is difficut to understand. I think this method further determines the position of reduce thread.

__device__ void warpReduce(volatile int* sdata, int tid) {

sdata[tid] += sdata[tid + 32];

sdata[tid] += sdata[tid + 16];

sdata[tid] += sdata[tid + 8];

sdata[tid] += sdata[tid + 4];

sdata[tid] += sdata[tid + 2];

sdata[tid] += sdata[tid + 1];

}

for(unsigned int s = blockDim.x/2; s>32; s>>=1) {

if(id < s)

sdata[tid] += sdata[tid+s];

__syncthreads();

}

if(tid < 32) warpReduce(sdata, tid);

Completely Unrolled

More precise on the basis of the last method.

Template __device__ void warpReduce(volatile int* sdata, int tid) {

if (blockSize >= 64) sdata[tid] += sdata[tid + 32];

if (blockSize >= 32) sdata[tid] += sdata[tid + 16];

if (blockSize >= 16) sdata[tid] += sdata[tid + 8];

if (blockSize >= 8) sdata[tid] += sdata[tid + 4];

if (blockSize >= 4) sdata[tid] += sdata[tid + 2];

if (blockSize >= 2) sdata[tid] += sdata[tid + 1];

}

if (blockSize >= 512) {

if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }

if (blockSize >= 256) {

if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); }

if (blockSize >= 128) {

if (tid < 64) { sdata[tid] += sdata[tid + 64]; } __syncthreads(); }

if (tid < 32) warpReduce(sdata, tid);

Each blockSize comparing module wil be evaluated at compile time, it’s also a waste! So we can optimize it with swicth-case -.-

Multiple Adds / Threads

Define the size of grid with blockSize and gridDim, because the pdf said that the best performance with 64-256 blocks of 128 threads on G80 GPU.

unsigned int index = blockIdx.x*(blockSize*2) + threadIdx.x;

unsigned int gridSize = blockSize*2*gridDim.x;

sdata[threadldx.x] = 0;

while (index< n) {

sdata[threadldx.x] += num[index] + num[index+blockSize];

index += gridSize; }

__syncthreads();

Conclsion

There are different types of bottlenecks, memory, core computation or instruction overhead. We can optimize code by using template parameters. The official references are very useful, by the way, I read a lot on stackoverflow to better understand the reduction.pdf.

At last, I am still not well understood Bank conflicts even I have searched a lot. I also think that there must be some misunderstands in my report. Anyway, I have a overall knowledge of the reduction optimization and know how to optimize reduction code at least. I am interested in CUDA, so I am going to apply for individual TER in next semester.

Peikun's blog

2015年12月10日星期四

分布式模块概述

2015年12月8日星期二

Introduction to Parallel Computing

2015年12月7日星期一

Seven Ofiicial Ways to Optimize Reduction Parallel Reduction in CUDA