线程同步树状的加法

system · 2010 年6 月 25 日 03:59

#define DATA_SIZE 1048576
#define BLOCK_NUM 32
#define THREAD_NUM 256
global static void sumOfSquares(int num, int result,
clock_t* time)
{
extern shared int shared;
const int tid = threadIdx.x;
const int bid = blockIdx.x;
int i;
int offset = 1, mask = 1;
if(tid == 0) time[bid] = clock();
shared[tid] = 0;
for(i = bid * THREAD_NUM + tid; i < DATA_SIZE;
i += BLOCK_NUM * THREAD_NUM) {
shared[tid] += num[i] * num[i];
}

__syncthreads();
while(offset < THREAD_NUM) {
if((tid & mask) == 0) {
shared[tid] += shared[tid + offset];
}
offset += offset;
mask = offset + mask;
__syncthreads();
}

if(tid == 0) {
result[bid] = shared[0];
time[bid + BLOCK_NUM] = clock();
}
}[/i][/i]
__syncthreads();
之间的while中同一个block下的thread是怎么同步并行计算的，就是这句shared[tid] += shared[tid + offset];还有mask为啥每次都要加offset。
不是让所有的同block下的shared【】相加得到一个block下的总和赋给result数组。result[bid] = shared[0]; 貌似只赋值了一次而已。不是有32
block吗，那么就应该有result32个结果啊！！弄不明白。。。
哦这个是我在深入浅出cuda中看到的。。网站：http://www.pcinlife.com/article/graphics/2008-06-04/1212575164d532_3.html
真诚求助！！谢过先。。。

[ 本帖最后由 ycc892009 于 2010-6-25 12:01 编辑 ]