大家好,最近实现一个算法,利用了Harris的树形规约算法,有一疑问请教大家,代码如下:
如需原算法,请见文件[attach]3142[/attach]:
#ifdef DEVICE_EMULATION#define EMUSYNC __syncthreads()
#else
#define EMUSYNC
#endif
…(有所省略)
#ifndef DEVICE_EMULATION
if (tid < 32)
#endif {
if (blockSize >= 64) { sdata[tid] = min(sdata[tid], sdata[tid + 32]); EMUSYNC; }
if (blockSize >= 32) { sdata[tid] = min(sdata[tid], sdata[tid + 16]); EMUSYNC; }
if (blockSize >= 16) { sdata[tid] = min(sdata[tid], sdata[tid + 8]); EMUSYNC; }
if (blockSize >= 8) { sdata[tid] = min(sdata[tid], sdata[tid + 4]); EMUSYNC; }
if (blockSize >= 4) { sdata[tid] = min(sdata[tid], sdata[tid + 2]); EMUSYNC; }
if (blockSize >= 2) { sdata[tid] = min(sdata[tid], sdata[tid + 1]); EMUSYNC; }
}
…
经过在Fermi GPU上测试,就求最小值而言,如果没有在以上代码中添加__syncthreads(),最小值结果就会出问题,添加后结果正确!
我理解cuda每个warp32条线程内的指令(也就是这里的tid<32所表达的) 其路径应该是一致的,那么为何一定需要这个同步呢?
难道是因为每个warp分两次发射?
谢谢。
LZ您好,您使用了shared memory进行规约,此时需要使用volatile关键字的,否则在计算能力2.x和3.x的卡上会出问题;或者您每步加一个__syncthreads()也可以保证正确。详情请参考CUDA C Programming Guide 中reduction的示例代码。
以及这个和warp发射具体方式无关。
大致如此,请您修正后再试。
祝您好运~
您好!
如果volatile可以解决问题,能否请您从编译优化角度帮我理解这一问题的原因?
代码已参阅,已掌握正确写法,感谢您的解答。
LZ您好,NVIDIA官方提供的 Fermi compatibility Guide文档指出:
“If your kernels implement this sort of optimization when passing values among threads
in a warp using shared or global memory, it is essential that the pointer into that
memory is declared with the volatile qualifier (shown in red below) to force the
compiler to write the intermediate values out to memory after each step rather than
holding the values (smem[tid] in the example below) in registers.
Code such as this that omits the volatile qualifier will not work correctly on Fermi due
to enhanced compiler optimizations. In the example below, the volatile qualifier tells the
compiler that it must store smem[tid] back to shared memory after every assignment
rather than simply allocating a register and optimizing away the stores to shared
memory. ”
此文档随CUDA Toolkit 4.2免费发布。
祝您编码顺利~
[attach]3148[/attach]