G8x硬件下原子函数实现问题

system · 2013 年11 月 26 日 01:50

由于G8x硬件不支持原子操作，实现的原子函数如下：
这是我看文章的源代码：device void addData256( volatile unsigned int *s_WarpHist, unsigned int data, unsigned int threadTag )
{ unsigned int count;
do{ count = s_WarpHist[data] & 0x07FFFFFFU;
count = threadTag | (count + 1);
s_WarpHist[data] = count;
}while(s_WarpHist[data] != count);
}

这个程序共享存储器的组织形式：行：WARP_N（每个线程块的warp个数）列：灰度个数（每行记录一个warp线程指向像素的灰度的分布，每个warp有自己的灰度分布计数数组，这里warp是怎么实现原子操作的呢？）。原文的解释是这样的，但是还是不懂，尤其是do…while循环中的第二行“位或”，这里得出的count代表什么了，还有这里的volatile有什么作用，红字具体指的是什么操作？请版主解惑：
addData256() is the core of the 256-bin histogram implementation. Let’s describe its logic in detail. According to data value (lying within 0 … 255 range), read from global memory, each warp thread must increment corresponding value in the s_WarpHistp array – a “frame”(row) within s_Hist array, corresponding to current warp. Each warp thread reads current warp counter s_WarpHist[data], corresponding to data value, then locally increments, tags it by warp-local thread ID (equal to threadIdx.x % 32), and writes it back to the same s_WarpHist[data] position. In case each warp thread received unique data values (from global memory), there are no collisions at all and no additional actions need to be done. Otherwise, when two or more threads collide on the same bin counter, the hardware performs shared memory write combining, resulting in acceptance of the tagged counter from one thread and rejection from all other pending threads. After the write attempt each thread queries the shared memory count value (the same s_WarpHist[data]) and owing to the tag decides whether its pending increment made its way to shared memory. If true, it becomes idle (masked out by hardware) until the entire warp is done (all the collisions are resolved). Otherwise, some other thread has submitted its increment into s_WarpHist, and current thread needs to grab the new counter value and perform the same actions. Since each warp is isolated and warp threads are always synchronized we do not rely on warp scheduling order (which is undefined). Not more than after 32 loop iterations all the warp threads submit their increments into s_WarpHist.

system · 2013 年11 月 26 日 03:11

贴上线程自己的标签而已，然后自我检测是否自己的操作成功。这个方法很好的。

system · 2013 年11 月 26 日 03:12

这里“|”不是位或操作符吗，怎么起到贴上标签的作用的，难道在GPU中含义与CPU不同？

system · 2013 年11 月 26 日 03:20

线程标签占据一些BIT位，计数占据另外一些BIT位。

system · 2013 年11 月 26 日 03:27

此问题版主们已经回答过多次了，楼主询问前不妨在本论坛搜索，比你发帖更快！

请参考http://cudazone.nvidia.cn/forum/forum.php?mod=viewthread&tid=7624&page=1

system · 2013 年11 月 26 日 03:53

哦，我这里又想起个问题：
就是在高配置的GPU上，这种方式和直接原子操作，哪个效率更高？
哪种方式是NV推荐的？

system · 2013 年11 月 26 日 03:57

我只说我的推荐吧：

您应当毫不犹豫的在有atomicAdd（整数版本）的卡上使用atomicAdd.
而不要使用该变通。

感谢来访。

system · 2013 年11 月 26 日 03:59

OK，明白了。多谢！

system · 2013 年11 月 26 日 07:37

您客气了，服务您是我们的荣幸。

感谢您的莅临。