cuda算出来的结果不稳定的问

碰到一个很诡异的问题:
1)按如下写法:
if(tid==0) // Thread 0 takes a ticket
{
zerocountPerBlock0[bid]=zerocount0+zerocount1;
__threadfence();
unsigned int ticket=atomicInc(&retirementCount, gridDim.x);
amLast = (ticket==gridDim.x-1);
}
__syncthreads();

if(amLast)
{
int ii;
for (ii=0;ii<gridDim.x;ii++)
{
if(tid==0)
{zerosum[0]+=zerocountPerBlock0[ii];
zerocountPerBlock[ii]=zerosum[0]-zerocountPerBlock0[ii];
retirementCount = 0;
}
__threadfence_block();
__syncthreads();
}
}

if(bid==0)
{
for (i=0;i<gridDim.x;i++)
{ __syncthreads();
if(bid==0 && tid<blockDim.x-zerocountPerBlock0[i])
g_after_Triindices[tid+iblockDim.x-zerocountPerBlock[i]]=tmpTriindices[tid+iblockDim.x];
}
}
其中,zerocountPerBlock0为每个block中tmpTriindices含0的总数目,zerocountPerBlock[i]为前一个block中tmpTriindices含0的总数目,即zerocountPerBlock[0]=0,zerocountPerBlock[1]=zerocountPerBlock0[0]。但是发现最后出来的g_after_Triindices的值从g_after_Triindices[0+1*blockDim.x-zerocountPerBlock[1]]开始就不对。

2)而如果采用原子函数:
if(tid==0) // Thread 0 takes a ticket
{
zerocountPerBlock0[bid]=zerocount0+zerocount1;
__threadfence();
unsigned int ticket=atomicInc(&retirementCount, gridDim.x);
amLast = (ticket==gridDim.x-1);
}
__syncthreads();

if(amLast)
{
int ii;
for (ii=0;ii<gridDim.x;ii++)
{
if(tid==0)
{zerosum[0]+=zerocountPerBlock0[ii];
zerocountPerBlock[ii]=zerosum[0]-zerocountPerBlock0[ii];
retirementCount = 0;
}
__threadfence_block();
__syncthreads();
if(tid<blockDim.x-zerocountPerBlock0[ii])
{g_after_Triindices[tid+iiblockDim.x-zerocountPerBlock[ii]]=tmpTriindices[tid+iiblockDim.x];}
}
}
g_after_Triindices[0],g_after_Triindices[1]…等能显示正确的数值,但如果是
if(tid==0 && bid ==0) A=g_after_Triindices[tid+bidc] 之类只要再涉及到tid、bid、bidc等形式的值,就变得不对,A总是只为0. 但是其他不是这样赋值的数组,如tmpTriindices等,则tmpTriindices[tid+bidc]显示的仍然是正确的数值。

3)
if(tid==0) // Thread 0 takes a ticket
{
zerocountPerBlock0[bid]=zerocount0+zerocount1;
__threadfence();
unsigned int ticket=atomicInc(&retirementCount, gridDim.x);
amLast = (ticket==gridDim.x-1);
}
__syncthreads();

if(amLast)
{
int ii;
for (ii=0;ii<gridDim.x;ii++)
{
if(tid==0)
{zerosum[0]+=zerocountPerBlock0[ii];
zerocountPerBlock[ii]=zerosum[0]-zerocountPerBlock0[ii];
retirementCount = 0;
}
__threadfence_block();
__syncthreads();
}
}

if(amLast)
{ for (i=0;i<gridDim.x;i++)
{__syncthreads();
if(tid<blockDim.x-zerocountPerBlock0[i])
{g_after_Triindices[tid+iiblockDim.x-zerocountPerBlock[i]]=tmpTriindices[tid+iblockDim.x];}
}
}

如果不是接着第一次的if(amLast) ,而是再起一段新的来写,则g_after_Triindices中的结果无论如何也不对,也即2)中的两种输出的结果都不对,就像情况1)一样

这究竟是什么原因呢?……请教下大家……thx!

http://cudabbs.it168.com/thread-659-1-1.html 可以看一下这个帖子

用gpu算的还是cpu模拟的呢?

都是Release出来的结果,只是三种不同写法

我是怀疑会不会用了原子操作后有什么死锁之类的问题?这个原子操作是按手册以及reduction那个例子学来的。每个block内还有的thread数(64)比grid内还有的block数(35)要大