有关ReduceSum中线程同步的问题

otakuxiang · 2020 年11 月 13 日 11:22

我在使用CUDA对核函数中每个线程产生的6x6矩阵做ReduceSum时，发现程序结果与中间添加的__syncthread()个数相关，很奇怪，代码如下

__syncthreads();
   for(int ii = 0 ; ii < 6 ; ii ++) for(int jj = 0 ; jj < 6 ; jj ++){
   dim_shared1[tid] = new_H(ii,jj);
   __syncthreads(); 
   for(int s = 128;s>0;s>>=1){
   if(tid < s){
   dim_shared1[tid] += dim_shared1[tid+s];
   }
   __syncthreads();
   __syncthreads();
   __syncthreads();
   __syncthreads();
   __syncthreads();
   __syncthreads();
   __syncthreads();
   __syncthreads();
   }
   __syncthreads();
   __syncthreads();
   __syncthreads();
   if(tid == 0){
   atomicAdd(&acc_H->at(ii,jj),dim_shared1[0]);

如果中间一串__syncthreads();变成一个__syncthreads();程序结果就会出错，请问这是什么问题？