关于原子操作？

system · 2013 年3 月 14 日 08:31

想问一下我的显卡是计算功能集1.3的按理说支持int类型的原子操作，可却运行不了CUDA实战第九章全局内存的上的原子操作，运行时会出现unknown error 的提示。而这个编译过的应用程序拿到计算功能集2.0的显卡确实能正确运行。然而我奇怪的不是这个，是我的显卡确实可以计算带有原子操作的其他程序！求解释？

system · 2013 年3 月 14 日 08:42

楼主您好，一般出现unknown error代表您的kernel进行了非法的地址访问（例如越界就可能导致这个，当然越界也可能没反应)。

而您2.0卡上能运行，说明该原子操作使用的地址，在2.0上是有效的，而在1.3上是无效的。
（考虑到2.x+支持generic addressing等，以及可能不同的最小分配粒度，这种情况是有可能的）。

因为我没有购买《CUDA实战》这本书，所以能否请您将《CUDA实战第九章》的的例子发到本论坛？然后论坛为您看一下呢？

system · 2013 年3 月 15 日 01:25

/*
 * Copyright 1993-2010 NVIDIA Corporation.  All rights reserved.
 *
 * NVIDIA Corporation and its licensors retain all intellectual property and 
 * proprietary rights in and to this software and related documentation. 
 * Any use, reproduction, disclosure, or distribution of this software 
 * and related documentation without an express license agreement from
 * NVIDIA Corporation is strictly prohibited.
 *
 * Please refer to the applicable NVIDIA end user license agreement (EULA) 
 * associated with this source code for terms and conditions that govern 
 * your use of this NVIDIA software.
 * 
 */


#include "../common/book.h"

#define SIZE    (100*1024*1024)


__global__ void histo_kernel( unsigned char *buffer,
   long size,
   unsigned int *histo ) {
   // calculate the starting index and the offset to the next
   // block that each thread will be processing
   int i = threadIdx.x + blockIdx.x * blockDim.x;
   int stride = blockDim.x * gridDim.x;
   while (i < size) {
   atomicAdd( &histo[buffer[i]], 1 );
   i += stride;
   }
}

int main( void ) {
   unsigned char *buffer =
   (unsigned char*)big_random_block( SIZE );

   // capture the start time
   // starting the timer here so that we include the cost of
   // all of the operations on the GPU.
   cudaEvent_t     start, stop;
   HANDLE_ERROR( cudaEventCreate( &start ) );
   HANDLE_ERROR( cudaEventCreate( &stop ) );
   HANDLE_ERROR( cudaEventRecord( start, 0 ) );

   // allocate memory on the GPU for the file's data
   unsigned char *dev_buffer;
   unsigned int *dev_histo;
   HANDLE_ERROR( cudaMalloc( (void**)&dev_buffer, SIZE ) );
   HANDLE_ERROR( cudaMemcpy( dev_buffer, buffer, SIZE,
   cudaMemcpyHostToDevice ) );

   HANDLE_ERROR( cudaMalloc( (void**)&dev_histo,
   256 * sizeof( int ) ) );
   HANDLE_ERROR( cudaMemset( dev_histo, 0,
   256 * sizeof( int ) ) );

   // kernel launch - 2x the number of mps gave best timing
   cudaDeviceProp  prop;
   HANDLE_ERROR( cudaGetDeviceProperties( &prop, 0 ) );
   int blocks = prop.multiProcessorCount;
   histo_kernel<<<blocks*2,256>>>( dev_buffer, SIZE, dev_histo );
   
   unsigned int    histo[256];
   HANDLE_ERROR( cudaMemcpy( histo, dev_histo,
   256 * sizeof( int ),
   cudaMemcpyDeviceToHost ) );

   // get stop time, and display the timing results
   HANDLE_ERROR( cudaEventRecord( stop, 0 ) );
   HANDLE_ERROR( cudaEventSynchronize( stop ) );
   float   elapsedTime;
   HANDLE_ERROR( cudaEventElapsedTime( &elapsedTime,
   start, stop ) );
   printf( "Time to generate:  %3.1f ms\n", elapsedTime );

   long histoCount = 0;
   for (int i=0; i<256; i++) {
   histoCount += histo[i];
   }
   printf( "Histogram Sum:  %ld\n", histoCount );

   // verify that we have the same counts via CPU
   for (int i=0; i<SIZE; i++)
   histo[buffer[i]]--;
   for (int i=0; i<256; i++) {
   if (histo[i] != 0)
   printf( "Failure at %d!  Off by %d\n", i, histo[i] );
   }

   HANDLE_ERROR( cudaEventDestroy( start ) );
   HANDLE_ERROR( cudaEventDestroy( stop ) );
   cudaFree( dev_histo );
   cudaFree( dev_buffer );
   free( buffer );
   return 0;
}

system · 2013 年3 月 15 日 06:19

楼主您好，通过阅读您代码的，似乎无问题的。

那么我做出如下建议：
（1）楼主发一下出错时候的行号，以便进一步的辅助观察。（我这里没有您需要的部分代码，也没有1.3的老卡，所以无法为您上机测试。）
（2）如果楼主用的是该书编译好的二进制代码，我建议楼主重新编译自己编译一下。
（3）如果可能，升级driver和toolkit到最新，避免是编译器BUG
（4）建议其他会员/版主/NVIDIA官方技术支持/总版主进行帮忙看下。

没有其他更好的建议了。

system · 2013 年3 月 15 日 07:26

谢谢版主的耐心解答！我会尝试一下，这个问题我只是感到奇怪后来就问了问！