请教如何使用CUDA测试GPU峰值浮点性能

system · 2013 年5 月 2 日 14:34

之前在论坛和stackoverflow上看过一些测试GPU浮点运算能力的例子，
例如类似下面的代码，kernel中是大量的MAD运算，没有memory的读取。（这样的代码可以找到很多版本）
有人使用M2090计算，时间为2.16ms，其浮点性能达到1000102430*80/2.16ms = 1.15TFlops vs. 1.33 theoretical peak

问题：我使用的是NVS4200M显卡(2.1)，在win7 x64 vs2010 + CUDA5.0环境下的运行时间55ms，浮点性能只有45GFLOPS，而这块显卡的理论峰值大约是100GFLOPS，类似的测试也在C2070上运行过，都表现很差。

想请教一下是什么原因导致的实测性能与理论性能相差太大？欢迎讨论，谢谢回复

__global__ void test(float loop, float *out)
{
   register float a=1.0f;
   register float b=1.0f;
   register float c=1.0f;
   register float d=1.0f;
   register float e=1.0f;
   register float f=1.0f;
   register float g=1.0f;
   register float h=1.0f;

   for (float x=0;x<loop;x++)
   {
   a+=x*loop;
   b+=x*loop;
   c+=x*loop;
   d+=x*loop;
   e+=x*loop;
   f+=x*loop;
   g+=x*loop;
   h+=x*loop;

   a+=x*loop;
   b+=x*loop;
   c+=x*loop;
   d+=x*loop;
   e+=x*loop;
   f+=x*loop;
   g+=x*loop;
   h+=x*loop;

   a+=x*loop;
   b+=x*loop;
   c+=x*loop;
   d+=x*loop;
   e+=x*loop;
   f+=x*loop;
   g+=x*loop;
   h+=x*loop;

   a+=x*loop;
   b+=x*loop;
   c+=x*loop;
   d+=x*loop;
   e+=x*loop;
   f+=x*loop;
   g+=x*loop;
   h+=x*loop;

   a+=x*loop;
   b+=x*loop;
   c+=x*loop;
   d+=x*loop;
   e+=x*loop;
   f+=x*loop;
   g+=x*loop;
   h+=x*loop;
   }
   if (out!=NULL) *out=a+b+c+d+e+f+g+h;
}

int main(int argc, char *argv[])
{
   float timestamp;
   cudaEvent_t event_start,event_stop;
   // Initialise
   cudaDeviceReset();
   cudaSetDevice(0);
   cudaThreadSetCacheConfig(cudaFuncCachePreferShared);
   // Allocate and generate buffers
   cudaEventCreate(&event_start);
   cudaEventCreate(&event_stop);
   cudaEventRecord(event_start, 0);
   dim3 threadsPerBlock;
   dim3 blocks;
   threadsPerBlock.x=32;
   threadsPerBlock.y=32;
   threadsPerBlock.z=1;
   blocks.x=1;
   blocks.y=1000;
   blocks.z=1;

   test<<<blocks,threadsPerBlock,0>>>(30,NULL);

   cudaEventRecord(event_stop, 0);
   cudaEventSynchronize(event_stop);
   cudaEventElapsedTime(×tamp, event_start, event_stop);
   printf("Calculated in %f\n", timestamp);
}

system · 2013 年5 月 2 日 14:48

如下3个方面可能会影响性能，注意是可能。

（1）建议使用(16,16)的block大小启动贵kernel.
（2）建议使用-O2编译。
（3）建议在启动kernel前单独启动一次，作为热身。而以第二次的计时为准。

请尝试并重新给出结果。

system · 2013 年5 月 2 日 15:00

谢谢您的回复，按照您的建议进行了以下的测试。

(16,16)的block运行时间为13.65ms，达到的浮点性能为45GFLOPS
在CUDA C/C++设置Additional Compiler Options -O2，运行时间基本不变
启动两次kernel，以第二次计时，运行时间基本不变

还是想不明白为什么会差这么多?

system · 2013 年5 月 2 日 15:02

我也不明白了。坐等他人回复吧。

system · 2013 年5 月 2 日 15:07

谢谢

类似还有这样的代码，我在nvs4200m上测试过多次，改过许多参数，基本运行速度都在10GFLOPS以下，而C2070上也只有50多GFLOPS，与理论峰值相差太远。

而这些kernel本身又是极其简单的，不存在mem-bound，math上也应该能较好发挥GPU性能，但是实测却差的比较多。（不过提供这些代码的网页上倒有许多不错的测试结果，基本都能达到理论峰值的90%）

#define LOOP (100000)
#define BLOCKS (200)
#define THPB (256)

__global__ void new_ker(float *x)
{
  int index = threadIdx.x+blockIdx.x*blockDim.x;
  float a = 1.0f, b = -1.0f;

  for(int i = 0; i < LOOP; i++){
   a = a*b + b;
  }  

  x[index] = a;
}

int main(int argc,char **argv)
{

   //Initializations
   float *x;
   float *dx;
   cudaEvent_t new_start,new_stop;
   float elapsed;
   double gflops;
   x = (float *)malloc(sizeof(float)*THPB*BLOCKS);
   cudaProfilerStart();
   cudaMalloc((void **)&dx,sizeof(float)*THPB);

   //ILP=1  
   cudaEventCreate(&new_start);
   cudaEventCreate(&new_stop);

   printf("Kernel1:\n");
   cudaEventRecord(new_start, 0);

   new_ker<<<BLOCKS,THPB>>>(dx);

   cudaEventRecord(new_stop,0);
   cudaEventSynchronize(new_stop);
   cudaEventElapsedTime(&elapsed,new_start,new_stop);
   
   cudaMemcpy(x,dx,sizeof(float)*THPB*BLOCKS,cudaMemcpyDeviceToHost);

   cudaEventDestroy(new_start);
   cudaEventDestroy(new_stop);
   cudaDeviceReset();
   cudaProfilerStop();

   printf("\t%f\n",elapsed);
   gflops = 2.0e-6 * ((double)(LOOP)*(double)(THPB*BLOCKS)/(double)elapsed);
   printf("\t%f\n",gflops);

   return 0;
}

system · 2013 年5 月 2 日 15:32

LZ您好，您不妨用NVVP跑一下，看看有什么提示没有。再根据NVVP的结果，调整您的代码，或者能有所启示。

祝您好运~

system · 2013 年5 月 2 日 16:46

我用GTX 650 下指令nvcc *.cu -arch=sm_30
float跑出來是
18.876352/542.477690
理論值是812.5GFLOPs
double跑出來是
149.229248/34.309628

system · 2013 年5 月 2 日 19:57

无法理解了。好奇怪的。

以及，建议第二个kernel进行#pragma unroll下。循环体太小。

建议其他人给出解答。

system · 2013 年5 月 3 日 02:10

谢谢
第二个kernel原本是带unroll的，但是对结果没有很大的影响。

system · 2013 年5 月 3 日 02:11

你的这个结果还不错，改改参数应该能到理论峰值的90%

system · 2013 年5 月 3 日 02:17

谢谢
之前用NVPP跑过
基本都是no issues，除了几个kernel concurrency的warning(显然跟这个没关系)
然后details里该是100%的都是100%，该是0%的都是0%，看起来一切正常。
有点搞不明白了。
程序编译设置是参考CUDA SDK里的项目。

system · 2013 年5 月 3 日 02:21

我在TESLA C2070上试了threadPerBlock = {32, 64, 128, 192, 256, 320, 384, 512}，blocks= {1, 2, 4, 16, 32, 64, 128, 256}这64组参数，跑出来的最高GFLOPS也只有63，理论峰值是1.03TFLOPS

system · 2013 年5 月 3 日 04:20

您好，
刚才我又试了一下，没有用visual studio 2010
自己在命令行里编译 nvcc kernel.cu -arch=sm_21
运行结果最高达到103GFLOPS（不过又有点高了，cuda-z实测nvs4200m的峰值性能也就103GFLOPS）

难道visual studio的编译设置有问题吗？我的CUDA C/C++编译设置参照CUDA5.0 SDK里的项目。

system · 2013 年5 月 3 日 04:25

根据您的回复，个人猜测您之前在VS2010中可能用debug模式编译运行的。
故建议您尝试下在VS2010中用release模式编译（不要用debug),然后测试下结果。

system · 2013 年5 月 3 日 04:26

谢谢，是用debug模式运行的，会有很大的区别吗？why？

system · 2013 年5 月 3 日 04:28

debug是调试模式，为了调试，必然会加入一些“额外”的代码(更具体的我也不清楚，期待横扫版主更详细的回答）。大体是这个意思。

system · 2013 年5 月 3 日 04:30

根据您的建议，切换到release模式下运行，果然速度有了很大的提升。
请教一下您，debug/release为什么会造成如此大的区别？还是因为debug中的一些编译设置不正确？

system · 2013 年5 月 3 日 04:32

嗯，那我一直理解的有点问题，我以为debug和release无非就是编译选项的区别，没有想过debug"会添加额外代码"，真捉急啊，总是碰到这种问题

system · 2013 年5 月 3 日 04:33

我知道的已经再16#说清楚了。我VS用的也不是很多。更多的内容。可以等横扫或者ICE版主回答。（我属实不清楚，不敢乱说）

system · 2013 年5 月 3 日 04:37

谢谢您的回复:P