GPU精度及运行时间

system · 2010 年3 月 10 日 17:14

我目前的工作是在GPU上实现ADI 和 Domain Decomposition 算法，以求解热能偏微分方程。当我将并行运算结果与串行运算结果相比较时，发现了两个问题。第一个问题是精度问题。第二个问题是怎样使用函数CUDA_ SAFE_CALL。问题的细节请参考后面的英文说明。
麻烦大家帮忙解决一下，如果需要，我很愿意上传我的代码，供大家参考。

The results are as follows:
1 Values of the solution: For the small size solution matrix within short diffusion time period, the parallel computing results could be exactly the same with the sequential computing results. However, when the diffusion time period extends or the dimension of the solution matrix increases, the difference becomes larger, but usually below 〖10〗^(-6).

My first question is if this difference is due to the machine epsilon or the accuracy of GPU.

2 Runtime of the code: When the command “CUDA_SAFE_CALL(cudaThreadSynchronize())" is applied, the runtime of the code is much greater than the program without CUDA_SAFE_CALL. For example, when the size of the solution matrix is 200 by 200, the runtime is 352 units with the command “CUDA_SAFE_CALL(cudaThreadSynchronize())", while the runtime is only 1 unit when the command is not applied .

My second question is whether the command “CUDA_SAFE_CALL(cudaThreadSynchronize())" is necessary, since this command makes the code much slower.

system · 2010 年3 月 11 日 01:33

因为CPU和GPU计算时中间数据的存储位数不一样，如int中间数据存储是48，但是GPU好像没有，所以GPU的精度比CPU差。

据我所知cuda_safe_call就是检查返回的错误码的，一般而言没有必要用，呵呵！多从来不用非标准的东西，出错后用自己的出错函数解决。

如果你能上传代码，非常感谢，呵呵！

system · 2010 年3 月 11 日 01:34

嗯！一起研究研究~~~
.

External Media

开源图形处理器体系结构论坛(OpenGPU论坛) http://www.opengpu.org/bbs/

OpenGPU Graphics Open Source community（图形开源社区），聚焦领域(focus domain)包括：
* GPU Architecture（图形处理器体系结构）.
* Graphics Algorithm（图形算法）.
* GPGPU Programming （面向通用的图形处理器编程）.
* Open Source Rendering Engine（开源渲染器）.
* Open Source GPU Simulator/RTL Implement（开源GPU模拟器）.

system · 2010 年3 月 11 日 04:57

1.精度有问题
2.CUDA_ SAFE_CALL 宏定义，看头文件里面的就可以了，这个在2.3以后已经变化了

system · 2010 年3 月 11 日 16:16

十分感谢。