我目前的工作是在GPU上实现ADI 和 Domain Decomposition 算法,以求解热能偏微分方程。当我将并行运算结果与串行运算结果相比较时,发现了两个问题。第一个问题是精度问题。第二个问题是怎样使用函数CUDA_ SAFE_CALL。问题的细节请参考后面的英文说明。
麻烦大家帮忙解决一下,如果需要,我很愿意上传我的代码,供大家参考。
The results are as follows:
1 Values of the solution: For the small size solution matrix within short diffusion time period, the parallel computing results could be exactly the same with the sequential computing results. However, when the diffusion time period extends or the dimension of the solution matrix increases, the difference becomes larger, but usually below 〖10〗^(-6).
My first question is if this difference is due to the machine epsilon or the accuracy of GPU.
2 Runtime of the code: When the command “CUDA_SAFE_CALL(cudaThreadSynchronize())" is applied, the runtime of the code is much greater than the program without CUDA_SAFE_CALL. For example, when the size of the solution matrix is 200 by 200, the runtime is 352 units with the command “CUDA_SAFE_CALL(cudaThreadSynchronize())", while the runtime is only 1 unit when the command is not applied .
My second question is whether the command “CUDA_SAFE_CALL(cudaThreadSynchronize())" is necessary, since this command makes the code much slower.