kernel函数不执行

system · 2013 年6 月 2 日 09:57

global void fsou_kernel_cuda(float device_field_current,
int nx, int ny,
float coef_x, float coef_y)
{
int xid = blockIdx.xblockDim.x + threadIdx.x + 1;
int yid = blockIdx.y*blockDim.y + threadIdx.y + 1;

int xiyj = yid*(nx+1) + xid;
int xi_1yj = xiyj - 1;
int xiyj_1 = xiyj - (nx+1);

float next_field;
if (xid<=nx && yid <=ny)
next_field = (1.f - coef_x - coef_y)*device_field_current[xiyj]

coef_x*device_field_current[xi_1yj]
coef_y*device_field_current[xiyj_1];

__syncthreads();
device_field_current[xiyj] = next_field;
__syncthreads();
}
调用部分：
for (int timecount=0; timecount<ntime; timecount++)
{
if (timecount % check_t_interval == 0)
{
printf(“time iterations have completed %2d%% \n”, complete_count);
complete_count += 10;
}
fsou_kernel_cuda<<<dim_grid, dim_block>>>(
device_field_current, nx, ny, coef_x, coef_y);
}
通过memcpy导出查看device_field_current并没有更新；利用nvvp查看也没有看到kernel执行时间。请问是什么原因？

system · 2013 年6 月 2 日 10:17

很多问题会导致kernel不运行，你可以用cuda-gdb跑一遍。它能提供更多有用信息。

system · 2013 年6 月 2 日 12:14

LZ您好，如2#所言，您可以给出更多调试信息以确定问题所在。以及您需要保证您调用kernel的时候各个参数都是合适和有效的。

此外，想说一下您的代码中的一个可能的逻辑问题。
您的代码似乎是某种差分算法的代码，其中有“nt xiyj_1 = xiyj - (nx+1)”，device_field_current中xiyj_1这个位置的变量，如果您有多个block的话，可能被其他的block更新过，也可能没有被更新，因为不同block之间的顺序是不被保证的，以及__syncthreads()只能在一个block内部起作用。所以您的__syncthreads()无法保证所有block的线程都先根据上一步的场值先算出局部的next_field，再回写的。所以此处可能出问题。

大致如此，供您参考，祝您编码顺利~

system · 2013 年6 月 3 日 07:01

你说的很对，确实是有这个问题。我现在将代码改了一下，但还是不对：

主机端：
for (int timecount=0; timecount<ntime; timecount++)
{
cudaMemcpy(device_field_previous, device_field_current, mem_size,
cudaMemcpyDeviceToDevice);

if (timecount % check_t_interval == 0)
{
printf(“time iterations have completed %2d%% \n”, complete_count);
complete_count += 10;
}
fsou_kernel_cuda<<<dim_grid, dim_block>>>(
device_field_previous, device_field_current,
nx, ny, coef_x, coef_y);
}

设备端：
global void fsou_kernel_cuda(float device_field_previous,
float device_field_current,
int nx, int ny,
float coef_x, float coef_y)
{
int xid = blockIdx.xblockDim.x + threadIdx.x + 1;
int yid = blockIdx.yblockDim.y + threadIdx.y + 1;

int xiyj = yid*(nx+1) + xid;
int xi_1yj = xiyj - 1;
int xiyj_1 = xiyj - (nx+1);

float next_field;
if (xid < nx && yid < ny)
next_field = (1.f - coef_x - coef_y)*device_field_previous[xiyj]

coef_x*device_field_previous[xi_1yj]
coef_y*device_field_previous[xiyj_1];

device_field_current[xiyj] = next_field;
}

是不是设备端的内存在每次重新执行kernel时就会释放？

system · 2013 年6 月 3 日 07:06

楼主您好。关于您最新的问题，答案是否定的，

device memory只有当您手工cudaFree后才会释放，而不会执行完kernel就自动释放。

关于您之前楼层的问题，您可以在您的除了<<<>>>后的每个api调用上，都检测返回值，并检查是否是cudaSuccess, 这样可以大致定位问题所在的范围。例如：
cudaError_t result = cudaMemcpy(…);
if (result != cudaSuccess) …; //这样您这样知道在在或者在这之前就存在问题了。

system · 2013 年6 月 3 日 07:19

LZ您好，您的提问已由5#的横扫斑竹解答。
我这里对您实现方式略作建议。

您这里使用了current和previous两个数组保存您的数值，每次用previous更新current，在memcpy将current的值复制给previous。

这样您每计算一次，就需要复制捣腾一遍全体数据。

其实您可以在kernel里面交换两个指针即可，比如第一步，用previos数组更新current数组，下一步的时候用current的数组（含意上是本时间步的previous数据）更新previous数组（更新为本时间步的current的值）。

只需要根据host循环中timecount的值，告知kernel当前的时间步情况，并在kernel中选择合适的更新方式即可，无需每次都memcpy复制数据。

请您参考，祝您好运~

system · 2013 年6 月 3 日 09:56

谢谢你的答案，我会尝试一下
现在我发现哪里出了问题，但不是很明白为什么，请指教一下：

正确的代码：
cudaMemcpy(device_field_previous, host_field_current, mem_size,
cudaMemcpyHostToDevice);
for (int timecount=0; timecount<ntime; timecount++)
{
fsou_kernel_cuda<<<dim_grid, dim_block>>>(
device_field_previous, device_field_current,
nx, ny, coef_x, coef_y);
cudaMemcpy(device_field_previous, device_field_current, mem_size,
cudaMemcpyDeviceToDevice);
}
错误的代码：

cudaMemcpy(device_field_current, host_field_current, mem_size,
cudaMemcpyHostToDevice);

for (int timecount=0; timecount<ntime; timecount++)
{
cudaMemcpy(device_field_previous, device_field_current, mem_size,
cudaMemcpyDeviceToDevice);
fsou_kernel_cuda<<<dim_grid, dim_block>>>(
device_field_previous, device_field_current,
nx, ny, coef_x, coef_y);

}

就是个数据传递的顺序不同，怎么会造成不同的结果，对于错误的情况根本没有更新。

system · 2013 年6 月 3 日 10:12

LZ您好：

根据您给出的调用顺序：
“正确的代码”执行为：host_current copy给 device_previous——kernel启动由device_previos生成device_current——将新生成的device_current copy给device_previous…

“错误的代码”执行为：host_current copy给device_current——device_current copy给device_previous——启动kernel由device_previous生成device_current…
应该说两者基本上是等价的，除了一些细节不同，比如前者最后device_previous和device_current都保存了最后的结果，而后者最后一步只有device_current是最终的结果。

所以无法理解您为何两种情况下结果不同，如果您循环只进行了一次，并且是取device_previous的值查看的话，后一种写法将观察到没有更新的数据，否则应该无妨的。

请您检查您的代码。