我的程序的步骤是这样的:
1.创建4个流
2.四个流异步传输数据
3.内核函数执行矩阵乘法(每一个流执行两个矩阵乘法)
4.内核函数执行矩阵加法(将上一步骤每个流中的两个矩阵相加,最终每个流得到一个矩阵加结果,四个流总共得到四个矩阵)
5.四个流异步传输到主机端
cudaMemcpyAsync(d_a1,a1,size,cudaMemcpyHostToDevice,stream[0]);
cudaMemcpyAsync(d_b1,b1,size,cudaMemcpyHostToDevice,stream[0]);
cudaMemcpyAsync(d_a2,a2,size,cudaMemcpyHostToDevice,stream[1]);
cudaMemcpyAsync(d_b2,b2,size,cudaMemcpyHostToDevice,stream[1]);
cudaMemcpyAsync(d_a3,a3,size,cudaMemcpyHostToDevice,stream[2]);
cudaMemcpyAsync(d_b3,b3,size,cudaMemcpyHostToDevice,stream[2]);
cudaMemcpyAsync(d_a4,a4,size,cudaMemcpyHostToDevice,stream[3]);
cudaMemcpyAsync(d_b4,b4,size,cudaMemcpyHostToDevice,stream[3]);
dim3 threads=dim3(BLOCK_SIZE,BLOCK_SIZE);
dim3 blocks=dim3(N/(2BLOCK_SIZE),N/(2BLOCK_SIZE));
matrixMul<<<blocks,threads,0,stream[0]>>>(d_c1,d_a1,d_b1,N/2,N/2);
matrixMul<<<blocks,threads,0,stream[1]>>>(d_c3,d_a1,d_b2,N/2,N/2);
matrixMul<<<blocks,threads,0,stream[2]>>>(d_c5,d_a3,d_b1,N/2,N/2);
matrixMul<<<blocks,threads,0,stream[3]>>>(d_c7,d_a3,d_b2,N/2,N/2);
matrixMul<<<blocks,threads,0,stream[0]>>>(d_c2,d_a2,d_b3,N/2,N/2);
matrixMul<<<blocks,threads,0,stream[1]>>>(d_c4,d_a2,d_b4,N/2,N/2);
matrixMul<<<blocks,threads,0,stream[2]>>>(d_c6,d_a4,d_b3,N/2,N/2);
matrixMul<<<blocks,threads,0,stream[3]>>>(d_c8,d_a4,d_b4,N/2,N/2);
addMat<<<blocks,threads,0,stream[0]>>>(d_C1,d_c1,d_c2);
addMat<<<blocks,threads,0,stream[1]>>>(d_C2,d_c3,d_c4);
addMat<<<blocks,threads,0,stream[2]>>>(d_C3,d_c5,d_c6);
addMat<<<blocks,threads,0,stream[3]>>>(d_C1,d_c7,d_c8);
cudaMemcpyAsync(C1,d_C1,size,cudaMemcpyDeviceToHost,stream[0]);
cudaMemcpyAsync(C2,d_C2,size,cudaMemcpyDeviceToHost,stream[1]);
cudaMemcpyAsync(C3,d_C3,size,cudaMemcpyDeviceToHost,stream[2]);
cudaMemcpyAsync(C4,d_C4,size,cudaMemcpyDeviceToHost,stream[3]);
cudaMemcpyAsync(c1,d_c1,size,cudaMemcpyDeviceToHost,stream[0]);
cudaMemcpyAsync(c3,d_c3,size,cudaMemcpyDeviceToHost,stream[1]);
cudaMemcpyAsync(c5,d_c5,size,cudaMemcpyDeviceToHost,stream[2]);
cudaMemcpyAsync(c7,d_c7,size,cudaMemcpyDeviceToHost,stream[3]);
cudaMemcpyAsync(c2,d_c2,size,cudaMemcpyDeviceToHost,stream[0]);
cudaMemcpyAsync(c4,d_c4,size,cudaMemcpyDeviceToHost,stream[1]);
cudaMemcpyAsync(c6,d_c6,size,cudaMemcpyDeviceToHost,stream[2]);
cudaMemcpyAsync(c8,d_c8,size,cudaMemcpyDeviceToHost,stream[3]);
cudaStreamSynchronize(stream[0]);
cudaStreamSynchronize(stream[1]);
cudaStreamSynchronize(stream[2]);
cudaStreamSynchronize(stream[3]);
我的问题:
第三步骤每个流中内核函数执行的矩阵乘法结果正确。但是第四步骤矩阵加法结果步骤错误,经过nisight monitor单步调试,发现每一个线程(譬如说线程块0中线程号tx=0,ty=0)执行了四次内核函数(执行完了一遍,又从内核函数开头执行,而且每一次内核函数传进的参数A,B也发生改变,我也没对传入的A,B参数进行操作,怎么会改变,这个怎么回事?为什么执行了四次,难道和四个流有关?)
加法内核函数代码
global void addMat(double* C,double* A,double* B)
{
// Block index
int bx = blockIdx.x;
int by = blockIdx.y;
// Thread index
int tx = threadIdx.x;
int ty = threadIdx.y;
int idx=tx+bxblockDim.x;
int idy=ty+byblockDim.y;
C[idy*(N/2)+idx]=A[idy*(N/2)+idx]+B[idy*(N/2)+idx];
变化时这样的
第一次:传入的参数值A[0]=4,B[0]=52,C是一个随机数字
第二次:传入的参数值A[0]=6,B[0]=62,C[0]=随机数
第三次:传入的参数值A[0]=36,B[0]=212,C[0]=随机数
第四次:传入参数值A[0]=70,B[0]=254,C[0]=56(这是我要的数据,怎么到现在才写入)
最终得到C[0]=324,请版主解惑。