CUDA核函数错误

昨天发的帖子被移到灌水专区,不知为何。

我想请教各位大牛,我有两个核函数a和b,分别被函数A和B调用,函数A中已经给各个全局变量申请显存

核函数b要处理核函数a生成的数据

经检查,核函数a生成的数据无错误,并且在函数A中并没有释放显存,且数据是全局变量,为什么函数B中的核函数b不能访问核函数a生成的数据?

cudaDeviceSynchronize函数显示核函数b出错,给出的错误代号是30,unknow error

出错的核函数b代码:
global void removeRedundantTrianglesKernel(bool *triangles1, float *triCenters1, float *triCenters2, size_t width, size_t height,float *K1, float *R1, float *T1, float *K2, float *R2, float *T2, float threshhold_r)
{
int x = threadIdx.x; // 得到线程索引
int y = blockIdx.x; // 得到块索引

// 首先判断是否是三角形
if (triangles1[y * width + x])
{
// 判断是否是边缘三角形,若是则投影到相邻深度图平面
if ((x == 0) || (y == 0) || !(triangles1[height * width + y * width + x] && triangles1[height * width + y * width + x - 1] && triangles1[height * width + (y - 1) * width + x]))
{
float depth = triCenters1[y * width + x];
float c1x = (x - K1[0]) * depth / K1[2] - T1[0];
float c1y = (y - K1[1]) * depth / K1[3] - T1[1];
float c1z = depth - T1[2];

float wx = R1[0] * c1x + R1[1] * c1y + R1[2] * c1z;
float wy = R1[3] * c1x + R1[4] * c1y + R1[5] * c1z;
float wz = R1[6] * c1x + R1[7] * c1y + R1[8] * c1z;

float d = R2[0] * R2[4] * R2[8] + R2[1] * R2[5] * R2[6] + R2[2] * R2[3] * R2[7] - R2[2] * R2[4] * R2[6] - R2[1] * R2[3] * R2[8] - R2[0] * R2[5] * R2[7];
float c2x = (wx * R2[4] * R2[8] + R2[1] * R2[5] * wz + R2[2] * wy * R2[7] - R2[2] * R2[4] * wz - R2[1] * wy * R2[8] - wx * R2[5] * R2[7]) / d;
float c2y = (R2[0] * wy * R2[8] + wx * R2[5] * R2[6] + R2[2] * R2[3] * wz - R2[2] * wy * R2[6] - wx * R2[3] * R2[8] - R2[0] * R2[5] * wz) / d;
float c2z = (R2[0] * R2[4] * wz + R2[1] * wy * R2[6] + wx * R2[3] * R2[7] - wx * R2[4] * R2[6] - R2[1] * R2[3] * wz - R2[0] * wy * R2[7]) / d;

float depth2 = c2z + T2[2];
int x2 = int((c2x + T2[0]) * K2[2] / depth2 + K2[0]);
int y2 = int((c2y + T2[1]) * K2[3] / depth2 + K2[1]);

// 投影到相机坐标系计算欧式距离
for (int i = y2 - 8; i <= y2 + 7; i++)
{
for (int j = x2 - 8; j <= x2 + 7; j++)
{
if (i >= 0 && i < 480 && j >= 0 && j < 640)
{
float tempDepth = triCenters2[i * width + j];
float tempx = (x2 + 0.5 - K2[0]) * tempDepth / K2[2] - T2[0];
float tempy = (y2 + 0.5 - K2[1]) * tempDepth / K2[3] - T2[1];
float tempz = tempDepth - T2[2];

if (getDistance(c2x, c2y, c2z, tempx, tempy, tempz) < threshhold_r)
{
triangles1[y * width + x] = false;
break;
}
}
}
}
}
}

}

调试发现,核函数中的矩阵数据均无法访问

LZ您好:

1:您要实现的意图“a kernel生成数据——b kernel消费数据”,这个是绝对可以实现的。

2:既然理论上一定是可以实现的,那么您使用中报错基本可以确定是您的实现有问题,因为CUDA的一套执行机制是经过多年考验的。
所以,根据以上两点,您日前拒绝就实现细节讨论,只能被转移到水区,敬请理解。

根据您给出的报错信息,这个一般是访存问题。
那么第一,您可以检查下您全局变量的写法和使用(比如您是否使用了__device__声明device端的全局变量),看看有无疏漏,您没有提供这方面的代码,暂时无建议给你。

第二,您给出了kernel b的代码,这个目测在使用方法上无问题,但是内部变量逻辑关系较为复杂,您需要仔细保证该实现的逻辑正确性,以及不会有访存越界什么的。

大致以上两点建议,供您参考。
欢迎进一步反馈和讨论。

祝您编码顺利~

楼主请继续发代码。

别光发kernel的,你的cuda error unknown一般是越界,你需要提供调用时候的各个缓冲区定义,启动的形状等,以便论坛为您观察哪里可能越界。

举个例子说,您直接使用了(x,y), 但无上段要求的数据,无法看出是否会导致越界的(我不是说您一定有,只是想为您找出)。

如果依然不方便发代码,那么将抱歉我无法提供服务。
您可以考虑第三方的公司,在和您签署保密协议的情况下,为您提供付费服务。

版主你好!我加了__device__后,问题依旧
我屏蔽了第二个核函数后:

extern “C” cudaError_t removeRedundantTrianglesWithCuda(bool *hst_triangles1, bool *dev_triangles1, float *triCenters1, float *triCenters2, float *K1, float *R1, float *T1, float *K2, float *R2, float *T2)
{
cudaError_t cudaStatus;

float thresh_r = 10.0;

// 块数和线程数
dim3 dimGrid(IMG_HEIGHT, 1, 1);
dim3 dimBlock(IMG_WIDTH, 1, 1);

cudaEvent_t start, stop;
float elapsedTime;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);

//removeRedundantTrianglesKernel<<<dimGrid, dimBlock>>>(dev_triangles1, triCenters1, triCenters2, IMG_WIDTH, IMG_HEIGHT, K1, R1, T1, K2, R2, T2, thresh_r);

//// cudaDeviceSynchronize waits for the kernel to finish, and returns
//// any errors encountered during the launch.
//cudaStatus = cudaDeviceSynchronize();
//if (cudaStatus != cudaSuccess) {
//	//fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
//	std::cout<<"cudaDeviceSynchronize returned error code "<<cudaStatus<<" after launching removeRedundantTrianglesKernel!"<<std::endl;
//	std::cout<<cudaGetErrorString(cudaStatus)<<std::endl;
//	return cudaStatus;
//}

cudaStatus = cudaMemcpy(hst_triangles1, dev_triangles1, 2 * IMG_HEIGHT * IMG_WIDTH * sizeof(bool), cudaMemcpyDeviceToHost);
if (cudaStatus != cudaSuccess) {
	std::cout<<"cudaMemcpy failed!";
	return cudaStatus;
}

cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsedTime, start, stop);
std::cout<<"Time to remove: "<<elapsedTime<<" ms."<<std::endl;

return cudaStatus;

}

只是在这个函数中将之前的核函数产生的数据拷贝出来,就出现了拷贝错误,调试发现依旧无法访问设备数据,均是0x0000000
好像是在这个函数中无法访问之前函数A中已分配显存的设备数据,但我在函数A中并没有释放,请问这是怎么回事?

楼主您好,

您最新的反馈说:直接a函数结束后,不调用b函数,直接cudaMemcpy就复制不出来。

那么这个流程可以这样认为:
kernel_a
cudaMemcpy

既然在这二步无法完成,那么显然不是kernel a出错就是紧紧跟随的cudaMemcpy有问题。

然后根据您在1#的说法原文:“经检查,核函数a生成的数据无错误”
那么显然必须只能有一种可能,kernel a后面的cudaMemcpy出现问题(例如参数不对),
请仔细检查。

根据您的描述,有且只能有这一种可能。
(以上均根据您的描述推断出,论坛已经假设了您自我描述正确)

LZ您好:

我不是建议您直接加上“device”这样的,而是询问您是否使用了“device”,因为这里是特殊处理的,比较容易出问题,以及也可以使用指针+cudaMalloc,使用指针是比较推荐的做法。

您4#给出的代码中,依然没有您声明和申请全局变量的代码,也无您使用的缓冲区定义,kernel启动参数等。

这样真心无法回答您“这是怎么回事”的,敬请谅解。

祝您好运~

目前问题是如果在函数A中核函数执行完后就执行cudaMemcpy的话,不会出错,所以我说第一个核函数没出错
但是我把cudaMemcpy放在函数B中,cudaMemcpy就出错,调试发现还是无法访问到这个数据,就是上述代码中的几个指针变量,但在函数A中能访问到,并且函数A中核函数执行完后并没有释放几个设备数据。
我不贴代码并不是为了保密,而是因为参数太多,大部分是分配现存和拷贝数据,只有函数A的话,程序运行正确,函数B是后来加的,已经在上方贴出

LZ您好:

1:您的意图是可以实现的,这一点毋庸置疑。
2:您的实现中的问题只能发代码寻找,而不是您文字叙述就能找到的。请您明白这一点。

所以,如果您依然尊重斑竹们的看贴时间,那么请发代码吧,都建议过多次了,这也将是最后一次。

祝您好运~

声明的全局设备端数据
float* dev_depth1 = 0;
bool* dev_triangles1 = 0;
float* dev_triCenters1 = 0;
float* dev_x1 = 0;
float* dev_y1 = 0;
float* dev_z1 = 0;
float* dev_K1 = 0;
float* dev_R1 = 0;
float* dev_T1 = 0;

float* dev_depth2 = 0;
bool* dev_triangles2 = 0;
float* dev_triCenters2 = 0;
float* dev_x2 = 0;
float* dev_y2 = 0;
float* dev_z2 = 0;
float* dev_K2 = 0;
float* dev_R2 = 0;
float* dev_T2 = 0;

float* dev_depth3 = 0;
bool* dev_triangles3 = 0;
float* dev_triCenters3 = 0;
float* dev_x3 = 0;
float* dev_y3 = 0;
float* dev_z3 = 0;
float* dev_K3 = 0;
float* dev_R3 = 0;
float* dev_T3 = 0;

float* dev_depth4 = 0;
bool* dev_triangles4 = 0;
float* dev_triCenters4 = 0;
float* dev_x4 = 0;
float* dev_y4 = 0;
float* dev_z4 = 0;
float* dev_K4 = 0;
float* dev_R4 = 0;
float* dev_T4 = 0;

函数A
extern “C” cudaError_t triangulateWithCudaDirect(const float *depth_map, const float *K, const float *R, const float *T, bool *hst_triangles, float *hst_x, float *hst_y, float *hst_z,
float *dev_depth, float *dev_K, float *dev_R, float *dev_T, bool *dev_triangles, float *dev_x, float *dev_y, float *dev_z, float *dev_triCenters)
{
cudaError_t cudaStatus;

float thresh_d = 20.0;

// 块数和线程数
dim3 dimGrid(IMG_HEIGHT, 1, 1);
dim3 dimBlock(IMG_WIDTH, 1, 1);

cudaEvent_t start, stop;
float elapsedTime;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);


// Choose which GPU to run on, change this on a multi-GPU system.
cudaStatus = cudaSetDevice(0);
if (cudaStatus != cudaSuccess) {
	std::cout<<"cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?";
	return cudaStatus;
}

// 将内外参拷贝到显存
cudaStatus = cudaMalloc((void**)&dev_K, 4 * sizeof(float));
if (cudaStatus != cudaSuccess) {
	std::cout<<"cudaMalloc failed!";
	return cudaStatus;
}

cudaStatus = cudaMalloc((void**)&dev_R, 9 * sizeof(float));
if (cudaStatus != cudaSuccess) {
	std::cout<<"cudaMalloc failed!";
	return cudaStatus;
}

cudaStatus = cudaMalloc((void**)&dev_T, 3 * sizeof(float));
if (cudaStatus != cudaSuccess) {
	std::cout<<"cudaMalloc failed!";
	return cudaStatus;
}

cudaStatus = cudaMemcpy(dev_K, K, 4 * sizeof(float), cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess) {
	std::cout<<"cudaMemcpy failed!";
	return cudaStatus;
}

cudaStatus = cudaMemcpy(dev_R, R, 9 * sizeof(float), cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess) {
	std::cout<<"cudaMemcpy failed!";
	return cudaStatus;
}

cudaStatus = cudaMemcpy(dev_T, T, 3 * sizeof(float), cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess) {
	std::cout<<"cudaMemcpy failed!";
	return cudaStatus;
}

// 为三角矩阵分配显存空间
cudaStatus = cudaMalloc((void**)&dev_triangles, 2 * IMG_HEIGHT * IMG_WIDTH * sizeof(bool));
if (cudaStatus != cudaSuccess) {
	std::cout<<"cudaMalloc failed!";
	return cudaStatus;
}

cudaStatus = cudaMemset(dev_triangles, 0, 2 * IMG_HEIGHT * IMG_WIDTH);
if (cudaStatus != cudaSuccess) {
	std::cout<<"cudaMemset failed!";
	return cudaStatus;
}

cudaStatus = cudaMalloc((void**)&dev_depth, IMG_HEIGHT * IMG_WIDTH * sizeof(float));
if (cudaStatus != cudaSuccess) {
	std::cout<<"cudaMalloc failed!";
	return cudaStatus;
}

cudaStatus = cudaMemcpy(dev_depth, depth_map, IMG_HEIGHT * IMG_WIDTH * sizeof(float), cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess) {
	std::cout<<"cudaMemcpy failed!";
	return cudaStatus;
}

// 为三角形中心分配显存空间
cudaStatus = cudaMalloc((void**)&dev_triCenters, 2 * IMG_HEIGHT * IMG_WIDTH * sizeof(float));
if (cudaStatus != cudaSuccess) {
	std::cout<<"cudaMalloc failed!";
	return cudaStatus;
}

// 初始化三角形中心矩阵
cudaStatus = cudaMemset(dev_triCenters, 0, 2 * IMG_HEIGHT * IMG_WIDTH);
if (cudaStatus != cudaSuccess) {
	std::cout<<"cudaMemset failed!";
	return cudaStatus;
}

// 为点云坐标分配显存空间
cudaStatus = cudaMalloc((void**)&dev_x, IMG_HEIGHT * IMG_WIDTH * sizeof(float));
if (cudaStatus != cudaSuccess) {
	std::cout<<"cudaMalloc failed!";
	return cudaStatus;
}

cudaStatus = cudaMalloc((void**)&dev_y, IMG_HEIGHT * IMG_WIDTH * sizeof(float));
if (cudaStatus != cudaSuccess) {
	std::cout<<"cudaMalloc failed!";
	return cudaStatus;
}

cudaStatus = cudaMalloc((void**)&dev_z, IMG_HEIGHT * IMG_WIDTH * sizeof(float));
if (cudaStatus != cudaSuccess) {
	std::cout<<"cudaMalloc failed!";
	return cudaStatus;
}




// 启动核函数将深度图中的2D点投影到3D空间

// projectKernelDirect<<<dimGrid, dimBlock>>>(dev_depth, dev_x, dev_y, dev_z, IMG_WIDTH, ccx, ccy, fcx, fcy);
projectKernel<<<dimGrid, dimBlock>>>(dev_depth, dev_x, dev_y, dev_z, IMG_WIDTH, dev_K, dev_R, dev_T);

// 启动核函数生成三角形,并得到三角形中心
triangulateKernelDirect<<<dimGrid, dimBlock>>>(dev_depth, dev_triangles, dev_triCenters, IMG_WIDTH, IMG_HEIGHT, thresh_d);

// cudaDeviceSynchronize waits for the kernel to finish, and returns
// any errors encountered during the launch.
cudaStatus = cudaDeviceSynchronize();
if (cudaStatus != cudaSuccess) {
	//fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
	std::cout<<"cudaDeviceSynchronize returned error code "<<cudaStatus<<" after launching triangulateKernel!"<<std::endl;
	std::cout<<cudaGetErrorString(cudaStatus)<<std::endl;
	return cudaStatus;
}

// Copy output vector from GPU buffer to host memory.
//cudaStatus = cudaMemcpy(hst_triangles, dev_triangles, 2 * IMG_HEIGHT * IMG_WIDTH * sizeof(bool), cudaMemcpyDeviceToHost);
//if (cudaStatus != cudaSuccess) {
//std::cout<<"cudaMemcpy failed!";
//return cudaStatus;
//}

cudaStatus = cudaMemcpy(hst_x, dev_x, IMG_HEIGHT * IMG_WIDTH * sizeof(float), cudaMemcpyDeviceToHost);
if (cudaStatus != cudaSuccess) {
	std::cout<<"cudaMemcpy failed!";
	return cudaStatus;
}

cudaStatus = cudaMemcpy(hst_y, dev_y, IMG_HEIGHT * IMG_WIDTH * sizeof(float), cudaMemcpyDeviceToHost);
if (cudaStatus != cudaSuccess) {
	std::cout<<"cudaMemcpy failed!";
	return cudaStatus;
}

cudaStatus = cudaMemcpy(hst_z, dev_z, IMG_HEIGHT * IMG_WIDTH * sizeof(float), cudaMemcpyDeviceToHost);
if (cudaStatus != cudaSuccess) {
	std::cout<<"cudaMemcpy failed!";
	return cudaStatus;
}

cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsedTime, start, stop);
std::cout<<"Time to generate: "<<elapsedTime<<" ms."<<std::endl;

return cudaStatus;

}

就是分配了显存,执行核函数,并拷贝一部分数据回主机端

红字部分注释掉了就是我说的移到函数B中执行就出错的代码,在A中正确运行
在B中会出现无法访问到那几个设备数据,调试显示 0x0000000

声明的全局设备端数据
float* dev_depth1 = 0;
bool* dev_triangles1 = 0;
float* dev_triCenters1 = 0;
float* dev_x1 = 0;
float* dev_y1 = 0;
float* dev_z1 = 0;
float* dev_K1 = 0;
float* dev_R1 = 0;
float* dev_T1 = 0;

float* dev_depth2 = 0;
bool* dev_triangles2 = 0;
float* dev_triCenters2 = 0;
float* dev_x2 = 0;
float* dev_y2 = 0;
float* dev_z2 = 0;
float* dev_K2 = 0;
float* dev_R2 = 0;
float* dev_T2 = 0;

float* dev_depth3 = 0;
bool* dev_triangles3 = 0;
float* dev_triCenters3 = 0;
float* dev_x3 = 0;
float* dev_y3 = 0;
float* dev_z3 = 0;
float* dev_K3 = 0;
float* dev_R3 = 0;
float* dev_T3 = 0;

float* dev_depth4 = 0;
bool* dev_triangles4 = 0;
float* dev_triCenters4 = 0;
float* dev_x4 = 0;
float* dev_y4 = 0;
float* dev_z4 = 0;
float* dev_K4 = 0;
float* dev_R4 = 0;
float* dev_T4 = 0;

函数A
extern “C” cudaError_t triangulateWithCudaDirect(const float *depth_map, const float *K, const float *R, const float *T, bool *hst_triangles, float *hst_x, float *hst_y, float *hst_z,
float *dev_depth, float *dev_K, float *dev_R, float *dev_T, bool *dev_triangles, float *dev_x, float *dev_y, float *dev_z, float *dev_triCenters)
{
cudaError_t cudaStatus;

float thresh_d = 20.0;

// 块数和线程数
dim3 dimGrid(IMG_HEIGHT, 1, 1);
dim3 dimBlock(IMG_WIDTH, 1, 1);

cudaEvent_t start, stop;
float elapsedTime;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);


// Choose which GPU to run on, change this on a multi-GPU system.
cudaStatus = cudaSetDevice(0);
if (cudaStatus != cudaSuccess) {
	std::cout<<"cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?";
	return cudaStatus;
}

// 将内外参拷贝到显存
cudaStatus = cudaMalloc((void**)&dev_K, 4 * sizeof(float));
if (cudaStatus != cudaSuccess) {
	std::cout<<"cudaMalloc failed!";
	return cudaStatus;
}

cudaStatus = cudaMalloc((void**)&dev_R, 9 * sizeof(float));
if (cudaStatus != cudaSuccess) {
	std::cout<<"cudaMalloc failed!";
	return cudaStatus;
}

cudaStatus = cudaMalloc((void**)&dev_T, 3 * sizeof(float));
if (cudaStatus != cudaSuccess) {
	std::cout<<"cudaMalloc failed!";
	return cudaStatus;
}

cudaStatus = cudaMemcpy(dev_K, K, 4 * sizeof(float), cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess) {
	std::cout<<"cudaMemcpy failed!";
	return cudaStatus;
}

cudaStatus = cudaMemcpy(dev_R, R, 9 * sizeof(float), cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess) {
	std::cout<<"cudaMemcpy failed!";
	return cudaStatus;
}

cudaStatus = cudaMemcpy(dev_T, T, 3 * sizeof(float), cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess) {
	std::cout<<"cudaMemcpy failed!";
	return cudaStatus;
}

// 为三角矩阵分配显存空间
cudaStatus = cudaMalloc((void**)&dev_triangles, 2 * IMG_HEIGHT * IMG_WIDTH * sizeof(bool));
if (cudaStatus != cudaSuccess) {
	std::cout<<"cudaMalloc failed!";
	return cudaStatus;
}

cudaStatus = cudaMemset(dev_triangles, 0, 2 * IMG_HEIGHT * IMG_WIDTH);
if (cudaStatus != cudaSuccess) {
	std::cout<<"cudaMemset failed!";
	return cudaStatus;
}

cudaStatus = cudaMalloc((void**)&dev_depth, IMG_HEIGHT * IMG_WIDTH * sizeof(float));
if (cudaStatus != cudaSuccess) {
	std::cout<<"cudaMalloc failed!";
	return cudaStatus;
}

cudaStatus = cudaMemcpy(dev_depth, depth_map, IMG_HEIGHT * IMG_WIDTH * sizeof(float), cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess) {
	std::cout<<"cudaMemcpy failed!";
	return cudaStatus;
}

// 为三角形中心分配显存空间
cudaStatus = cudaMalloc((void**)&dev_triCenters, 2 * IMG_HEIGHT * IMG_WIDTH * sizeof(float));
if (cudaStatus != cudaSuccess) {
	std::cout<<"cudaMalloc failed!";
	return cudaStatus;
}

// 初始化三角形中心矩阵
cudaStatus = cudaMemset(dev_triCenters, 0, 2 * IMG_HEIGHT * IMG_WIDTH);
if (cudaStatus != cudaSuccess) {
	std::cout<<"cudaMemset failed!";
	return cudaStatus;
}

// 为点云坐标分配显存空间
cudaStatus = cudaMalloc((void**)&dev_x, IMG_HEIGHT * IMG_WIDTH * sizeof(float));
if (cudaStatus != cudaSuccess) {
	std::cout<<"cudaMalloc failed!";
	return cudaStatus;
}

cudaStatus = cudaMalloc((void**)&dev_y, IMG_HEIGHT * IMG_WIDTH * sizeof(float));
if (cudaStatus != cudaSuccess) {
	std::cout<<"cudaMalloc failed!";
	return cudaStatus;
}

cudaStatus = cudaMalloc((void**)&dev_z, IMG_HEIGHT * IMG_WIDTH * sizeof(float));
if (cudaStatus != cudaSuccess) {
	std::cout<<"cudaMalloc failed!";
	return cudaStatus;
}




// 启动核函数将深度图中的2D点投影到3D空间

// projectKernelDirect<<<dimGrid, dimBlock>>>(dev_depth, dev_x, dev_y, dev_z, IMG_WIDTH, ccx, ccy, fcx, fcy);
projectKernel<<<dimGrid, dimBlock>>>(dev_depth, dev_x, dev_y, dev_z, IMG_WIDTH, dev_K, dev_R, dev_T);

// 启动核函数生成三角形,并得到三角形中心
triangulateKernelDirect<<<dimGrid, dimBlock>>>(dev_depth, dev_triangles, dev_triCenters, IMG_WIDTH, IMG_HEIGHT, thresh_d);

// cudaDeviceSynchronize waits for the kernel to finish, and returns
// any errors encountered during the launch.
cudaStatus = cudaDeviceSynchronize();
if (cudaStatus != cudaSuccess) {
	//fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
	std::cout<<"cudaDeviceSynchronize returned error code "<<cudaStatus<<" after launching triangulateKernel!"<<std::endl;
	std::cout<<cudaGetErrorString(cudaStatus)<<std::endl;
	return cudaStatus;
}

// Copy output vector from GPU buffer to host memory.
//cudaStatus = cudaMemcpy(hst_triangles, dev_triangles, 2 * IMG_HEIGHT * IMG_WIDTH * sizeof(bool), cudaMemcpyDeviceToHost);
//if (cudaStatus != cudaSuccess) {
//std::cout<<"cudaMemcpy failed!";
//return cudaStatus;
//}

cudaStatus = cudaMemcpy(hst_x, dev_x, IMG_HEIGHT * IMG_WIDTH * sizeof(float), cudaMemcpyDeviceToHost);
if (cudaStatus != cudaSuccess) {
	std::cout<<"cudaMemcpy failed!";
	return cudaStatus;
}

cudaStatus = cudaMemcpy(hst_y, dev_y, IMG_HEIGHT * IMG_WIDTH * sizeof(float), cudaMemcpyDeviceToHost);
if (cudaStatus != cudaSuccess) {
	std::cout<<"cudaMemcpy failed!";
	return cudaStatus;
}

cudaStatus = cudaMemcpy(hst_z, dev_z, IMG_HEIGHT * IMG_WIDTH * sizeof(float), cudaMemcpyDeviceToHost);
if (cudaStatus != cudaSuccess) {
	std::cout<<"cudaMemcpy failed!";
	return cudaStatus;
}

cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsedTime, start, stop);
std::cout<<"Time to generate: "<<elapsedTime<<" ms."<<std::endl;

return cudaStatus;

}

就是分配了显存,执行核函数,并拷贝一部分数据回主机端

红字部分注释掉了就是我说的移到函数B中执行就出错的代码,在A中正确运行
在B中会出现无法访问到那几个设备数据,调试显示 0x0000000

之前回复贴了好几次,论坛都显示回复出错,实在不是因为我不发,我以为太长了发不了

LZ您好:

是您指针传递的问题。

您在您的函数A中将全局定义的指针的值作为参数传递进来,并进行cudaMalloc(),实际上malloc的空间是由您函数A内部的局部变量指向的,而对应的全局变量的值并没有变。

此时,您在函数A中直接用这个局部变量,copy是正常的,而您在函数b中使用全局变量,注意到全局变量的值没有改变过,所以您无论是kernel运行还是copy都不能正常。

请您对照上述说法检查您的dev_triangles这个局部变量和dev_triangles1这个全局变量。

一般地,如果需要使用某指针指向的一段缓冲区,那么直接将该指针作为函数参数即可,而如果需要在函数内部修改该指针的值,需要将该指针的地址传入。

您可以这样:
int* dev_a;
void malloc_a(int a)
{cudaMalloc((void
)a,memsize);}

调用的时候
malloc_a(&dev_a);

这样,修改的是全局变量dev_a的值。

大致如此,请您参考。

祝您好运~

LZ您好,发代码有个专门的方法是“代码模式”,您可以在发帖或者回帖的时候,点击"<>"按钮,并将您的代码一次性粘贴到里面。
这样会有行号,缩进等支持,同时也不会遇到转义字符的困扰。

祝您好运~

忘记发调用端了
这是调用函数A
cudaError_t cudaStatus = triangulateWithCudaDirect(depth_map1, K1, R1, T1, hst_triangles1, hst_x1, hst_y1, hst_z1, dev_depth1, dev_K1, dev_R1, dev_T1, dev_triangles1, dev_x1, dev_y1, dev_z1, dev_triCenters1);
// 处理错误

cudaStatus = triangulateWithCudaDirect(depth_map2, K2, R2, T2, hst_triangles2, hst_x2, hst_y2, hst_z2, dev_depth2, dev_K2, dev_R2, dev_T2, dev_triangles2, dev_x2, dev_y2, dev_z2, dev_triCenters2);
// 处理错误

cudaStatus = triangulateWithCudaDirect(depth_map3, K3, R3, T3, hst_triangles3, hst_x3, hst_y3, hst_z3, dev_depth3, dev_K3, dev_R3, dev_T3, dev_triangles3, dev_x3, dev_y3, dev_z3, dev_triCenters3);
// 处理错误

cudaStatus = triangulateWithCudaDirect(depth_map4, K4, R4, T4, hst_triangles4, hst_x4, hst_y4, hst_z4, dev_depth4, dev_K4, dev_R4, dev_T4, dev_triangles4, dev_x4, dev_y4, dev_z4, dev_triCenters4);
// 处理错误

调用函数B
cudaStatus = removeRedundantTrianglesWithCuda(hst_triangles1, dev_triangles1, dev_triCenters1, dev_triCenters2, dev_K1, dev_R1, dev_T1, dev_K2, dev_R2, dev_T2);
// 处理错误

cudaStatus = removeRedundantTrianglesWithCuda(hst_triangles2, dev_triangles2, dev_triCenters2, dev_triCenters3, dev_K2, dev_R2, dev_T2, dev_K3, dev_R3, dev_T3);
// 处理错误

cudaStatus = removeRedundantTrianglesWithCuda(hst_triangles3, dev_triangles3, dev_triCenters3, dev_triCenters4, dev_K3, dev_R3, dev_T3, dev_K4, dev_R4, dev_T4);
// 处理错误

cudaStatus = removeRedundantTrianglesWithCuda(hst_triangles4, dev_triangles4, dev_triCenters4, dev_triCenters1, dev_K4, dev_R4, dev_T4, dev_K1, dev_R1, dev_T1);
// 处理错误

然后函数B中就无法访问设备端数据,若用版主的方法,malloc_a是相当于再在B里面重新分配一次?

LZ您好:

看到您补发的调用代码,以及该代码与我估计的相一致。

以及,请您仔细查看我12#关于如何修改全局指针的介绍。(请注意,这个用法完全是c语言的基本用法)

以及,我没有说过需要在B里面重新分配一次,我给你的演示代码也不是这个意图。

请您仔细思考下。

祝您好运~

不好意思,这回懂了。
分配显存的函数A传进去的参数应该是**,需要传指针的地址,要不然只是个拷贝,函数结束后就销毁了
不怎么写C语言,没注意到这一点。
谢谢斑竹。

嗯嗯,不客气的,恭喜您解决了这个问题。

以后应该更加得心应手了!

祝您编码顺利~~