关于cublas<t>gemmBatched的使用

system · 2013 年3 月 28 日 08:15

各位大神好！我打算做batch次复数矩阵乘法AB，A(mk)，B(k*n)，于是我就调用了库函数cublasCgemmBatched。
结果出现了两个问题
1.库函数执行时间居然为零（如果执行一个普通的库函数例如cublasCgemm则可以测出时间）；
2.执行完库函数后，变量不能释放（若不执行库函数，变量可以正常释放）。
请问这是为啥啊，谢谢指教！
附上完整源代码：

 const int n=300;
	const int m=200;
	const int k=100;
	const int batch=8;
	cuComplex alpha,beta;
	alpha.x=1.0f;alpha.y=0.0f;
	beta.x=0.0f;beta.y=0.0f;

	cublasHandle_t handle;
	cublasCreate(&handle);
	cublasStatus_t status;
	cudaError_t err;

	const cuComplex* (A[batch]);
	const cuComplex* (B[batch]);
	Complex* (C[batch]);
	for(int i=0;i<batch;i++)
		err=cudaMalloc((void**)&(A[i]),m*k*sizeof(Complex));
	for(int i=0;i<batch;i++)
		err=cudaMalloc((void**)&(B[i]),n*k*sizeof(Complex));
	for(int i=0;i<batch;i++)
		err=cudaMalloc((void**)&(C[i]),n*m*sizeof(Complex));

	cudaEvent_t start,stop;
	cudaEventCreate(&start);
	cudaEventCreate(&stop);
	float timerr;
	cudaEventRecord(start,0);

	status=cublasCgemmBatched(handle,CUBLAS_OP_N,CUBLAS_OP_N,m,n,k,&alpha,A,m,B,k,&beta,C,m,batch);
	//status=cublasCgemm(handle,CUBLAS_OP_T,CUBLAS_OP_N,m,n,k,&alpha,A[0],k,B[0],k,&beta,C[0],m);//该函数运行后可测出时间
	cudaEventRecord(stop,0);
	cudaEventSynchronize(stop);
	cudaEventElapsedTime(&timerr,start,stop);
	printf("%.4f\n",timerr);

	cublasDestroy(handle);
	cudaEventDestroy(start);
	cudaEventDestroy(stop);

	for(int i=0;i<batch;i++)
		err=cudaFree((cuComplex*)(A[i]));//释放时出现错误"ErrorUnKnown"
	for(int i=0;i<batch;i++)
		err=cudaFree((cuComplex*)(B[i]));
	for(int i=0;i<batch;i++)
		err=cudaFree(C[i]);
	getchar();

system · 2013 年3 月 28 日 09:13

不懂cublas…不过根据您的代码对比，是否这里有问题？如下：
cublasCgemmBatched(handle,CUBLAS_OP_N,CUBLAS_OP_N,m,n,k,&alpha,A,m,B,k,&beta,C,m,batch);

改成：
cublasCgemmBatched(handle,CUBLAS_OP_T,CUBLAS_OP_N,m,n,k,&alpha,A,k,B,k,&beta,C,m,batch);

不懂这个库的。不过如此猜测。以及此回复可能不对。请谨慎看下？

system · 2013 年3 月 28 日 09:31

以及，或者不修改为OP_T,
直接改为：
cublasCgemmBatched(handle,CUBLAS_OP_N,CUBLAS_OP_N,m,n,k,&alpha,A,k,B,n,&beta,C,n,batch);

这样？

没用过cublas…请其他人赐教。本贴是准备写来丢人的。

system · 2013 年3 月 28 日 10:42

我大致目测了一下，函数参数似乎没问题。不过我也不知道LZ为何运行有问题。

留待其他人补充了。

祝LZ好运~

system · 2013 年4 月 1 日 08:18

版主谦虚了，我从网上找了个使用这个函数的源代码，发现是函数参数设置的问题（感觉这个函数使用起来巨麻烦），以后有谁要用这个函数的话可以参考：
h_t_t_p://_docs._nvidia._com/cuda/cuda-samples/index.html#batchcublas
(为啥不可以发链接？)

system · 2013 年4 月 1 日 08:21

汗了，貌似NVIDIA的例子程序里就有。。。。。

system · 2013 年4 月 8 日 06:04

呃…是这样的……cublasCgemmBatched这个我没用过，但是有点要注意的就是Cublas的矩阵是按照列优先方式进行存储，如果你的C++矩阵是行优先存储，用正常顺序输入参数，并且矩阵非方阵，调用这个函数肯定会内存出错！
检查方式如下：命令行里运行 cuda-memcheck your_program.exe [-your_arguments]，如果无法通过cuda-memcheck，那就是输入参数顺序不对……

假设行优先存储的矩阵A，如果不对矩阵做任何转秩，按照cublas列优先读入就变成了A(T)，原先的AB就变成了A(T)B(T)，如果不是方阵，这两个矩阵能不能乘起来都不一定有保证。
如果要AB，cublas的输入顺序应该是B在前，A在后。C(T) = (AB)(T) = B(T)*A(T)，culbas输出的列优先的C(T)到了行优先的C++矩阵就自动转秩成了C。