关于CUDA自带矩阵相乘例子的问题

system · 2013 年11 月 27 日 15:09

matrixMul(float *C, float *A, float *B, size_type wA, size_type wB)
{
   // Block index
   size_type bx = blockIdx.x;
   size_type by = blockIdx.y;

   // Thread index
   size_type tx = threadIdx.x;
   size_type ty = threadIdx.y;

   // Index of the first sub-matrix of A processed by the block
   size_type aBegin = wA * block_size * by;

   // Index of the last sub-matrix of A processed by the block
   size_type aEnd   = aBegin + wA - 1;

   // Step size used to iterate through the sub-matrices of A
   size_type aStep  = block_size;

   // Index of the first sub-matrix of B processed by the block
   size_type bBegin = block_size * bx;

   // Step size used to iterate through the sub-matrices of B
   size_type bStep  = block_size * wB;

   // Csub is used to store the element of the block sub-matrix
   // that is computed by the thread
   float Csub = 0;

   // Loop over all the sub-matrices of A and B
   // required to compute the block sub-matrix
   for (size_type a = aBegin, b = bBegin;
   a <= aEnd;
   a += aStep, b += bStep)
   { // printf("the WB is %d\n",wB);
   // Declaration of the shared memory array As used to
   // store the sub-matrix of A
   __shared__ float As[block_size][block_size];

   // Declaration of the shared memory array Bs used to
   // store the sub-matrix of B
   __shared__ float Bs[block_size][block_size];

   // Load the matrices from device memory
   // to shared memory; each thread loads
   // one element of each matrix
   AS(ty, tx) = A[a + wA * ty + tx];
   BS(ty, tx) = B[b + wB * ty + tx];

#define WA (4 * block_size) // Matrix A width
#define HA (6 * block_size) // Matrix A height
#define WB (4 * block_size) // Matrix B width
#define HB WA  // Matrix B height
#define WC WB  // Matrix C width 
#define HC HA  // Matrix C height

今天在学习这个矩阵相乘的例子
但是实在没有看懂
我也在纸上画了画矩阵还是没有懂

第一遍的loop是什么作用呢？
这个矩阵遍历方法经常应用到矩阵相乘的计算中吗？

初学者还有一个问题，就是如何更好地学习CUDA编程本人编程基础一般，多谢版主指点一下。

system · 2013 年11 月 27 日 15:35

LZ您好：

关于此矩阵相乘的例子，请参阅CUDA C Programming Guide，该手册上有讲解和带有详细注释（注释与代码行数相当）程序，以及还有一个作为对比的不使用shared memory的版本供您对照参考。

以及这个例子纯粹是用来讲述shared memory的用法的，矩阵相乘运算在各种线性代数库中有着更为高效的实现形式。您可以直接使用这些库，或者研究这些库所用的算法。

关于学习CUDA 编程，请多关注一下常见的较新的一些参考书，国外高校的CUDA课程录像/课件，以及CUDA 自带的手册，特别是CUDA C Programming Guide这本，此外还可以多利用网络上的资源，比如google搜索和在本论坛潜水查看过往的讨论帖。

当然最最重要的是需要您付出足够多的精力，此一行并无捷径。

祝您好运~

system · 2013 年11 月 28 日 00:57

多谢！
希望能多多进步！