关于CUDA自带矩阵相乘例子的问题

matrixMul(float *C, float *A, float *B, size_type wA, size_type wB)
{
   // Block index
   size_type bx = blockIdx.x;
   size_type by = blockIdx.y;

   // Thread index
   size_type tx = threadIdx.x;
   size_type ty = threadIdx.y;

   // Index of the first sub-matrix of A processed by the block
   size_type aBegin = wA * block_size * by;

   // Index of the last sub-matrix of A processed by the block
   size_type aEnd   = aBegin + wA - 1;

   // Step size used to iterate through the sub-matrices of A
   size_type aStep  = block_size;

   // Index of the first sub-matrix of B processed by the block
   size_type bBegin = block_size * bx;

   // Step size used to iterate through the sub-matrices of B
   size_type bStep  = block_size * wB;

   // Csub is used to store the element of the block sub-matrix
   // that is computed by the thread
   float Csub = 0;

   // Loop over all the sub-matrices of A and B
   // required to compute the block sub-matrix
   for (size_type a = aBegin, b = bBegin;
   a <= aEnd;
   a += aStep, b += bStep)
   { // printf("the WB is %d\n",wB);
   // Declaration of the shared memory array As used to
   // store the sub-matrix of A
   __shared__ float As[block_size][block_size];

   // Declaration of the shared memory array Bs used to
   // store the sub-matrix of B
   __shared__ float Bs[block_size][block_size];

   // Load the matrices from device memory
   // to shared memory; each thread loads
   // one element of each matrix
   AS(ty, tx) = A[a + wA * ty + tx];
   BS(ty, tx) = B[b + wB * ty + tx];
#define WA (4 * block_size) // Matrix A width
#define HA (6 * block_size) // Matrix A height
#define WB (4 * block_size) // Matrix B width
#define HB WA  // Matrix B height
#define WC WB  // Matrix C width 
#define HC HA  // Matrix C height

今天在学习这个矩阵相乘的例子
但是实在没有看懂
我也在纸上画了画矩阵 还是没有懂

第一遍的loop是什么作用呢?
这个矩阵遍历方法经常应用到矩阵相乘的计算中吗?

初学者还有一个问题,就是如何更好地学习CUDA编程 本人编程基础一般,多谢版主指点一下。

LZ您好:

关于此矩阵相乘的例子,请参阅CUDA C Programming Guide,该手册上有讲解和带有详细注释(注释与代码行数相当)程序,以及还有一个作为对比的不使用shared memory的版本供您对照参考。

以及这个例子纯粹是用来讲述shared memory的用法的,矩阵相乘运算在各种线性代数库中有着更为高效的实现形式。您可以直接使用这些库,或者研究这些库所用的算法。

关于学习CUDA 编程,请多关注一下常见的较新的一些参考书,国外高校的CUDA课程录像/课件,以及CUDA 自带的手册,特别是CUDA C Programming Guide这本,此外还可以多利用网络上的资源,比如google搜索和在本论坛潜水查看过往的讨论帖。

当然最最重要的是需要您付出足够多的精力,此一行并无捷径。

祝您好运~

多谢!
希望能多多进步!