如何打破memory bandwidth limited

system · 2012 年12 月 3 日 02:11

对于如下的Kernel用Compute Visual Profile进行分析，这样简单的Kernel也能运用像它提供的建议吗？[attach]2850[/attach]
global void KernelV2(int *dev_ZZ,double *dev_V,double dev_VKK)
{
int k=blockIdx.x512+threadIdx.x;
int i=blockIdx.y;

if((i<XROW)&&(k<XROW))
{
for(int j=0;j<(XCOL*(XCOL-1)/2);j++)
{
dev_V[i*XROW+k]=dev_V[i*XROW+k]+dev_ZZ[i+j*XROW]dev_ZZ[k+jXROW]*dev_VKK[j];
}
}
}
结果如下所示：

Analysis for kernel KernelV2 on device GeForce GTX 580

Summary profiling information for the kernel:

Number of calls: 1

GPU time(us): 287692.88

GPU time (%): 15.17

Grid size: [2 600 1]

Block size: [512 1 1]

Limiting Factor

Achieved Instruction Per Byte Ratio: 2.73 ( Balanced Instruction Per Byte Ratio: 4.16 )

Achieved Occupancy: 0.24 ( Theoretical Occupancy: 1.00 )

IPC: 1.63 ( Maximum IPC: 2 )

Achieved global memory throughput: 129.96 ( Peak global memory throughput(GB/s): 192.38 )

Hint(s)
[list]

[]The achieved instructions per byte ratio for the kernel is less than the balanced instruction per byte ratio for the device. Hence, the kernel is likely memory bandwidth limited. For details, click on Memory Throughput Analysis.
[]The kernel occupancy is low. For details, click on Occupancy Analysis.
[/list]
Factors that may affect analysis
[list]

[]The counters of type SM are collected only for 1 multiprocessor in the chip and the values are extrapolated to get the behavior of entire GPU assuming equal work distribution. This may result in some inaccuracy in the analysis in some cases.
[]The counters for some derived stats are collected in different runs of application. This may cause some inaccuracy in the derived statistics as the blocks scheduled on each multiprocessor may be different for each run and for some applications the behavior changes for each run.
[*]The derived statistics instruction per byte ratio and IPC assume that all instructions are single precision floating point instructions. If the application uses double precision floating point instructions then the limiting factor predicted here may be incorrect.
[/list]
Memory Throughput Analysis for kernel KernelV2 on device GeForce GTX 580

[list]

[]Kernel requested global memory read throughput(GB/s): 38.88
[]Kernel requested global memory write throughput(GB/s): 72.68
[*]Kernel requested global memory throughput(GB/s): 111.56
[/list][list]

[*]L2 cache texture memory read throughput(GB/s): 0.00
[/list][list]

[]L2 cache global memory read throughput(GB/s): 163.40
[]L2 cache global memory write throughput(GB/s): 72.68
[]L2 cache global memory throughput(GB/s): 236.08
[]Global memory excess load(%): 76.20
[*]Global memory excess store(%): 0.00
[/list][list]

[]Achieved global memory read throughput(GB/s): 58.44
[]Achieved global memory write throughput(GB/s): 71.52
[*]Achieved global memory throughput(GB/s): 129.96
[/list][list]

[*]Peak global memory throughput(GB/s): 192.38
[/list]
The following derived statistic(s) cannot be computed as required counters are not available:
[list]

[]L1 cache read throughput(GB/s)
[]L1 cache global hit ratio (%)
[]Texture cache memory throughput(GB/s)
[]Texture cache hit rate(%)
[*]Local memory bus traffic(%)
[/list]
Hint(s)
[list]

[*]Memory access pattern is not coalesced. The kernel requested throughput and achieved global memory throughput can be different because of following two reasons:
[list]

[]Scattered/misaligned pattern: not all transaction bytes are utilized.
[]Broadcast: the same transaction serves many requests (due to sector size, cache line size and caching);
[/list]
Refer to the “Global Memory” section in the “Performance Guidelines” chapter of the CUDA C Programming Guide for more details.
[*]Access pattern of global memory load is not coalesced resulting in multiple transactions. In perfectly coalesced access 1 gld instruction for 32, 64,128 bit word size should cause 1,2 and 4 L1 cache line(128 byte) accesses respectively.
[list]

[]Try disabling L1 cache for global memory using compiler option -Xptxas dlcm=cg. The uncached memory transactions are in multiples of 32, 64 and 128 bytes, where as the cached transactions are always in multiples of 128 bytes;
[]Consider using shared memory as a user managed cache to store frequently accessed data, instead of making multiple accesses to global memory;
[]Try using texture memory for read only resources, the transaction size is smaller and it has a separate cache;
[]Try re-arranging the data if possible;
[/list][]Consider using shared memory as a user managed cache for frequently accessed global memory resources.
Refer to the “Shared Memory” section in the “CUDA C Runtime” chapter of the CUDA C Programming Guide for more details.
[]The achieved global memory throughput is low compared to the peak global memory throughput. To achieve closer to peak global memory throughput, try to
[list]

[]Launch enough threads to hide memory latency (check occupancy analysis);
[]Process more data per thread to hide memory latency;
[/list][*]Consider using texture memory for read only global memory, texture memory has its own cache so it does not pollute L1 cache, this cache is also optimized for 2D spatial locality.
Refer to the “Texture Memory” section in the “CUDA C Runtime” chapter of the CUDA C Programming Guide for more details.
[/list]
Factors that may affect analysis
[list]

[]If display is attached to the GPU that is being profiled, the DRAM reads, DRAM writes, l2 read hit ratio and l2 write hit ratio may include data for display in addition to the data for kernel that is being profiled.
[]The thresholds that are used to provide the hints may not be accurate in all cases. It is recommended to analyze all derived statistics and signals and correlate them with your algorithm before arriving to any conclusion.
[]The value of a particular derived statistic provided in the analysis window is the average value of the derived statistic for all calls of that kernel. To know the value of the derived statistic corresponding to a particular call please refer to the kernel profiler table.
[]The counters of type SM are collected only for 1 multiprocessor in the chip and the values are extrapolated to get the behavior of entire GPU assuming equal work distribution. This may result in some inaccuracy in the analysis in some cases.
[*]The counters for some derived stats are collected in different runs of application. This may cause some inaccuracy in the derived statistics as the blocks scheduled on each multiprocessor may be different for each run and for some applications the behavior changes for each run.
[/list]

system · 2012 年12 月 3 日 04:59

除了合并访问，你还可以尝试将dev_VKK存储在常数存储器中。
它对一个warp内多个线程访问同一全局内存地址有比较好的加速效果。

system · 2012 年12 月 3 日 06:47

大致地看了一下您的代码和profiler的结果，首先，简单的kernel也完全可以利用profiler的结果进行优化，kernel无论复杂简单一律平等。其次，profiler的结果仅供参考包括建议解决方法等，因为profiler的这些统计量是采集信息以后根据固定的公式计算出来的，还需要使用者灵活掌握。

具体到您的代码的profiler结果
因为您的kernel里面的计算较为简单，而访存挺多，所以卡在访存上大体是应该的，应该着手设法减少访存或者访存中的开销。

您的Achieved Occupancy为 0.24，这个数值不正常地偏低，如果您提供kernel的occupancy analysis的详细内容，应该有助于分析和解决问题。

下面看一下您的Memory Throughput Analysis
Kernel requested global memory read throughput(GB/s): 38.88
Kernel requested global memory write throughput(GB/s): 72.68
Kernel requested global memory throughput(GB/s): 111.56
这个看上去没有什么问题。
Achieved global memory read throughput(GB/s): 58.44
Achieved global memory write throughput(GB/s): 71.52
Achieved global memory throughput(GB/s): 129.96
这个说明写入吞吐量基本吻合，读取吞吐量要比requested的数量大20GB/s，这表示某些开销增加了实际的读取量。是哪些开销呢？后面再分析。
Peak global memory throughput(GB/s): 192.38
从这里看出，achieved global memory throughput还没有超出硬件提供的最大吞吐量。是访存不成为问题么？先放一放，继续往下看。

L2 cache global memory read throughput(GB/s): 163.40
L2 cache global memory write throughput(GB/s): 72.68
L2 cache global memory throughput(GB/s): 236.08
Global memory excess load(%): 76.20
Global memory excess store(%): 0.00

可以看到L2 cache global memory read throughput远远高于Kernel requested global memory read throughput。这代表kernel从L2读取数据的时候有很多开销。
同时Global memory excess load=100-（100kernel requested global memory read throughput/L2 read throughput）=100-（10038.88/163.40）=76.20，此项表示访问的L2读取里面有多少是实际损失掉没有用的。此项数值偏高，表示访问开销大，可能是非合并访问造成。

（从上面看到，写入一直都很正常。）

L2 cache global memory throughput(GB/s): 236.08
同时注意到这个数值，根据某来源的说法，FERMI的L2cache的带宽只是比global的带宽高一点，现在global的带宽是192GB/s，而L2的带宽已经用到了236.08GB/s。

所以，我认为，应该是您kernel访存合并性不好，造成L2读取带宽被大量占用（也就是效率比较低），虽然经过L2优化后，实际的global memory的带宽只用了2/3左右，但是L2的带宽已经被占满，所以就无法再进一步提升global memory的使用了。简单地说，L2被您的kernel折磨的精疲力尽了。——这里为个人观点，欢迎指正/补充。

好，我们现在回到kernel的代码里面。
核心的一句赋值为
dev_V[i*XROW+k]=dev_V[i*XROW+k]+dev_ZZ[i+j*XROW]dev_ZZ[k+jXROW]*dev_VKK[j];
考虑一个warp的行为，考虑循环中某次循环的情况。
赋值号左边的写入是合并的，问题不大。
右边第一项读取是合并的；第二项是一个广播，广播不影响延迟，但是因为访存粒度的问题，会丢弃一些结果，损失吞吐量；第三项是连续的，在一个循环内看，似乎问题不大，但是j每次变化对应读取位置变化很大，结合访存粒度可能会有一些损失；第四项是一个广播。

同样考虑右边第三项，现在考虑有同一block中的两个warp，因为这两个warp运行的进度一般来说是不同的，所以内循环的j一般来说是不同的。如果j相同，两个warp编号上也相邻，那么前一个warp因为这里不对齐（XROW为600）而多读取的数据可以被下一个warp使用，会在L2内部处理。但现在j一般是不相同的，所以这里读取局部性不好，不利于L2工作。

那么肿么办呢？个人的习惯和建议是把j循环拿到kernel外面。通过反复发射kernel来实现这个循环。根据您之前的说法，XROW甚至可能达到60000这样的数，那么只对应j循环的一次循环，规模也还是不小的。
如果这样，那么赋值号右边的第二项和第四项在整个block内不变的，您可以放到shared memory中，避开L2cache——global memory这个路径。同时第三项，因为j为固定值，同一个block内多个warp访问这里实际上是连续访问一段显存，此时不对齐的开销可以忽略。而剩下两项原本访问就没有什么问题。

个人预计这样能更好一些。当然，上述内容只代表个人的看法，也可能和实际完全不同，欢迎指正/补充！

欢迎LZ再次莅临cudazone，祝您编码调试愉快~

system · 2012 年12 月 3 日 06:50

另外补充一下，直接用profiler建议的这一项：
Try disabling L1 cache for global memory using compiler option -Xptxas dlcm=cg. The uncached memory transactions are in multiples of 32, 64 and 128 bytes, where as the cached transactions are always in multiples of 128 bytes;
可能能直接改善一些广播引起的读取效能下降。

system · 2012 年12 月 3 日 07:14

好的，我需要慢慢理解一下，操作一下，看看效果如何

system · 2012 年12 月 3 日 07:15

从来没有用过常数存储器，一般什么情况用比较好？

system · 2012 年12 月 3 日 07:19

Occupancy Analysis for kernel KernelV2 on device GeForce GTX 580,

Kernel details: Grid size: [2 600 1], Block size: [512 1 1]
Register Ratio: 0.5625 ( 18432 / 32768 ) [12 registers per thread]
Shared Memory Ratio: 0 ( 0 / 49152 ) [0 bytes per Block]
Active Blocks per SM: 3 (Maximum Active Blocks per SM: 8)
Active threads per SM: 1536 (Maximum Active threads per SM: 1536)
Potential Occupancy: 1 ( 48 / 48 )
Occupancy limiting factor: None
Note: The potential occupancy is calculated assuming the default cache configuration i.e. 48KB of shared memory. If the cache configuration is changed, it will lead to a difference between ‘Occupancy’ in the profiler table and potential occupancy. Also it will affect the following fields in the Kernel Occupancy Analysis: Shared Memory Ratio, Potential Occupancy, Occupancy limiting factor

system · 2012 年12 月 3 日 07:23

这里显示Occupancy limiting factor: None
不知为何您顶楼提示的占用率那么低。

要不您host安排循环把kernel跑个百十遍的，看看是不是profiler前面提示的信息不正确。
另外把XROW选成1024这种32的整倍数试试看。

system · 2012 年12 月 3 日 13:35

从代码看，你这主要是矩阵的列相乘，计算和存储器比例很高，理应加速比非常不错。

代码目前的问题是广播和非合并访问。

建议，使用cublas计算矩阵列相乘，计算前先对dev_ZZ做一下处理。

system · 2012 年12 月 3 日 14:16

正常的话占用率应该在多少啊？

system · 2012 年12 月 3 日 14:19

对了，例如我的dev_XMatrix矩阵的值是永远不变的，它是从.txt文件中读取的实验数据，是不是像这样的数据用常量存储器比较好呢？常量存储器适宜的数据大小一般是多少？

system · 2012 年12 月 3 日 14:23

好的，我再学习一下吧，现在我用cublas只会用A*B=C这种整个A乘以整个B的，如果是A的列乘以整个B的，我没有用过，我试试看吧

system · 2012 年12 月 3 日 15:43

正常的话应该接近profiler提供的理论值，你这里应该接近1。

system · 2012 年12 月 4 日 02:03

好的，现在600个样本，121个标记的情况下，程序用了50min，比之前又提高了17min，但是在600个样本的情况下，速度还是没有超过SAS——43min，样本大点肯定会比SAS算的快，不过朱军老师的QTLNETWOR(GPU)软件在100个样本的时候就能比他自己的CPU版本快4倍，我想经过优化我的算法在600个样本能达到4-5倍的加速就很满意了。这段代码之所以加速不明显我觉得是因为反复的数据传输，数据传输的次数达到（while循环*for循环）的次数，所以有可能的话减少数据传递也是加速的一个办法。

while(error>1e-8)
{	for(...){		double *dev_ZTVI;
			double *dev_RTVI;
			cudaMalloc((void**)&dev_ZTVI,XROW*sizeof(double));                //GPU端分配显存，调用Kernel
			cudaMalloc((void**)&dev_RTVI,XROW*sizeof(double));
			cudaMemcpy(dev_VI,VI,XROW*XROW*sizeof(double),cudaMemcpyHostToDevice);
			cudaMemcpy(dev_ZTVI,ZTVI,XROW*sizeof(double),cudaMemcpyHostToDevice);
			cudaMemcpy(dev_RTVI,RTVI,XROW*sizeof(double),cudaMemcpyHostToDevice);
			KernelZR22<<<blocknumZR,threadnum>>>(dev_ZTVI,dev_RTVI,dev_R,dev_VI,dev_ZZ,i,j);
			cudaMemcpy(ZTVI,dev_ZTVI,XROW*sizeof(double),cudaMemcpyDeviceToHost);
			cudaMemcpy(RTVI,dev_RTVI,XROW*sizeof(double),cudaMemcpyDeviceToHost);
			cudaFree(dev_ZTVI);
			cudaFree(dev_RTVI);

			for(int n=0;n<XROW;n++)
			{
				ZVZ=ZVZ+ZTVI[n]*ZZ[((2*XCOL*i-i*i-i)*XROW/2)+(j-i-1)*XROW+n];
			}
   }
}

system · 2012 年12 月 4 日 02:43

常数类型依然是在显存中的，不过因为可以被cache，所以比较快。
cache大小是有限的，所以比较大的数据是不行的。
这种永远不变的矩阵直接扔到显存里面，后面反复用也可以。

（另：您引用的内容和您本楼的内容无关）

欢迎莅临cudazone，祝您编码愉快。

system · 2012 年12 月 4 日 03:08

您引用我13#的内容是关于occupancy的，其逻辑关系在于你在顶楼的profiler结果中给出了
Achieved Occupancy: 0.24 ( Theoretical Occupancy: 1.00 )，只有理论占用率的不到1/4。
而您在7#给出的occupancy analysis却指出没有影响占用率的项。
这似乎是矛盾的。
您在14#引用了我在13#根据上述逻辑关系给出的说法，却在14#没有提及任何关于occupancy的内容，所以我决定无视关于occupancy的问题。

关于14#主题内容
1：你没有给出在什么情况下提高了17min，没给出做了什么程序修改，也没给出修改后profiler情况如何，又或者是使用了库？因此无法回答和评价。也无法理解关于“600个样本，121个标记”是什么意思，这只和您的算法有关，请自行保证。

2：关于某老师自己的程序100个样本时比自己的CPU程序快。这一点无法评价，除非你们使用相同的benchmark数据，相同的硬件，跑绝对的时间，否则说自己的GPU程序比CPU程序快多少倍没什么意义。他的CPU程序用了几个核心？优化了么？用的神马CPU和神马GPU？

3：关于您的算法能否达到您预期的加速比，这个需要您根据算法，计算量等等各方面评估。直接主观预设一个目标的话，无法保证您的目标能够实现。也许能加速几十倍，也许比CPU慢，这个和您的算法有关，也和实现方式有关。

4：关于您的代码，第一没看到是如何计时的；第二，您把除了kernel以外各种cudamalloc，cudamemcpy，cudafree乃至串行实现的循环都扔到大循环里面了，不知您为何要这样做，是算法要求么？这里面其他内容占用的时间可能比您的kernel计算时间还长很多。

5：减少pci-e上的传输量，尽量将数据扔到显存反复用，是正确的优化方式。

祝您编码愉快。

system · 2012 年12 月 4 日 03:25

1.我并没有忽视occupancy的问题，只是现在还没有开始解决，我现在从网上下了一些关于constant memory的东西，正在学习constant memory的特点，以及别人是怎么用的。
2.关于朱军老师的软件，是在我的电脑上运行的，和运行我的程序是一样的硬件配置（64 位WIN7+i5 2400 +GTX580 +8G内存），用的数据是它软件中自带的，QTLNETWORK在百度是可以搜到的。
3.我是用clock()函数计时的，代码比较乱比较长就没有上传，虽然我知道数据传输可能花费了大量的时间，但是这个算法我目前还没有细想如何把数据传输放到循环外面，就是先写好一些能够并行的成分，得到跟SAS软件一样的结果，下面就是开始优化了。
4.将数据扔到显存上，意思就是cudaMalloc、cudaMemcpy之后，不要cudaFree就可以了吗？

system · 2012 年12 月 4 日 03:31

常数存储器是只读的，假设我定义了一个__constant__ VK[100]，然后给VK[100]赋了值，是不是我以后就不能更新VK[100]了呢？因为VK[j]在每个while循环中是会改变的？还有关于gloabl memory合并访问，网上的资料大都跟官网的指导手册大同小异，我还不是很清楚如何才能做到合并访问，理解的很模糊~

system · 2012 年12 月 4 日 05:44

1：必须指出，您在14#引用内容是关于occupancy的，但后面没有任何关于occupancy的内容，这样引用是不合乎写作逻辑的。其次，您在1#给出的achieved occupancy是一个异常偏低的值，而且和7#给出的occupancy analysis分析结论对不上。一个是异常偏低的occupancy，一个是没有因素限制occupancy。

一般来说，较低的occupancy不可能取得良好的实现结果。况且这里的分析结果还有矛盾之处。所以建议您先研究下profiler给出的结果是不是有什么问题，又或者您的occupancy是否真的被限制住了。如果保持如此低的occupancy，那么您把constant cache用到爆可能也没什么变化。如果您使用了constant 提高了occupancy，那么有助于性能的提升。

2：关于某老师的软件，您在17#补充您的运行软硬件环境。但这其实无法完全说明问题。这表示了他的两个实现，在同一套配置上，使用同样的数据，运行速度的差别。但这无法说明他的两个程序是否都是优化的。可能他的CPU程序和GPU程序都是优化的，并且优化的好，然后GPU有4倍的提升；也可能他的CPU程序没有优化，完全没有发挥出来CPU的能力，此时衬托出了GPU的“强大”。他的CPU程序计算的时候CPU多核心都跑满了么？您用自己的CPU程序和他的CPU程序用同样的数据比较过时间么？和您们用SAS算的相比，速度如何呢？这些对比，也许您都是知道的，但是从您在帖子里面给出的信息，我们并无法得知该老师实现的实际情况，也无法以此推断您的GPU程序在何等规模下能达到如何的加速程度。（顺便说一下，QYLENTWORK什么的这一点与我无关。我并没有搜索、考证该软件的义务。如果该软件与您的算法有关，或者您在学术研究中需要使用或探讨该软件，那么一般地，请联系原作者。）

3：如果是这样，这个计时结果又让人如何评判呢？评判kernel写的好坏？（前面一直都是这个工作）你这里计时又不是kernel的，并无法以此评判kernel，换句话说，即使这里没有kernel，单纯其他代码也会花去很多时间。而同时，前面kernel自身的问题，无论是occupancy，还是合并访问，还是profiler结果的进一步验证，这里一个也没有。如果是评价host端调用的合理与否，实现如何，那么请您先思考这里面的逻辑关系，哪部分可以放在循环外，哪部分必须放在循环内，根据您的算法逻辑要求确定实现。当然本着先调对，再优化的原则，可以说您完成了第一步，但既然是没有优化过的代码，对比时间神马的，有那么好纠结么？（而且，从标题和1#的分析profiler结果，优化kernel访存，到这里的host端调用逻辑，这楼歪的…）

4:不是的。究竟如何使用，需要按照您的算法逻辑来。如果是申请一段空间，而中途一直在用（注意，不限定是一个kernel在用），那么只需要在最前面和最后面cudamalloc和cudafree（手工配对使用cudamalloc和cudafree自己控制释放显存是良好的习惯）。如果中途需要拷贝和回拷数据，那么需要多次cudamemcpy，如果中途的数据都是GPU自己生成的，而host只需要最后结果，那么也只需要开始拷贝一次过去，最终拷贝一次回来。总之需要在算法许可的情况下，尽量减少cudamemcpy，尽量复用已经申请的空间而不是反复申请。

5：您依然没有告知14#提到的“提高了17min”是如何得到的，是参考了（或反驳了）本帖前面十几楼的各种意见得到的么？亦或只是在某个特定的运算规模下得到？亦或是我们所不知道的某种情况下得到？以至于完全无法判断前面提出个各种建议是有效？亦或无效？无法获得反馈，自然也无法再进一步给出建议。

祝您编码愉快。

system · 2012 年12 月 4 日 05:52

1：根据programming guide的说法，constant__类型在kernel内固定不变的。我印象中修改后，在下一次kernel启动应该就是新值了。也即，启动kernel A，修改__constant，启动kernel B，那么B里面是修改过的值，而A是之前的值。您也可以写代码验证一下。另外 constant cache只有64KB大小，适合于放一些小的数表，这样缓冲效果好，如果数据量比较大，而读取又很随机，可能缓冲效果不好。

2：关于global memory的合并访问，请以官方的几本手册为准。如有兴趣，可以参考如下资料：[attach]2845[/attach]