对于如下的Kernel用Compute Visual Profile进行分析,这样简单的Kernel也能运用像它提供的建议吗?[attach]2850[/attach]
global void KernelV2(int *dev_ZZ,double *dev_V,double dev_VKK)
{
int k=blockIdx.x512+threadIdx.x;
int i=blockIdx.y;
if((i<XROW)&&(k<XROW))
{
for(int j=0;j<(XCOL*(XCOL-1)/2);j++)
{
dev_V[i*XROW+k]=dev_V[i*XROW+k]+dev_ZZ[i+j*XROW]dev_ZZ[k+jXROW]*dev_VKK[j];
}
}
}
结果如下所示:
Analysis for kernel KernelV2 on device GeForce GTX 580
Summary profiling information for the kernel:
Number of calls: 1
GPU time(us): 287692.88
GPU time (%): 15.17
Grid size: [2 600 1]
Block size: [512 1 1]
Limiting Factor
Achieved Instruction Per Byte Ratio: 2.73 ( Balanced Instruction Per Byte Ratio: 4.16 )
Achieved Occupancy: 0.24 ( Theoretical Occupancy: 1.00 )
IPC: 1.63 ( Maximum IPC: 2 )
Achieved global memory throughput: 129.96 ( Peak global memory throughput(GB/s): 192.38 )
Hint(s)
[list]
[]The achieved instructions per byte ratio for the kernel is less than the balanced instruction per byte ratio for the device. Hence, the kernel is likely memory bandwidth limited. For details, click on Memory Throughput Analysis.
[]The kernel occupancy is low. For details, click on Occupancy Analysis.
[/list]
Factors that may affect analysis
[list]
[]The counters of type SM are collected only for 1 multiprocessor in the chip and the values are extrapolated to get the behavior of entire GPU assuming equal work distribution. This may result in some inaccuracy in the analysis in some cases.
[]The counters for some derived stats are collected in different runs of application. This may cause some inaccuracy in the derived statistics as the blocks scheduled on each multiprocessor may be different for each run and for some applications the behavior changes for each run.
[*]The derived statistics instruction per byte ratio and IPC assume that all instructions are single precision floating point instructions. If the application uses double precision floating point instructions then the limiting factor predicted here may be incorrect.
[/list]
Memory Throughput Analysis for kernel KernelV2 on device GeForce GTX 580
[list]
[]Kernel requested global memory read throughput(GB/s): 38.88
[]Kernel requested global memory write throughput(GB/s): 72.68
[*]Kernel requested global memory throughput(GB/s): 111.56
[/list][list]
[*]L2 cache texture memory read throughput(GB/s): 0.00
[/list][list]
[]L2 cache global memory read throughput(GB/s): 163.40
[]L2 cache global memory write throughput(GB/s): 72.68
[]L2 cache global memory throughput(GB/s): 236.08
[]Global memory excess load(%): 76.20
[*]Global memory excess store(%): 0.00
[/list][list]
[]Achieved global memory read throughput(GB/s): 58.44
[]Achieved global memory write throughput(GB/s): 71.52
[*]Achieved global memory throughput(GB/s): 129.96
[/list][list]
[*]Peak global memory throughput(GB/s): 192.38
[/list]
The following derived statistic(s) cannot be computed as required counters are not available:
[list]
[]L1 cache read throughput(GB/s)
[]L1 cache global hit ratio (%)
[]Texture cache memory throughput(GB/s)
[]Texture cache hit rate(%)
[*]Local memory bus traffic(%)
[/list]
Hint(s)
[list]
[*]Memory access pattern is not coalesced. The kernel requested throughput and achieved global memory throughput can be different because of following two reasons:
[list]
[]Scattered/misaligned pattern: not all transaction bytes are utilized.
[]Broadcast: the same transaction serves many requests (due to sector size, cache line size and caching);
[/list]
Refer to the “Global Memory” section in the “Performance Guidelines” chapter of the CUDA C Programming Guide for more details.
[*]Access pattern of global memory load is not coalesced resulting in multiple transactions. In perfectly coalesced access 1 gld instruction for 32, 64,128 bit word size should cause 1,2 and 4 L1 cache line(128 byte) accesses respectively.
[list]
[]Try disabling L1 cache for global memory using compiler option -Xptxas dlcm=cg. The uncached memory transactions are in multiples of 32, 64 and 128 bytes, where as the cached transactions are always in multiples of 128 bytes;
[]Consider using shared memory as a user managed cache to store frequently accessed data, instead of making multiple accesses to global memory;
[]Try using texture memory for read only resources, the transaction size is smaller and it has a separate cache;
[]Try re-arranging the data if possible;
[/list][]Consider using shared memory as a user managed cache for frequently accessed global memory resources.
Refer to the “Shared Memory” section in the “CUDA C Runtime” chapter of the CUDA C Programming Guide for more details.
[]The achieved global memory throughput is low compared to the peak global memory throughput. To achieve closer to peak global memory throughput, try to
[list]
[]Launch enough threads to hide memory latency (check occupancy analysis);
[]Process more data per thread to hide memory latency;
[/list][*]Consider using texture memory for read only global memory, texture memory has its own cache so it does not pollute L1 cache, this cache is also optimized for 2D spatial locality.
Refer to the “Texture Memory” section in the “CUDA C Runtime” chapter of the CUDA C Programming Guide for more details.
[/list]
Factors that may affect analysis
[list]
[]If display is attached to the GPU that is being profiled, the DRAM reads, DRAM writes, l2 read hit ratio and l2 write hit ratio may include data for display in addition to the data for kernel that is being profiled.
[]The thresholds that are used to provide the hints may not be accurate in all cases. It is recommended to analyze all derived statistics and signals and correlate them with your algorithm before arriving to any conclusion.
[]The value of a particular derived statistic provided in the analysis window is the average value of the derived statistic for all calls of that kernel. To know the value of the derived statistic corresponding to a particular call please refer to the kernel profiler table.
[]The counters of type SM are collected only for 1 multiprocessor in the chip and the values are extrapolated to get the behavior of entire GPU assuming equal work distribution. This may result in some inaccuracy in the analysis in some cases.
[*]The counters for some derived stats are collected in different runs of application. This may cause some inaccuracy in the derived statistics as the blocks scheduled on each multiprocessor may be different for each run and for some applications the behavior changes for each run.
[/list]