求助:access violation on load (global memory)

调试信息如下:

Nsight Debug

CUDA Memory Checker detected 64 threads caused an access violation:
Launch Parameters
CUcontext = 00496c38
CUstream = 0357e5c0
CUmodule = 035e83f0
CUfunction = 0361a440
FunctionName = Z7encryptPjS_iS_PhS0_S0_S_S
gridDim = {32,1,1}
blockDim = {512,1,1}
sharedSize = 896
Parameters:
dev_input = 0x05d60000 16843009
key = 0x05aa0000 4294967295
length = 720896
dev_out = 0x05da0000 3452816845
S = 0x059a0000 99 ‘c’
Logtable = 0x059a0400 0 ’
Nsight Debug
Memory Checker detected 64 access violations.
error = access violation on load (global memory)
blockIdx = {0,0,0}
threadIdx = {224,0,0}
address = 0x00130e68
accessSize = 4

Nsight Debug
CUDA grid launch failed: CUcontext: 4811832 CUmodule: 56525808 Function: Z7encryptPjS_iS_PhS0_S0_S_S
Nsight Debug

CUDA Memory Checker detected 64 threads caused an access violation:
Launch Parameters
CUcontext = 02aa6c38
CUstream = 0352e5c0
CUmodule = 035983c8
CUfunction = 035ca420
FunctionName = Z7encryptPjS_iS_PhS0_S0_S_S
gridDim = {32,1,1}
blockDim = {512,1,1}
sharedSize = 896
Parameters:
dev_input = 0x05d60000 16843009
key = 0x05aa0000 4294967295
length = 720896
dev_out = 0x05da0000 0
S = 0x059a0000 99 ‘c’
Logtable = 0x059a0400 0 ’
Nsight Debug
Memory Checker detected 64 access violations.
error = access violation on load (global memory)
blockIdx = {2,0,0}
threadIdx = {128,0,0}
address = 0x00884868
accessSize = 4

设备为GT630M
1G显存

THREAD_NUM = 512
BLOCK_NUM = 32
是在做这样一步:
key[index] = keyExpansion[keyIndex];

key 与 keyExpansion的定义分别为
word32* key ; length : 44 * THREAD_NUM * BLOCK_NUM;
word32* keyExpansion; length : 4 * THREAD_NUM * BLOCK_NUM;

下标为 index = 44 * (THREAD_NUM * (BLOCK_NUM * blockIdx.y) + threadIdx.x);
keyIndex = 4 * (THREAD_NUM * (BLOCK_NUM * blockIdx.y) + threadIdx.x);

不知内存访问冲突的原因,请教各位,
谢谢~

楼主你好。你要读取的设备地址是无效的。

至于如何无效,
(1)建议你得看下你的下标计算之类的,是否越界,是否符合你的本意。

以及,你的4 * (THREAD_NUM * (BLOCK_NUM * blockIdx.y) + threadIdx.x);的写法。在你的形状下,实际上blockIdx.y一直是0,也就是它们是4 * threadIdx.x。那么如果您的写法是本意的,则对于KeyExpansion, 如果越界,可以得出 4 * threadIdx.x > 你的下表最大值。因为threadIdx.x < 512, 所以能推出你的KeyExpansion的元素总数<2048才可能会导致。所以:
(2)建议您检查下host code里的device memory分配部分,是否对KeyExpansion分配了足够的空间。(假设您需要2048个元素,您需要写成cudaMalloc(&your_pointer, 2048 * sizeof(每个元素),您检查下是否忘记*sizeof(每个元素)这里)

这是根据您提供的信息给出的2条建议。我再重复一遍:检查您的device code对该KeyExpansionp的下标计算是否正确;检查您的分配是否足够。