程序刚开始运行时，帧数显示为40fps，10秒左右后就变成5fps

system · 2013 年7 月 19 日 07:23

你好，谢谢你的分析

关于等价性问题，我自己分析了一下，觉得基本可以算等价，理由是：我的程序是和opengl进行结合，核函数是在opengl的display 函数中进行调用，此函数本来就是不停的死循环执行

我又用for循环执行了400次，结果数据还是一样。

关于fps算法，我的想法是，如果是算法有问题，那应该是程序一开始执行的时候fps就小，而不应该是执行一段时间后变小。

您认为呢？

system · 2013 年7 月 19 日 07:55

因为您已经自证cuda kernel无时间变换了，那么您认为的“性能下降”可能是别的方面的原因，但和贵kernel无关了。

以及，因为kernel无性能下降，所以您的性能下降可能超出了本论坛的技术支持范围了。

请您检查您的代码的非cuda kernel的其他部分。

system · 2013 年7 月 19 日 16:31

你好，我今晚又看了一下我的代码，并与CUDA的示例代码进行了比较，同时也对CUDA的示例代码也进行了profiler。

经过比较，我发现，CUDA的示例代码的执行时间（3ms）远远低于我的代码执行时间(186ms)，因此我的结论是：我的kernel函数有问题。尽管我的代码在执行多次后执行时间没有明显变化，但其稳定执行时间是大大超过示例代码的，这也解释了为什么我的fps会那么低。

那现在的问题是，为什么我的执行时间超过示例代码那么多，我比较之后发现，我的代码有几处嫌疑最大，比如多了两个纹理读取语句：
float4 startPosition = tex2D(cudafronttex,u,v);
float4 endPostion = tex2D(cudabacktex,u,v);
还多一句求根运算：
depth = sqrt((endPostion.x-pos.x)(endPostion.x-pos.x) + (endPostion.y-pos.y)(endPostion.y-pos.y) + (endPostion.z-pos.z)*(endPostion.z-pos.z));
以上是我认为的最有可能的原因。

鉴于暂没有办法能够很好的用其它高效的方法代替这几条语句进行验证，现在想请各位版主用自己的经验帮我分析一下：在kernel函数中使用tex2D以及sqrt，其对程序的影响是否比较大？（两个纹理都是512＊512的）

system · 2013 年7 月 22 日 06:29

你好，关于我的问题，我这两天又在分析，分析内容见楼上给横扫千军版主的回复，现在将部分代码贴出来，请帮我看一下吧，麻烦了。

CUDA示例代码，profiler时间为3ms：
global void
d_render(uint *d_output, uint imageW, uint imageH,
float density, float brightness,
float transferOffset, float transferScale)
{
const int maxSteps = 500;
const float tstep = 0.01f;
const float opacityThreshold = 0.95f;
const float3 boxMin = make_float3(-1.0f, -1.0f, -1.0f);
const float3 boxMax = make_float3(1.0f, 1.0f, 1.0f);

uint x = blockIdx.x*blockDim.x + threadIdx.x;

uint y = blockIdx.y*blockDim.y + threadIdx.y;

if ((x >= imageW) || (y >= imageH)) return;

float u = (x / (float) imageW)*2.0f-1.0f;
float v = (y / (float) imageH)*2.0f-1.0f;

// calculate eye ray in world space
Ray eyeRay;
eyeRay.o = make_float3(mul(c_invViewMatrix, make_float4(0.0f, 0.0f, 0.0f, 1.0f)));
eyeRay.d = normalize(make_float3(u, v, -2.0f));
eyeRay.d = mul(c_invViewMatrix, eyeRay.d);

// find intersection with box

float tnear, tfar;
int hit = intersectBox(eyeRay, boxMin, boxMax, &tnear, &tfar);
if (!hit) return;

if (tnear < 0.0f) tnear = 0.0f; // clamp to near plane

// march along ray from front to back, accumulating color
float4 sum = make_float4(0.0f);
float t = tnear;
float3 pos = eyeRay.o + eyeRay.dtnear;
float3 step = eyeRay.dtstep;

for (int i=0; i<maxSteps; i++)

{
float sample = tex3D(tex, pos.x0.5f+0.5f, pos.y0.5f+0.5f, pos.z*0.5f+0.5f);

// lookup in transfer function texture
float4 col = tex1D(transferTex, (sample-transferOffset)*transferScale);
col.w *= density;

// “under” operator for back-to-front blending
//sum = lerp(sum, col, col.w);

// pre-multiply alpha
col.x *= col.w;
col.y *= col.w;
col.z = col.w;
// “over” operator for front-to-back blending
sum = sum + col(1.0f - sum.w);

// exit early if opaque
if (sum.w > opacityThreshold)
break;

t += tstep;
if (t > tfar) break;

pos += step;
}

sum *= brightness;

// write output color
d_output[y*imageW + x] = rgbaFloatToInt(sum);
}

我的代码，profiler时间186ms:
global void
d_render(uint *d_output, uint imageW, uint imageH,
float density, float brightness,
float transferOffset, float transferScale)
{
const int maxSteps = 500;
const float tstep = 0.01f;
const float opacityThreshold = 0.95f;

uint x = blockIdx.xblockDim.x + threadIdx.x;
uint y = blockIdx.yblockDim.y + threadIdx.y;
if ((x >= imageW) || (y >= imageH)) return;

float u = (x / (float) imageW);
float v = (y / (float) imageH);

// calculate eye ray in world space
Ray eyeRay;
eyeRay.o = make_float3(mul(c_invViewMatrix, make_float4(0.0f, 0.0f, 0.0f, 1.0f)));
eyeRay.d = normalize(make_float3(u-0.5f, v-0.5f, -1.0f));
eyeRay.d = mul(c_invViewMatrix, eyeRay.d);

float4 sum = make_float4(0.0f);

float3 step = eyeRay.d*tstep;

  float4 startPosition = tex2D(cudafronttex,u,v);

float4 endPostion = tex2D(cudabacktex,u,v);
float3 pos;
pos.x = startPosition.x;
pos.y = startPosition.y;
pos.z = startPosition.z;

depth = sqrtf(pow((endPostion.x-startPosition.x),2) + pow((endPostion.y-startPosition.y),2) + pow((endPostion.z-startPosition.z),2));
if (depth<1e-7) return;
float t = 0;

for(int i=0; i<maxSteps; i++)

{
float sample = tex3D(tex, pos.x, pos.y, pos.z);

// lookup in transfer function texture
float4 col = tex1D(transferTex, (sample-transferOffset)*transferScale);
col.w *= density;

// “under” operator for back-to-front blending

// pre-multiply alpha
col.x *= col.w;
col.y *= col.w;
col.z = col.w;
// “over” operator for front-to-back blending
sum = sum + col(1.0f - sum.w);

// exit early if opaque
if (sum.w > opacityThreshold) break;

t += tstep;
if (t > depth) break;

pos += step;

}
sum *= brightness;

// write output color
d_output[y*imageW + x] = rgbaFloatToInt(sum);
}

1、以上代码绿色为相同的代码，从两个代码上看，在循环处理上基本没差别，在循环之外，我的代码多了两个读取纹理的操作，读取之后有一个求根操作，而CUDA示例代码多了一个 intersectBox函数，但从效果看，其耗时可以忽略不计。
2、我很纳闷的是，即使是读取纹理和求根操作比较耗时，但那也是在循环之外，每个线程只读取一次，也不至于时间消耗增加那么多，从3ms增加到180多ms

3、那么与我的纹理的动态变化有无关系？我的两个纹理都是512*512，而且是实时动态变化的，而且kernel 函数是循环执行的，kernel 函数在两次执行过程中，同样的纹理坐标，取到的值是不一样的。我看了资料，CUDA纹理是有缓存的，这种不断动态变化的纹理内容会不会对缓存造成影响，导致缓存摇摆，影响速度？