求大神帮忙啊

system · 2013 年3 月 28 日 13:27

我用GPU计算X^2-X+1=0在[1,2]内的解，把[a,b]等分成64份，交给64个线程处理，找出根的范围

源代码如下：//////////////////////////////////////////////////////////////////////////////
// Copyright 1993-2012 NVIDIA Corporation. All rights reserved.
//
// Please refer to the NVIDIA end user license agreement (EULA) associated
// with this source code for terms and conditions that govern your use of
// this software. Any use, reproduction, disclosure, or distribution of
// this software and related documentation outside the terms of the EULA
// is strictly prohibited.
//
////////////////////////////////////////////////////////////////////////////

/* Template project which demonstrates the basics on how to setup a project

example application.
Host code.
*/

// includes, system
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include
#define F(x) (x*x-x+1)

// includes CUDA
#include <cuda_runtime.h>

// includes, project
#include <helper_cuda.h>
#include <helper_functions.h> // helper functions for SDK examples

////////////////////////////////////////////////////////////////////////////////
// declaration, forward
void runTest(int argc, char **argv);
using std::endl;
using std::cout;

global void
FindAnswer(float *g_A,float *g_B, float *g_C)
{
int tid=threadIdx.x;
int bid=blockIdx.x;
float a=*g_A;
float b=*g_B;
float New_a,New_b;
do
{
New_a=*g_A+(g_B-g_A)/64(bidblockDim.x+tid);
New_b=g_A+(g_B-g_A)/64(bidblockDim.x+tid+1);
if((F(New_a))(F(New_b))<0)
{
*g_A=New_a;
*g_B=New_b;
}
__syncthreads();
}
while(fabs(*g_B-*g_A)>0.001);
if(!(a==*g_A&&b==g_B)&&(bidblockDim.x+tid)==0)
*g_C=(*g_A+*g_B)/2;
}

////////////////////////////////////////////////////////////////////////////////
// Program main
////////////////////////////////////////////////////////////////////////////////
int
main(int argc, char **argv)
{
runTest(argc, argv);
}

////////////////////////////////////////////////////////////////////////////////
//! Run a simple test for CUDA
////////////////////////////////////////////////////////////////////////////////
void
runTest(int argc, char **argv)
{
bool bTestResult = true;

printf(“%s Starting…\n\n”, argv[0]);
cudaError_t err=cudaSuccess;

// use command-line specified CUDA device, otherwise use device with highest Gflops/s
int devID = findCudaDevice(argc, (const char **)argv);

StopWatchInterface *timer = 0;
sdkCreateTimer(&timer);
sdkStartTimer(&timer);

unsigned int num_threads = 32;
size_t mem_size = sizeof(float) *1;

// allocate host memory
float *h_A = (float *) malloc(mem_size);
float *h_B = (float *) malloc(mem_size);
// initalize the memory
h_A[0]=1;
h_B[0]=2;

// allocate device memory
float *d_A=NULL;
float *d_B=NULL;
err=cudaMalloc((void **) &d_A, mem_size);

err=cudaMalloc((void **) &d_B, mem_size);

// copy host memory to device
err=cudaMemcpy(d_A, h_A, mem_size,
cudaMemcpyHostToDevice);

err=cudaMemcpy(d_B, h_B, mem_size,
cudaMemcpyHostToDevice);

// allocate device memory for result
float *d_C;
checkCudaErrors(cudaMalloc((void **) &d_C, mem_size));

// setup execution parameters
//dim3 grid(2, 1, 1);
//dim3 threads(num_threads, 1, 1);

// execute the kernel
FindAnswer<<<2,32>>>(d_A, d_B,d_C);

// check if kernel execution generated and error
getLastCudaError(“Kernel execution failed”);

// allocate mem for the result on host side
float *h_C = (float *) malloc(mem_size);
// copy result from device to host
checkCudaErrors(cudaMemcpy(h_C, d_C, mem_size,
cudaMemcpyDeviceToHost));

sdkStopTimer(&timer);
printf(“Processing time: %f (ms)\n”, sdkGetTimerValue(&timer));
sdkDeleteTimer(&timer);
std::cout<<“解为”<<*h_C;

// compute reference solution
//float *reference = (float *) malloc(mem_size);
//computeGold(reference, h_idata, num_threads);

// check result
/* if (checkCmdLineFlag(argc, (const char **) argv, “regression”))
{
// write file for regression test
sdkWriteFile(“./data/regression.dat”, h_odata, num_threads, 0.0f, false);
}
else
{
// custom output handling when no regression test running
// in this case check if the result is equivalent to the expected soluion
bTestResult = compareData(reference, h_odata, num_threads, 0.0f, 0.0f);
}*/

// cleanup memory
free(h_A);
free(h_B);
free(h_C);
//free(reference);
checkCudaErrors(cudaFree(d_A));
checkCudaErrors(cudaFree(d_B));
checkCudaErrors(cudaFree(d_C));

cudaDeviceReset();
exit(bTestResult ? EXIT_SUCCESS : EXIT_FAILURE);
}

这个程序执行时，在执行上面红色标注的行时，出错了，我单步调试到这里时，显示器就黑屏一会，下面在执行checkCudaErrors(cudaFree(d_C))时也出错了，究竟是怎么回事啊？
我不知道在VC里如何单步调试内核函数，并且看到GPU里的变量

system · 2013 年3 月 28 日 14:40

这个问题我解决了，限定循环的次数就不再出现这个情况了，但是遇到了新的问题，发现算的结果是0.5，也就是说，循环没起到作用，怎么回事？

system · 2013 年3 月 28 日 14:44

LZ您好，您的算法逻辑应该有一定问题。
您kernel里面的内容，大致是寻求某方程的根（曲线与x轴交点）的做法。但是有诸多问题：

1：您的这个方法只适合和x轴有一个交点的情况，如果有多个交点，那么会漏解（因为你只有一组全局的表示位置的变量），还可能进入死循环（假定有两个解但落在同一初始区间，然后这两个解都被华丽地无视了，然后大家都找不到解）。

2：您的方法即使是在只有一解的情况下，如果初始区间给的就不包含有解区间，那么整个程序也会落入死循环。

所以您的算法实现问题颇多，若kernel进入死循环，那么在超时之后，就会画面一黑，显卡驱动重启。

此外，如果您需要定位是红色的copy步骤出问题，还是前面的kernel挂了，您可以加上适当的检查CUDAError的代码，这部分内容另外一个帖子说了，不再赘述。

祝您调试顺利！

system · 2013 年3 月 28 日 15:19

我执行结果出来了，您提的问题也是我所焦虑的，所以我也在寻找这种算法，我去查资料

system · 2013 年3 月 28 日 16:48

[

LZ您好，有一些讲述计算方法的书会讲述一些常用的方程求根的数值方法，您不妨参考并考虑下是否可以并行实现。

祝您好运。

system · 2013 年3 月 29 日 06:45

是的，CS有一门叫《数值计算》的课，里面有大量解方程的实践。要不要买一本看下？

system · 2013 年3 月 31 日 13:37

cs是什么，我在图书馆找到《数值计算方法引论》的书，理论太强，看不懂，要找到本实践性强，又容易上手的书，不好找，没听懂楼主说的书的具体信息

system · 2013 年3 月 31 日 13:39

嗯嗯。就是这本。《数值计算导论》没错。如果这本还算“理论强”的书。。那么没有更具有实践意义的了。。。

system · 2013 年3 月 31 日 13:48

嗯，可以问你，GPU里执行四重循环有没有问题？计算512512512次会不会导致黑屏？

system · 2013 年3 月 31 日 13:49

我拒绝回答这个问题，这是对我的侮辱！

system · 2013 年3 月 31 日 13:54

我是初次接触CUDA，而且不是计算机专业的，我真没有别的意思，如果我闹了笑话，您见谅

system · 2013 年3 月 31 日 13:59

嗯嗯。那么这里我就对论坛的一些常见问题做一下总结性的回答：
（1）问：CUDA能用goto吗？答：可以
（2）问：CUDA能用do…while吗？答：可以
（3）问：CUDA能用局部数组，能用指针吗？答：可以
（4）问：CUDA能嵌套2层循环吗？答：可以
（5）问：CUDA能嵌套3层循环吗？答：可以
（6）问：CUDA能嵌套4层循环吗？答：可以

感谢来访。敬请参考。

system · 2013 年3 月 31 日 14:07

嗯嗯，我编了计算角系数的CPU程序，正在调试CUDA程序，看来我的黑屏的问题不是因为四重循环的问题，我在看看有没有逻辑错误，编好后两个程序比较下，看加速效果怎么样