I have a cuda program that calls two cuda streams with two cpu threads.I didn’t want the runtime of the two cuda streams to overlap, just want to hide the copy of the data.
I get this effect with the MSVC compiler on Windows, but the GCC compiler on Linux doesn’t always work the way I want it to.
Below are the execution times of the two programs I analyzed with the Nsight System. I’ve added stream sync, and it’s still not right,Can someone tell me why it’s different?(First image is linux gcc)
Hi @Simple_Liu,
欢迎访问我们的开发者论坛!已经看到您的问题了,我转给相关的同事看一下。