主机如何知道一个kernel已经执行结束了,或者kernel结束了如何通知主机?
主机端invoke kernel的时候,是立即返回的,也就是说CPU向下执行的时候,kernel可能还没跑完。如果下面有cudaMemcpy的话,会自动等前面的kernel执行完毕,再copy。
以及,您可以使用诸如cudaDeviceSynchronize()来指定前面都执行完毕以后,CPU端再向下执行。
不知这些做法能否满足LZ的需求?
祝您编码愉快~
此外, 还有cudaStreamSynchronize(), cudaEventSynchronize()可以用来同步。以及各种神马cudaEventQuery(), cudaStreamQuery()等等。楼主可以查询一下手册,看看说明,然后使用。
Since all operations in non-default streams are non-blocking with respect to the host code, you will run across situations where you need to synchronize the host code with operations in a stream. There are several ways to do this. The “heavy hammer” way is to use cudaDeviceSynchronize(), which blocks the host code until all previously issued operations on the device have completed. In most cases this is overkill, and can really hurt performance due to stalling the entire device and host thread.
The CUDA stream API has multiple less severe methods of synchronizing the host with a stream. The function cudaStreamSynchronize(stream) can be used to block the host thread until all previously issued operations in the specified stream have completed. The function cudaStreamQuery(stream) tests whether all operations issued to the specified stream have completed, without blocking host execution. The functions cudaEventSynchronize(event) and cudaEventQuery(event) act similar to their stream counterparts, except that their result is based on whether a specified event has been recorded rather than whether a specified stream is idle. You can also synchronize operations within a single stream on a specific event using cudaStreamWaitEvent(event) (even if the event is recorded in a different stream, or on a different device!).
转自nVidia官方八股:developer.nvidia.com/content/how-overlap-data-transfers-cuda-cc