你好,我们手头上有一个NVIDIA ORIN NANO 8GB内存的盒子, 在上面部署了DL模型的推理软件。该软件以Linux系统服务形式(假定名称为AAA)提供。对外提供REST 接口,允许用户透过REST接口做图片推理。这盒子在部署了该软件之后,运行还正常,但最近发现这盒子 这软件的服务状态不对。
在CGroup中,这软件模块的主进程(假定名称为:PROCA) 并不在其中。而只有一个负责拉起主进程的watchdog脚本run.sh在工作。
我们使用top命令查找PROCA进程也未查到。
我们持续观察所有进程的变化,发现PROCA在被拉起之后,很快就没有了。
但这个服务和这个进程,在之前一直工作正常。
我们怀疑系统环境或者驱动可能发生问题,发现系统日志持续大量打印有关NVIDIA kernel的信息:
nvidia@nvidia:/logs/bak$ tail /var/log/kern.log
Apr 23 15:39:56 nvidia kernel: [1233149.724310] NVRM nvAssertFailedNoLog: Assertion failed: NV_FALSE @ gpu_mgr.c:296
Apr 23 15:39:57 nvidia kernel: [1233150.425566] NVRM rpcRmApiControl_dce: NVRM_RPC_DCE: Failed RM ctrl call cmd:0x2080013f result 0x56:
Apr 23 15:39:57 nvidia kernel: [1233150.426438] NVRM rpcRmApiControl_dce: NVRM_RPC_DCE: Failed RM ctrl call cmd:0x2080017e result 0x56:
Apr 23 15:39:57 nvidia kernel: [1233150.429764] NVRM rpcRmApiControl_dce: NVRM_RPC_DCE: Failed RM ctrl call cmd:0x2080014a result 0x56:
Apr 23 15:39:57 nvidia kernel: [1233150.468826] NVRM rpcRmApiControl_dce: NVRM_RPC_DCE: Failed RM ctrl call cmd:0x731341 result 0xffff:
Apr 23 15:39:57 nvidia kernel: [1233150.469224] NVRM rpcRmApiControl_dce: NVRM_RPC_DCE: Failed RM ctrl call cmd:0x730190 result 0x56:
Apr 23 15:39:57 nvidia kernel: [1233150.604976] NVRM gpumgrGetSomeGpu: Failed to retrieve pGpu - Too early call!.
Apr 23 15:39:57 nvidia kernel: [1233150.604983] NVRM nvAssertFailedNoLog: Assertion failed: NV_FALSE @ gpu_mgr.c:296
Apr 23 15:39:57 nvidia kernel: [1233150.928171] cpufreq: cpu0,cur:1501000,set:960000,set ndiv:75
Apr 23 15:40:04 nvidia kernel: [1233157.023131] cpufreq: cpu0,cur:1145000,set:729600,set ndiv:57
nvidia@nvidia:/logs/bak$ tail /var/log/kern.log
Apr 23 15:40:36 nvidia kernel: [1233189.551729] cpufreq: cpu0,cur:1191000,set:960000,set ndiv:75
Apr 23 15:40:36 nvidia kernel: [1233189.553505] cpufreq: cpu0,cur:1135000,set:1510400,set ndiv:118
Apr 23 15:40:37 nvidia kernel: [1233190.570364] cpufreq: cpu4,cur:1286000,set:1510400,set ndiv:118
Apr 23 15:40:38 nvidia kernel: [1233191.585591] cpufreq: cpu0,cur:1503000,set:729600,set ndiv:57
Apr 23 15:40:41 nvidia kernel: [1233194.632140] cpufreq: cpu0,cur:1323000,set:1510400,set ndiv:118
Apr 23 15:40:45 nvidia kernel: [1233198.695936] cpufreq: cpu0,cur:1264000,set:1510400,set ndiv:118
Apr 23 15:40:47 nvidia kernel: [1233200.730169] cpufreq: cpu0,cur:958000,set:1510400,set ndiv:118
Apr 23 15:40:48 nvidia kernel: [1233201.648874] cpufreq: cpu0,cur:1158000,set:729600,set ndiv:57
Apr 23 15:40:49 nvidia kernel: [1233202.665527] cpufreq: cpu0,cur:1179000,set:1510400,set ndiv:118
Apr 23 15:40:49 nvidia kernel: [1233202.760847] cpufreq: cpu0,cur:1238000,set:729600,set ndiv:57
nvidia@nvidia:/logs/bak$ tail /var/log/kern.log
Apr 23 15:40:52 nvidia kernel: [1233205.085634] NVRM rpcRmApiControl_dce: NVRM_RPC_DCE: Failed RM ctrl call cmd:0x2080017e result 0x56:
Apr 23 15:40:52 nvidia kernel: [1233205.089077] NVRM rpcRmApiControl_dce: NVRM_RPC_DCE: Failed RM ctrl call cmd:0x2080014a result 0x56:
Apr 23 15:40:52 nvidia kernel: [1233205.127339] NVRM rpcRmApiControl_dce: NVRM_RPC_DCE: Failed RM ctrl call cmd:0x731341 result 0xffff:
Apr 23 15:40:52 nvidia kernel: [1233205.127802] NVRM rpcRmApiControl_dce: NVRM_RPC_DCE: Failed RM ctrl call cmd:0x730190 result 0x56:
Apr 23 15:40:52 nvidia kernel: [1233205.259640] NVRM gpumgrGetSomeGpu: Failed to retrieve pGpu - Too early call!.
Apr 23 15:40:52 nvidia kernel: [1233205.259647] NVRM nvAssertFailedNoLog: Assertion failed: NV_FALSE @ gpu_mgr.c:296
Apr 23 15:40:53 nvidia kernel: [1233206.822477] cpufreq: cpu0,cur:1113000,set:1510400,set ndiv:118
Apr 23 15:40:53 nvidia kernel: [1233206.824891] cpufreq: cpu4,cur:1196000,set:1510400,set ndiv:118
Apr 23 15:40:56 nvidia kernel: [1233209.870925] cpufreq: cpu0,cur:1333000,set:1190400,set ndiv:93
Apr 23 15:40:56 nvidia kernel: [1233209.872137] cpufreq: cpu0,cur:1373000,set:1510400,set ndiv:118