fabric manager启动日志部分报错如下:
[Jul 31 2024 00:07:36] [ERROR] [tid 7543] NVLink access connection not detected NVLink:26 <=======> NVLink:1
[Jul 31 2024 00:07:36] [ERROR] [tid 7543] NVLink access connection not detected NVLink:28 <=======> NVLink:4
[Jul 31 2024 00:07:36] [ERROR] [tid 7543] NVLink access connection not detected NVLink:29 <=======> NVLink:5
[Jul 31 2024 00:07:36] [ERROR] [tid 7543] NVLink access connection not detected NVLink:29 <=======> NVLink:6
[Jul 31 2024 00:07:36] [ERROR] [tid 7543] NVLink access connection not detected NVLink:28 <=======> NVLink:7
[Jul 31 2024 00:07:36] [ERROR] [tid 7543] NVLink access connection not detected NVLink:10 <=======> NVLink:9
[Jul 31 2024 00:07:36] [ERROR] [tid 7543] NVLink access connection not detected NVLink:26 <=======> NVLink:11
[Jul 31 2024 00:07:36] [ERROR] [tid 7543] NVLink access connection not detected NVLink:24 <=======> NVLink:1
[Jul 31 2024 00:07:36] [ERROR] [tid 7543] NVLink access connection not detected NVLink:26 <=======> NVLink:4
[Jul 31 2024 00:07:36] [ERROR] [tid 7543] NVLink access connection not detected NVLink:27 <=======> NVLink:5
[Jul 31 2024 00:07:36] [ERROR] [tid 7543] NVLink access connection not detected NVLink:31 <=======> NVLink:6
[Jul 31 2024 00:07:36] [ERROR] [tid 7543] NVLink access connection not detected NVLink:30 <=======> NVLink:7
[Jul 31 2024 00:07:36] [ERROR] [tid 7543] NVLink access connection not detected NVLink:26 <=======> NVLink:9
[Jul 31 2024 00:07:36] [ERROR] [tid 7543] NVLink access connection not detected NVLink:11 <=======> NVLink:11
[Jul 31 2024 00:07:36] [ERROR] [tid 7543] NVSwitch failure detected and degraded mode configuration set to abort Fabric Manager
Jul 30 23:17:45 GQ203-JTGPUA800-01 systemd[1]: nvidia-fabricmanager.service: Failed with result ‘exit-code’.
Jul 30 23:17:45 GQ203-JTGPUA800-01 systemd[1]: Failed to start NVIDIA fabric manager service.
nvlink状态如下