hpe_hba_cabling_check falsely issues a warning
After a routine update of a 6-node Nutanix cluster, a Nutanix Cluster Check (NCC) warning popped up indicating a problem with the SAS cabling. Running the check on the CLI offered some more details.
Running : health_checks hardware_checks disk_checks hpe_hba_cabling_check
[==================================================] 100%
/health_checks/hardware_checks/disk_checks/hpe_hba_cabling_check                                                                                   [ WARN ]
-----------------------------------------------------------------------------------------------------------------------------------------------------------+
Detailed information for hpe_hba_cabling_check:
Node 10.99.1.205:
WARN: Disk cabling for disk(s) S6GLNG0T610113 are detected at incorrect location(s) 3:251:8 respectively where each value in the location corresponds to box:bay
Node 10.99.1.202:
WARN: Disk cabling for disk(s) S6GLNG0T610203 are detected at incorrect location(s) 3:251:8 respectively where each value in the location corresponds to box:bay
Node 10.99.1.206:
WARN: Disk cabling for disk(s) S6GLNG0T610104 are detected at incorrect location(s) 3:251:8 respectively where each value in the location corresponds to box:bay
Node 10.99.1.201:
WARN: Disk cabling for disk(s) S6GLNG0T610248, S6GLNG0T610219, S6GLNG0T610220, S6GLNG0T610081, S6GLNG0T603894, S6GLNG0T603909, S6GLNG0T610222 are detected at incorrect location(s) 3:252:1, 3:252:7, 3:252:6, 3:252:2, 3:252:3, 3:252:4, 3:252:5 respectively where each value in the location corresponds to box:bay
Node 10.99.1.203:
WARN: Disk cabling for disk(s) S6GLNG0T610247 are detected at incorrect location(s) 3:251:8 respectively where each value in the location corresponds to box:bay
Node 10.99.1.204:
WARN: Disk cabling for disk(s) S6GLNG0T610213 are detected at incorrect location(s) 3:251:8 respectively where each value in the location corresponds to box:bay
Refer to KB 11310 (http://portal.nutanix.com/kb/11310) for details on hpe_hba_cabling_check or Recheck with: ncc health_checks hardware_checks disk_checks hpe_hba_cabling_check --cvm_list=10.99.1.205,10.99.1.202,10.99.1.206,10.99.1.201,10.99.1.203,10.99.1.204
+-----------------------+
| State         | Count |
+-----------------------+
| Warning       | 1     |
| Total Plugins | 1     |
+-----------------------+
Plugin output written to /home/nutanix/data/logs/ncc-output-latest.log
All six nodes were affected. The cluster is running for quite some time without any issues, and this issue never came up before. It appeared right after installing the latest patches.
AHV: el7.nutanix.20220304.478 AOS: 6.5.5.1 LTS NCC: 4.6.6.1
The warning mentioned Nutanix KB article 11310, and this article was pretty interesting, because it mentioned an issue with the NCC.
This check is introduced in NCC 4.3.0. On HPE platform, HPE has identified an inconsistent backplane cabling that shipped from the factory, which requires a physical re-cabling of the backplane and reconfiguration of the CVM to properly utilize disks.
Beyond that:
Note: This check is disabled temporarily in NCC 4.6.5 and enabled again in NCC 4.6.6 after fixing the error.
The hpe_hba_cabling_check triggered was re-enabled in NCC 4.6.6, and the cluster is running 4.6.6.1. To rule out a problem with the SAS cabling I opened a case at HPE and HPE confirmed that there might be an issue with the NCC. In turn, the SAS cabling was checked and no problem was found. HPE opened a case at Nutanix, and Nutanix confirmed an issue with the NCC 4.6.6.1. The fix for this issue is expected with NCC 4.6.6.2. Until then, the warning can be ignored.
