Wrong iovDisableIR setting on ProLiant Gen8 might cause a PSOD
TL;DR: There’s a script at the bottom of the page that fixes the issue.
Some days ago, this HPE customer advisory caught my attention:
And there is also a corrosponding VMware KB article:
ESXi host fails with intermittent NMI PSOD on HP ProLiant Gen8 servers
It isn’t clear WHY this setting was changed, but in VMware ESXi 5.5 patch 10, 6.0 patch 4, 6.0 U3 and, 6.5 the Intel IOMMU’s interrupt remapper functionality was disabled. So if you are running these ESXi versions on a HPE ProLiant Gen8, you might want to check if you are affected.
To make it clear again, only HPE ProLiant Gen8 models are affected. No newer (Gen9) or older (G6, G7) models.
Currently there is no resolution, only a workaround. The iovDisableIR setting must set to FALSE. If it’s set to TRUE, the Intel IOMMU’s interrupt remapper functionality is disabled.
To check this setting, you have to SSH to each host, and use esxcli to check the current setting:
[root@esx1:~] esxcli system settings kernel list -o iovDisableIR
Name Type Description Configured Runtime Default
------------ ---- --------------------------------------- ---------- ------- -------
iovDisableIR Bool Disable Interrupt Routing in the IOMMU... FALSE FALSE TRUE
I have written a small PowerCLI script that uses the Get-EsxCli cmdlet to check all hosts in a cluster. The script only checks the setting, it doesn’t change the iovDisableIR setting.
Here’s another script, that analyzes and fixes the issue.