Troubleshooting

Full Root FS on ESXi due to iLOREST logfile

This really annoying issue was hunting me for several weeks until I discovered the root cause. One of my customers is running VMware ESXi on top of HPE ProLiant DX hardware, the customized Hardware from HPE for Nutanix. It’s simply a ProLiant DL with a specific set of available components, firmware, drivers and branding. Instead of running AHV, this customer chose to run VMware ESXi as hypervisor. Everything was running fine until the customer reported reocurring fails of a specific Nutanix Cluster Check, in this case the ‘host_disk_usage_check’. While investiagting the issue, I noticed that the root filesystem on all nodes of the clsuter was full.

vCenter vLCM and HPE ILO Amplifier: Error occured during compliance scan

The HPE iLO Amplifier Integration into the vSphere Lifecycle Manager is an easy way to update your ESXi hosts not only with updates, but also with the latest firmware. Despite the fact that HPE has discontinued HPE ILO Amplifier, the product is widly used. The final version is 2.22, the final version for the HPE iLO Amplifier HSM Provider for VMware vSphere is version 1.5.0.

A recurring problem is an error message during the compliance scan:

hpe_hba_cabling_check falsely issues a warning

After a routine update of a 6-node Nutanix cluster, a Nutanix Cluster Check (NCC) warning popped up indicating a problem with the SAS cabling. Running the check on the CLI offered some more details.

Running : health_checks hardware_checks disk_checks hpe_hba_cabling_check
[==================================================] 100%
/health_checks/hardware_checks/disk_checks/hpe_hba_cabling_check                                                                                   [ WARN ]
-----------------------------------------------------------------------------------------------------------------------------------------------------------+

Detailed information for hpe_hba_cabling_check:
Node 10.99.1.205:
WARN: Disk cabling for disk(s) S6GLNG0T610113 are detected at incorrect location(s) 3:251:8 respectively where each value in the location corresponds to box:bay
Node 10.99.1.202:
WARN: Disk cabling for disk(s) S6GLNG0T610203 are detected at incorrect location(s) 3:251:8 respectively where each value in the location corresponds to box:bay
Node 10.99.1.206:
WARN: Disk cabling for disk(s) S6GLNG0T610104 are detected at incorrect location(s) 3:251:8 respectively where each value in the location corresponds to box:bay
Node 10.99.1.201:
WARN: Disk cabling for disk(s) S6GLNG0T610248, S6GLNG0T610219, S6GLNG0T610220, S6GLNG0T610081, S6GLNG0T603894, S6GLNG0T603909, S6GLNG0T610222 are detected at incorrect location(s) 3:252:1, 3:252:7, 3:252:6, 3:252:2, 3:252:3, 3:252:4, 3:252:5 respectively where each value in the location corresponds to box:bay
Node 10.99.1.203:
WARN: Disk cabling for disk(s) S6GLNG0T610247 are detected at incorrect location(s) 3:251:8 respectively where each value in the location corresponds to box:bay
Node 10.99.1.204:
WARN: Disk cabling for disk(s) S6GLNG0T610213 are detected at incorrect location(s) 3:251:8 respectively where each value in the location corresponds to box:bay
Refer to KB 11310 (http://portal.nutanix.com/kb/11310) for details on hpe_hba_cabling_check or Recheck with: ncc health_checks hardware_checks disk_checks hpe_hba_cabling_check --cvm_list=10.99.1.205,10.99.1.202,10.99.1.206,10.99.1.201,10.99.1.203,10.99.1.204
+-----------------------+
| State         | Count |
+-----------------------+
| Warning       | 1     |
| Total Plugins | 1     |
+-----------------------+
Plugin output written to /home/nutanix/data/logs/ncc-output-latest.log

All six nodes were affected. The cluster is running for quite some time without any issues, and this issue never came up before. It appeared right after installing the latest patches.

Failed to connect to IKEv2 VPN using iPhone USB tethering

Usually I tend to use the iPhone WiFi hotspot feature. But lately, I had to switch to USB tethering, because I had to work a whole workday using the hotspot feature. USB tethering saves battery and the connection was more reliable for me. Please note, that you need to install iTunes to use USB tethering, because the necessary Ethernet driver is only available with iTunes. Without this driver, Windows won’t recorgnize the iPhone as an Ethernet connection.

Why you should change your KRBTGT password prior disabling RC4

While chilling on my couch, I stumbled over this pretty interesting Reddit thread: Story Time - How I blew up my company’s AD for 24 hours and fixed it : sysadmin (reddit.com)

Long story short: A poor guy applied some STIG hardening and his Active Directory blew up. Root cause was disabling RC4, which caused Kerberos failures, primarily documented by errors like “The encryption type requested is not supported by the KDC.” The guy fixed it by shutdown all domain controllers, changing the KRBTGT account password on one domain controller, and finally, everything came back

Upgrade to ESXi 7.0: Missing dependencies VIBs Error

This error gets me from time to time, regardless which server vendor, mostly on hosts that were upgraded a couple of times. In this case it was a ESXi host currently running a pretty old build of ESXi 6.7 U3 and my job was the upgrade to 7.0 Update 3c.

If you add a upgrade baseline to the cluster or host, and you try to remediate the host, the task fails with a dependency error. When taking a closer look into the taks details, you were getting told that the task has failed because of a failed dependency, but not which VIB it caused.

Outlook Web Access fails with "440 Login Timeout"

Today I faced an interesting problem. A customer told me that their Exchange 2010, which is currently part of a Exchange cross-forest migration project, has an issue with Outlook Web Access and the Exchange Control Panel. Both web sites fail with a white screen and a single message:

440 Login Timeout

I checked some basics, like certificate, configuration of the virtual directories and I found nothing suspicious. Most hints on the internet pointed towards problems with the IUSR_servername user, which is not used with IIS 7 and later. But authentication configuration and filesystem permissions were okay. Also the IIS end event logs were pretty unhelpful.

Escaping special characters in proxy auth passwords in vCenter

EDIT: It seems that his was fixed in vCenter 7.0 U3.

While debugging a vCener Lifecycle Manager, which was unable to download updates, I’ve stumbled over a weird behaviour, which is (IMHO) by design.

Some of you might use a proxy server. And some of you might use a proxy server which requires credentials. In my case, my customer uses a Sophos SG appliance as a web proxy server with authentication. The customer creaded a user with a complex password. But I was unable to get a working internet connection.

Exchange Control Panel /ecp broken after certificate replacement

As part of an ongoing Exchange 2010 to 2016 migration, I had to replace the self-signed certificate with a certificate from the customers PKI. Everything went fine, the customer had a suitable template, we’ve added the necessary hostnames and bound IIS and SMTP to the certificate. The mess started with an iisreset /noforce

The iisreset took longer than expected. After that, I tried to login into the ECP, entered username and password and got an error.

Update Manager fails with unknown error during host remediation

During an vSphere 6.5 > 6.7 update a was host failing continously at the remediation with an “unknown error”. The host was updated from ESXI 6.5 to 6.7 using an upgrade baseline. Other hosts were updated to 6.7 and with the latest patches without any issues. Something strange was going on…

The esxupdate.log and the vua.log on the host itself showed nothing special. So I checked the vmware-vum-server-log4cpp.log which was much more informative!