VMware vCenter: Host state 'not responding' flapping
While I was onsite at a customer to decommission an old storage system, one of my very first tasks was to unmount and detach some old datastores. No big deal, until I saw that one after one ESXi hosts went to “not responding”. Time for a heart attack but hey: Why should a host ran into a PDL/ APD, while I was dismounting datastores on the vSphere layer? The LUNs were still there and accessible. The hosts came back quickly and from that point, I watched the hosts flapping between “connected” and “not responding”. Time for an investigation. My first thought was that it must have something to do with the network. But the network was okay, no problems with interfaces, (M/R)STP or similar. Then I checked the logs and found this
2014-09-18 10:53:42,072 [Timer-13] ERROR com.vmware.vim.sms.provider.vasa.event.EventDispatcher - Error occurred while polling events
com.vmware.vim.sms.fault.VasaServiceException: org.apache.axis2.AxisFault: No buffer space available (maximum connections reached?): JVM_Bind
and this
2014-09-18 10:53:42,946 [Timer-15] DEBUG com.vmware.vim.sms.provider.vasa.event.EventDispatcher - [pollEvents] Last event ID = -1
2014-09-18 10:53:42,946 [Timer-13] INFO org.apache.commons.httpclient.HttpMethodDirector - I/O exception (java.net.SocketException) caught when processing request: No buffer space available (maximum connections reached?): JVM_Bind
in the sms.log on the vCenter server. This messages was logged multiple times per second in the Windows Server eventlog:
Transport authentication failed.
Service: https://localhost:3509/
ClientIdentity: CN=SMS-120913144508527, O=VMware; C595AE5B1DF15E922569E6E3BDB4D512C1366353
ActivityId: <null>
MessageSecurityException: The HTTP request with client authentication scheme 'Anonymous' failed with 'Forbidden' status.
A short search in the Microsoft Knowledgebase pointed me to KB2577795. Checking the number of current connections showed, that the server opens hundreds and thousands of connections to itself. Unfortunately, the installation of KB2577795 did not solved the problem. My second attempted was to rise the number of MaxUserPort which is used to limit the number of dynamic ports available to applications. Open the registry editor and locate the key:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters
Create a new DWORD wamed MaxUserPort
and add the value 65534 to it.
Value Name: MaxUserPort Value Type: DWORD Value data: 65534
After a reboot of the vCenter server, the host state flapping has stopped and and did not recur. This is NOT the solution, it’s only a dirty workaround. The customer and I decided to reinstall the virtual vCenter Server and update it to vCenter 5.5 U2. The reinstallation wasn’t a big deal due to an external database and a backup of the SSL certificates.
As I already mentioned: I don’t have a clue what happened to the vCenter server. I was able to create a workaround but I wasn’t able to solve to root cause. But the customer used this situation to make a clear cut and to start with a new vCenter server.