Troubleshooting

Event ID 4625 - Failure Reason: Domain sid inconsistent

The last two days I had a lot of trouble with Microsoft Remote Desktop Services (RDP), or to use the older wording, terminal services. To be honest: Terminal servers are not really my specialty, and actually I was at the customer to help him with some vSphere related changes. But because I was there, I was asked to throw a closer look at some problems with their Microsoft Windows 2008 R2 based terminal server farm. Some problems with removable media (USB sticks etc.) and audio on IGEL thin clients were hard to troubleshoot, but we were able to fix them. The main problem was none at first glance.

VMware ESXi 5.5 host doesn't mount VMFS 5 datastore

Yesterday I stumbled over a forum post in a german VMware forum. A user noticed after a vSphere 5.5 update, that a newly updated ESXi 5.5 hosts wasn’t able to mount some datastores. The host was updated with a HP customized ESXi 5.5 Image. The other two hosts, ESXi 5.1 installed from a HP customized image, had no problems. A HP P2000 G3 MSA Array with iSCSI was used as shared storage. The datastores with VMFS version 5.54 were mounted. Only datastores with VMFS 5.58 were not mouted. The user evacuated the VMs off one of the datastores, and then deleted and recreated the datastore. The recreated datastore appeared for a short moment and than disappered again.

Problem analysis with Kepner-Tregoe

When you deal with problems in IT, you often deal with problems where is root cause is unknown. To solve such problems, you have to use a systematic method. Only a systematic method leads to a fast, effective and efficient solution. One of the most commonly observed methods in my career bases on approximation. We all know it as “trial and error”. Someone tries as long until the problem is solved. Often this method makes it worse than it was before, and it often leads to wrong conclusions, and furthermore wrong results. If someone draws a wrong connections at the beginning of the analysis, this leads to a totally wrong path. I would like to illustrate this with an example:

Trouble with Broadcom NetXtreme II and VMware ESXi

I faced today a really nasty problem. I have four HP ProLiant DL360 G6 in my lab. This server type has two 1 GbE NICs with the Broadcom NetXtreme II BCM5709 chip onboard, which are usually claimed by the bnx2 driver. While applying a host profile to three of the hosts, one hosts reported an error. Supposedly the host hasn’t a vmnic0 and because of this the host profile couldn’t be applied. Okay, quick check in the vSphere Web Client: Only three NICs. C# client showed the same result. Now it was interesting:

HP 4 Gb Fibre Channel Pass-Thru Module for c-Class BladeSystem & 8 Gb SFP+ transceiver

TL;DR: The HP 4Gb Fibre Channel Pass-Thru Module is (as the name says) 4 Gb Fibre-Channel module. Even if HP delivers the module with 8 Gb SFP+ transceivers, the module can only provide a 4 Gb link. Don't make the same mistake as I did. Just because 8 Gb SFP + are included, it doesn't necessarily mean that the module provides an 8 Gb connection.

The HP 4Gb Fibre Channel Pass-Thru Module for c-Class BladeSystem (PN 403626-B21) is a interconnect module for the HP BladeSystem c-Class. It’s a simple pass-thru module, which provides a 1:1 non-switched, non-blocking paths between the server blade and a Fibre Channel Fabric. There are several Fibre Channel interconnect modules, like the Virtual Connect 8 Gb Fibre Channel modules (20 or 24 ports) or the Brocade and Cisco 8Gb SAN Switches for HP BladeSystem c-Class. The pass-thru modules is a good choice if the customer has a good Fibre Channel infrastructure and the number of servers is manageable. It’s much cheaper than the Virtual Connect Fibre Channel modules (which require a Virtual Connect Ethernet module for management) or the Brocade or Cisco MDS Fibre Channel switches for HP BladeSystem c-Class. But it also has disadvantages. it only provides a 4 Gb Fibre Channel link! Even if HP delivers the modules with 8 Gb SFP+ transceivers, only a maximum of 4 Gb are possible. Neither the Quick Specs, nor the HP support could make a statement which SFP+ transceivers are included. That 8Gb SFP+ transceivers are included, was a chance finding. Unfortunately HP doesn’t provide a 8 Gb pass-thru module and the 4 Gb pass-thru module doesn’t support 8 Gb connections, even with 8 Gb SFP+ transceivers. If you need a 8 Gb connection you have to use Virtual Connect or Brocade or Cisco MDS Fibre Channel switches.

Flooded network due HP Networking Switches & Windows NLB

Today I was onsite at a customer to bring a tiny VMware vSphere cluster to life (HP BladeSystem c7000 with 7 HP ProLiant BL460 Gen8). Normally no big deal, but it started with two unavailable Onboard Administrator (OA) network interfaces. I switched from static ip addresses to DHCP, but I had no luck. I noticed that both interfaces were available if I connect my notebook directly to the interfaces. I even noticed that the Insight Display was unresponsive after connecting one or both OA to the network. The customer told me, that they had yesterday network related problems with virtual AND physical machines. Short outages, lost pings, things like that. This morning, before I arrived on site, the problems were worse. The customer told me that they had this network problems for a while. They had a lot of work and the outages were annoying, but not a big problem. The network of the BladeSystem were already connected (HP 10GbE Pass-Thru modules), but this kind of interconnect couldn’t cause this kind of problems. I checked the Switches and found on EVERY SINGLE ACTIVE port an enormous amount of “Drops TX”. But I found no loops or something like that. The network was flat. One VLAN and a /16 network. Not nice, but functional. I asked the customer to start Wireshark. I wanted to take a look around, get a feeling for what was going on in the network. Wireshark started and… stopped responding. After a couple of seconds it came back and I saw traffic that was… spooky. Usually I expect things like broadcasts, ARP, traffic from my client or for my client. But I saw traffic from a domain controller to a Windows NLB cluster and Citrix traffic to a Windows NLB cluster. I checked if the workstation was connected to a monitoring port, but it wasn’t. And it was only traffic with destination to the Windows NLB cluster. Our network problems had something to do with the Windows NLB. The customer and I decided to stop both NLB nodes. After that: Silence… I saw the expected traffic in Wireshark and my OA were both responding. Everything was fine… until we started the NLB again.

Regenerating expired vCenter SSL certificates

During a vSphere 5.0 > 5.5 upgrade I got this message:

The SSL certificate for this product is expired. See Knowledge Base article kb.vmware.com/kb/1009092

The customer hasn’t installed CA-signed certificats, so the expired certificates are the out-of-the-box self-signed certificates. The certificates are valid for two (VirtualCenter 2.5) respectively 10 years (since vCenter 4.x), depending on the Version. The only way to continue the installation is to renew the certificates. After renewing the certificates, you can simply continue the setup due the fact, that the vCenter service is stopped at this point of the setup and it loads the new certificates during startup. It’s the setup which checks the validity of the certificates. KB1009092 describes in great detail what to do, so I will not repeat what is already written there. You should note, that you can’t use the ESXi busybox to renew the certificates. The necessary OpenSSL binary isn’t included. The KB articles recommends OpenSSL for Windows. I simply used my Linux root server. But you can also use a small Linux VM. After renewing the certificates for vCenter, Inventory server and Web Client I simply continued the setup and it ran without problems by. The deployment of CA-signed certifcates is planned.