Troubleshooting

"Cannot execute upgrade script on host" during ESXi 6.5 upgrade

I was onsite at one of my customers to update a small VMware vSphere 6.0 U3 environment to 6.5 U2c. The environment consists of three hosts. Two hosts in a cluster, and a third host is only used to run a HPE StoreVirtual Failover Manager.

The update of the first host, using the Update Manager and a HPE custom ESX 6.5 image, was pretty flawless. But the update of the second host failed with “Cannot execute upgrade script on host”

Veeam backups fails because of time differences

Last week I had an interesting incident at a customer. The customer reported that one of multiple Veeam backup jobs jobs constantly failed.

The backup job included two VMs, and the backup of one of these VMs failed with this error:

Error: Failed to open VDDK disk [[VMDS-SAS-01] VMDC1/VMDC1_1.vmdk] ( is read-only mode - [true] ) 
Failed to open virtual disk Logon attempt with parameters [VC/ESX: [vcenter.domain.tld];Port: 443;Login: 
[ADAdministrator];VMX Spec: [moref=vm-59];Snapshot mor: [snapshot-20226];Transports: [san];Read Only: [true]]
failed because of the following errors: Failed to open virtual disk Logon attempt with parameters 
VC/ESX: [vcenter.domain.tld];Port: 443;Login: [AD\Administrator

The verified the used credentials for that job, but re-entering the password does not solved the issue. I then checked the Veeam backup logs located under %ProgramData%\VeeamBackup (look for the Agent.Job_Name.Source.VM_Name.vmdk.log) and found VDDK Error 3014:

Demystifying "Interfaces on which heartbeats are not seen"

By accident, I found a heartbeat/ VLAN issue on a NetScaler cluster at one of my customers. The NetScaler ADC appliances have three interfaces connected to a switch stack. Two of the three interfaces were configured as a channel (LAG). This is a snippet from the config:

set channel LA/1 -tagall ON -throughput 0 -lrMinThroughput 0 -bandwidthHigh 0 -bandwidthNormal 0
...
bind vlan 10 -ifnum 1/3
bind vlan 10 -ifnum LA/1 -tagged
bind vlan 54 -ifnum LA/1 -tagged
bind vlan 55 -ifnum LA/1 -tagged

On the switch stack, the port to which interface 1/3 is connected, is configured as an access port. The ports, to which the channel is connected, is configured as a trunk port with some permitted VLANs. The customer is using HPE Comware based switches. The terminology is the same for Cisco. If you use HPE ProVision or Alcatel Lucent Enterprise, translate “access” to “untagged” and “trunk” to “tagged”. Because the channel is configured as a trunk port on the switch, the tagall option was set.

Unsupported hardware family 'vmx-06'

A customer of mine got an appliance from a software vendor. The appliance was delivered as ZIP file with a VMDK, a MF, and an OVF file. Unfortunately, the appliance was created with VMware Workstation 6.0 with virtual machine hardware version 6, which is incompatible with VMware ESXi (Virtual machine hardware versions). During deployment, my customer got this error:

unsupported hardware family 'vmx-06'

The OVF file includes a line with the VM hardware version.

Exchange DAG member dies during snapshot creation

Yesterday, a customer called me and told me about a scary observation on one of his Exchange 2016 DAG (Database Availability Groups) nodes.

In preparation of a security check, my customer created a snapshot of a Exchange 2016 DAG node. This node is part of a two node Windows Server 2012 R2/ Exchange 2016 CU7 cluster.

That something went wrong was instantly clear, after the first alarm messages were received. My customer opened a console windows and saw, that the VM was booting.

Exchange receive connector rejects incoming connections

As part of a bigger Microsoft Exchange migration, one of my customers moved the in- and outbound mailflow to a newly installed mail relay cluster. We modified MX records to move the mailflow to the new mail relay, because the customer also switched the ISP. While changing the MX records for ~40 domains, and therefore more and more mails received through the new mail relay cluster, we noticed events from MSExchangeTransport (event id 1021):

Roaming of AppData-Local breaks Windows 10 Start Menu

One of my customers has started a project to create a Windows 10 Enterprise (LTSB 2016) master for their VMware Horizon View environment. Beside the fact (okay, it is more a personal feeling), that Windows 10 is a real PITA for VDI, I noticed an interesting issue during tests.

The issue

For convenience, I adopted some settings of the current Persona Management GPO for Windows 7 for the new Windows 10 environment. During the tests, the customer and I noticed a strange behaviour: After login, the start menu won’t open. The only solution was to logoff and delete the persona folder (most folders are redirected using native Folder Redirections, not the redirection feature of the View Persona Management). While debugging this issue, I found this error in the eventlog.

Checking the 3PAR Quorum Witness appliance

Two 3PAR StoreServs running in a Peer Persistence setup lost the connection to the Quorum Witness appliance. The appliance is an important part of a 3PAR Peer Persistence setup, because it acts as a tie-breaker in a split-brain scenario.

While analyzing this issue, I saw this message in the 3PAR Management Console:

Patrick Terlisten/ vcloudnine.de/ Creative Commons CC0

Patrick Terlisten/ vcloudnine.de/ Creative Commons CC0

In addition to that, the customer got e-mails that the 3PAR StoreServ arrays lost the connection to the Quorum Witness appliance. In my case, the CouchDB process died. A restart of the appliance brought it back online.

Solving problems: A structured approach

What is a problem? A problem is an obstacle, that has to be surmounted. Solving a problem is connected with obstacles. Or more general: Problem solving is a process to get from an unsatisfactory to a satisfactory situation.

Most of us get paid for solving problems. It’s irrelevant if you are paid for solving technical problem (e.g. My computer doesn’t work), or if you are paid to create solutions for customers (e.g. design infrastructure for a Citrix XenApp farm). At the end you solve a problem.

Windows recieves wrong DNS server from DHCP after DHCPINFORM

Last week, I was surprisingly booked by a customer who observed a problem in his network. Unfortunately, colleagues worked on this network some day before (moving servers, routers etc. to a new pair of HP 7509 new core switches).

It was quickly clear, that some of the clients have received the wrong DNS servers from the DHCP server. The environment is a bit unusual. The customer is running two Active Directory domains (root and sub domain) in a single layer 2 broadcast domain. This nothing unusual, but he is also running two DHCP servers in the same layer 2 broadcast domain. To get this working, the customer uses exclusion ranges and reservations. This guarantees, that the client receives the correct DHCP information.