Veeam backups fails because of time differences
Last week I had an interesting incident at a customer. The customer reported that one of multiple Veeam backup jobs jobs constantly failed.
The backup job included two VMs, and the backup of one of these VMs failed with this error:
Error: Failed to open VDDK disk [[VMDS-SAS-01] VMDC1/VMDC1_1.vmdk] ( is read-only mode - [true] )
Failed to open virtual disk Logon attempt with parameters [VC/ESX: [vcenter.domain.tld];Port: 443;Login:
[ADAdministrator];VMX Spec: [moref=vm-59];Snapshot mor: [snapshot-20226];Transports: [san];Read Only: [true]]
failed because of the following errors: Failed to open virtual disk Logon attempt with parameters
VC/ESX: [vcenter.domain.tld];Port: 443;Login: [AD\Administrator
The verified the used credentials for that job, but re-entering the password does not solved the issue. I then checked the Veeam backup logs located under %ProgramData%\VeeamBackup
(look for the Agent.Job_Name.Source.VM_Name.vmdk.log) and found VDDK Error 3014:
Insufficient permissions in the host operating system
The user, that was used to connect to the vCenter, was an Active Directory located account. The account were granted administrator privileges root of the vCenter. Switching from an AD located account to Administrator@vsphere.local solved the issue. Next stop: vmware-sts-idmd.log on the vCenter Server appliance. The error found in this log confirmed my theory, that there was an issue with the authentication itself, not an issue with the AD located account.
[2018-07-04T11:59:49.848+02:00 vsphere.local 142f5216-8316-4752-b02c-e02be4154816 INFO ] [VmEventAppender] EventLog: source=[VMware Identity Server], tenant=[vsphere.local], eventid=[USER_NAME_PWD_AUTH_FAILED], level=[ERROR], category=[VMEVENT_CATEGORY_IDM], text=[Failed to authenticate principal [AD\Administrator]. Native platform error [code: 851968][null][null]], detailText=[com.vmware.identity.interop.idm.IdmNativeException: Native platform error [code: 851968][null][null]
[2018-07-04T11:59:49.848+02:00 vsphere.local 142f5216-8316-4752-b02c-e02be4154816 ERROR] [IdentityManager] Failed to authenticate principal [AD\Administrator]. Native platform error [code: 851968][null][null]
com.vmware.identity.interop.idm.IdmNativeException: Native platform error [code: 851968][null][null]
[2018-07-04T12:10:41.603+02:00 vsphere.local 64051ea1-0d7f-453d-8e34-92f0c8c37e77 INFO ] [IdentityManager] Authentication succeeded for user [AD\Administrator] in tenant [vsphere.local] in [37] milliseconds with provider [ad.domain.tld] of type [com.vmware.identity.idm.server.provider.activedirectory.ActiveDirectoryProvider]
To make a long story short: Time differences. The vCenter, the ESXi hosts and some servers had the wrong time. vCenter and ESXi hosts were using the Domain Controllers as time source.
This is the ntpq output of the vCenter. You might notice the jitter values on the right side, both noted in milliseconds.
vcenter:/storage/log/vmware/sso # ntpq
ntpq> peer
remote refid st t when poll reach delay offset jitter
==============================================================================
vmdc2.ad. 192.168.16.11 2 u 53 64 363 0.532 207.553 152007.
vmdc1.ad. .LOCL. 1 u 2 64 377 0.257 204.559 161964.
After some investigation, the root cause seemed to be a bad DCF77 receiver, which was connected to the domain controller that was hosting the PDC Emulator role. The DCF77 receiver was connected using an USB-2-LAN converter. Instead of using a DCF77 receiver, the customer and I implemented a NTP hierarchy using a valid NTP source on the internet (pool.ntp.org).