Nutanix

One of my customers is running a five node Nutanix cluster, which is configured for FT2 (2N/2D). This means, that if the storage containers are configured with RF3, two of the five nodes can fail. Now I stumbled over the “Cluster Resiliency / Fault Tolerance Status” widget on the dashboard, which clearly showed that only a single node failure would be tolerated. This was a bit confusing, because the cluster is configured for FT2, which should allow two node failures (in case of RF3 for the storage containers).

This really annoying issue was hunting me for several weeks until I discovered the root cause. One of my customers is running VMware ESXi on top of HPE ProLiant DX hardware, the customized Hardware from HPE for Nutanix. It’s simply a ProLiant DL with a specific set of available components, firmware, drivers and branding. Instead of running AHV, this customer chose to run VMware ESXi as hypervisor. Everything was running fine until the customer reported reocurring fails of a specific Nutanix Cluster Check, in this case the ‘host_disk_usage_check’. While investiagting the issue, I noticed that the root filesystem on all nodes of the clsuter was full.

After a routine update of a 6-node Nutanix cluster, a Nutanix Cluster Check (NCC) warning popped up indicating a problem with the SAS cabling. Running the check on the CLI offered some more details.

Running : health_checks hardware_checks disk_checks hpe_hba_cabling_check
[==================================================] 100%
/health_checks/hardware_checks/disk_checks/hpe_hba_cabling_check                                                                                   [ WARN ]
-----------------------------------------------------------------------------------------------------------------------------------------------------------+

Detailed information for hpe_hba_cabling_check:
Node 10.99.1.205:
WARN: Disk cabling for disk(s) S6GLNG0T610113 are detected at incorrect location(s) 3:251:8 respectively where each value in the location corresponds to box:bay
Node 10.99.1.202:
WARN: Disk cabling for disk(s) S6GLNG0T610203 are detected at incorrect location(s) 3:251:8 respectively where each value in the location corresponds to box:bay
Node 10.99.1.206:
WARN: Disk cabling for disk(s) S6GLNG0T610104 are detected at incorrect location(s) 3:251:8 respectively where each value in the location corresponds to box:bay
Node 10.99.1.201:
WARN: Disk cabling for disk(s) S6GLNG0T610248, S6GLNG0T610219, S6GLNG0T610220, S6GLNG0T610081, S6GLNG0T603894, S6GLNG0T603909, S6GLNG0T610222 are detected at incorrect location(s) 3:252:1, 3:252:7, 3:252:6, 3:252:2, 3:252:3, 3:252:4, 3:252:5 respectively where each value in the location corresponds to box:bay
Node 10.99.1.203:
WARN: Disk cabling for disk(s) S6GLNG0T610247 are detected at incorrect location(s) 3:251:8 respectively where each value in the location corresponds to box:bay
Node 10.99.1.204:
WARN: Disk cabling for disk(s) S6GLNG0T610213 are detected at incorrect location(s) 3:251:8 respectively where each value in the location corresponds to box:bay
Refer to KB 11310 (http://portal.nutanix.com/kb/11310) for details on hpe_hba_cabling_check or Recheck with: ncc health_checks hardware_checks disk_checks hpe_hba_cabling_check --cvm_list=10.99.1.205,10.99.1.202,10.99.1.206,10.99.1.201,10.99.1.203,10.99.1.204
+-----------------------+
| State         | Count |
+-----------------------+
| Warning       | 1     |
| Total Plugins | 1     |
+-----------------------+
Plugin output written to /home/nutanix/data/logs/ncc-output-latest.log

All six nodes were affected. The cluster is running for quite some time without any issues, and this issue never came up before. It appeared right after installing the latest patches.

There’s a world below clouds and enterprise environments with thousands of VMs and hundered or thousands of hosts. A world that consists of maximal three hosts. I’m working with quite a few customers, that are using VMware vSphere Essentials Plus. Those environments consist typically of two or three hosts and something between 10 and 100 VMs. Just to mention it: I don’t have any VMware vSphere Essentials customer. I can’t see any benefit for buying these license. Most of these environments are designed for a lifeime of three to four years. After that time, I come again and replace it with new gear. I can’t remember any customer that upgraded his VMware vSphere Essentials Plus. Even if the demands to the IT infrastructure increases, the license stays the same. The hosts and storage gets bigger, but the requirements stays the same: HA, vMotion, sometimes vSphere Replication, often (vSphere API for) Data Protection. Maybe this is a german thing and customers outside of german are growing faster and invest more in their IT.

Nutanix was founded in 2009 and left the stealth mode in 2011. Their Virtual Computing Platform combines storage and computing resources in a building block scheme. Each appliance consists up to four nodes and local storage (SSD and rotating rust). At least three nodes are necessary to form a cluster. If you need more storage or compute resources, you can add more appliances, and thus nodes, to the cluster (scale out). Nutanix scales proportionately with cluster growth. The magic is not the hardware - it’s the software. The local storage resources of each appliance are passed to the Nutanix Controller VM (CVM). The CVM services I/O and storage to the VMs and is running on each node, regardless of the hypervisor. You can run VMware ESXi, Microsoft Hyper-V and KVM on the nodes. Although the Nutanix Distributed File System (NDFS) is stretched across all nodes, I/O for a VM is served by the local CVM. The storage can be presented via iSCSI, NFS or SMB3 to the hypervisor.

Nutanix

Fault Tolerance Status shows only FT2 in case of 2N/2D

Full Root FS on ESXi due to iLOREST logfile

hpe_hba_cabling_check falsely issues a warning

Is Nutanix the perfect fit for SMBs?

Useful stuff about Nutanix