The Top 10 Things to Check for a healthy vSAN Cluster

Top 10 Things to check vSAN Cluster 

1-vSAN Metrics
Topic: Performance and Troubleshooting
Problem: Poor performance
Impact: High. The Workloads might not receive the expected resources for a base performance
Cause: Host, Device or Network failure. Not optimal vSAN Design. Design or Sizing didn't align with Best Practices
Max Disk Group Congestion
Read Cache / Write Cache Latency (ms)
Avg Read / Write Latency (ms)
vSAN Port Group Packets Dropped
Capacity Disk Latency (ms)
Min Disk Group Write Buffer free (%)
Sum Disk Group Errors
Read Cache Hit Rate (%) (Hybrid vSAN Cluster)
Read Cache Miss Rate Ratio (Hybrid vSAN Cluster)
Best Practice: Align Cache, Endurance and Capacity disks based on Workload behaviour expected (Write, Read and Mix use intensive)

2-What if
Topic: Potential failures on Host Resources or Fault Domains
Problem: After a vSAN failure the Cluster doesn’t have the minimum amount of Resources to provide Availability based on the PFTT Policy Rule
Impact: Medium-High. Components state might be Degraded, Absent or Stale. Some VMs Objects would not be available
Cause: A Host in Maintenance mode, Network partition, Host Isolated, Controller Failure, Disk Failure
RVC: vsan.whatif_host_failures
vSphere Client Health Check -> Limits -> After 1 additional Host failure
ESXCLI vsan health cluster get -t "After 1 additional host failure”
Best Practice: Don’t use the minimum amount of Hosts per Cluster

3-Hardware Compatibility
Topic: vSAN Compatibility Guide (VCG)
Problem: Hardware not supported. Firmware and Drivers not validated
Impact: Medium-High. vSphere Health Check will show a warning or error. VMware support may not accept the ticket
Cause: The Hardware is not in the vSAN VCG for the current vSphere version. The Hardware-Firmware-Driver is not supported or validated for the current version. Firmware and-or Driver was not updated after a vSphere Upgrade
vSphere Client vSAN Health Check -> Hardware compatibility
vSphere Client vSAN Health Check -> Online health -> vCenter Server up to date
esxcli vsan debug controller list
Best Practice: Use vSAN Ready Nodes if possible. Always check the VCG before Upgrading. Keep the vSAN HCL DB (vCenter Health Check) up to date.

4-Network Performance
Topic: Network Configuration and Bandwidth
Problem: Network misconfiguration, physical errors, dropped packets, poor performance
Impact: High. Network problems might result in Isolated Hosts, vSAN Cluster Partitions and implications in the Availability and Performance
Cause: Not following the Best Practices for Network Design. The Network resources provided for vSAN VMkernel are not enough. Potential failures in the Physical layer
Sum vSAN Portgroup Packets Dropped (%)
Total Throughput (KBps)
vSphere Client vSAN Health Check -> Network -> Hosts with connectivity issues
Best Practice: 10Gbps for All-Flash at a minimum. QoS at the physical layer. NIOC if you share vmnics. Jumbo Frames and one VLAN per vSAN Cluster. Enable vDS with Health Check in vCenter.

5-vSAN Components Resynchronizing
Topic: vSAN Object Compliance
Problem: After a Failure or Rebalance the vSAN Cluster has to re-create Components. While that process takes place it is not recommended to run any Maintenance task such as Upgrade, apply a new Policy to existing VMs, force a Proactive Rebalance or put a Host in Maintenance mode.
Impact: Medium-High. It’s possible to see an impact on the Performance. Based on the Available Resources and the PFTT and FTM policy’s, if one Host enters in Maintenance mode, that might affect the Availability of some Components.
Cause: Host or Device failure, proactive or reactive rebalance, Maintenance task and Change vSAN Policy.
vSphere Client -> vSAN Cluster -> Monitor -> vSAN -> Resyncing Components
RVC: vsan.resync_dashboard
PowerCLI -> Get-VsanResyncingComponent -Cluster $cluster
Best Practice: Provide enough Network resources and avoid the deployment of vSAN Clusters with a minimal amount of Hosts (based on the PFTT and FTM rules).

6-vSAN Hosts and KMS Clusters
Topic: vSAN Encryption
Problem: After a general outage over a vSAN Cluster with Encryption services enabled, the Hosts are not able to reach the KMS Servers.
Impact: High. The Virtual Machines in that Cluster won’t be able to be powered on.
Cause: A general outage that powered off all the Hosts and Virtual Machines, including vCenter Server VM.
vSphere Client vSAN Health Check -> Encryption -> vCenter and all hosts are connected to Key Management Servers
vCenter and Hosts have to be able to reach KMS Cluster that on 5696 Port
Best Practice: Avoid single point of failures. Add KMS Cluster based on IP. Don't encrypt vCenter VM.

7-Host Membership
Topic: vSAN Cluster Partitioned
Problem: The Host is not able to provide resources to the Cluster.
Impact: Medium-High. Some Objects will appear as non-compliance and some Components might be Absent.
Cause: Because of a logical problem, a network partition, misconfigurations and human errors, the vSAN Cluster is partitioned, one Host isolated or the Host is not a member of the Cluster (even if the vSphere Client shows the Host inside the Cluster in the UI).
esxcli vsan cluster get
RVC: vsan.cluster_info
vSphere Client vSAN Health Check -> Cluster -> vSphere cluster members
Best Practice: Follow the vSAN Network Design Best Practices. Avoid a SPOF.

8.-Stretched Cluster Sites Connectivity
Topic: Stretched Cluster
Problem: Available Bandwidth, high Latency and lost connectivity.
Impact: Medium. In the case of failures or high latency between Sites, Replicas might be impacted. A Witness failure will suppose Absent Components and Objects in non-compliance state and, for this reason, a Risk.
Cause: Poor network resources such as Low Bandwidth, high Latency and non-stable connectivity between Sites.
vSphere Client vSAN Health Check -> Stretched cluster
Available Bandwidth and Round Trip Latency between Sites (using 3rd party tools)
Best Practice: Follow the vSAN Network Design Best Practices for Stretched Cluster and 2 Node Cluster.

9.-Available Capacity
Topic: vSAN Storage Capacity
Problem: Low available capacity in the vSAN Cluster.
Impact: High. This situation might create a Risk if any failure takes place. It will limit some maintenance tasks and may restrict the creation of new VMs.
Cause: The design didn't consider the usable capacity, the growth, snapshots, swap files, slack and the impact of the policies.
Slack space (between 25% and 30%)
Total Disk Space (GB)
Disk Space Used (%)
Used Disk Space (GB)
Best Practice: Maintain a 25%-30% additional space for Slack. Consider the ratio Cache:Capacity when adding more capacity.

10.-Are you Following the vSAN Best Practices?
Topic: vSAN Best Practices to check
Two or more Disk Groups per Host
Two (or more) Disk Controllers per Host
QoS and Jumbo Frames
LACP (if already configured). Align physical switch configuration with vDS LACP
1 vSAN Cluster, 1 VMkernel PG, 1 VLAN
Use Passthrough Controller mode. Set 100% Read Cache on Controllers
Avoid Dedup and Compression on High-Performance Workloads
Sharing vmnics? Use vDS with NIOC. Configure Bandwidth reservation and high custom shares
Align Cache, Endurance and Capacity disks based on Workload behaviour expected (Write, Read and Mix use intensive)
Deploy homogenous Hosts Configurations for CPU, RAM, NETWORK and DISK
Configure BIOS Host Power Management for OS Controlled
Use multiple Storage Policies
Using controllers with high queue depth improves performance
Consider NVMe Devices for high-performance

Deja un comentario

Muchas gracias por tus comentarios!!
Tras la revisión rutinaria, será publicado.