I am running a windows server 2012 r2 hyper-v cluster with via datacore attached jbod storage (via iscsi).
Situation: within the last 6 month this situation occured 3 times. It happened on different Hyper-V hosts. For some reason the host looses connection to the virtual disk and cannot see other hyper-v server of the cluster. The Failover cluster service is stopped on the host and according to event logs the hyper-v host is removed from failover-cluster and later one re-joined.
The virtual disk gets moved to another Hyper-V hosts and all VMs are moved to other Hyper-V hosts and restart.
When I look at the failover-cluster the hyper-v host that lost the connection is already part of the hyper-v cluster again. It is owner of a virtual disk and several vms are running on this host. So the failed hyper-v hosts is fine again.
I am trying to find out what causes this behavior. All VMs on this host go through a crash restart and that isn't something I want to see on my production environment.
There are no AV running on the hyper-v hosts. The storage system/datacore do not show any errors. No backups or snapshots are being done during that time frame. I checked with our network admin if there were any information on switches/routers/firewall during
that time frame but nothing was found.
The errors I see are:
In event log FailoverClustering - Diagnostic I see the first event
[NETFTAPI] Signaled NetftRemoteUnreachable event, local address x.x.x.x:3343 remote address x.x.x.x:3343 I see this event for all networks.
On System Event logs - at the same time:
The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted. This is typically associated with cluster health detection and recovery of a resource. Refer to the System event log to determine which resource and resource DLL is causing the issue. (Error ID 1146).
The next error I see is:
Cluster Shared Volume 'vDisk02-LUN02-IaaS-R01L' ('vDisk02-LUN02-IaaS-R01L') has entered a paused state because of '(c000000e)'. All I/O will temporarily be queued until a path to the volume is reestablished. (Error ID 5120)
Afterwards I see additional errors ID 1146 and 1135 (about removing hyper-v hosts from cluster) and the error:
Cluster Shared Volume XXX has entered a paused state because of '(c000026e)'. All I/O will temporarily be queued until a path to the volume is reestablished. (Error ID 5120).
But this doesn't provide information why RHS was failing or some ideas why the connection to the storage and other hyper-v hosts was lost.
Any ideas what I could do to determine the cause?