Hi everyone,
we've got a cluster virtual disk where the partition table and volume name broke. Has anyone experienced a simliar problem and got some hints on how to recover?
The problem occured last friday. I restarted node3 for windows updates. During the restart node1 had a bluescreen and also restarted. The failover cluster manager tried to bring online the cluster resources but failed several times. Finally the resource-swapping
came to a rest on node1 which came up early after the crash. Many virtual disks were in an unhealthy state, but the repair process managed to repair all disks so they are now in a healthy state. We aren't able to explain why node1 crashed. Since the storage
pool is in dual parity mode the disks should be able to work even if there are only 2 nodes running.
One virtual disk, however, lost its partition information.
Network config:
Hardware: 2x Emulex OneConnect OCe14102-NT, 2x Intel(R) Ethernet Connection X722 for 10GBASE-T
Backbone-Network: On the "right" Emulex network card (only members in this subnet are the 4 nodes)
Client-access teaming network: emulex "left" and intel "left" cards in team; 1 untagged network and 2 tagged networks
Software Specs:
- Windows Server 2016
- Cluster with 4 Clusternodes
- Failover Cluster Manager + File Server Roles running on the cluster
- 1 Storagepool with 36 HDDs / 12 SSDs (9HDD / 3 SSD on each node
- Virtual disks are configured to use dual parity:
Get-VirtualDisk Archiv | get-storagetier | fl
- FriendlyName : Archiv_capacity
- MediaType : HDD
NumberOfColumns : 4
NumberOfDataCopies : 1
NumberOfGroups : 1
ParityLayout : Non-rotated Parity
PhysicalDiskRedundancy : 2
ProvisioningType : Fixed
ResiliencySettingName : Parity
Hardware Specs per Node:
- 2x Intel Xeon Silver 4110
- 9HDDs à 4 TB and 3 SSD à 1 TB
- 32GB RAM on each node
Additional information:
The virtualdisk is currently in Healthy state:
Get-VirtualDisk -FriendlyName Archiv
FriendlyName ResiliencySettingName OperationalStatus HealthStatus IsManualAttach Size
------------ --------------------- ----------------- ------------ -------------- ----
Archiv OK
Healthy True 500 GB
The storagepool is also healthy:
PS C:\Windows\system32> Get-StoragePool
FriendlyName OperationalStatus HealthStatus IsPrimordial IsReadOnly
------------ ----------------- ------------ ------------ ----------
Primordial OK Healthy True False
Primordial OK Healthy True False
tn-sof-cluster OK Healthy False False
Since the incident the event log (of current master: Node2) has various errors for this disk like:
[RES] Physical Disk <Cluster Virtual Disk (Archiv)>: VolumeIsNtfs: Failed to get volume information for \\?\GLOBALROOT\Device\Harddisk13\ClusterPartition2\. Error: 1005.
Before the incident we also had errors that might indicate a problem:
[API] ApipGetLocalCallerInfo: Error 3221356570 calling RpcBindingInqLocalClientPID.
Our suspicions so far:
We did registry changes to: SYSTEM\CurrentControlSet\Control\Class\{4d36e972-e325-11ce-bfc1-08002be10318}\0001 (to 0009) and set the value PnPCapabilities to 280 (disabling the checkbox "Allow the computer to turn off this device to save power")
but not all network adapters support this checkbox so this may have had some side effects)
One curiosity: after the error we noticed that one of the 2 tagged networks had the wrong subnet on two nodes. This may have caused some of the failover role switches that occured on friday, but we're unsure about the reason since they were configured correctly
some time before.
We've had a similar problem in our test environment after activating jumbo frames on the network interfaces. In that case we lost more and more filesystems after moving the file server role to another server. In the end all filesystems were lost and we reinstalled
the whole cluster without enabling jumbo frames.
We now suspect that maybe two different network cards in the same network team may cause this problem.
What are your ideas? What may have caused the problem and how can we prevent this from happening again?
We could endure the loss of this virtual disk since it was only archive data and we have a backup, but we'd like to be able to fix this problem.
Best regards
Tobias Kolkmann