Quantcast
Channel: High Availability (Clustering) forum
Viewing all 5648 articles
Browse latest View live

Stretched Cluster with active/active storage and split bain

$
0
0

Hello forums,

we have a problem understanding what will happen in a split brain situation in our current setup:

Site A:

- 2 Hyper-V Servers (but I guess for the question the type of application does not matter)
- 1 active storage node

Site B:

- 2 Hyper-V Servers (but I guess for the question the type of application does not matter)
- 1 active storage node

So 2 identical sites, each with an active storage node.
The storage is synchronous, so it will only send the ACK to the VM when both nodes have written the data.
Because of a quick interconnect, this performs well.
If the interconnect is broken, both nodes will continue to work and allow the VMs (or whatever wants to write data), to write it's data.

If we now use a disk witness as the tie breaker for the cluster and the interconnect is broken, both sites will see 2 Hyper-V hosts and the quorum lun.
If my understanding is right, then both would assume the majority and start all VMs, which is the worst scenario you can have...

Do we have to use a file share witness on a VM / another Server) or can we somehow prevent such a scenario?


Nr of VMs per LUN (CSV)

$
0
0
I have seven node Hyper-V cluster (Server 2012 R2) and currently 133 VM's deployd there. I'm using only one LUN CSV with size 10TB, and 8TB of it is used by VM's. I have not had any problems with with LUN performance. I'm monitoring it with SC Operations Manager.
Somehow I've missed that it is recommended to make many LUN's for better performance.
I read some blogs, but can anyone give me recommendations what's the best way?

CAU Hotfix Plugin - The plug-in argument HotfixRootFolderPath has invalid value

$
0
0

Hi. I have 2012R2 cluster configured for CAU in self-updating mode with both WindowsUpdate and Hotfix plugins. The configuration went fine, however when I try to run CAU using these options, it will fail with the error "The plug-in argument HotfixRootFolderPath has invalid value".

I've repeatedly checked that the path is correct and browsable and has all the correct permissions it should have. I've tried with both DisableAclChecks True/False, didn't make a difference. The path contains a space, so I've tried enclosing it in double-quotes, that didn't help either.

I've ran CAU from the GUI, here's the command it generates:

Invoke-CauRun -ClusterName cluster01 -CauPluginName 'Microsoft.WindowsUpdatePlugin','Microsoft.HotfixPlugin' -CauPluginArguments @{ 'HotfixConfigFileName' = 'DefaultHotfixConfig.xml'; 'DisableAclChecks' = 'False'; 'HotfixRootFolderPath' = '\\fileserver\CAU\Windows Server 2012 R2\Hotfixes\Hyper-V\Root'; 'IncludeRecommendedUpdates' = 'True'; 'RequireSmbEncryption' = 'True' } -MaxFailedNodes -1 -MaxRetriesPerNode 3 -EnableFirewallRules -FailbackMode Immediate -Force

The root folder contains DefaultHotfixConfig.xml per documentation and also there's CAUHotfix_All folder (currently empty as there are no hotfixes I need to install).

As I've said above, I tried modifying the path in the command above to 'HotfixRootFolderPath' = '"\\fileserver\CAU\Windows Server 2012 R2\Hotfixes\Hyper-V\Root"', which didn't help.

Any idea what's wrong?



An error was encountered while configuring the clustered virtual machine

$
0
0

Hello.

I have some 3 VMs on a Hyper-V server(2012 R2) which is part of a Failover Cluster.

These machines are not highly available, but I would like to have them highly available, added to my cluster.

These VMs have 4-5 snapshots and when the snapshots were taken, there was an ISO file mounted, from a drive which no longer exists.

When I try to create a role of HA VM in the cluster, I select the VM which is not HA and I get the error as in the title.

I tried to see how to remove the ISO file from snapshots, but the only solution I found was to delete snapshots which is not possible as I need those snapshots.Is there a powershell solution to get rid of the ISO file from the snapshots? Is there another solution to do this?

Thank you,

Gabi

New-Cluster tries to cluster wrong disks

$
0
0

Dell chassis with two blade servers and storage.

Three virtual disks on the chassis presented to both servers.

Each server has two disks on local storage, for local C and D drives.

Windows Server 2012 R2

I am trying to automate the configuration of the servers, including clustering them.

When I run Test-Cluster, the report shows it identified all of the drives on each node, and knows that only the three shared disks are to be validated for clustering.

When I run:

New-Cluster -Name Cluster1 -Node Server1, Server2, -StaticAddress 10.10.10.10

the three shared disks are correctly clustered.

But it is also doing something with the local D drives.  After the command runs, the disks containing the D drives are offline and readonly on both servers.  If I try to bring one of them online, I get:

set-disk : The specified object is managed by the Microsoft Failover Clustering component. The disk must be in cluster
maintenance mode and the cluster resource status must be online to perform this operation.

How do I get New-Cluster to ignore my local D: drives?


Tim Curwick
MadWithPowerShell.com

virtual lab space

$
0
0

hey all,

what is the link to the Clustering  test environment please 

Failover Happened on RTO?

$
0
0

Hello Everyone,

i'm using windows 2012 R2 failover cluster, can anyone help me to understand how manyRTO occurred on the network that initiate the failover to another Node.

NLB and the hosts it load balances

$
0
0

Hi,

I have been playing around with NLB and noticed some unexpected behaviour. 

Scenario: I have 2 NLB hosts load balancing 2 Web Servers e.g. Server1 running NLB and Apache1 and Server 2 running NLB and Apache2. Single affinity enabled with a timeout of 5 minutes.

Issue / understanding 1: If I hit Server1 then I only ever see Apache1 instance. I assume that this is because (a) the Affinity is between the Client and NLB and not the Client and the Host and (b) once the request hits Server1 it only uses the Apache instance on Server1...... (I expected NLB on Server1 to be able to see both Apache instances) Is this understanding correct?

Issue / understanding 2: If I stop Apache1 on this server (reflective of a failure of the Apache instance) then I just get blank pages as NLB is running and accepting requests. It seems to be unable to detect there is an issue with the service it is load balancing. Is this correct?

If these are both right then how is NLB fault tolerant under Affinity. Surely once Apache fails on Server1 almost 50% of all requests will continue to fail unless NLB is stopped on that host.

Sorry but I appear to have either a misunderstanding of how NLB works or my environment is not configured correctly.

Thanks,

Eddie



Windows 2012 : sharing tab slow

$
0
0

We have a cluster with 2 Windows Server R2.

When we create a share inside another share, the Sharing tab is very slow to appear (about 2 minutes).

We thought it is because the root folder has many subfolders, but with an empty root folder the behavior is the same.

Does anyone know how to fix this behavior?

Windows Failover Cluster - Architecture documents??

$
0
0

Hi guys !!

I am looking for information about Windows Failover Cluster - Architecture. I need deep inside info.

Somebody know where find this? I cant find info about CLIUSR account, low level Networking, dlls, so on. (A customer wanto to rename CLIUSR account).

Thanks in advance,

JD

trouble authenticating to the domain for Live Migration

$
0
0

Hi

I have a scenario in failover clustering which I am unsure on how to work around.

Our setup is a 2 node cluster, one node in building A, the other in building B.

In each building we have the following setup:

Datacenter switch where Live Migration, Heartbeat, and Disk connection goes through

A main public switch where Host management and Virtual switch connect through.

A physical domain controller is a separate physical machine and utilises the public switch

The cluster communication therefore is private between the datacenter switches

Earlier today our public switch crashed, at the time we didn’t know what caused the problem but this prevented the host in Building A from talking to the DC in Building A, and with the host in Building B. According to the cluster everything was fine except that initially the cluster was partitioned, and then it figured out that only Building A was not contactable. I first tried a live migration but this failed (I see event errors 21502, 2050 and 2051) I had to use quick migration and this worked. As far as I can tell, it was problems relating to authentication to the domain (all workloads were running in Building A by the way), which makes sense since Building A host couldn’t talk to a DC, and the DC in Building B is non routable through the datacenter switch (a fibre optic links the two data switches from one building to the other).

How can I use live migration when I can’t authenticate to a domain controller – is this the Kerberos authentication option in the hyper-v configuration (constrained delegation) or will this not help me?

Since the two nodes can talk to each other through the datacenter switches and belong to the same cluster…. I thought the node wouldn’t need to contact AD for authentication, or if it did then the second node would have some sense to talk to it on behalf of the other server – maybe I am expecting too much from the inner workings of the cluster to do this.

Any suggestions on what I need to do to get this to work under this scenario, we can not re-route physical cables between the two buildings unfortunately, and I am not looking at “buy additional switches” as an answer, I already know that. I am specifically looking for a server configuration change which could address this – if there is one.

Many thanks

Steve

Winrm and URL Prefix

$
0
0
My company changed the default wsman URL Prefix for Winrm. As a consequence, we cannot access file shares on Failover Clustering. We get Event Error codes 142 and 49 under the Windows Remote Management  Event logs. Changing the URL Prefix back to wsman resolves our issue. Is there a way to resolve this issue keeping our changed URL Prefix?

Changing Cluster IP and Cluster Name

$
0
0

Hi All,

We have 2 node cluster with physical Windows 2008 Enterprise server.

We are planning to Change the Cluster IP and Cluster name of this existing cluster, i am pretty much aware about the steps like

Cluster Name and Cluster IP could be changed directly from Failover Manager console...

However I am not aware whether above change could be carried out without outage or do we need to shut all Services and Application followed by Shutdown cluster...

Could anyone please guide with their experience and expertise. Thanks

Problems live migrating some VMs between nodes - 2012 R2 Cluster

$
0
0

Today we migrated from a old 2008 R2 cluster to a new 2012 R2 cluster, we are having problems with live migration of selected VMs between the 2 nodes currently in the cluster.

When we attempt this we see the following error in failover cluster manager

Live migration of 'SCVMM RC-LYNC-MON' failed.

Virtual machine migration operation for 'RC-LYNC-MON' failed at migration source 'RC-HYPERV-4'. (Virtual machine ID D3E6F2DF-A25B-459B-944D-A35979963B92)

The Virtual Machine Management Service failed to establish a connection for a Virtual Machine migration with host 'RC-HYPERV-5': General access denied error (0x80070005).

The Virtual Machine Management Service failed to establish a connection for a Virtual Machine migration because the destination host rejected the request: General access denied error (0x80070005).

the following in the is seen on the destination node:

The Virtual Machine Management Service denied a request for a Virtual Machine migration at the destination host: General access denied error (0x80070005).

The Virtual Machine Management Service blocked a connection request for a Virtual Machine migration from client address '192.168.4.6': General access denied error (0x80070005).

I've read things about permissions but am unsure what exactly I am looking at and where I should be looking.

For example, the ACL on this folder C:\ClusterStorage\Volume1\RC-LYNC-MON\Virtual Machines\D3E6F2DF-A25B-459B-944D-A35979963B92 is missing RC-HYPERV-5$ as an entry.  Other VMs that are migrating fine have this entry.

All VMs reside on a CSV, but I don't want to go altering permissions in case I make the situation worse!

Can anybody suggest some things to try?


Server 2012 R2 Cluster Failover issues during catastrophic failure of iSCSI shared storage on Node 1 of 2

$
0
0

On this posting, I detailed a simple two node Server 2012 R2 cluster configured with a single file server role.  Both nodes accessing shared iSCSI storage.  Client system is performing a simple looped 64 KB read on a file share from the clustered server.  This test applet will fail any time the ReadFile() API fails.

When one node experiences a complete power loss, the other node will take over and no failure will occur on the client system (other than an ~30 second delay where the synchronous ReadFile() API does not return).  This was detailed in the other post. The key point is that failover completed successfully and NO failure was seen on the client PC.

I want to test out a different failover scenario so with Node 1 being the current host, and the owner node for the file server role, quorum disk, and shared storage, I simulate a catastrophic failure of Node 1’s iSCSI storage.  The OS itself boots from a virtual ATA drive so the OS drive will remain alive.  What I do is go into device manager, find the NIC(s) that are configured for the iSCSI network, and then I disable them.  This causes all iSCSI I/O activity to fail.  When I perform this test, the following is seen from the client PC:

  • 11:20:01 - Node 1 loss of iSCSI shared storage
  • 11:22:01 - Synchronous ReadFile() returns after 120 seconds with ERROR_INVALID_HANDLE
  • 11:22:20 - CloseHandle() completes successfully after 19 seconds

I understand the 120 second delay.  It’s a combination of the iSCSI link down timer as well as the MPIO PDO remove period timer.  What I don’t understand is why the request failed.  I expected a failover and successful completion of the ReadFile like I saw when I did a power loss of Node 1.

What’s interesting is that Windows did move the host server, file server, and disk storage, to Node 2.  So a failover did occur.  At least this time it did. Other times, I have seen the file server role stopped.  It appears that the failover occurred immediately after the client PC received the read failure.  Why not complete the failoverbefore failing the client ReadFile request?  Are there timeout adjustments I could experiment with?

Here is Node 1's event log which shows the failover sequence from Node 1's perspective starting just after the catastrophic iSCSI loss on Node 1:

   ProviderName: Microsoft-Windows-FailoverClustering

TimeCreated                     Id LevelDisplayName Message
-----------                     -- ---------------- -------
7/29/2015 11:20:06 AM         1132 Information      Cluster network interface 'TESTCLUSTER1 - Ethernet 2' for node 'TESTCLUSTER1' on network 'Cluster Network 1' was removed.
7/29/2015 11:21:38 AM         1649 Information      Cluster resource 'Cluster Disk 1' in clustered role 'Cluster Group' has taken more than one minute to respond to a control code. The control code was 'STORAGE_GET_DISK_INFO_EX'.
7/29/2015 11:22:01 AM         5264 Information      Physical Disk resource 'd47df305-c3a6-4bbe-8475-48e1398bbee6' has been disconnected from this node.
7/29/2015 11:22:01 AM         5264 Information      Physical Disk resource 'b543b5ab-67c9-4836-89a1-ae5636916de5' has been disconnected from this node.
7/29/2015 11:22:01 AM         1637 Information      Cluster resource 'Cluster Disk 1' in clustered role 'Cluster Group' has transitioned from state Online to state ProcessingFailure.
7/29/2015 11:22:01 AM         1637 Information      Cluster resource 'Cluster Disk 1' in clustered role 'Cluster Group' has transitioned from state ProcessingFailure to state WaitingToTerminate. Cluster resource 'Cluster Disk 1' is waiting on the following resources: .
7/29/2015 11:22:01 AM         1637 Information      Cluster resource 'Cluster Disk 1' in clustered role 'Cluster Group' has transitioned from state WaitingToTerminate to state Terminating.
7/29/2015 11:22:01 AM         1637 Information      Cluster resource 'Cluster Disk 1' in clustered role 'Cluster Group' has transitioned from state Terminating to state DelayRestartingResource.
7/29/2015 11:22:01 AM         1637 Information      Cluster resource 'Cluster Disk 2' in clustered role 'DATAFS' has transitioned from state Online to state ProcessingFailure.
7/29/2015 11:22:01 AM         1637 Information      Cluster resource 'Cluster Disk 2' in clustered role 'DATAFS' has transitioned from state ProcessingFailure to state WaitingToTerminate. Cluster resource 'Cluster Disk 2' is waiting on the following resources: File Server (\\DATAFS).
7/29/2015 11:22:01 AM         1637 Information      Cluster resource 'File Server (\\DATAFS)' in clustered role 'DATAFS' has transitioned from state Online to state WaitingToTerminate. Cluster resource 'File Server (\\DATAFS)' is waiting on the following resources: .
7/29/2015 11:22:01 AM         1637 Information      Cluster resource 'File Server (\\DATAFS)' in clustered role 'DATAFS' has transitioned from state WaitingToTerminate to state Terminating.
7/29/2015 11:22:01 AM         1637 Information      Cluster resource 'File Server (\\DATAFS)' in clustered role 'DATAFS' has transitioned from state Terminating to state WaitingToComeOnline. Cluster resource 'File Server (\\DATAFS)' is waiting on the following resources: Cluster Disk 2.
7/29/2015 11:22:01 AM         1637 Information      Cluster resource 'Cluster Disk 2' in clustered role 'DATAFS' has transitioned from state WaitingToTerminate to state Terminating.
7/29/2015 11:22:01 AM         1637 Information      Cluster resource 'Cluster Disk 2' in clustered role 'DATAFS' has transitioned from state Terminating to state DelayRestartingResource.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster Disk 1' in clustered role 'Cluster Group' has transitioned from state DelayRestartingResource to state OnlineCallIssued.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster Disk 1' in clustered role 'Cluster Group' has transitioned from state OnlineCallIssued to state ProcessingFailure.
7/29/2015 11:22:02 AM         1633 Information      The Cluster service failed to bring clustered role 'Cluster Group' completely online or offline. One or more resources may be in a failed or an offline state. This may impact the availability of the clustered role.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster Disk 1' in clustered role 'Cluster Group' has transitioned from state ProcessingFailure to state WaitingToTerminate. Cluster resource 'Cluster Disk 1' is waiting on the following resources: .
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster Disk 1' in clustered role 'Cluster Group' has transitioned from state WaitingToTerminate to state Terminating.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster Disk 1' in clustered role 'Cluster Group' has transitioned from state Terminating to state CannotComeOnlineOnThisNode.
7/29/2015 11:22:02 AM         1153 Information      The Cluster service is attempting to fail over the clustered role 'Cluster Group' from node 'TESTCLUSTER1' to node 'TESTCLUSTER2'.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster Disk 2' in clustered role 'DATAFS' has transitioned from state DelayRestartingResource to state OnlineCallIssued.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster Disk 2' in clustered role 'DATAFS' has transitioned from state OnlineCallIssued to state ProcessingFailure.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster Disk 2' in clustered role 'DATAFS' has transitioned from state ProcessingFailure to state WaitingToTerminate. Cluster resource 'Cluster Disk 2' is waiting on the following resources: File Server (\\DATAFS).
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster Disk 2' in clustered role 'DATAFS' has transitioned from state WaitingToTerminate to state Terminating.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'File Server (\\DATAFS)' in clustered role 'DATAFS' has transitioned from state WaitingToComeOnline to state OfflineDueToProvider. Cluster resource 'File Server (\\DATAFS)' is waiting on the following resources: Cluster Disk 2.
7/29/2015 11:22:02 AM         1203 Information      The Cluster service is attempting to bring the clustered role 'Cluster Group' offline.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster IP Address' in clustered role 'Cluster Group' has transitioned from state Online to state WaitingToGoOffline. Cluster resource 'Cluster IP Address' is waiting on the following resources: Cluster Name.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster Name' in clustered role 'Cluster Group' has transitioned from state Online to state WaitingToGoOffline. Cluster resource 'Cluster Name' is waiting on the following resources: .
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster Name' in clustered role 'Cluster Group' has transitioned from state WaitingToGoOffline to state OfflineCallIssued.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster Disk 2' in clustered role 'DATAFS' has transitioned from state Terminating to state CannotComeOnlineOnThisNode.
7/29/2015 11:22:02 AM         1204 Information      The Cluster service successfully brought the clustered role 'DATAFS' offline.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster Name' in clustered role 'Cluster Group' has transitioned from state OfflineCallIssued to state OfflinePending.
7/29/2015 11:22:02 AM         1153 Information      The Cluster service is attempting to fail over the clustered role 'DATAFS' from node 'TESTCLUSTER1' to node 'TESTCLUSTER2'.
7/29/2015 11:22:02 AM         1203 Information      The Cluster service is attempting to bring the clustered role 'DATAFS' offline.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'DATAFS' in clustered role 'DATAFS' has transitioned from state Online to state WaitingToGoOffline. Cluster resource 'DATAFS' is waiting on the following resources: .
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'DATAFS' in clustered role 'DATAFS' has transitioned from state WaitingToGoOffline to state OfflineCallIssued.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'IP Address 10.18.236.0' in clustered role 'DATAFS' has transitioned from state Online to state WaitingToGoOffline. Cluster resource 'IP Address 10.18.236.0' is waiting on the following resources: DATAFS.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'DATAFS' in clustered role 'DATAFS' has transitioned from state OfflineCallIssued to state OfflinePending.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster Name' in clustered role 'Cluster Group' has transitioned from state OfflinePending to state OfflineSavingCheckpoints.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster Name' in clustered role 'Cluster Group' has transitioned from state OfflineSavingCheckpoints to state Offline.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster IP Address' in clustered role 'Cluster Group' has transitioned from state WaitingToGoOffline to state OfflineCallIssued.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster IP Address' in clustered role 'Cluster Group' has transitioned from state OfflineCallIssued to state OfflineSavingCheckpoints.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster IP Address' in clustered role 'Cluster Group' has transitioned from state OfflineSavingCheckpoints to state Offline.
7/29/2015 11:22:02 AM         1204 Information      The Cluster service successfully brought the clustered role 'Cluster Group' offline.
7/29/2015 11:22:02 AM         1641 Information      Clustered role 'Cluster Group' is moving to cluster node 'TESTCLUSTER2'.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'DATAFS' in clustered role 'DATAFS' has transitioned from state OfflinePending to state OfflineSavingCheckpoints.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'DATAFS' in clustered role 'DATAFS' has transitioned from state OfflineSavingCheckpoints to state Offline.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'IP Address 10.18.236.0' in clustered role 'DATAFS' has transitioned from state WaitingToGoOffline to state OfflineCallIssued.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'IP Address 10.18.236.0' in clustered role 'DATAFS' has transitioned from state OfflineCallIssued to state OfflineSavingCheckpoints.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'IP Address 10.18.236.0' in clustered role 'DATAFS' has transitioned from state OfflineSavingCheckpoints to state Offline.
7/29/2015 11:22:02 AM         1204 Information      The Cluster service successfully brought the clustered role 'DATAFS' offline.
7/29/2015 11:22:02 AM         1641 Information      Clustered role 'DATAFS' is moving to cluster node 'TESTCLUSTER2'.


diskshadow and Hyper-V VMs on 2012 R2 CSV

$
0
0

Does diskshadow support taking snapshots of CSV volumes on Server 2012 R2?

I am unable to include the "Cluster Shared Volume VSS Writer" using the verify option.


Ayush

I got this error into my Win server 8 Can any one help me?

$
0
0
Cluster network name resource 'GLOB-PAXFS1' failed to create its associated computer object in domain 'AUHD.LOC' for the following reason: Unable to obtain the Primary Cluster Name Identity token.

The text for the associated error code is: An attempt has been made to operate on an impersonation token by a thread that is not currently impersonating a client.

Random VM reboots with Veeam B&R

$
0
0

I’m wondering if anyone can help us with the issue below.

We are currently running around 12 HA VMs, on a 2-node Windows Server 2012 R2 Hyper-V cluster. VM storage is housed on SMB 3.0 shares, which are running from a 2-node Windows Server 2012 Scale-out File Server cluster.

2 SMB 3.0 shares have been provisioned from the 2 CSVs presented by the SOFS cluster. Each SOFS cluster node is an owner of a CSV. Storage hardware is a Lenovo ThinkServer JBOD.

For VM backup, we are utilising Veeam Backup and Replication 8, with update 2b applied. The backup run begins at 10pm each evening, by means of a scheduled PowerShell script.

The backup completes successfully, but have found over the past week or so that upon arriving at the office the next day, there are multiple 1069 cluster events for the VMs which have rebooted at random.

The VMs in question, are random in terms of which ones reboot each evening.

In an effort to find the root cause of the problem, we disabled all Veeam VM backups one evening. The following morning, our Hyper-V cluster reported that it had gone the entire period without having any issues.

We then, manually ran the backup script during office hours, and waited for any issues. The backup ran without issue, until it came to one of the last few VMs.

What then happened, was that 3 of the VMs restarted. Specifically, EMAIL, PRINTSRV & TS4. The VMs rebooted during the TS4 backup.

These restarted between 12.47pm and 12.48pm. All 3 came back up. There doesn’t appear to be any link between the three (apart from the fact that all 3 were running on the same HV node). What’s more odd is that the reboots occurred way after 2 of the VMs.

I should add that there were other VMs running on that same HV node.

Backup completion times:

EMAIL – 10.49am

PRINTSRV – 12.08am

TS4 – 12.55am

The backup then proceeded, until reaching the penultimate VM. Suddenly, I then noticed that all VMs on all HV nodes lost connection to their storage, and were either turning off or starting up on another node of the HV cluster. A few seconds after seeing this, I checked the logs for our SOFS cluster, and noticed that RHS had stopped unexpectedly, which caused the file cluster to restart and VMs to bomb out.

Amazingly the Veeam backup proceeded to backup the last VM when both clusters returned to a normal running state.

Does anyone have any ideas what is causing this problem? I keep reading about disabling ODX in Server 2012, for storage hardware that doesn’t support it.

All I know is that running the backup, causes problems.

Many thanks.


Windows Clustering AND Network card Power Management

$
0
0
Should Power Management on the network card in the Windows OS be disabled on a Windows Cluster heartbeat? is there any articles, KBs...etc?

John M. Couch

Add disk to geo-cluster unavailable

$
0
0

Dear all,

I have a litle specific problem, i can't add a disk to the cluster.

Lets have a little description.

I created a 2 node cluster, both in same network,but on different BladeCenter. VS-11 and VS-12 is the name of the servers. I created a Cluster with name VS-10, and I tried to add to each servers 1 Tb storage. On both servers the storage are visible, I can write on it. BUT, when i try to add to cluster , i can't... I mean no storage found itis visible only from on node. (VS-11). On VS-12 the storage is unavailable from cluster I mean I can't add to cluster. I know in GEO-Cluster is normal to have one store in offline, depending on owner but in this case i cant't add at all.

I put some print screen about IT.

Please if somebody can advise me or help me...

If somebody are intrested I can giva an TeamViewer account, to check it.

Viewing all 5648 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>