Quantcast
Channel: High Availability (Clustering) forum
Viewing all 5648 articles
Browse latest View live

Always on Cluster Error

$
0
0

Hi;

I had 3 nodes Always on Cluster. The DRC node had a problem, so i had did evict node. After i added the DRC Cluster node to Domain again. After that i could add node to the windows cluster. As a result the cluster disk nodes can not online on the DRC server. The cluster disk resources can not online, too.

Thanks


ADAM (AD LDS) sync issue

$
0
0

Dear All

I am facing issue in syncing ADAM server. Syncing is started and has been running for 1 day now. In LDP.exe it also shows state "running". Log size is increase to around 12 GB now. My Question is that have you faced this issue and taking this much long time is normal with this size of log ? Another thing is using LDP i tried to find some users but no record found till now. I do not know will it give results when sync is complete or its not bringing anything at all that is why it is running and expanding the log while getting nothing from AD. 

SQL Cluster node not working

$
0
0

HI Team ,

My Server 2012 R2 SQL Cluster Servers Two Node ,

Problem : Server A D Drive unable online mode ( your reference below the screen short) quram Drive which server have D Drive Gone

Stuck on joining state

$
0
0

Hi guys i have 2 node and both went down due to some issue.

Before rebooting my node1 was the primary node and node2 was the secondary node.

My node1 was having issue rebooting but my node2 was fine upon rebooting.

However my node2 is not able to start up the cluster as i check the state is joining.

Please let me know if you need more information. 

S2D 2 node cluster

$
0
0

Hello,

We have 2 node S2D cluster with windows server 2019. Between two nodes we have directly connected RDMA storage network (Cluster only) and client-facing network based on LACP teaming on each node (Client And Cluster). We have done failover test and it works: when we power off one node, virtual machines migrates to another host as expected. But when we unplug client facing adapters (two adapters in LACP) on one node, where VM are resides, VM migration fails and after some time Cluster network name and Cluster IP address also failed. When we plug again client facing adapters (two adapters in LACP) to failed node, cluster IP address recover and VM client network works again. So the problem: cluster migration fails after unexpectedly shutdown of client facing network of one node, where VM are resides. Nodes can communicate with each other through Storage network and all nodes are up in Failover Cluster manager. So when client network is down, VM should migrate to another node with working client-facing network. But cluster fails and VM do not migrate. Where we can fix this behaviour? Has anyone met this before?

windows server 2016 failover cluster virtual machine is in locked state

$
0
0

windows server 2016 failover cluster virtual machine is in locked state

please anyone provide solution:

Hyper-v Replication unfortunately stopped from that time onwards automatically checkpoint creating status showing 7%. i could not able to that VM.VM is showing in failover cluster status (locked). i could not able remove the checkpoint.   

Cluster shared storage issue

$
0
0

Hi

I have windows servers 2012 R2 cluster with 7 drive shared from SAN storage.

Now I am not able to open all 7 drives from each node as below error.

c:\ClusterStorage\Volume1 is not accessbile

the reference account is currently locked out and may not be logged on to.

Cluster IP keep switching

$
0
0

Dear All,

I have cluster node with 2 IPs, one active and the other one is passive. When i do NSLOOKUP i get the 2 IPs, when i ping the cluster name, then its pining the passive IP not the active IP, it should ping the active IP ( passive IP not pining - request time out). I did delete both A records in DNS then it worked fine, but after a while it went back to the passive IP again. What i need is when i ping the cluster note it must ping the active IP.

Thank you 


Not able to rebuild cluster, issue on disks ?

$
0
0

Hi all,

I have two Windows servers 2012 r2 (DB1A and DB1B) where a failover cluster + SQL Server Avalibility Groups used to work. But something went wrong (don't really know what, maybe an aggressive GPO) and the cluster was totally dead.

When I try to rebuild it, I get this kind of warning :

List Disks To Be Validated
Physical disk ab780ec8 is visible from only one node and will not be tested. Validation requires that the disk be visible from at least two nodes. The disk is reported as visible at node: DB1A
Physical disk ab780ec0 is visible from only one node and will not be tested. Validation requires that the disk be visible from at least two nodes. The disk is reported as visible at node: DB1A
No disks were found on which to perform cluster validation tests. To correct this, review the following possible causes:
* The disks are already clustered and currently Online in the cluster. When testing a working cluster, ensure that the disks that you want to test are Offline in the cluster.
* The disks are unsuitable for clustering. Boot volumes, system volumes, disks used for paging or dump files, etc., are examples of disks unsuitable for clustering.
* Review the "List Disks" test. Ensure that the disks you want to test are unmasked, that is, your masking or zoning does not prevent access to the disks. If the disks seem to be unmasked or zoned correctly but could not be tested, try restarting the servers before running the validation tests again.
* The cluster does not use shared storage. A cluster must use a hardware solution based either on shared storage or on replication between nodes. If your solution is based on replication between nodes, you do not need to rerun Storage tests. Instead, work with the provider of your replication solution to ensure that replicated copies of the cluster configuration database can be maintained across the nodes.
* The disks are Online in the cluster and are in maintenance mode.
No disks were found on which to perform cluster validation tests.

and when I open the Failover Cluster Manager, I can see the two nodes but can't see anything on the Roles folder, nor Disks.

Of course, SQL Server Availibility Groups is not possible :


The local node is not part of quorum and is therefore unable to process this operation. This may be due to one of the following reasons:
•   The local node is not able to communicate with the WSFC cluster.
•   No quorum set across the WSFC cluster.

I'm a bit lost. It would be great if someone could help.

Windows Server 2016 cluster system Failover Cluster Validation Report shows error on the CNO

$
0
0

Hi All,

I'm having an issue with my Windows Server 2016 cluster system.
it consists of 2 nodes, let say Node1 (showing as down) and Node2 (is up).

Node1 is ping-able to Node2 and vice versa, but not sure why it is showing as down.

The Fail-over Cluster Validation Report shows error only on the below CNO:

  • The cluster network name resource 'PRDSQL-CLUS01' has issues in the Active Directory. The account could have been disabled or deleted. It could also be because of a bad password. This might result in a degradation of functionality dependent on the cluster network name. Offline the cluster network name resource and run the repair action on it. 
    An error occurred while executing the test.
    The operation has failed. An error occurred while checking the state of the Active Directory object associated with the network name resource 'Cluster Name'.

    Access is denied
This is the error logged from the Failover Cluster Manager.

Event ID 1069

Cluster resource 'Cluster Name' of type 'Network Name' in clustered role 'Cluster Group' failed.Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

Event ID 1688
Cluster network name resource detected that the associated computer object in Active Directory was disabled and failed in its attempt to enable it. This may impact functionality that is dependent on Cluster network name authentication.Network Name: Cluster NameOrganizational Unit: Guidance:Enable the computer object for the network name in Active Directory.

The Virtual Cluster Frontend called PRDSQL-CLUS01is reporting it is disabled in Active Directory, as per the above error.
 
I have tried:

Taking the virtual endpoint offline and running a repair, but the errors state that “File not Found” and Error Displaying Cluster Information
Create a blank role, SQL and CAU are still working, it is only the front end failover cluster virtual network name AD account (CNO) that is having the issue.

Any help would be greatly appreciated.

Thanks,


/* Server Support Specialist */

windows HPC 2019 Pack Configuration

$
0
0

Hi,

We have plan to configure windows hpc pack in our lab with high end workstation installed. need help.  

Partition information lost on cluster shared disk

$
0
0

Hi everyone,


we've got a cluster virtual disk where the partition table and volume name broke. Has anyone experienced a simliar problem and got some hints on how to recover?


The problem occured last friday. I restarted node3 for windows updates. During the restart node1 had a bluescreen and also restarted. The failover cluster manager tried to bring online the cluster resources but failed several times. Finally the resource-swapping came to a rest on node1 which came up early after the crash. Many virtual disks were in an unhealthy state, but the repair process managed to repair all disks so they are now in a healthy state. We aren't able to explain why node1 crashed. Since the storage pool is in dual parity mode the disks should be able to work even if there are only 2 nodes running.

One virtual disk, however, lost its partition information.


Network config:

Hardware: 2x Emulex OneConnect OCe14102-NT, 2x Intel(R) Ethernet Connection X722 for 10GBASE-T

Backbone-Network: On the "right" Emulex network card (only members in this subnet are the 4 nodes)

Client-access teaming network: emulex "left" and intel "left" cards in team; 1 untagged network and 2 tagged networks


Software Specs:

    • Windows Server 2016
    • Cluster with 4 Clusternodes
    • Failover Cluster Manager + File Server Roles running on the cluster
    • 1 Storagepool with 36 HDDs / 12 SSDs (9HDD / 3 SSD on each node
    • Virtual disks are configured to use dual parity:
Get-VirtualDisk Archiv | get-storagetier | fl
  •    FriendlyName           : Archiv_capacity
  •    MediaType              : HDD
       NumberOfColumns        : 4
       NumberOfDataCopies     : 1
       NumberOfGroups         : 1
       ParityLayout           : Non-rotated Parity
       PhysicalDiskRedundancy : 2
       ProvisioningType       : Fixed
       ResiliencySettingName  : Parity

Hardware Specs per Node:

  • 2x Intel Xeon Silver 4110
  • 9HDDs à 4 TB and 3 SSD à 1 TB
  • 32GB RAM on each node

Additional information:

The virtualdisk is currently in Healthy state:

Get-VirtualDisk -FriendlyName Archiv

FriendlyName ResiliencySettingName OperationalStatus HealthStatus IsManualAttach   Size

------------ --------------------- ----------------- ------------ --------------   ----
Archiv                             OK                Healthy      True           500 GB


The storagepool is also healthy:

PS C:\Windows\system32> Get-StoragePool
FriendlyName   OperationalStatus HealthStatus IsPrimordial IsReadOnly

------------   ----------------- ------------ ------------ ----------
Primordial     OK                Healthy      True         False
Primordial     OK                Healthy      True         False
tn-sof-cluster OK                Healthy      False        False


Since the incident the event log (of current master: Node2) has various errors for this disk like:

[RES] Physical Disk <Cluster Virtual Disk (Archiv)>: VolumeIsNtfs: Failed to get volume information for \\?\GLOBALROOT\Device\Harddisk13\ClusterPartition2\. Error: 1005.


Before the incident we also had errors that might indicate a problem:

[API] ApipGetLocalCallerInfo: Error 3221356570 calling RpcBindingInqLocalClientPID.


Our suspicions so far:

We did registry changes to: SYSTEM\CurrentControlSet\Control\Class\{4d36e972-e325-11ce-bfc1-08002be10318}\0001 (to 0009) and set the value PnPCapabilities to 280 (disabling the checkbox "Allow the computer to turn off this device to save power") but not all network adapters support this checkbox so this may have had some side effects)



One curiosity: after the error we noticed that one of the 2 tagged networks had the wrong subnet on two nodes. This may have caused some of the failover role switches that occured on friday, but we're unsure about the reason since they were configured correctly some time before.

We've had a similar problem in our test environment after activating jumbo frames on the network interfaces. In that case we lost more and more filesystems after moving the file server role to another server. In the end all filesystems were lost and we reinstalled the whole cluster without enabling jumbo frames.

We now suspect that maybe two different network cards in the same network team may cause this problem.

What are your ideas? What may have caused the problem and how can we prevent this from happening again?

We could endure the loss of this virtual disk since it was only archive data and we have a backup, but we'd like to be able to fix this problem.

Best regards

Tobias Kolkmann


Active Directory Domain Services: Best practice for 3 Domain Controller server (High Availability) & etc.

$
0
0

Hello. I have some clarifications.

1. What would be the best practice for an environment with 3 Domain Controller Servers when i want it to be high-available?

2. If the Primary DC Server is down because of hardware problems, which server would take place?

3. How can i control which server would the users connect to if Primary Server is down?

4. In case Primary Server is up and running again, would i need to configure ADDS to point out again which server should the users connect to?

Thanks!

Live Migration and WorkGroup Cluster on windows 2019

$
0
0

Hi ,

I found the following document about live migration and work group cluster on Windows 2016.

https://techcommunity.microsoft.com/t5/Failover-Clustering/Workgroup-and-Multi-domain-clusters-in-Windows-Server-2016/ba-p/372059

I understand Live migration is not support, and support quick migration. Is it same on windows 2019? or any plans about it ?


Failed Server 2016 Rolling cluster upgrade.

$
0
0

Hi,

I have a 2 node Hyper-V server 2012 R2 cluster. I would like to upgrade the cluster to 2016 to take advantage of the new nested virtualisation feature. I have been following this guide (https://technet.microsoft.com/en-us/windows-server-docs/failover-clustering/cluster-operating-system-rolling-upgrade) and have got to the point of adding the first upgraded node back into the cluster but it fails with the following error. 

Cluster service on node CAMHVS02 did not reach the running state. The error code is 0x5b4. For more information check the cluster log and the system event log from node CAMHVS02. This operation returned because the timeout period expired.

In the event log of the 2016 node I have these errors.

FAILOVER CLUSTERING LOG

mscs_security::BaseSecurityContext::DoAuthenticate_static: (30)' because of '[Schannel] Received wrong header info: 1576030063, 4089746611'

cxl::ConnectWorker::operator (): HrError(0x0000001e)' because of '[SV] Security Handshake failed to obtain SecurityContext for NetFT driver'

[QUORUM] Node 2: Fail to form/join a cluster in 6-7 minutes

[QUORUM] An attempt to form cluster failed due to insufficient quorum votes. Try starting additional cluster node(s) with current vote or as a last resort use Force Quorum option to start the cluster. Look below for quorum information,

[QUORUM] To achieve quorum cluster needs at least 2 of quorum votes. There is only 1 quorum votes running

[QUORUM] List of running node(s) attempting to form cluster: CAMHVS02, 

[QUORUM] List of running node(s) with current vote: CAMHVS02, 

[QUORUM] Attempt to start some or all of the following down node(s) that have current vote: CAMHVS01, 

join/form timeout (status = 258)

join/form timeout (status = 258), executing OnStop

SYSTEM LOG

Cluster node 'CAMHVS02' failed to join the cluster because it could not communicate over the network with any other node in the cluster. Verify network connectivity and configuration of any network firewalls.

Cluster failed to start. The latest copy of cluster configuration data was not available within the set of nodes attempting to start the cluster. Changes to the cluster occurred while the set of nodes were not in membership and as a result were not able to receive configuration data updates. .

The Cluster Service service terminated unexpectedly.  It has done this 1 time(s).  The following corrective action will be taken in 15000 milliseconds: Restart the service.

The Service Control Manager tried to take a corrective action (Restart the service) after the unexpected termination of the Cluster Service service, but this action failed with the following error: 
The service cannot be started, either because it is disabled or because it has no enabled devices associated with it.

For information Node 2 is 2016. Windows firewall is disabled on both nodes. Interfaces can all ping each other between nodes. There was one error in the validation wizard which may be related but I thought it was because of the different versions of windows.

CAMHVS01.gemalto.com
Device-specific module (DSM) Name Major Version Minor Version Product Build QFE Number 
Microsoft DSM 6 2 9200 17071 


Getting information about registered device-specific modules (DSMs) from node CAMHVS02.gemalto.com.


CAMHVS02.gemalto.com
Device-specific module (DSM) Name Major Version Minor Version Product Build QFE Number 
Microsoft DSM 10 0 14393 0 


For the device-specific module (DSM) named Microsoft DSM, versions do not match between node CAMHVS01.gemalto.com and node CAMHVS02.gemalto.com.
For the device-specific module (DSM) named Microsoft DSM, versions do not match between node CAMHVS02.gemalto.com and node CAMHVS01.gemalto.com.
Stop: 31/10/2016 09:03:47.
Indicates two revision levels are incompatible

Any help would be really appreciated.

Thanks



Error Cluster Aware Update(CAU) in Windows 2012 FO Cluster

$
0
0

Hi,

I have configured Self-updating CAU  for 2 Nodes Windows 2012 FO Cluster . I use "Microsoft.HotfixPlugin" of using insatall hotfix at cluster nodes. 

  In my Lab, 3 Windows 2012 Servers -  1 AD & 2 are Cluster Nodes(W2016-N1 & W2016-N2). Shared folder at AD server (BLUEAD01 - 10.10.10.10) with "Full Permission" for Cluster Administrator  (User Name : Cluadmin - both nodes are loginto Blue.Local\Cluadmin user) . I am suspecting the issue is related with WUSA & having some conflict with security patch with extension *.msu.

Please check attached snap. If you have any way out, please let me know.

-Suddhaman

Performance Issue on Storage Space Direct Server 2019 - Getting high read and write Latency

$
0
0

Hello All,

On S2D i am getting performance issue, getting high read and write Latency. From some days getting more issues, not getting constant IOPs, in every second IOPs reach thousands and in next second it came to hundreds, same thing happening with Throughput read and write speed, earlier having performance issue but getting constant IOPs. In admin center it's creating peeks on IOP's and Throughput, due to this hosted VPS are getting hang and slow.

I have configured S2D with 4 nodes having Nvme for caching and SSD for storage as below:

Node 1 : 1x250 Nvme, 3x1TB SSD, Not Having Hyper-v role

Node 2 : 1x500 Nvme, 3x1TB SSD, Not Having Hyper-v role

Node 3 : 2X250 Nvme, 4x1TB SSD, Having Hyper-v role

Node 4 : 2X250 Nvme, 4x1TB SSD, Not Having Hyper-v role

Node 5,6,7 : Not having any SSD or Nvme for storage,  Only having Hyper-V role

All server are connected with 10 GB Ethernet and using CSV to storing the VM files.

Please suggest how to resolve the issue.

Evict node failed

$
0
0
Hi,

what is the procedure for removing a cluster hyperv (Mix 2012R2 and 2016) node that can not be connected to the cluster again?

Thank you.

HyperV Failover cluster backup leaving avhdx files

$
0
0

I am running a 4 node 2016 Hyper V failover cluster with about 30+ servers. I just noticed one of my servers is leaving behind a .avhdx .avhdx.mrt and a .avhdx.rct file after every backup with DPM 2016. 

The machine list not checkpoints under hyper V via the gui or powershell. These are being created when DPM backs up(Which happens successfully). How do figure out what is causing this and how do i go about cleaning this up?

Syncing updates between domain controllers, DHCP, SQL and IIS servers

$
0
0

Hello,

So I'm curious about having Windows updates being synced between machines.  From my understanding if you have 2 domain controllers on the same domain (Server 2016) and they're set to have Automatic Updates that they sync in a way with the reboot, so one DC is always up and then they work together somehow so they're not out of sync with anything.  Is that possible for DHCP, SQL and IIS servers if those are setup as a cluster or fail-over state?  It's something I'm trying to research and am just not 100% sure on.  I know there's SCCM (which I'm also looking at), but if I can just cluster/fail-over the servers and set to automatic updates and just move onto other things that'd be ideal.  Any help appreciated.  Thanks in advance.

Viewing all 5648 articles
Browse latest View live