Solaris 11 Ultimate

SUNCLUSTER 3.2 & Above Versions

Its suggested to move UCLM port range to 7000-7032 for Suncluster 3.2 and higher

To check the port UDLM is running on
#scrgadm -pvv |grep udlm:port

To check if something is already running on that port
#netstat -an|egrep '\.60-[0-3][0-9]'
also check if new port range is clear
#netstat -an|egrep '\.70-[0-3][0-9]'

Now lets disable and unmanage the RAC Framework

#scswitch -F -g rac-framework-rg
#scswitch -n -j rac_new
#scswitch -n -j rac_udlm
#scswitch -n -j rac-framework-rg

Now we have to reboot the cluster so it comes up with rac_udlm offline to allow the port change.
Remember you have to reboot the cluster using scshutdown on master node, so we have consistent reboot as per the quorum assigned.

#scshutdown -y -g0

Once the nodes are up, from the master node then make the port change
#scrgadm -c -j rac_udlm-rs -x port=7000 to set the port range.

Now bring the rac-framework offline
#scswitch -Z -g rac-framework-rg

Verify the cluster is using the new UDLM port 7000.
#scrgadm -pvv |grep udlm:port

If you have any pending resources after port change:

clresource disable -R -g rac-framework-rg +
clresource enable -R -g rac-framework-rg +

What is Split-Brain?

The term "Split-Brain" is often used to describe the scenario when two or more co-operating processes in a distributed system, typically a high availability cluster, lose connectivity with one another but then continue to operate independently of each other, including acquiring logical or physical resources, under the incorrect assumption that the other process(es) are no longer operational or using the said resources.

What does "co-operating" mean?

Co-operating processes are those that use shared or otherwise related resources, including accessing or modifying shared system state, during the process of performing some coordinated action, typically at the request of a client.

What's at risk?

The biggest risk following a Split-Brain event is the potential for corrupting system state. There are three typical causes of corruption:

The processes that were once co-operating prior to the Split-Brain event occurring independently modify the same logically shared state, thus leading to conflicting views of system state. This is often called the "multi-master problem".

New requests are accepted after the Split-Brain event and then performed on potentially corrupted system state (thus potentially corrupting system state even further).

When the processes of the distributed system "rejoin" together it is possible that they have conflicting views of system state or resource ownerships. During the process of resolving conflicts, information may be lost or become corrupted.

Examples of potential corruption include, creating multiple copies of the same information, updating the same information multiple times, deleting information, creating multiple events for a single operation, processing an event multiple times, starting duplicate services or suspending existing services.

What if the processes aren't co-operating?

If the processes in a distributed system aren't co-operating in any way (ie: they don't use any shared resources), Split-Brain, or it's effects, may not occur.

I thought Split-Brain was all about physical network infrastructure and connectivity failures. Can it really occur at the process level, on the same physical server?

While network infrastructure failure is one of the more common causes of Split-Brain, the loss of communication or connectivity between two or more processes on a single physical server, even running on a single processor, may also cause a Split-Brain event.

For example; if one of two co-operating processes on a server are swapped out for a long period of time, longer than the configured network or connectivity time-out between the processes, a Split-Brain may occur if each process continues to operate independently, especially when the swapped out process returns to normal operation. ie: The swapped process does not take into account that it has been unavailable for a long period of time.

Similarly if a process is interrupted for a long period of time, say due to an unusually long Garbage Collection or when a physical processor is unavailable for a process to due heavy contention virtualized infrastructure, the said process may not respond to communication requests from another process, and thus a Split-Brain may occur.

Split-Brain does not require physical network infrastructure to occur.

Garbage Collection can't cause a Split-Brain. Right?

Unfortunately not. As explained above, excessively long Garbage Collection or regular back-to-back Garbage Collections may make a process seen unavailable to other processes in a distributed system and thus not be in a position to respond to communication requests.

Split-Brain will only ever occur when a system has two processes. With three or more processes it can easily be detected and resolved. Right?

Unfortunately not. Split-Brain scenarios may just as easily occur when there are n processes (where n > 2) in a distributed system. For example: it's possible that all n processes in a distributed system, especially on heavily loaded hardware, could be swapped out at the same point in time, thus effectively losing connectivity with each other, thus potentially creating an n-way Split-Brain.

As Split-Brains occur as the result of incorrectly performing some action due to an observed communication failure, the number of "pieces" of "brains" may be greater than two.

Splits always occur "down-the-middle". Right?

Unfortunately not. Following on from above, as most distributed systems consist of large numbers of processes (large values of n), splits rarely occur "down-the-middle", especially when n is odd number! In fact, even when n is an even number, there is absolutely no guarantee that a split will contain two equally sized collections of processes. For example: A system consisting of five processes may be split such that one side of the system (ie: brain) may have three processes and the other side may have two processes. Alternatively it may be split such that one side has four processes and the other has just one process. In a system with six processes, a split may occur with four processes on one side and just two on another.

Stateless architectures don't suffer from Split-Brain. Right?

Correct. Systems that don't have shared state or use shared resources typically don't suffer from Split-Brain.

Client+Server architectures don't suffer from Split-Brain. Right?

Correct - if and only if the Server component of the architecture operates as single process. However if the Server component of an architecture operates as a collection of processes, it's possible that such an architecture will suffer from Split-Brain.

All Stateful architectures suffer from Split-Brain. Right?

The "statefulness" of an architecture does not imply that it will suffer from Split-Brain. It is very possible to define an architecture that is stateful and yet avoids the possibility of Split-Brain (as defined above), by ensuring no shared resources are accessed across processes.

The solution is simple - completely avoid Split-Brains by waiting longer for communication to recover, doing more checks and avoiding assumptions. Right?

Unfortunately in most Split-Brain scenarios all that can be observed is an inability to send and/or receive information. It is from these observations that systems must make assumptions about a failure. When these assumptions are incorrect, Split-Brain may occur.

While waiting longer for a communication response may seem like a reasonable solution, the challenge is not in waiting. The waiting part is easy. The challenge is determining "how long to wait" or "how many times to retry".

Unfortunately to determine "how long to wait" or "retry" we need to make some assumptions about process connectivity and ultimately process availability. The challenge here is that those assumptions may quickly become invalid, especially in a dynamically or arbitrarily loaded distributed system. Alternatively if there is a sudden spike in the number of requests, processes may pause more frequently (especially in the case of Garbage Collection or in virtualized environments) and thus increase the potential for a Split-Brain scenario to form.

Split-Brain only occurs in systems that use unreliable network protocols (ie: protocols other than TCP/IP). Right?

The network protocol used by a distributed system, whether a reliable protocol like TCP/IP or unreliable protocol like UDP (unicast or multicast) does not preclude a Split-Brain from occurring. As discussed above, a physical network is not required for a Split-Brain scenario to develop.

TCP/IP decreases the chances of a Split-Brain occurring. Right?

Unfortunately not. Following on from above, the operational challenge with the use of TCP/IP with in a distributed system is that the protocol does not directly expose internal communication failures to the system, like for example the ability to send from a process and not receive (often called "deafness") or the ability to receive but not send (often called "muteness"). Instead TCP/IP protocol failures are only notified after a fixed, typically operating system designated time-out (usually configurable but typically set to seconds or even minutes by default). However, the use of unreliable protocols such as UDP, may highlight the communication problems sooner, including the ability to detect "deafness" or "muteness" and thus allow a system to take corrective action soon.

Having all processes in a distributed system connected via a single physical switch will help prevent Split-Brain. Right?

While deploying a distributed system such that all processes are interconnected via a single physical switch may seem to reduce the chances of a Split-Brain occurring, the possibility of a switch failing atomically at once is extremely low. Typically when a switch fails, it does so in an unreliable and degrading manner. That is, some components of a switch will continue to remain operational where as others may be intermittent. Thus in their entirety, switches become intermittent, before they fail completely (or are shutdown completely for maintenance).

What are the best practices for dealing with Split-Brain?

While the problem of Split-Brains can't be solved using a generalized approach, "prevention" and "cure" are possible.

There are essentially six approaches that may be used to prevent a Split-Brain from occurring;

· Use high quality and reliable network infrastructure.

· Provide multiple paths of communication between processes, so that a single observed communication failure does not trigger a Split-Brain.

· Avoid overloading physical resources so that processes are not swapped out for long periods of time.

· Avoid unexpectedly long Garbage Collection pauses.

· Ensure communication time-outs are suitably long enough to prevent a Split-Brain occurring "too early" due to (3) or (4)

· Architect an application so that it uses few shared resources (this is rarely possible).

However even implementing all of the approaches, Split-Brain may still occur. In which case we need to focus on "cure".

There are essentially four approaches to "curing" a Split-Brain:

Fail-Fast: As soon as a Split-Brain scenario is detected, the entire system or suitable processes in the system are immediately shutdown to avoid the possibility of corruption.

Isolation is weak form of Fail-Fast. Instead of shutting down processes of a system, they are simply isolated. When the Split-Brain is recovered, the isolated processes are re-introduced into the system as if they were new (ie: they drop any previously held information/assumptions to avoid corruption). After the Split-Brain event has occurred, the Isolated processes may continue to perform work, but on re-introduction to the system, currently information / processing will be lost.

Fencing is a stronger form of isolation, but still a weaker form of Fail-Fast. Fencing requires an additional constraint over Isolation in that fenced-processes must stop execution immediately (and release resources), rather than continue to operate after a Split-Brain has occurred.

Resolve Conflicts (assumes the above two approaches have not be used)

When the communication channels have been recovered, it's highly possible that there are conflicts at the resource level - including conflicts in system state. By providing a "Conflict Resolving" interface, developers may provide an application specific mechanism for resolving said conflicts, thus allowing a system to continue to operation. Of course, this is completely application dependent and development intensive, but provides the best way to recover.

How can a Split-Brain event be detected? How are they defined?

Unfortunately there is no simple way to define or detect when a Split-Brain has occurred. While it's fairly obvious and perhaps trivial in a system composed of only two processes, these are increasingly rare. Most distributed systems have 10's if not 1000's of processes.

For example; if four processes in a five process system collectively lose contact with a single process, does that mean a four-to-one Split-Brain has occurred, or simply that a single process has failed?

The common solution to this problem to define what is called a Quorum, the idea being, those processes not belonging to the quorum should Fail-Fast or be Isolated. The typically way to define a quorum is to specify the minimum number of co-operating processes must be collectively available to continue operating.

Hence for the above example, with a quorum of "three processes", the system would treat a failure of a single process not as a Split-Brain, but simply as a lost process. However if the quorum were incorrectly defined to be "a single process", a Split-Brain would occur - with a four to one split.

Obviously depending on size-based quorums may be problematic as it's yet another assumption we need to make - "how big should a quorum be?". Often however the definition of a quorum is less about the number of processes that are collectively available, but instead more about the roles or locality of the said processes.

For example; if three processes in a five process system collectively lose contact with two other processes, but those two processes remain in contact, we essentially have a three-to-two Split-Brain. With a quorum that is defined using a "size-based" approach, the two processes may be failed-fast/isolated. However let's consider that the three processes have also lost connectivity with a valuable resource (say a database or network attached storage device), but the other two processes have not. In this situation it's often preferable that the two surviving processes should remain available and the other three processes should be failed-fast/isolated.

Further, and as previously mentioned, it's important to remember that failures in communication between processes are rarely observed to occur at the same time. Rather they are observed over a period of time, perhaps seconds, minutes or even hours.

Thus when discussing "when" a Split-Brain occurs, we usually need to consider the entire period of time, during which there may be multiple failure and recovery events, to fully conclude that a Split-Brain has occurred.

How does Oracle Coherence deal with Split-Brain?

Internally Oracle Coherence uses a variety of both proprietary (UDP uni-cast and multi-cast based) and standard (TCP/IP) network technologies for inter-process communication and maintaining system health. These technologies are combined in multiple ways to enable highly scalable and high-performance one-to-one and one-to-many communication channels to be established and reliably observed, across hundreds (even thousands if you really like) of processes, with little CPU or network overhead.

For example, through the combined use of these technologies Coherence can easily detect and appropriately deal with remote-garbage-collection across a system sub-second (using commodity 1Gb switch infrastructure).

The major philosophy and delivered advantages of this approach is to ideally "prevent" a Split-Brain occurring as much as possible. Should a Split-Brain occur, say due to a switch failure, Coherence does the following:

Uses Isolation to ensure reliable system communication between the individual fragments of the Split-Brain. Without reliable communication, data integrity within the fragments cannot be guaranteed (or recovered)

Uses Fencing around processes that are deemed unresponsive (ie: deaf), but continue to communicate with other processes in the system. This is a form of dynamic blacklisting to ensure system-wide communication reliability and prevent Fenced processes corrupting state.

Raises real-time programmatic events concerning process health. This enables developers to provide custom Fail-Fast algorithms based on system health changes. In Coherence these events are handled using MemberListeners.

Raises real-time programmatic events concerning the locality and availability of partitions (ie: information). This enables developers to provide custom Fail-Fast algorithms based on partitions being move or becoming unavailable due to catastrophic system failure. In Coherence these events may be handled using Backing Map Listeners and/or Partition Listeners.

Uses a simple "largest side wins" size-based quorum rule to resolve resource (and state) ownership when a Split-Brain is recovered.

Amnesia Scenario:

Node node-1 is shut down.

Node node-2 crashes and will not boot due to hardware failure.

Node node-1 is rebooted but stops and prints out the messages:

Booting as part of a cluster

NOTICE: CMM: Node node-1 (nodeid = 1) with votecount = 1 added.

NOTICE: CMM: Node node-2 (nodeid = 2) with votecount = 1 added.

NOTICE: CMM: Quorum device 1 (/dev/did/rdsk/d4s2) added; votecount = 1, bitmask of nodes with configured paths = 0x3.

NOTICE: CMM: Node node-1: attempting to join cluster.

NOTICE: CMM: Quorum device 1 (gdevname /dev/did/rdsk/d4s2) can not be acquired by the current cluster members. This quorum device is held by node 2.

NOTICE: CMM: Cluster doesn't have operational quorum yet; waiting for quorum.

Node node-1 cannot boot completely because it cannot achieve the needed quorum vote count.

NOTE: for Oracle Solaris 3.2 update 1 and above with Solaris 10 : the boot continue, in non cluster mode, after a timeout.

In the above case, node node-1 cannot start the cluster due to the amnesia protection of Oracle Solaris Cluster. Since node node-1 was not a member of the cluster when it was shut down (when node-2 crashed) there is a possibility it has an outdated CCR and should not be allowed to automatically start up the cluster on its own.

The general rule is that a node can only start the cluster if it was part of the cluster when the cluster was last shut down. In a multi node cluster it is possible for more than one node to become "the last" leaving the cluster.

Resolution:

If you have a 3 node cluster, start with a node that is suitable for starting the cluster. Eg a node connected to the majority of the storage. In this example node-1 represents this first node.

1. Stop node-1 and reboot in non-cluster mode. (Single user not necessary, only faster)

ok boot -sx

2. Make a backup of /etc/cluster/ccr/infrastructure file or /etc/cluster/ccr/global/infrastructure depending upon cluster and patch revisions noted in UPDATE_NOTE #1 below.

# cd /etc/cluster/ccr

# /usr/bin/cp infrastructure infrastructure.old

Or if UPDATE_NOTE #1 applies

# cd /etc/cluster/ccr/global

# /usr/bin/cp infrastructure infrastructure.old

3. Get this node's id.

# cat /etc/cluster/nodeid

4. Edit the /etc/cluster/ccr/infrastructure file or /etc/cluster/ccr/global/infrastructure depending upon cluster and patch revisions noted in UPDATE_NOTE #1 below.

Change the quorum_vote to 1 for the node that is up (node-1, nodeid = 1).

cluster.nodes.1.name node-1

cluster.nodes.1.state enabled

cluster.nodes.1.properties.quorum_vote 1

For all other nodes and any Quorum Device, set the votecount to zero.

Other nodes, where is any node id but the one edited above:

cluster.nodes..properties.quorum_vote 0

Quorum Device(s), where is the quorum device id which is internal to the cluster code:

cluster.quorum_devices..properties.votecount 0

5. Regenerate the checksum of the infrastructure file by running:

# /usr/cluster/lib/sc/ccradm -i /etc/cluster/ccr/infrastructure -o

Or if UPDATE_NOTE #1 applies

# /usr/cluster/lib/sc/ccradm -i /etc/cluster/ccr/global/infrastructure -o

NOTE: If running SC 3.3, SC 3.2u3 or 3.2 Cluster core patch equal to or greater then 126105-36 (5.9) or 126106-36 (5.10) or 126107-36 (5.10 x86) the ccradm command would be

# /usr/cluster/lib/sc/ccradm recover -o infrastructure

6. Boot node node-1 into the cluster.

# /usr/sbin/reboot

7. The cluster is now started, so as long as other nodes have been mended they can be booted up and join the cluster again. When these nodes joins the cluster their votecount will be reset to its original value, and if a node is connected to any quorum device its voteqount will also be reset.

UPDATE_NOTE #1: If running Solaris Cluster 3.2u2 or higher, the directory path /etc/cluster/ccr is replaced with /etc/cluster/ccr/global. The same applies if running Cluster core patch equal to or greater then 126105-27 (5.9) or 126106-27 (5.10) or 126107-27 (5.10 x86)

Troubleshooting Steps Outline for node not booting into cluster

1. Verify booting from the right system disk

2. Verify actual cluster doesn't start issue

3. Verify correct boot syntax

4. Verify not “invalid CCR table”

5. Verify not “waiting for operational quorum” amnesia

6. Confirm private interfaces have started

7. Verify that the node did not already join the cluster with a date/time in the future.

8. Verify if other errors are reported on the console.

Troubleshooting Steps in Detail:

1) Verify booting from the right system disk

In many cases, there are several system disks attached to a system: the current disk,

the mirror of the current disk, backup of system disk, system disk with a previous solaris version, ...

Be sure that you are booting from the right disk.

If you are using an alias, check that the devalias points to the right system disk.

2) Verify actual cluster doesn't start issue

Ensure the system will boot in non-cluster mode.

This step will verify the hardware and Solaris operating system are functional.

A system may be booted in non-cluster mode by adding the '-x' boot flag.

3) Verify correct boot syntax

Check that you used the right boot syntax to boot the node.

If you boot using an alias

In order to boot a node in cluster mode, you must not use the -x flag.

And you must see the following message on the console

Booting as part of a cluster

4) Verify not “invalid CCR table”

The cluster uses several tables stored as files in the CCR repository.

If one of is table is corrupted, it could prevent the node to boot in the cluster.

You may see errors like the following ones on the console:

* UNRECOVERABLE ERROR: /etc/cluster/ccr/infrastructure file is corrupted Please reboot in noncluster mode(boot -x) and Repair

* WARNING: CCR: Invalid CCR table : infrastructure

* WARNING: CCR: Invalid CCR table : dcs_service_xx.

* WARNING: CCR: Table /etc/cluster/ccr/directory has invalid checksum field. .....

UNRECOVERABLE ERROR: Sun Cluster boot: Could not initialize cluster framework Please reboot in non cluster mode(boot -x) and Repair

* ...

In such case, further troubleshooting is required to understand the failure. For additional support contact Oracle Support.

5) Verify not “waiting for operational quorum” amnesia

A cluster must reach an operational quorum in order to start.

It means that we must have enough votes (more than half of all configured votes) to be able to start.

A node will report the "waiting for operational quorum” message during boot if that quorum cannot be reached.

NOTICE: CMM: Node ita-v240c: attempting to join cluster. NOTICE: CMM: Cluster doesn't have operational quorum yet; waiting for quorum.

This happens when the node is not able to talk to the other cluster nodes through the private links and is not able to access the quorum device(s).

This usually happens when the other cluster node has been the last one being stopped and we are rebooting another node first.

In that case check that you are booting the nodes in the right order.

If the other node owning the quorum device(s) cannot be booted due to other problems, then you need to recover from amnesia.

6) Verify that the private interfaces have started.

A node will not be able to join the cluster if the private links do not start.

You'll then see messages like the following ones:

NOTICE: clcomm: Adapter bge3 constructed

NOTICE: clcomm: Adapter bge2 constructed

NOTICE: CMM: Node ita-v240c: attempting to join cluster.

NOTICE: CMM: Cluster doesn't have operational quorum yet; waiting for quorum.

NOTICE: clcomm: Path ita-v240c:bge3 - ita-v240b:bge3 errors during initiation

NOTICE: clcomm: Path ita-v240c:bge2 - ita-v240b:bge2 errors during initiation

WARNING: Path ita-v240c:bge3 - ita-v240b:bge3 initiation encountered errors, errno = 62. Remote node may be down or unreachable through this path.

WARNING: Path ita-v240c:bge2 - ita-v240b:bge2 initiation encountered errors, errno = 62. Remote node may be down or unreachable through this path.

The node won't be able to talk to the other node(s) and continue its boot until it prints:

NOTICE: clcomm: Path ita-v240c:bge3 - ita-v240b:bge3 online

NOTICE: clcomm: Path ita-v240c:bge2 - ita-v240b:bge2 online

7) Verify that the node did not already join the cluster with a date/time in the future.

When node C joins node B in a cluster, node B saves the boot date/time of node C.

It is used by node B as node C's incarnation number.

May 19 03:12:05 ita-v240b cl_runtime: [ID 537175 kern.notice] NOTICE: CMM: Node ita-v240c (nodeid: 1, incarnation #: 1242748936) has become reachable.

May 19 03:12:05 ita-v240b cl_runtime: [ID 377347 kern.notice] NOTICE: CMM: Node ita-v240c (nodeid = 1) is up; new incarnation number = 1242748936.

Now assume that node C joined the cluster with the wrong date/time being set, say a few hours in the future.

The administrator realizes that and decides to reboot node C with a fixed date/time.

Node B will then see node C trying to join the cluster with an incarnation number before the one it already knows.

Node B then think that this is a stale version of node C trying to join the cluster and it will refuse that, and report the following message:

May 19 03:21:08 ita-v240b cl_runtime: [ID 182413 kern.warning] WARNING: clcomm: Rejecting communication attempt from a stale incarnation of node ita-v240c;

reported boot time Tue May 19 08:21:01 GMT 2009, expected boot time Tue May 19 16:02:16 GMT 2009 or later.

In older SunCluster version (3.0 for example), this message is not printed at all.

You may find more information by dumping a cluster debug buffer on node B.

ita-v240b# mdb -k

Loading modules: [ unix krtld genunix dtrace specfs ufs sd pcisch ip hook neti sctp arp usba fcp fctl qlc nca ssd lofs zfs random logindmux ptm cpc sppp crypto fcip nfs ]

>*udp_debug/s

.....

th 3000305f6c0 tm 964861043: PE 60012f10440 peer 1 laid 2 udp_putq stale remote incarnation 1242721261

.....

Node C comes with an incarnation of 1242721261 which is before the already known incarnation of 1242748936.

Node C will only be allowed to join the cluster once its date/time goes beyond the date/time when it first joined the cluster.

To fix this, wait for the time to go beyond the first date/time and reboot node C.

This is not always possible if the date was too far in the future (days, weeks, months or years!).

The only other fix is to reboot the whole cluster.

May 19 03:46:14 ita-v240b cl_runtime: [ID 537175 kern.notice] NOTICE: CMM: Node ita-v240c (nodeid: 1, incarnation #: 1242722767) has become reachable.

May 19 03:46:14 ita-v240b cl_runtime: [ID 377347 kern.notice] NOTICE: CMM: Node ita-v240c (nodeid = 1) is up; new incarnation number = 1242722767.

May 19 03:46:14 ita-v240b cl_runtime: [ID 377347 kern.notice] NOTICE: CMM: Node ita-v240b (nodeid = 2) is up; new incarnation number = 1242722750.

8) Verify if other errors are reported on the console.

The node can fail to join the cluster due to other errors seen during the boot sequence in the various scripts being run.

This can block the boot sequence until the issue is fixed.

This includes: - failure to mount the global devices filesystem. - mount point missing. - invalid /etc/vfstab entry.

Fix the error seen and reboot the node in the cluster.

Cluster Panics due to Reservation Conflict:
A major issue for clusters is a failure that causes the cluster to become partitioned (called split brain). When split brain occurs, not all nodes can communicate, so individual nodes or subsets of nodes might try to form individual or subset clusters. Each subset or partition might believe it has sole access and ownership to the multihost devices. When multiple nodes attempt to write to the disks, data corruption can occur. In order to avoid data corruption Solaris Cluster software uses SCSI Reservation method to implement failure fencing and to limit the nodes leaving the cluster from accessing shared devices.

When the hosts are in cluster they are able to write their keys on shared devices and access the devices. However, when a nodes becomes unreachable, the cluster framework will drop the node from the cluster and have the dropped node's keys removed. Therefore, cluster framework fences that node out and limits its access to the shared devices.

There are three types of SCSI reservation methods: SCSI2, SCSI3 and SCSI2/SCSI3 Persistent Group Reservation Emulation (PGRE).

In an SCSI2 type reservation, only one host at a time can reserve and access the disk whereas in SCSI3 and SCSI3 PGRE every host connected to the host can write its keys and access the disk. Also SCSI2 reservation is cleared by reboots whereas the other two methods are permanent and survive reboots.
Solaris Cluster framework uses SCSI3 PGRE for the storage devices with two physical paths to the hosts and SCSI3 for the devices with more than 3 paths to the hosts.

Reservation conflict arises when a node, not in communication with the surviving node(s) in the cluster, either a fenced out member of the cluster or a non-member host from another cluster tries to access a shared device it has been fenced off from, i.e. its key is not present on the device. In such a case, if failfast mechanism is enabled on the offending node (e.g. when the node is running in the cluster but unaware that it is fenced off due to unforeseen circumstances such as a hung status) will panic by design in order to limit its access to the shared device.

Solaris Cluster 3.x node panics with "SCSI Reservation Conflict" with the following messages:
Nov 14 15:14:28 node2 cl_dlpitrans: [ID 624622 kern.notice] Notifying cluster that this node is panicking
Nov 14 15:14:28 node2 unix: [ID 836849 kern.notice]
Nov 14 15:14:28 node2 ^Mpanic[cpu8]/thread=2a100849ca0:
Nov 14 15:14:28 node2 unix: [ID 747640 kern.notice] Reservation Conflict
Nov 14 15:14:28 node2 unix: [ID 100000 kern.notice]

To troubleshoot this we have do few verifications as given below:
1. Verify that no disk is accessible by hosts that are not part of this cluster:
If a disk is used by another host(s) of another cluster or host(s) using SCSI reservation methods, even if it is not being used or mounted, the keys on the disk might be altered without the knowledge of any of the nodes in our cluster.

2. Check EMC Gate Keeper or VCMD disks are accessible by cluster nodes:

If EMC Gate Keeper or VCMDB disks are exposed to Sun Cluster, the cluster nodes might panic with SCSI Reservation Conflict panics. To avoid such circumstances, one might "blacklist" VCMDB disks. Please consult EMC for appropriate procedure.
We have to unmask or blacklist them in /etc/fp.conf file .

3. Check the time stamps of the messages on all the nodes during and prior to the panic:

When the nodes booted into a cluster stops receiving heart beats from other node(s) via private interconnects, they have no way of telling if the the problem was due to:

problem with the private interconnect paths, e.g. cable problems,
the other node(s) going down as a part of a normal shutdown, i.e. "init 0",
the other nodes(s) paniced
the other node(s) being hung.

Removing SCSI Reservations
SCSI-2 Reservations and SCSI-3 Persistent Reservations. Solaris Cluster uses only one SCSI reservation type for a given shared storage device. However, both SCSI reservation types may be used at the same time in the same cluster.
To remove SCSI-2 Reservations use the 'scsi' command with the 'release' option as shown below. This command, however, must be executed from the system that owns the reservation in order to successfully remove the reservation.

# /usr/cluster/lib/sc/scsi -c disfailfast -d /dev/did/rdsk/d#s2
# /usr/cluster/lib/sc/scsi -c release -d /dev/did/rdsk/d#s2
The '#' character is a reference to the DID device number of the disk from which you wish to remove a SCSI-2 Reservation. The standard Solaris disk device reference, /dev/rdsk/c#t#d#s2, can also be used in place of the DID device reference.
Note: Sun Microsystems recommends that you first disable the failfast mechanism using the 'scsi' command with the 'disfailfast' option, as shown above, before removing any SCSI reservations. The system will panic if the system is running in cluster mode and you do not disable the failfast mechanism and you mistakenly attempt a release of a SCSI-2 Reservation when the SCSI reservation was in fact a SCSI-3 Persistent Reservation.

An alternative is to use the 'reserve' command with the 'release' option to remove a SCSI-2 Reservation as shown below.
# /usr/cluster/lib/sc/reserve -c disfailfast -z /dev/did/rdsk/d#s2
# /usr/cluster/lib/sc/reserve -c release -z /dev/did/rdsk/d#s2
Note: SCSI-2 Reservations will be removed automatically if the system that owns the SCSI-2 Reservations is shut down or power cycled. Likewise, SCSI-2 Reservations will be removed automatically if the storage devices that have SCSI-2 Reservations on them are shut down or power cycled. There are other methods that also remove SCSI-2 Reservations without resorting to the commands presented above to do so. For example, a SCSI bus reset will remove SCSI-2 Reservations from the storage devices affected by the reset.

Example 1:If DID device d15 has a SCSI-2 Reservation you wish to remove, use the following commands.
# /usr/cluster/lib/sc/scsi -c disfailfast -d /dev/did/rdsk/d15s2
do_enfailfast returned 0
# /usr/cluster/lib/sc/scsi -c release -d /dev/did/rdsk/d15s2
do_scsi2_release returned 0
Alternatively, if DID device d15 corresponds to disk device c3t4d0 on the cluster node where the commands will be executed, you could use the following commands instead.
# /usr/cluster/lib/sc/reserve -c disfailfast -z /dev/rdsk/c3t4d0s2
do_enfailfast returned 0
# /usr/cluster/lib/sc/reserve -c release -z /dev/rdsk/c3t4d0s2
do_scsi2_release 0
Note, again, that you can use either the DID device reference or the Solaris disk device reference with both the 'scsi' and the 'reserve' commands.

Example 2:If DID device d15 has a SCSI-3 Persistent Reservation and you mistakenly execute the 'scsi' command with the 'release' option without disabling the failfast mechanism, the system will panic if it is running in cluster mode. So, if you execute the following command against a device that has a SCSI-3 Persistent Reservation and the command prompt does not return, then your system has probably experienced a 'Reservation Conflict' panic.
# /usr/cluster/lib/sc/scsi -c release -d /dev/did/rdsk/d15s2
Check the system console for a panic message like the following.
panic[cpu0]/thread=2a1003e5d20: Reservation Conflict
If, however, DID device d15 has a SCSI-3 Persistent Reservation and you mistakenly use both commands presented at the beginning of this section, you should see the following results.

# /usr/cluster/lib/sc/scsi -c disfailfast -d /dev/did/rdsk/d15s2
do_enfailfast returned 0
# /usr/cluster/lib/sc/scsi -c release -d /dev/did/rdsk/d15s2
do_scsi2_release returned -1
The failfast mechanism will be disabled for DID device d15, but the SCSI-3 Persistent Reservation will remain intact. The 'release' option has no effect on a SCSI-3 Persistent Reservation.
Removing SCSI-3 Persistent Reservations
To remove SCSI-3 Persistent Reservations and all the reservation keys registered with a device use the 'scsi' command with the 'scrub' option as shown below. This command does not need to be executed from the system that owns the reservation or from a system running in the cluster.

# /usr/cluster/lib/sc/scsi -c disfailfast -d /dev/did/rdsk/d#s2
# /usr/cluster/lib/sc/scsi -c scrub -d /dev/did/rdsk/d#s2
The '#' character is a reference to the DID device number of the disk from which you wish to remove a SCSI-3 Persistent Reservation. The standard Solaris disk device reference, /dev/rdsk/c#t#d#s2, can also be used in place of the DID device reference.

Note: Sun Microsystems recommends that you first disable the failfast mechanism using the 'scsi' command with the 'disfailfast' option, as shown above, before removing any SCSI reservations.

Before executing the scrub operation, it is recommended that you confirm there are reservation keys registered with the device. To confirm there are reservation keys registered with the device execute the 'scsi' command with the 'inkeys' option and then with the 'inresv' option as shown below.
# /usr/cluster/lib/sc/scsi -c inkeys -d /dev/did/rdsk/d#s2
# /usr/cluster/lib/sc/scsi -c inresv -d /dev/did/rdsk/d#s2

Note: SCSI-3 Persistent Reservations are 'persistent' by design which means that any reservation keys registered with a storage device or any reservation placed on a storage device are to be retained by the storage device, even if powered off, and the only way to remove them is by issuing a specific SCSI command to do so. In other words, there are no automatic methods whereby a SCSI-3 Persistent Reservation will be removed, such as a reset or power cycle of the device. Sun Cluster uses the Persistent Reserve Out (PROUT) SCSI command with the PREEMPT AND ABORT service action to remove SCSI-3 Persistent Reservations. This PROUT SCSI command is also programmed into the 'scsi' command through the 'scrub' option for your use when absolutely necessary.

Note: Most storage systems also provide a method to manually remove SCSI-3 Persistent Reservations by accessing or logging into the storage device controller and executing commands defined by the storage vendor for this purpose.

Example 3:If DID device d27 has a SCSI-3 Persistent Reservation you wish to remove, use the following commands.
The first command displays the reservation keys registered with the device.
# /usr/cluster/lib/sc/scsi -c inkeys -d /dev/did/rdsk/d27s2
Reservation keys(3):
0x44441b0800000003
0x44441b0800000001
0x44441b0800000002
The second command displays the SCSI-3 Persistent Reservation owner and the reservation type.

# /usr/cluster/lib/sc/scsi -c inresv -d /dev/did/rdsk/d27s2
Reservations(1):
0x44441b0800000001
type ---> 5
Note: Type 5 corresponds to a Write-Exclusive Registrants-Only type of SCSI-3 Persistent Reservation. This is the SCSI-3 Persistent Reservation type used in Sun Cluster.

The third command removes the SCSI-3 Persistent Reservation and all the reservation keys registered with the device.
# /usr/cluster/lib/sc/scsi -c scrub -d /dev/did/rdsk/d27s2
Reservation keys currently on disk:
0x44441b0800000003
0x44441b0800000001
0x44441b0800000002
Attempting to remove all keys from the disk...
Scrubbing complete, use '/usr/cluster/lib/sc/scsi -c inkeys -d /dev/did/rdsk/d27s2' to verify success
When executing the 'scsi' command with the 'scrub' option, the output first displays the keys currently registered with the device and then recommends that you check the device again to confirm that the registered keys have been removed. This command and the result is shown below.
# /usr/cluster/lib/sc/scsi -c inkeys -d /dev/did/rdsk/d27s2
Reservation keys(0):
You should also confirm that the SCSI-3 Persistent Reservation has also been removed.
# /usr/cluster/lib/sc/scsi -c inresv -d /dev/did/rdsk/d27s2
Reservations(0):

Removing PGRe Keys
Sun Cluster uses the SCSI-3 Persistent Reservation keys that are registered with the quorum device when the Cluster Membership Monitor (CMM) determines the quorum vote tally. How these keys are used by the CMM is beyond the scope of this document, however when SCSI-2 Reservations are used there are no reservation keys. Therefore, when SCSI-2 Reservations are used with a quorum device, Sun Cluster uses an emulation mode to store SCSI-3 Persistent Reservation keys on the quorum device for when the CMM needs to use them.
This mechanism is referred to as Persistent Group Reservation emulation, or PGRe, and even though the reservation keys are the same ones used with SCSI-3 compliant devices they will be referred to in the remainder of this document as emulation keys because of the special, non-SCSI, way they are handled and stored in this case.

Note: Emulation keys are stored in a vendor defined location on the disk and do not interfere with or take away from the storage of user data on the quorum device.

Note: The PGRe mechanism is used only by Sun Cluster and the way emulation keys are handled and stored are defined only by Sun Cluster. The PGRe mechanism and the emulation keys are not a part of the SCSI Specification. In other words, the PGRe mechanism and the way emulation keys are handled and stored do not have any operational effect as it relates to what is defined in the SCSI Specification documents. Therefore, any desire or need to remove emulation keys must be associated only with Sun Cluster itself and should have nothing whatsoever to do with the storage devices used as quorum devices

To remove emulation keys use the 'pgre' command with the 'pgre_scrub' option as shown below.
# /usr/cluster/lib/sc/pgre -c pgre_scrub -d /dev/did/rdsk/d#s2
The '#' character is a reference to the DID device number of the disk from which you wish to remove emulation keys. The standard Solaris disk device reference, /dev/rdsk/c#t#d#s2, can also be used in place of the DID device reference.
Before executing the pgre_scrub operation, it is recommended that you confirm there are reservation keys registered with the device. To confirm there are reservation keys registered with the device execute the 'pgre' command with the 'pgre_inkeys' option and then with the 'pgre_inresv' option as shown below.
# /usr/cluster/lib/sc/pgre -c pgre_inkeys -d /dev/did/rdsk/d#s2
# /usr/cluster/lib/sc/pgre -c pgre_inresv -d /dev/did/rdsk/d#s2

Example 4:If DID device d4 has emulation keys you wish to remove, use the following commands.
The first command displays the emulation keys that have been registered with the device.
# /usr/cluster/lib/sc/pgre -c pgre_inkeys -d /dev/did/rdsk/d4s2
key[0]=0x447f129700000001.
The second command displays which emulation key is the reservation owner.
# /usr/cluster/lib/sc/pgre -c pgre_inresv -d /dev/did/rdsk/d4s2
resv[0]: key=0x447f129700000001.
The third command removes the emulation keys from the device.
# /usr/cluster/lib/sc/pgre -c pgre_scrub -d /dev/did/rdsk/d4s2
Scrubbing complete. Use '/usr/cluster/lib/sc/pgre -c pgre_inkeys -d /dev/did/rdsk/d4s2'
to verify success.
When executing the 'pgre' command with the 'pgre_scrub' option, the output displays whether the operation completed successfully and then recommends that you check the device again to confirm that the emulation keys have been removed. This command and the result is shown below.
# /usr/cluster/lib/sc/pgre -c pgre_inkeys -d /dev/did/rdsk/d4s2
No keys registered.
You should also confirm that the reservation owner has also been removed.
# /usr/cluster/lib/sc/pgre -c pgre_inresv -d /dev/did/rdsk/d4s2
No reservations on the device.

Example 5:If the DID device has never been used as a quorum device you will receive the following error when you use the 'pgre' command. This error means that the storage area on the disk used by Sun Cluster to store the emulation keys has not been initialized. Sun Cluster initializes the area on the disk to store emulation keys when the quorum device is created.
# /usr/cluster/lib/sc/pgre -c pgre_inkeys -d /dev/did/rdsk/d7s2
quorum_scsi2_sector_read: pgre id mismatch. The sector id is .
scsi2 read returned error (22).
/usr/cluster/lib/sc/pgre -c pgre_inkeys -d /dev/did/rdsk/d7s2 command failed errno = 22.

Sun cluster software is unable to set any scsi-3 reservations on the array even though array is functioning:
The reservations needed to be cleared at the array level.
From any host connected to the array, unmount the volumes.Confirm that you have a host with serial access to the array.

On the host managing the array, run sccli:
# /usr/sbin/sccli
Choose the correct array if multiple are connected to the host. Run the command to shutdown the controller, it is necessary to sync the cache to the disks so that the LDs remain consistent.
sccli> shutdown controller
Reset the controller
sccli> reset controller
From the host with serial port access Watch the boot up until you find message

Restoring saved persistent reservations. Preparing to restore saved persistent reservations.
Type skip which will skip the loading of any persistent reservation keys stored on the array, essential scrubbing them from the array.
Allow the array to continue to boot normally. At this time, retry the cluster command to initiate scsi-3 reservations.
The reset on the array needs only be run once. If any reservations on any hosts exist prior to this procedure, the host will no longer have access to the lds, and could result in a panic of that host, or an ejection from the cluster.

Sun Cluster 3.2 has the following features and limitations:

· Support for 2-16 nodes.

· Global device capability--devices can be shared across the cluster.

· Global file system --allows a file system to be accessed simultaneously by all cluster nodes.

· Tight implementation with Solaris--The cluster framework services have been implemented in the kernel.

· Application agent support.

· Tight integration with zones.

· Each node must run the same revision and update of the Solaris OS.

· Two node clusters must have at least one quorum device.

· Each cluster needs at least two separate private networks. (Supported hardware, such as ce and bge may use tagged VLANs to run private and public networks on the same physical connection.)

· Each node's boot disk should include a 500M partition mounted at /globaldevices

· Attached storage must be multiply connected to the nodes.

· ZFS is a supported file system and volume manager. Veritas Volume Manager (VxVM) and Solaris Volume Manager (SVM) are also supported volume managers.

· Veritas multipathing (vxdmp) is not supported. Since vxdmp must be enabled for current VxVM versions, it must be used in conjunction with mpxio or another similar solution like EMC's Powerpath.

· SMF services can be integrated into the cluster, and all framework daemons are defined as SMF services

· PCI and SBus based systems cannot be mixed in the same cluster.

· Boot devices cannot be on a disk that is shared with other cluster nodes. Doing this may lead to a locked-up cluster due to data fencing.

The overall health of the cluster may be monitored using the cluster status or scstat -v commands. Other useful options include:

scstat -g: Resource group status

scstat -D: Device group status

scstat -W: Heartbeat status

scstat -i: IPMP status

scstat -n: Node status

Failover applications (also known as "cluster-unaware" applications in the Sun Cluster documentation) are controlled by rgmd (the resource group manager daemon). Each application has a data service agent, which is the way that the cluster controls application startups, shutdowns, and monitoring. Each application is typically paired with an IP address, which will follow the application to the new node when a failover occurs.

"Scalable" applications are able to run on several nodes concurrently. The clustering software provides load balancing and makes a single service IP address available for outside entities to query the application.

"Cluster aware" applications take this one step further, and have cluster awareness programmed into the application. Oracle RAC is a good example of such an application.

All the nodes in the cluster may be shut down with cluster shutdown -y -g0. To boot a node outside of the cluster (for troubleshooting or recovery operations, run boot -x

clsetup is a menu-based utility that can be used to perform a broad variety of configuration tasks, including configuration of resources and resource groups.

Cluster Configuration

The cluster's configuration information is stored in global files known as the "cluster configuration repository" (CCR). The cluster framework files in /etc/cluster/ccr should not be edited manually; they should be managed via the administrative commands.

The cluster show command displays the cluster configuration in a nicely-formatted report.

The CCR contains:

· Names of the cluster and the nodes.

· The configuration of the cluster transport.

· Device group configuration.

· Nodes that can master each device group.

· NAS device information (if relevant).

· Data service parameter values and callback method paths.

· Disk ID (DID) configuration.

· Cluster status.

Some commands to directly maintain the CCR are:

ccradm: Allows (among other things) a checksum re-configuration of files in /etc/cluster/ccr after manual edits. (Do NOT edit these files manually unless there is no other option. Even then, back up the original files.) ccradm -i /etc/cluster/ccr/filename -o

scgdefs: Brings new devices under cluster control after they have been discovered by devfsadm.

The scinstall and clsetup commands may

We have observed that the installation process may disrupt a previously installed NTP configuration (even though the installation notes promise that this will not happen). It may be worth using ntpq to verify that NTP is still working properly after a cluster installation.

Resource Groups

Resource groups are collections of resources, including data services. Examples of resources include disk sets, virtual IP addresses, or server processes like httpd.

Resource groups may either be failover or scalable resource groups. Failover resource groups allow groups of services to be started on a node together if the active node fails. Scalable resource groups run on several nodes at once.

The rgmd is the Resource Group Management Daemon. It is responsible for monitoring, stopping, and starting the resources within the different resource groups.

Some common resource types are:

· SUNW.LogicalHostname: Logical IP address associated with a failover service.

· SUNW.SharedAddress: Logical IP address shared between nodes running a scalable resource group.

· SUNW.HAStoragePlus: Manages global raw devices, global file systems, non-ZFS failover file systems, and failover ZFS zpools.

Resource groups also handle resource and resource group dependencies. Sun Cluster allows services to start or stop in a particular order. Dependencies are a particular type of resource property. The r_properties man page contains a list of resource properties and their meanings. The rg_properties man page has similar information for resource groups. In particular, the Resource_dependencies property specifies something on which the resource is dependent.

Some resource group cluster commands are:

# clrt register resource-type: Register a resource type.

# clrt register -n node1name,node2name resource-type: Register a resource type to specific nodes.

# clrt unregister resource-type: Unregister a resource type.

# clrt list -v: List all resource types and their associated node lists.

# clrt show resource-type: Display all information for a resource type.

# clrg create -n node1name,node2name rgname: Create a resource group.

# clrg delete rgname: Delete a resource group.

# clrg set -p property-name rgname: Set a property.

# clrg show -v rgname: Show resource group information.

# clrs create -t HAStoragePlus -g rgname -p AffinityOn=true -p FilesystemMountPoints=/mountpoint resource-name

# clrg online -M rgname

# clrg switch -M -n nodename rgname

# clrg offline rgname: Offline the resource, but leave it in a managed state.

# clrg restart rgname

# clrs disable resource-name: Disable a resource and its fault monitor.

# clrs enable resource-name: Re-enable a resource and its fault monitor.

# clrs clear -n nodename -f STOP_FAILED resource-name

# clrs unmonitor resource-name: Disable the fault monitor, but leave resource running.

# clrs monitor resource-name: Re-enable the fault monitor for a resource that is currently enabled.

# clrg suspend rgname: Preserves online status of group, but does not continue monitoring.

# clrg resume rgname: Resumes monitoring of a suspended group

# clrg status: List status of resource groups.

# clrs status -g rgname

Data Services

A data service agent is a set of components that allow a data service to be monitored and fail over within the cluster. The agent includes methods for starting, stopping, monitoring, or failing the data service. It also includes a registration information file allowing the CCR to store the information about these methods in the CCR. This information is encapsulated as a resource type.

The fault monitors for a data sevice place the daemons under the control of the process monitoring facility (rpc.pmfd), and the service, using client commands.

Public Network

The public network uses pnmd (Public Network Management Daemon) and the IPMP in.mpathd daemon to monitor and control the public network addresses.

IPMP should be used to provide failovers for the public network paths. The health of the IPMP elements can be monitored with scstat -i

The clrslh and clrssa commands are used to configure logical and shared hostnames, respectively.

# clrslh create -g rgname logical-hostname

Private Network

The "private," or "cluster transport" network is used to provide a heartbeat between the nodes so that they can determine which nodes are available. The cluster transport network is also used for traffic related to global devices.

While a 2-node cluster may use crossover cables to construct a private network, switches should be used for anything more than two nodes. (Ideally, separate switching equipment should be used for each path so that there is no single point of failure.)

The default base IP address is 172.16.0.0, and private networks are assigned subnets based on the results of the cluster setup. Available network interfaces can be identified by using a combination of dladm show-dev and ifconfig.

Private networks should be installed and configured using the scinstall command during cluster configuration. Make sure that the interfaces in question are connected, but down and unplumbed before configuration. The clsetup command also has menu options to guide you through the private network setup process.

Alternatively, something like the following command string can be used to establish a private network:

# clintr add nodename1:ifname

# clintr add nodename2:ifname2

# clintr add switchname

# clintr add nodename1:ifname1,switchname

# clintr add nodename2:ifname2,switchname

# clintr status

The health of the heartbeat networks can be checked with the scstat -W command. The physical paths may be checked with clintr status or cluster status -t intr.

Quorum

Sun Cluster uses a quorum voting system to prevent split-brain and cluster amnesia. The Sun Cluster documentation refers to "failure fencing" as the mechanism to prevent split-brain (where two nodes run the same service at the same time, leading to potential data corruption).

"Amnesia" occurs when a change is made to the cluster while a node is down, then that node attempts to bring up the cluster. This can result in the changes being forgotten, hence the use of the word "amnesia."

One result of this is that the last node to leave a cluster when it is shut down must be the first node to re-enter the cluster. Later in this section, we will discuss ways of circumventing this protection.

Quorum voting is defined by allowing each device one vote. A quorum device may be a cluster node, a specified external server running quorum software, or a disk or NAS device. A majority of all defined quorum votes is required in order to form a cluster. At least half of the quorum votes must be present in order for cluster services to remain in operation. (If a node cannot contact at least half of the quorum votes, it will panic. During the reboot, if a majority cannot be contacted, the boot process will be frozen. Nodes that are removed from the cluster due to a quorum problem also lose access to any shared file systems. This is called "data fencing" in the Sun Cluster documentation.)

Quorum devices must be available to at least two nodes in the cluster.

Disk quorum devices may also contain user data. (Note that if a ZFS disk is used as a quorum device, it should be brought into the zpool before being specified as a quorum device.)

Sun recommends configuring n-1 quorum devices (the number of nodes minus 1). Two node clusters must contain at least one quorum device.

Disk quorum devices must be specified using the DID names.

Quorum disk devices should be at least as available as the storage underlying the cluster resource groups.

Quorum status and configuration may be investigating using:

# scstat -q

# clq status

These commands report on the configured quorum votes, whether they are present, and how many are required for a majority.

Quorum devices can be manipulated through the following commands:

# clq add did-device-name

# clq remove did-device-name: (Only removes the device from the quorum configuration. No data on the device is affected.)

# clq enable did-device-name

# clq disable did-device-name: (Removes the quorum device from the total list of available quorum votes. This might be valuable if the device is down for maintenance.)

# clq reset: (Resets the configuration to the default.)

By default, doubly-connected disk quorum devices use SCSI-2 locking. Devices connected to more than two nodes use SCSI-3 locking. SCSI-3 offers persistent reservations, but SCSI-2 requires the use of emulation software. The emulation software uses a 64-bit reservation key written to a private area on the disk.

In either case, the cluster node that wins a race to the quorum device attempts to remove the keys of any node that it is unable to contact, which cuts that node off from the quorum device. As noted before, any group of nodes that cannot communicate with at least half of the quorum devices will panic, which prevents a cluster partition (split-brain).

In order to add nodes to a 2-node cluster, it may be necessary to change the default fencing with scdidadm -G prefer3 or cluster set -p global_fencing=prefer3, create a SCSI-3 quorum device with clq add, then remove the SCSI-2 quorum device with clq remove.

NetApp filers and systems running the scqsd daemon may also be selected as quorum devices. NetApp filers use SCSI-3 locking over the iSCSI protocol to perform their quorum functions.

The claccess deny-all command may be used to deny all other nodes access to the cluster. claccess allow nodename re-enables access for a node.

Purging Quorum Keys

CAUTION: Purging the keys from a quorum device may result in amnesia. It should only be done after careful diagnostics have been done to verify why the cluster is not coming up. This should never be done as long as the cluster is able to come up. It may need to be done if the last node to leave the cluster is unable to boot, leaving everyone else fenced out. In that case, boot one of the other nodes to single-user mode, identify the quorum device, and:

For SCSI 2 disk reservations, the relevant command is pgre, which is located in /usr/cluster/lib/sc:

# pgre -c pgre_inkeys -d /dev/did/rdks/d#s2 (List the keys in the quorum device.)

# pgre -c pgre_scrub -d /dev/did/rdks/d#s2 (Remove the keys from the quorum device.)

Similarly, for SCSI 3 disk reservations, the relevant command is scsi:

# scsi -c inkeys -d /dev/did/rdks/d#s2 (List the keys in the quorum device.)

# scsi -c scrub -d /dev/did/rdks/d#s2 (Remove the keys from the quorum device.)

Global Storage

Sun Cluster provides a unique global device name for every disk, CD, and tape drive in the cluster. The format of these global device names is /dev/did/device-type. (eg /dev/did/dsk/d2s3) (Note that the DIDs are a global naming system, which is separate from the global device or global file system functionality.)

DIDs are componentsof SVM volumes, though VxVM does not recognize DID device names as components of VxVM volumes.

DID disk devices, CD-ROM drives, tape drives, SVM volumes, and VxVM volumes may be used as global devices. A global device is physically accessed by just one node at a time, but all other nodes may access the device by communicating across the global transport network.

The file systems in /global/.devices store the device files for global devices on each node. These are mounted on mount points of the form /global/.devices/node@nodeid, where nodeid is the identification number assigned to the node. These are visible on all nodes. Symbolic links may be set up to the contents of these file systems, if they are desired. Sun Cluster sets up some such links in the /dev/global directory.

Global file systems may be ufs, VxFS, or hsfs. To mount a file system as a global file system, add a "global" mount option to the file system's vfstab entry and remount. Alternatively, run a mount -o global... command.

(Note that all nodes in the cluster should have the same vfstab entry for all cluster file systems. This is true for both global and failover file systems, though ZFS file systems do not use the vfstab at all.)

In the Sun Cluster documentation, global file systems are also known as "cluster file systems" or "proxy file systems."

Note that global file systems are different from failover file systems. The former are accessible from all nodes; the latter are only accessible from the active node.

Maintaining Devices

New devices need to be read into the cluster configuration as well as the OS. As usual, we should run something like devfsadm or drvconfig; disks to create the /device and /dev links across the cluster. Then we use the scgdevs or scdidadm command to add more disk devices to the cluster configuration.

Some useful options for scdidadm are:

# scdidadm -l: Show local DIDs

# scdidadm -L: Show all cluster DIDs

# scdidadm -r: Rebuild DIDs

We should also clean up unused links from time to time with devfsadm -C and scdidadm -C

The status of device groups can be checked with scstat -D. Devices may be listed with cldev list -v. They can be switched to a different node via a cldg switch -n target-node dgname command.

Monitoring for devices can be enabled and disabled by using commands like:

# cldev monitor all

# cldev unmonitor d#

# cldev unmonitor -n nodename d#

# cldev status -s Unmonitored

Parameters may be set on device groups using the cldg set command, for example:

# cldg set -p failback=false dgname

A device group can be taken offline or placed online with:

# cldg offline dgname

# cldg online dgname

VxVM-Specific Issues

Since vxdmp cannot be disabled, we need to make sure that VxVM can only see one path to each disk. This is usually done by implementing mpxio or a third party product like Powerpath. The order of installation for such an environment would be:

· Install Solaris and patches.

· Install and configure multipathing software.

· Install and configure Sun Cluster.

· Install and configure VxVM

If VxVM disk groups are used by the cluster, all nodes attached to the shared storage must have VxVM installed. Each vxio number in /etc/name_to_major must also be the same on each node. This can be checked (and fixed, if necessary) with the clvxvm initialize command. (A reboot may be necessary if the /etc/name_to_major file is changed.)

The clvxvm encapsulate command should be used if the boot drive is encapsulated (and mirrored) by VxVM. That way the /global/.devices information is set up properly.

The clsetup "Device Groups" menu contains items to register a VxVM disk group, unregister a device group, or synchronize volume information for a disk group. We can also re-synchronize with the cldg sync dgname command.

Solaris Volume Manager-Specific Issues

Sun Cluster allows us to add metadb or partition information in the /dev/did format or in the usual format. In general:

Use local format for boot drive mirroring in case we need to boot outside the cluster framework.

Use cluster format for shared disksets because otherwise we will need to assume the same controller numbers on each node.

Configuration information is kept in the metadatabase replicas. At least three local replicas are required to boot a node; these should be put on their own partitions on the local disks. They should be spread across controllers and disks to the degree possible. Multiple replicas may be placed on each partition; they should be spread out so that if any one disk fails, there will still be at least three replicas left over, constituting at least half of the total local replicas.

When disks are added to a shared diskset, database replicas are automatically added. These will always be added to slice 7, where they need to remain. If a disk containing replicas is removed, the replicas must be removed using metadb.

If fewer than 50% of the replicas in a diskset are available, the diskset ceases to operate. If exactly 50% of the replicas are available, the diskset will continue to operate, but will not be able to be enabled or switched on another node.

A mediator can be assigned to a shared diskset. The mediator data is contained within a Solaris process on each node and counts for two votes in the diskset quorum voting.

Standard c#t#d#s# naming should be used when creating local metadb replicas, since it will make recovery easier if we need to boot the node outside of a cluster context. On the other hand, /dev/did/rdsk/d#s# naming should be used for shared disksets, since otherwise the paths will need to be identical on all nodes.

Creating a new shared diskset involves the following steps:

(Create an empty diskset.) # metaset -s set-name -a -h node1-name node2-name

(Create a mediator.) # metaset -s set-name -a -m node1-name node2-name

(Add disks to the diskset.) # metaset -s set-name -a /dev/did/rdsk/d# /dev/did/rdsk/d#

(Check that the diskset is present in the cluster configuration.)

# cldev list -v

# cldg status

# cldg show set-name

Solaris 11 Ultimate

Sol Group

Thursday, December 5, 2013

No comments:

Post a Comment

About Me

Blog Archive