SUNCLUSTER 3.2 & Above Versions- Just i feel important for an Sys Admin -Anand
Its suggested to move UCLM port range to 7000-7032 for
Suncluster 3.2 and higher
To check the port UDLM is running on
#scrgadm -pvv |grep udlm:port
To check if something is already running on that port
#netstat -an|egrep '\.60-[0-3][0-9]'
also check if new port range is
clear
#netstat -an|egrep '\.70-[0-3][0-9]'
Now lets disable and unmanage the RAC Framework
#scswitch -F -g rac-framework-rg
#scswitch -n -j rac_new
#scswitch -n -j rac_udlm
#scswitch -n -j rac-framework-rg
Now we have to reboot the cluster so it comes up with rac_udlm offline to allow
the port change.
Remember you have to reboot the
cluster using scshutdown on master node, so we have consistent reboot as per
the quorum assigned.
#scshutdown -y -g0
Once the nodes are up, from the master node then make the port change
#scrgadm -c -j rac_udlm-rs -x
port=7000 to set the port range.
Now bring the rac-framework offline
#scswitch -Z -g rac-framework-rg
Verify the cluster is using the new UDLM port 7000.
#scrgadm -pvv |grep udlm:port
If you have any pending resources after port change:
clresource disable -R -g rac-framework-rg +
clresource enable -R -g
rac-framework-rg +
What is Split-Brain?
The term "Split-Brain" is
often used to describe the scenario when two or more co-operating processes in
a distributed system, typically a high availability cluster, lose connectivity
with one another but then continue to operate independently of each other,
including acquiring logical or physical resources, under the incorrect
assumption that the other process(es) are no longer operational or using the
said resources.
What does "co-operating"
mean?
Co-operating processes are those
that use shared or otherwise related resources, including accessing or
modifying shared system state, during the process of performing some
coordinated action, typically at the request of a client.
What's at risk?
The biggest risk following a
Split-Brain event is the potential for corrupting system state. There are
three typical causes of corruption:
The processes that were once
co-operating prior to the Split-Brain event occurring independently modify the
same logically shared state, thus leading to conflicting views of system state.
This is often called the "multi-master problem".
New requests are accepted after the
Split-Brain event and then performed on potentially corrupted system state
(thus potentially corrupting system state even further).
When the processes of the distributed
system "rejoin" together it is possible that they have conflicting
views of system state or resource ownerships. During the process of
resolving conflicts, information may be lost or become corrupted.
Examples of potential corruption
include, creating multiple copies of the same information, updating the same
information multiple times, deleting information, creating multiple events for
a single operation, processing an event multiple times, starting duplicate
services or suspending existing services.
What if the processes aren't
co-operating?
If the processes in a distributed
system aren't co-operating in any way (ie: they don't use any shared
resources), Split-Brain, or it's effects, may not occur.
I thought Split-Brain was all about
physical network infrastructure and connectivity failures. Can it really occur
at the process level, on the same physical server?
While network infrastructure failure
is one of the more common causes of Split-Brain, the loss of communication or
connectivity between two or more processes on a single physical server, even
running on a single processor, may also cause a Split-Brain event.
For example; if one of two
co-operating processes on a server are swapped out for a long period of time,
longer than the configured network or connectivity time-out between the
processes, a Split-Brain may occur if each process continues to operate
independently, especially when the swapped out process returns to normal
operation. ie: The swapped process does not take into account that it has been
unavailable for a long period of time.
Similarly if a process is
interrupted for a long period of time, say due to an unusually long Garbage
Collection or when a physical processor is unavailable for a process to due
heavy contention virtualized infrastructure, the said process may not respond
to communication requests from another process, and thus a Split-Brain may
occur.
Split-Brain does not require
physical network infrastructure to occur.
Garbage Collection can't cause a
Split-Brain. Right?
Unfortunately not. As
explained above, excessively long Garbage Collection or regular back-to-back
Garbage Collections may make a process seen unavailable to other processes in a
distributed system and thus not be in a position to respond to communication
requests.
Split-Brain will only ever occur
when a system has two processes. With three or more processes it can
easily be detected and resolved. Right?
Unfortunately not. Split-Brain
scenarios may just as easily occur when there are n processes (where n > 2)
in a distributed system. For example: it's possible that all n processes
in a distributed system, especially on heavily loaded hardware, could be
swapped out at the same point in time, thus effectively losing connectivity
with each other, thus potentially creating an n-way Split-Brain.
As Split-Brains occur as the result
of incorrectly performing some action due to an observed communication failure,
the number of "pieces" of "brains" may be greater than two.
Splits always occur
"down-the-middle". Right?
Unfortunately not. Following
on from above, as most distributed systems consist of large numbers of
processes (large values of n), splits rarely occur "down-the-middle",
especially when n is odd number! In fact, even when n is an even number, there
is absolutely no guarantee that a split will contain two equally sized
collections of processes. For example: A system consisting of five
processes may be split such that one side of the system (ie: brain) may have
three processes and the other side may have two processes. Alternatively
it may be split such that one side has four processes and the other has just
one process. In a system with six processes, a split may occur with four
processes on one side and just two on another.
Stateless architectures don't suffer
from Split-Brain. Right?
Correct. Systems that don't
have shared state or use shared resources typically don't suffer from
Split-Brain.
Client+Server architectures don't
suffer from Split-Brain. Right?
Correct - if and only if the Server
component of the architecture operates as single process. However if the
Server component of an architecture operates as a collection of processes, it's
possible that such an architecture will suffer from Split-Brain.
All Stateful architectures suffer
from Split-Brain. Right?
The "statefulness" of an
architecture does not imply that it will suffer from Split-Brain. It is
very possible to define an architecture that is stateful and yet avoids the
possibility of Split-Brain (as defined above), by ensuring no shared resources
are accessed across processes.
The solution is simple - completely
avoid Split-Brains by waiting longer for communication to recover, doing more
checks and avoiding assumptions. Right?
Unfortunately in most Split-Brain
scenarios all that can be observed is an inability to send and/or receive
information. It is from these observations that systems must make assumptions
about a failure. When these assumptions are incorrect, Split-Brain may
occur.
While waiting longer for a communication
response may seem like a reasonable solution, the challenge is not in waiting.
The waiting part is easy. The challenge is determining "how long to
wait" or "how many times to retry".
Unfortunately to determine "how
long to wait" or "retry" we need to make some assumptions about
process connectivity and ultimately process availability. The challenge
here is that those assumptions may quickly become invalid, especially in a
dynamically or arbitrarily loaded distributed system. Alternatively if there
is a sudden spike in the number of requests, processes may pause more
frequently (especially in the case of Garbage Collection or in virtualized
environments) and thus increase the potential for a Split-Brain scenario to
form.
Split-Brain only occurs in systems
that use unreliable network protocols (ie: protocols other than TCP/IP). Right?
The network protocol used by a
distributed system, whether a reliable protocol like TCP/IP or unreliable
protocol like UDP (unicast or multicast) does not preclude a Split-Brain from
occurring. As discussed above, a physical network is not required for a
Split-Brain scenario to develop.
TCP/IP decreases the chances of a
Split-Brain occurring. Right?
Unfortunately not. Following
on from above, the operational challenge with the use of TCP/IP with in a
distributed system is that the protocol does not directly expose internal
communication failures to the system, like for example the ability to send from
a process and not receive (often called "deafness") or the ability to
receive but not send (often called "muteness"). Instead TCP/IP
protocol failures are only notified after a fixed, typically operating system
designated time-out (usually configurable but typically set to seconds or even
minutes by default). However, the use of unreliable protocols such as
UDP, may highlight the communication problems sooner, including the ability to
detect "deafness" or "muteness" and thus allow a system to
take corrective action soon.
Having all processes in a
distributed system connected via a single physical switch will help prevent
Split-Brain. Right?
While deploying a distributed system
such that all processes are interconnected via a single physical switch may
seem to reduce the chances of a Split-Brain occurring, the possibility of a
switch failing atomically at once is extremely low. Typically when a
switch fails, it does so in an unreliable and degrading manner. That is,
some components of a switch will continue to remain operational where as others
may be intermittent. Thus in their entirety, switches become
intermittent, before they fail completely (or are shutdown completely for
maintenance).
What are the best practices for
dealing with Split-Brain?
While the problem of Split-Brains
can't be solved using a generalized approach, "prevention" and
"cure" are possible.
There are essentially six approaches
that may be used to prevent a Split-Brain from occurring;
· Use high quality and reliable network infrastructure.
· Provide multiple paths of communication between processes,
so that a single observed communication failure does not trigger a Split-Brain.
· Avoid overloading physical resources so that processes are
not swapped out for long periods of time.
· Avoid unexpectedly long Garbage Collection pauses.
· Ensure communication time-outs are suitably long enough to
prevent a Split-Brain occurring "too early" due to (3) or (4)
· Architect an application so that it uses few shared
resources (this is rarely possible).
However even implementing all of the
approaches, Split-Brain may still occur. In which case we need to focus
on "cure".
There are essentially four
approaches to "curing" a Split-Brain:
Fail-Fast: As soon as a Split-Brain scenario is detected, the entire
system or suitable processes in the system are immediately shutdown to avoid
the possibility of corruption.
Isolation is weak form of Fail-Fast. Instead of shutting down processes of a system,
they are simply isolated. When the Split-Brain is recovered, the isolated
processes are re-introduced into the system as if they were new (ie: they drop
any previously held information/assumptions to avoid corruption). After the
Split-Brain event has occurred, the Isolated processes may continue to perform
work, but on re-introduction to the system, currently information / processing
will be lost.
Fencing is a stronger form of isolation, but still a weaker form of
Fail-Fast. Fencing requires an additional constraint over Isolation in
that fenced-processes must stop execution immediately (and release resources),
rather than continue to operate after a Split-Brain has occurred.
Resolve Conflicts (assumes the above two approaches have not be used)
When the communication channels have
been recovered, it's highly possible that there are conflicts at the resource
level - including conflicts in system state. By providing a
"Conflict Resolving" interface, developers may provide an application
specific mechanism for resolving said conflicts, thus allowing a system to
continue to operation. Of course, this is completely application
dependent and development intensive, but provides the best way to recover.
How can a Split-Brain event be
detected? How are they defined?
Unfortunately there is no simple way
to define or detect when a Split-Brain has occurred. While it's fairly
obvious and perhaps trivial in a system composed of only two processes, these
are increasingly rare. Most distributed systems have 10's if not 1000's
of processes.
For example; if four processes in a
five process system collectively lose contact with a single process, does that
mean a four-to-one Split-Brain has occurred, or simply that a single process
has failed?
The common solution to this problem
to define what is called a Quorum, the idea being, those processes not
belonging to the quorum should Fail-Fast or be Isolated. The
typically way to define a quorum is to specify the minimum number of
co-operating processes must be collectively available to continue operating.
Hence for the above example, with a
quorum of "three processes", the system would treat a failure of a
single process not as a Split-Brain, but simply as a lost process.
However if the quorum were incorrectly defined to be "a single
process", a Split-Brain would occur - with a four to one split.
Obviously depending on size-based
quorums may be problematic as it's yet another assumption we need to make -
"how big should a quorum be?". Often however the definition of
a quorum is less about the number of processes that are collectively available,
but instead more about the roles or locality of the said processes.
For example; if three processes in a
five process system collectively lose contact with two other processes, but
those two processes remain in contact, we essentially have a three-to-two
Split-Brain. With a quorum that is defined using a "size-based"
approach, the two processes may be failed-fast/isolated. However let's
consider that the three processes have also lost connectivity with a valuable
resource (say a database or network attached storage device), but the other two
processes have not. In this situation it's often preferable that the two
surviving processes should remain available and the other three processes
should be failed-fast/isolated.
Further, and as previously
mentioned, it's important to remember that failures in communication between
processes are rarely observed to occur at the same time. Rather they are
observed over a period of time, perhaps seconds, minutes or even hours.
Thus when discussing
"when" a Split-Brain occurs, we usually need to consider the entire
period of time, during which there may be multiple failure and recovery events,
to fully conclude that a Split-Brain has occurred.
How does Oracle Coherence deal with
Split-Brain?
Internally Oracle Coherence uses a
variety of both proprietary (UDP uni-cast and multi-cast based) and standard
(TCP/IP) network technologies for inter-process communication and maintaining
system health. These technologies are combined in multiple ways to enable
highly scalable and high-performance one-to-one and one-to-many communication
channels to be established and reliably observed, across hundreds (even
thousands if you really like) of processes, with little CPU or network
overhead.
For example, through the combined
use of these technologies Coherence can easily detect and appropriately deal
with remote-garbage-collection across a system sub-second (using commodity 1Gb
switch infrastructure).
The major philosophy and delivered
advantages of this approach is to ideally "prevent" a Split-Brain
occurring as much as possible. Should a Split-Brain occur, say due to a
switch failure, Coherence does the following:
Uses Isolation to ensure reliable
system communication between the individual fragments of the Split-Brain.
Without reliable communication, data integrity within the fragments cannot be
guaranteed (or recovered)
Uses Fencing around processes that
are deemed unresponsive (ie: deaf), but continue to communicate with other
processes in the system. This is a form of dynamic blacklisting to ensure
system-wide communication reliability and prevent Fenced processes corrupting
state.
Raises real-time programmatic events
concerning process health. This enables developers to provide custom Fail-Fast
algorithms based on system health changes. In Coherence these events are
handled using MemberListeners.
Raises real-time programmatic events
concerning the locality and availability of partitions (ie: information). This
enables developers to provide custom Fail-Fast algorithms based on partitions
being move or becoming unavailable due to catastrophic system failure. In
Coherence these events may be handled using Backing Map Listeners and/or
Partition Listeners.
Uses a simple "largest side
wins" size-based quorum rule to resolve resource (and state) ownership
when a Split-Brain is recovered.
Amnesia Scenario:
Node node-1 is shut down.
Node node-2 crashes and will not
boot due to hardware failure.
Node node-1 is rebooted but stops
and prints out the messages:
Booting as part
of a cluster
NOTICE: CMM: Node
node-1 (nodeid = 1) with votecount = 1 added.
NOTICE: CMM: Node
node-2 (nodeid = 2) with votecount = 1 added.
NOTICE: CMM:
Quorum device 1 (/dev/did/rdsk/d4s2) added; votecount = 1, bitmask of nodes
with configured paths = 0x3.
NOTICE: CMM: Node
node-1: attempting to join cluster.
NOTICE: CMM:
Quorum device 1 (gdevname /dev/did/rdsk/d4s2) can not be acquired by the
current cluster members. This quorum device is held by node 2.
NOTICE: CMM:
Cluster doesn't have operational quorum yet; waiting for quorum.
Node node-1 cannot boot completely
because it cannot achieve the needed quorum vote count.
NOTE: for Oracle Solaris 3.2 update
1 and above with Solaris 10 : the boot continue, in non cluster mode, after a
timeout.
In the above case, node node-1
cannot start the cluster due to the amnesia protection of Oracle Solaris
Cluster. Since node node-1 was not a member of the cluster when it was shut
down (when node-2 crashed) there is a possibility it has an outdated CCR and
should not be allowed to automatically start up the cluster on its own.
The general rule is that a node can
only start the cluster if it was part of the cluster when the cluster was last
shut down. In a multi node cluster it is possible for more than one node to
become "the last" leaving the cluster.
Resolution:
If you have a 3 node cluster, start
with a node that is suitable for starting the cluster. Eg a node connected to
the majority of the storage. In this example node-1 represents this first node.
1. Stop node-1 and reboot in
non-cluster mode. (Single user not necessary, only faster)
ok boot -sx
2. Make a backup of
/etc/cluster/ccr/infrastructure file or /etc/cluster/ccr/global/infrastructure
depending upon cluster and patch revisions noted in UPDATE_NOTE #1 below.
# cd /etc/cluster/ccr
# /usr/bin/cp infrastructure
infrastructure.old
Or if UPDATE_NOTE #1 applies
# cd /etc/cluster/ccr/global
# /usr/bin/cp infrastructure
infrastructure.old
3. Get this node's id.
# cat /etc/cluster/nodeid
4. Edit the
/etc/cluster/ccr/infrastructure file or /etc/cluster/ccr/global/infrastructure
depending upon cluster and patch revisions noted in UPDATE_NOTE #1 below.
Change the quorum_vote to 1 for the
node that is up (node-1, nodeid = 1).
cluster.nodes.1.name node-1
cluster.nodes.1.state enabled
cluster.nodes.1.properties.quorum_vote 1
For all other nodes and any Quorum
Device, set the votecount to zero.
Other nodes, where is any node id
but the one edited above:
cluster.nodes..properties.quorum_vote
0
Quorum Device(s), where is the
quorum device id which is internal to the cluster code:
cluster.quorum_devices..properties.votecount
0
5. Regenerate the checksum of the
infrastructure file by running:
# /usr/cluster/lib/sc/ccradm -i
/etc/cluster/ccr/infrastructure -o
Or if UPDATE_NOTE #1 applies
# /usr/cluster/lib/sc/ccradm -i
/etc/cluster/ccr/global/infrastructure -o
NOTE: If running SC 3.3, SC 3.2u3 or
3.2 Cluster core patch equal to or greater then 126105-36 (5.9) or 126106-36
(5.10) or 126107-36 (5.10 x86) the ccradm command would be
# /usr/cluster/lib/sc/ccradm recover
-o infrastructure
6. Boot node node-1 into the
cluster.
# /usr/sbin/reboot
7. The cluster is now started, so as
long as other nodes have been mended they can be booted up and join the cluster
again. When these nodes joins the cluster their votecount will be reset to its
original value, and if a node is connected to any quorum device its voteqount
will also be reset.
UPDATE_NOTE #1: If running Solaris
Cluster 3.2u2 or higher, the directory path /etc/cluster/ccr is replaced with
/etc/cluster/ccr/global. The same applies if running Cluster core patch equal
to or greater then 126105-27 (5.9) or 126106-27 (5.10) or 126107-27 (5.10 x86)
Troubleshooting Steps Outline for
node not booting into cluster
1. Verify booting from the right
system disk
2. Verify actual cluster doesn't
start issue
3. Verify correct boot syntax
4. Verify not “invalid CCR table”
5. Verify not “waiting for
operational quorum” amnesia
6. Confirm private interfaces have
started
7. Verify that the node did not
already join the cluster with a date/time in the future.
8. Verify if other errors are
reported on the console.
Troubleshooting Steps in Detail:
1) Verify booting from the right
system disk
In many cases, there are several
system disks attached to a system: the current disk,
the mirror of the current disk,
backup of system disk, system disk with a previous solaris version, ...
Be sure that you are booting from
the right disk.
If you are using an alias, check
that the devalias points to the right system disk.
2) Verify actual cluster doesn't
start issue
Ensure the system will boot in
non-cluster mode.
This step will verify the hardware
and Solaris operating system are functional.
A system may be booted in
non-cluster mode by adding the '-x' boot flag.
3) Verify correct boot syntax
Check that you used the right boot syntax
to boot the node.
If you boot using an alias
In order to boot a node in cluster
mode, you must not use the -x flag.
And you must see the following
message on the console
Booting as part of a cluster
4) Verify not “invalid CCR table”
The cluster uses several tables
stored as files in the CCR repository.
If one of is table is corrupted, it
could prevent the node to boot in the cluster.
You may see errors like the
following ones on the console:
* UNRECOVERABLE ERROR:
/etc/cluster/ccr/infrastructure file is corrupted Please reboot in noncluster
mode(boot -x) and Repair
* WARNING: CCR: Invalid CCR table :
infrastructure
* WARNING: CCR: Invalid CCR table :
dcs_service_xx.
* WARNING: CCR: Table
/etc/cluster/ccr/directory has invalid checksum field. .....
UNRECOVERABLE ERROR: Sun Cluster
boot: Could not initialize cluster framework Please reboot in non cluster
mode(boot -x) and Repair
* ...
In such case, further
troubleshooting is required to understand the failure. For additional support
contact Oracle Support.
5) Verify not “waiting for
operational quorum” amnesia
A cluster must reach an operational
quorum in order to start.
It means that we must have enough
votes (more than half of all configured votes) to be able to start.
A node will report the "waiting
for operational quorum” message during boot if that quorum cannot be reached.
NOTICE: CMM: Node ita-v240c:
attempting to join cluster. NOTICE: CMM: Cluster doesn't have operational
quorum yet; waiting for quorum.
This happens when the node is not
able to talk to the other cluster nodes through the private links and is not
able to access the quorum device(s).
This usually happens when the other
cluster node has been the last one being stopped and we are rebooting another
node first.
In that case check that you are
booting the nodes in the right order.
If the other node owning the quorum
device(s) cannot be booted due to other problems, then you need to recover from
amnesia.
6) Verify that the private
interfaces have started.
A node will not be able to join the
cluster if the private links do not start.
You'll then see messages like the
following ones:
NOTICE: clcomm: Adapter bge3
constructed
NOTICE: clcomm: Adapter bge2
constructed
NOTICE: CMM: Node ita-v240c:
attempting to join cluster.
NOTICE: CMM: Cluster doesn't have
operational quorum yet; waiting for quorum.
NOTICE: clcomm: Path ita-v240c:bge3
- ita-v240b:bge3 errors during initiation
NOTICE: clcomm: Path ita-v240c:bge2
- ita-v240b:bge2 errors during initiation
WARNING: Path ita-v240c:bge3 -
ita-v240b:bge3 initiation encountered errors, errno = 62. Remote node may be
down or unreachable through this path.
WARNING: Path ita-v240c:bge2 -
ita-v240b:bge2 initiation encountered errors, errno = 62. Remote node may be
down or unreachable through this path.
The node won't be able to talk to
the other node(s) and continue its boot until it prints:
NOTICE: clcomm: Path ita-v240c:bge3
- ita-v240b:bge3 online
NOTICE: clcomm: Path ita-v240c:bge2
- ita-v240b:bge2 online
7) Verify that the node did not
already join the cluster with a date/time in the future.
When node C joins node B in a
cluster, node B saves the boot date/time of node C.
It is used by node B as node C's
incarnation number.
May 19 03:12:05 ita-v240b
cl_runtime: [ID 537175 kern.notice] NOTICE: CMM: Node ita-v240c (nodeid: 1,
incarnation #: 1242748936) has become reachable.
May 19 03:12:05 ita-v240b
cl_runtime: [ID 377347 kern.notice] NOTICE: CMM: Node ita-v240c (nodeid = 1) is
up; new incarnation number = 1242748936.
Now assume that node C joined the
cluster with the wrong date/time being set, say a few hours in the future.
The administrator realizes that and
decides to reboot node C with a fixed date/time.
Node B will then see node C trying
to join the cluster with an incarnation number before the one it already knows.
Node B then think that this is a
stale version of node C trying to join the cluster and it will refuse that, and
report the following message:
May 19 03:21:08 ita-v240b
cl_runtime: [ID 182413 kern.warning] WARNING: clcomm: Rejecting communication
attempt from a stale incarnation of node ita-v240c;
reported boot time Tue May 19
08:21:01 GMT 2009, expected boot time Tue May 19 16:02:16 GMT 2009 or later.
In older SunCluster version (3.0 for
example), this message is not printed at all.
You may find more information by
dumping a cluster debug buffer on node B.
ita-v240b# mdb -k
Loading modules: [ unix krtld
genunix dtrace specfs ufs sd pcisch ip hook neti sctp arp usba fcp fctl qlc nca
ssd lofs zfs random logindmux ptm cpc sppp crypto fcip nfs ]
>*udp_debug/s
.....
th 3000305f6c0 tm 964861043: PE
60012f10440 peer 1 laid 2 udp_putq stale remote incarnation 1242721261
.....
Node C comes with an incarnation of
1242721261 which is before the already known incarnation of 1242748936.
Node C will only be allowed to join
the cluster once its date/time goes beyond the date/time when it first joined
the cluster.
To fix this, wait for the time to go
beyond the first date/time and reboot node C.
This is not always possible if the
date was too far in the future (days, weeks, months or years!).
The only other fix is to reboot the
whole cluster.
May 19 03:46:14 ita-v240b
cl_runtime: [ID 537175 kern.notice] NOTICE: CMM: Node ita-v240c (nodeid: 1,
incarnation #: 1242722767) has become reachable.
May 19 03:46:14 ita-v240b
cl_runtime: [ID 377347 kern.notice] NOTICE: CMM: Node ita-v240c (nodeid = 1) is
up; new incarnation number = 1242722767.
May 19 03:46:14 ita-v240b
cl_runtime: [ID 377347 kern.notice] NOTICE: CMM: Node ita-v240b (nodeid = 2) is
up; new incarnation number = 1242722750.
8) Verify if other errors are
reported on the console.
The node can fail to join the
cluster due to other errors seen during the boot sequence in the various
scripts being run.
This can block the boot sequence
until the issue is fixed.
This includes: - failure to mount
the global devices filesystem. - mount point missing. - invalid /etc/vfstab
entry.
Fix the error seen and reboot the
node in the cluster.
Cluster Panics due to Reservation
Conflict:
A major issue for clusters is a
failure that causes the cluster to become partitioned (called split brain).
When split brain occurs, not all nodes can communicate, so individual nodes or
subsets of nodes might try to form individual or subset clusters. Each subset
or partition might believe it has sole access and ownership to the multihost
devices. When multiple nodes attempt to write to the disks, data corruption can
occur. In order to avoid data corruption Solaris Cluster software uses SCSI
Reservation method to implement failure fencing and to limit the nodes leaving
the cluster from accessing shared devices.
When the hosts are in cluster they are able to write their keys on shared
devices and access the devices. However, when a nodes becomes unreachable, the
cluster framework will drop the node from the cluster and have the dropped
node's keys removed. Therefore, cluster framework fences that node out and
limits its access to the shared devices.
There are three types of SCSI reservation methods: SCSI2, SCSI3 and
SCSI2/SCSI3 Persistent Group Reservation Emulation (PGRE).
In an SCSI2 type reservation, only one host at a time can reserve and
access the disk whereas in SCSI3 and SCSI3 PGRE every host connected to the
host can write its keys and access the disk. Also SCSI2 reservation is cleared
by reboots whereas the other two methods are permanent and survive
reboots.
Solaris Cluster framework uses SCSI3
PGRE for the storage devices with two physical paths to the hosts and SCSI3 for
the devices with more than 3 paths to the hosts.
Reservation conflict arises when a
node, not in communication with the surviving node(s) in the cluster, either a
fenced out member of the cluster or a non-member host from another cluster
tries to access a shared device it has been fenced off from, i.e. its key is
not present on the device. In such a case, if failfast mechanism is enabled on
the offending node (e.g. when the node is running in the cluster but unaware
that it is fenced off due to unforeseen circumstances such as a hung status)
will panic by design in order to limit its access to the shared device.
Solaris Cluster 3.x node panics with "SCSI Reservation
Conflict" with the following messages:
Nov 14 15:14:28 node2 cl_dlpitrans:
[ID 624622 kern.notice] Notifying cluster that this node is panicking
Nov 14 15:14:28 node2 unix: [ID
836849 kern.notice]
Nov 14 15:14:28 node2
^Mpanic[cpu8]/thread=2a100849ca0:
Nov 14 15:14:28 node2 unix: [ID
747640 kern.notice] Reservation Conflict
Nov 14 15:14:28 node2 unix: [ID
100000 kern.notice]
To troubleshoot this we have do few verifications as given below:
1. Verify that no disk is accessible
by hosts that are not part of this cluster:
If a disk is used by another host(s)
of another cluster or host(s) using SCSI reservation methods, even if it is not
being used or mounted, the keys on the disk might be altered without the
knowledge of any of the nodes in our cluster.
2. Check EMC Gate Keeper or VCMD disks
are accessible by cluster nodes:
If EMC Gate Keeper or VCMDB disks are exposed to Sun Cluster, the
cluster nodes might panic with SCSI Reservation Conflict panics. To avoid such
circumstances, one might "blacklist" VCMDB disks. Please consult EMC
for appropriate procedure.
We have to unmask or blacklist them
in /etc/fp.conf file .
3. Check the time stamps of the messages on all the nodes during and prior to
the panic:
When the nodes booted into a cluster stops receiving heart beats from
other node(s) via private interconnects, they have no way of telling if the the
problem was due to:
problem with the private interconnect paths, e.g. cable problems,
the other node(s) going down as a
part of a normal shutdown, i.e. "init 0",
the other nodes(s) paniced
the other node(s) being hung.
Removing SCSI Reservations
SCSI-2 Reservations and SCSI-3
Persistent Reservations. Solaris Cluster uses only one SCSI reservation type
for a given shared storage device. However, both SCSI reservation types may be
used at the same time in the same cluster.
To remove SCSI-2 Reservations use
the 'scsi' command with the 'release' option as shown below. This
command, however, must be executed from the system that owns the reservation in
order to successfully remove the reservation.
# /usr/cluster/lib/sc/scsi -c disfailfast -d /dev/did/rdsk/d#s2
# /usr/cluster/lib/sc/scsi -c
release -d /dev/did/rdsk/d#s2
The '#' character is a reference to
the DID device number of the disk from which you wish to remove a SCSI-2
Reservation. The standard Solaris disk device reference, /dev/rdsk/c#t#d#s2,
can also be used in place of the DID device reference.
Note: Sun Microsystems recommends
that you first disable the failfast mechanism using the 'scsi' command with the
'disfailfast' option, as shown above, before removing any SCSI reservations.
The system will panic if the system is running in cluster mode and you do
not disable the failfast mechanism and you mistakenly attempt a release of a
SCSI-2 Reservation when the SCSI reservation was in fact a SCSI-3 Persistent
Reservation.
An alternative is to use the 'reserve' command with the 'release' option to
remove a SCSI-2 Reservation as shown below.
# /usr/cluster/lib/sc/reserve
-c disfailfast -z /dev/did/rdsk/d#s2
# /usr/cluster/lib/sc/reserve
-c release -z /dev/did/rdsk/d#s2
Note: SCSI-2 Reservations will be
removed automatically if the system that owns the SCSI-2 Reservations is shut
down or power cycled. Likewise, SCSI-2 Reservations will be removed
automatically if the storage devices that have SCSI-2 Reservations on them are
shut down or power cycled. There are other methods that also remove
SCSI-2 Reservations without resorting to the commands presented above to do so.
For example, a SCSI bus reset will remove SCSI-2 Reservations from the
storage devices affected by the reset.
Example 1:If DID device d15 has a SCSI-2 Reservation you wish to remove, use
the following commands.
# /usr/cluster/lib/sc/scsi -c
disfailfast -d /dev/did/rdsk/d15s2
do_enfailfast returned 0
# /usr/cluster/lib/sc/scsi -c
release -d /dev/did/rdsk/d15s2
do_scsi2_release returned 0
Alternatively, if DID device d15
corresponds to disk device c3t4d0 on the cluster node where the commands will
be executed, you could use the following commands instead.
# /usr/cluster/lib/sc/reserve
-c disfailfast -z /dev/rdsk/c3t4d0s2
do_enfailfast returned 0
# /usr/cluster/lib/sc/reserve
-c release -z /dev/rdsk/c3t4d0s2
do_scsi2_release 0
Note, again, that you can use either
the DID device reference or the Solaris disk device reference with both the
'scsi' and the 'reserve' commands.
Example 2:If DID device d15 has a SCSI-3 Persistent Reservation and you
mistakenly execute the 'scsi' command with the 'release' option without
disabling the failfast mechanism, the system will panic if it is running in
cluster mode. So, if you execute the following command against a device
that has a SCSI-3 Persistent Reservation and the command prompt does not
return, then your system has probably experienced a 'Reservation Conflict'
panic.
# /usr/cluster/lib/sc/scsi -c
release -d /dev/did/rdsk/d15s2
Check the system console for a panic
message like the following.
panic[cpu0]/thread=2a1003e5d20: Reservation Conflict
If, however, DID device d15 has a
SCSI-3 Persistent Reservation and you mistakenly use both commands presented at
the beginning of this section, you should see the following results.
# /usr/cluster/lib/sc/scsi -c disfailfast -d /dev/did/rdsk/d15s2
do_enfailfast returned 0
# /usr/cluster/lib/sc/scsi -c
release -d /dev/did/rdsk/d15s2
do_scsi2_release returned -1
The failfast mechanism will be
disabled for DID device d15, but the SCSI-3 Persistent Reservation will remain
intact. The 'release' option has no effect on a SCSI-3 Persistent
Reservation.
Removing SCSI-3 Persistent
Reservations
To remove SCSI-3 Persistent
Reservations and all the reservation keys registered with a device use the
'scsi' command with the 'scrub' option as shown below. This command does
not need to be executed from the system that owns the reservation or from a
system running in the cluster.
# /usr/cluster/lib/sc/scsi -c disfailfast -d /dev/did/rdsk/d#s2
# /usr/cluster/lib/sc/scsi -c
scrub -d /dev/did/rdsk/d#s2
The '#' character is a reference to
the DID device number of the disk from which you wish to remove a SCSI-3 Persistent
Reservation. The standard Solaris disk device reference,
/dev/rdsk/c#t#d#s2, can also be used in place of the DID device reference.
Note: Sun Microsystems recommends that you first disable the failfast mechanism
using the 'scsi' command with the 'disfailfast' option, as shown above, before
removing any SCSI reservations.
Before executing the scrub operation, it is recommended that you confirm there
are reservation keys registered with the device. To confirm there are
reservation keys registered with the device execute the 'scsi' command with the
'inkeys' option and then with the 'inresv' option as shown below.
# /usr/cluster/lib/sc/scsi -c
inkeys -d /dev/did/rdsk/d#s2
# /usr/cluster/lib/sc/scsi -c
inresv -d /dev/did/rdsk/d#s2
Note: SCSI-3 Persistent Reservations are 'persistent' by design which means
that any reservation keys registered with a storage device or any reservation
placed on a storage device are to be retained by the storage device, even if
powered off, and the only way to remove them is by issuing a specific SCSI
command to do so. In other words, there are no automatic methods whereby
a SCSI-3 Persistent Reservation will be removed, such as a reset or power cycle
of the device. Sun Cluster uses the Persistent Reserve Out (PROUT) SCSI
command with the PREEMPT AND ABORT service action to remove SCSI-3 Persistent
Reservations. This PROUT SCSI command is also programmed into the 'scsi'
command through the 'scrub' option for your use when absolutely necessary.
Note: Most storage systems also provide a method to manually remove SCSI-3
Persistent Reservations by accessing or logging into the storage device
controller and executing commands defined by the storage vendor for this
purpose.
Example 3:If DID device d27 has a SCSI-3 Persistent Reservation you wish to
remove, use the following commands.
The first command displays the
reservation keys registered with the device.
# /usr/cluster/lib/sc/scsi -c
inkeys -d /dev/did/rdsk/d27s2
Reservation keys(3):
0x44441b0800000003
0x44441b0800000001
0x44441b0800000002
The second command displays the
SCSI-3 Persistent Reservation owner and the reservation type.
# /usr/cluster/lib/sc/scsi -c inresv -d /dev/did/rdsk/d27s2
Reservations(1):
0x44441b0800000001
type ---> 5
Note: Type 5 corresponds to a
Write-Exclusive Registrants-Only type of SCSI-3 Persistent Reservation.
This is the SCSI-3 Persistent Reservation type used in Sun Cluster.
The third command removes the SCSI-3 Persistent Reservation and all the
reservation keys registered with the device.
# /usr/cluster/lib/sc/scsi -c
scrub -d /dev/did/rdsk/d27s2
Reservation keys currently on
disk:
0x44441b0800000003
0x44441b0800000001
0x44441b0800000002
Attempting to remove all keys
from the disk...
Scrubbing complete, use
'/usr/cluster/lib/sc/scsi -c inkeys -d /dev/did/rdsk/d27s2' to verify success
When executing the 'scsi' command
with the 'scrub' option, the output first displays the keys currently
registered with the device and then recommends that you check the device again
to confirm that the registered keys have been removed. This command and
the result is shown below.
# /usr/cluster/lib/sc/scsi -c
inkeys -d /dev/did/rdsk/d27s2
Reservation keys(0):
You should also confirm that the
SCSI-3 Persistent Reservation has also been removed.
# /usr/cluster/lib/sc/scsi -c
inresv -d /dev/did/rdsk/d27s2
Reservations(0):
Removing PGRe Keys
Sun Cluster uses the SCSI-3
Persistent Reservation keys that are registered with the quorum device when the
Cluster Membership Monitor (CMM) determines the quorum vote tally. How
these keys are used by the CMM is beyond the scope of this document, however
when SCSI-2 Reservations are used there are no reservation keys.
Therefore, when SCSI-2 Reservations are used with a quorum device, Sun
Cluster uses an emulation mode to store SCSI-3 Persistent Reservation keys on
the quorum device for when the CMM needs to use them.
This mechanism is referred to
as Persistent Group Reservation emulation, or PGRe, and even though the reservation
keys are the same ones used with SCSI-3 compliant devices they will be referred
to in the remainder of this document as emulation keys because of the special,
non-SCSI, way they are handled and stored in this case.
Note: Emulation keys are stored in a vendor defined location on the disk and do
not interfere with or take away from the storage of user data on the quorum
device.
Note: The PGRe mechanism is used only by Sun Cluster and the way emulation keys
are handled and stored are defined only by Sun Cluster. The PGRe
mechanism and the emulation keys are not a part of the SCSI Specification.
In other words, the PGRe mechanism and the way emulation keys are handled
and stored do not have any operational effect as it relates to what is defined
in the SCSI Specification documents. Therefore, any desire or need to
remove emulation keys must be associated only with Sun Cluster itself and
should have nothing whatsoever to do with the storage devices used as quorum
devices
To remove emulation keys use the 'pgre' command with the 'pgre_scrub' option as
shown below.
# /usr/cluster/lib/sc/pgre -c
pgre_scrub -d /dev/did/rdsk/d#s2
The '#' character is a reference to
the DID device number of the disk from which you wish to remove emulation keys.
The standard Solaris disk device reference, /dev/rdsk/c#t#d#s2, can also
be used in place of the DID device reference.
Before executing the pgre_scrub
operation, it is recommended that you confirm there are reservation keys
registered with the device. To confirm there are reservation keys
registered with the device execute the 'pgre' command with the 'pgre_inkeys'
option and then with the 'pgre_inresv' option as shown below.
# /usr/cluster/lib/sc/pgre -c
pgre_inkeys -d /dev/did/rdsk/d#s2
# /usr/cluster/lib/sc/pgre -c
pgre_inresv -d /dev/did/rdsk/d#s2
Example 4:If DID device d4 has emulation keys you wish to remove, use the
following commands.
The first command displays the
emulation keys that have been registered with the device.
# /usr/cluster/lib/sc/pgre -c
pgre_inkeys -d /dev/did/rdsk/d4s2
key[0]=0x447f129700000001.
The second command displays which
emulation key is the reservation owner.
# /usr/cluster/lib/sc/pgre -c
pgre_inresv -d /dev/did/rdsk/d4s2
resv[0]:
key=0x447f129700000001.
The third command removes the
emulation keys from the device.
# /usr/cluster/lib/sc/pgre -c
pgre_scrub -d /dev/did/rdsk/d4s2
Scrubbing complete. Use
'/usr/cluster/lib/sc/pgre -c pgre_inkeys -d /dev/did/rdsk/d4s2'
to verify success.
When executing the 'pgre' command with
the 'pgre_scrub' option, the output displays whether the operation completed
successfully and then recommends that you check the device again to confirm
that the emulation keys have been removed. This command and the result is
shown below.
# /usr/cluster/lib/sc/pgre -c
pgre_inkeys -d /dev/did/rdsk/d4s2
No keys registered.
You should also confirm that the
reservation owner has also been removed.
# /usr/cluster/lib/sc/pgre -c
pgre_inresv -d /dev/did/rdsk/d4s2
No reservations on the
device.
Example 5:If the DID device has never been used as a quorum device you will
receive the following error when you use the 'pgre' command. This error
means that the storage area on the disk used by Sun Cluster to store the
emulation keys has not been initialized. Sun Cluster initializes the area
on the disk to store emulation keys when the quorum device is created.
# /usr/cluster/lib/sc/pgre -c
pgre_inkeys -d /dev/did/rdsk/d7s2
quorum_scsi2_sector_read:
pgre id mismatch. The sector id is .
scsi2 read returned error
(22).
/usr/cluster/lib/sc/pgre -c
pgre_inkeys -d /dev/did/rdsk/d7s2 command failed errno = 22.
Sun cluster software is unable to set any scsi-3 reservations on the array even
though array is functioning:
The reservations needed to be cleared
at the array level.
From any host connected to the
array, unmount the volumes.Confirm that you have a host with serial access to
the array.
On the host managing the array, run sccli:
# /usr/sbin/sccli
Choose the correct array if multiple
are connected to the host. Run the command to shutdown the controller, it is
necessary to sync the cache to the disks so that the LDs remain consistent.
sccli> shutdown
controller
Reset the controller
sccli> reset
controller
From the host with serial port access
Watch the boot up until you find message
Restoring saved persistent reservations. Preparing to restore saved persistent
reservations.
Type skip which will skip the
loading of any persistent reservation keys stored on the array, essential
scrubbing them from the array.
Allow the array to continue to boot
normally. At this time, retry the cluster command to initiate scsi-3
reservations.
The reset on the array needs only be
run once. If any reservations on any hosts exist prior to this procedure, the
host will no longer have access to the lds, and could result in a panic of that
host, or an ejection from the cluster.
Sun
Cluster 3.2 has the following features and limitations:
· Support for 2-16 nodes.
· Global device capability--devices can be shared across the
cluster.
· Global file system --allows a file system to be accessed
simultaneously by all cluster nodes.
· Tight implementation with Solaris--The cluster framework
services have been implemented in the kernel.
· Application agent support.
· Tight integration with zones.
· Each node must run the same revision and update of the
Solaris OS.
· Two node clusters must have at least one quorum device.
· Each cluster needs at least two separate private networks.
(Supported hardware, such as ce and bge may use tagged VLANs to run private and
public networks on the same physical connection.)
· Each node's boot disk should include a 500M partition mounted
at /globaldevices
· Attached storage must be multiply connected to the nodes.
· ZFS is a supported file system and volume manager. Veritas
Volume Manager (VxVM) and Solaris Volume Manager (SVM) are also supported
volume managers.
· Veritas multipathing (vxdmp) is not supported. Since vxdmp
must be enabled for current VxVM versions, it must be used in conjunction with
mpxio or another similar solution like EMC's Powerpath.
· SMF services can be integrated into the cluster, and all
framework daemons are defined as SMF services
· PCI and SBus based systems cannot be mixed in the same
cluster.
·
Boot devices cannot be on a disk
that is shared with other cluster nodes. Doing this may lead to a locked-up
cluster due to data fencing.
The
overall health of the cluster may be monitored using the cluster status or
scstat -v commands. Other useful options include:
scstat
-g: Resource group status
scstat
-D: Device group status
scstat
-W: Heartbeat status
scstat
-i: IPMP status
scstat
-n: Node status
Failover
applications (also known as "cluster-unaware" applications in the Sun
Cluster documentation) are controlled by rgmd (the resource group manager
daemon). Each application has a data service agent, which is the way that the
cluster controls application startups, shutdowns, and monitoring. Each
application is typically paired with an IP address, which will follow the
application to the new node when a failover occurs.
"Scalable"
applications are able to run on several nodes concurrently. The clustering
software provides load balancing and makes a single service IP address
available for outside entities to query the application.
"Cluster
aware" applications take this one step further, and have cluster awareness
programmed into the application. Oracle RAC is a good example of such an
application.
All
the nodes in the cluster may be shut down with cluster shutdown -y -g0. To boot
a node outside of the cluster (for troubleshooting or recovery operations, run
boot -x
clsetup
is a menu-based utility that can be used to perform a broad variety of
configuration tasks, including configuration of resources and resource groups.
Cluster Configuration
The
cluster's configuration information is stored in global files known as the
"cluster configuration repository" (CCR). The cluster framework files
in /etc/cluster/ccr should not be edited manually; they should be managed via
the administrative commands.
The
cluster show command displays the cluster configuration in a nicely-formatted
report.
The
CCR contains:
· Names of the cluster and the nodes.
· The configuration of the cluster transport.
· Device group configuration.
· Nodes that can master each device group.
· NAS device information (if relevant).
· Data service parameter values and callback method paths.
· Disk ID (DID) configuration.
·
Cluster status.
Some
commands to directly maintain the CCR are:
ccradm:
Allows (among other things) a checksum re-configuration of files in
/etc/cluster/ccr after manual edits. (Do NOT edit these files manually unless
there is no other option. Even then, back up the original files.) ccradm -i /etc/cluster/ccr/filename
-o
scgdefs:
Brings new devices under cluster control after they have been discovered by
devfsadm.
The
scinstall and clsetup commands may
We
have observed that the installation process may disrupt a previously installed
NTP configuration (even though the installation notes promise that this will
not happen). It may be worth using ntpq to verify that NTP is still working
properly after a cluster installation.
Resource Groups
Resource
groups are collections of resources, including data services. Examples of
resources include disk sets, virtual IP addresses, or server processes like
httpd.
Resource
groups may either be failover or scalable resource groups. Failover resource
groups allow groups of services to be started on a node together if the active
node fails. Scalable resource groups run on several nodes at once.
The
rgmd is the Resource Group Management Daemon. It is responsible for monitoring,
stopping, and starting the resources within the different resource groups.
Some
common resource types are:
· SUNW.LogicalHostname: Logical IP address associated with a
failover service.
· SUNW.SharedAddress: Logical IP address shared between nodes
running a scalable resource group.
·
SUNW.HAStoragePlus: Manages global
raw devices, global file systems, non-ZFS failover file systems, and failover
ZFS zpools.
Resource
groups also handle resource and resource group dependencies. Sun Cluster allows
services to start or stop in a particular order. Dependencies are a particular
type of resource property. The r_properties man page contains a list of
resource properties and their meanings. The rg_properties man page has similar
information for resource groups. In particular, the Resource_dependencies
property specifies something on which the resource is dependent.
Some
resource group cluster commands are:
# clrt register resource-type: Register a resource type.
# clrt register -n node1name,node2name resource-type: Register
a resource type to specific nodes.
# clrt unregister resource-type: Unregister a resource type.
# clrt list -v: List all resource types and their associated
node lists.
# clrt show resource-type: Display all information for a
resource type.
# clrg create -n node1name,node2name rgname: Create a resource
group.
# clrg delete rgname: Delete a resource group.
# clrg set -p property-name rgname: Set a property.
# clrg show -v rgname: Show resource group information.
# clrs create -t HAStoragePlus -g rgname -p AffinityOn=true -p
FilesystemMountPoints=/mountpoint resource-name
# clrg online -M rgname
# clrg switch -M -n nodename rgname
# clrg offline rgname: Offline the resource, but leave it in a
managed state.
# clrg restart rgname
# clrs disable resource-name: Disable a resource and its fault
monitor.
# clrs enable resource-name: Re-enable a resource and its
fault monitor.
# clrs clear -n nodename -f STOP_FAILED resource-name
# clrs unmonitor resource-name: Disable the fault monitor, but
leave resource running.
# clrs monitor resource-name: Re-enable the fault monitor for
a resource that is currently enabled.
# clrg suspend rgname: Preserves online status of group, but
does not continue monitoring.
# clrg resume rgname: Resumes monitoring of a suspended group
# clrg status: List status of resource groups.
#
clrs status -g rgname
Data Services
A
data service agent is a set of components that allow a data service to be
monitored and fail over within the cluster. The agent includes methods for
starting, stopping, monitoring, or failing the data service. It also includes a
registration information file allowing the CCR to store the information about
these methods in the CCR. This information is encapsulated as a resource type.
The
fault monitors for a data sevice place the daemons under the control of the
process monitoring facility (rpc.pmfd), and the service, using client commands.
Public Network
The
public network uses pnmd (Public Network Management Daemon) and the IPMP
in.mpathd daemon to monitor and control the public network addresses.
IPMP
should be used to provide failovers for the public network paths. The health of
the IPMP elements can be monitored with scstat -i
The
clrslh and clrssa commands are used to configure logical and shared hostnames,
respectively.
#
clrslh create -g rgname
logical-hostname
Private Network
The
"private," or "cluster transport" network is used to
provide a heartbeat between the nodes so that they can determine which nodes
are available. The cluster transport network is also used for traffic related
to global devices.
While
a 2-node cluster may use crossover cables to construct a private network,
switches should be used for anything more than two nodes. (Ideally, separate
switching equipment should be used for each path so that there is no single
point of failure.)
The
default base IP address is 172.16.0.0, and private networks are assigned
subnets based on the results of the cluster setup. Available network interfaces
can be identified by using a combination of dladm show-dev and ifconfig.
Private
networks should be installed and configured using the scinstall command during
cluster configuration. Make sure that the interfaces in question are connected,
but down and unplumbed before configuration. The clsetup command also has menu
options to guide you through the private network setup process.
Alternatively,
something like the following command string can be used to establish a private
network:
# clintr add nodename1:ifname
# clintr add nodename2:ifname2
# clintr add switchname
# clintr add nodename1:ifname1,switchname
# clintr add nodename2:ifname2,switchname
#
clintr status
The
health of the heartbeat networks can be checked with the scstat -W command. The
physical paths may be checked with clintr status or cluster status -t intr.
Quorum
Sun
Cluster uses a quorum voting system to prevent split-brain and cluster amnesia.
The Sun Cluster documentation refers to "failure fencing" as the
mechanism to prevent split-brain (where two nodes run the same service at the
same time, leading to potential data corruption).
"Amnesia"
occurs when a change is made to the cluster while a node is down, then that
node attempts to bring up the cluster. This can result in the changes being
forgotten, hence the use of the word "amnesia."
One
result of this is that the last node to leave a cluster when it is shut down
must be the first node to re-enter the cluster. Later in this section, we will
discuss ways of circumventing this protection.
Quorum
voting is defined by allowing each device one vote. A quorum device may be a
cluster node, a specified external server running quorum software, or a disk or
NAS device. A majority of all defined quorum votes is required in order to form
a cluster. At least half of the quorum votes must be present in order for
cluster services to remain in operation. (If a node cannot contact at least
half of the quorum votes, it will panic. During the reboot, if a majority
cannot be contacted, the boot process will be frozen. Nodes that are removed
from the cluster due to a quorum problem also lose access to any shared file
systems. This is called "data fencing" in the Sun Cluster
documentation.)
Quorum
devices must be available to at least two nodes in the cluster.
Disk
quorum devices may also contain user data. (Note that if a ZFS disk is used as
a quorum device, it should be brought into the zpool before being specified as
a quorum device.)
Sun
recommends configuring n-1 quorum devices (the number of nodes minus 1). Two
node clusters must contain at least one quorum device.
Disk
quorum devices must be specified using the DID names.
Quorum
disk devices should be at least as available as the storage underlying the
cluster resource groups.
Quorum
status and configuration may be investigating using:
# scstat -q
#
clq status
These
commands report on the configured quorum votes, whether they are present, and
how many are required for a majority.
Quorum
devices can be manipulated through the following commands:
# clq add did-device-name
# clq remove did-device-name: (Only removes the device from
the quorum configuration. No data on the device is affected.)
# clq enable did-device-name
# clq disable did-device-name: (Removes the quorum device from
the total list of available quorum votes. This might be valuable if the device
is down for maintenance.)
#
clq reset: (Resets the configuration
to the default.)
By
default, doubly-connected disk quorum devices use SCSI-2 locking. Devices
connected to more than two nodes use SCSI-3 locking. SCSI-3 offers persistent
reservations, but SCSI-2 requires the use of emulation software. The emulation
software uses a 64-bit reservation key written to a private area on the disk.
In
either case, the cluster node that wins a race to the quorum device attempts to
remove the keys of any node that it is unable to contact, which cuts that node
off from the quorum device. As noted before, any group of nodes that cannot
communicate with at least half of the quorum devices will panic, which prevents
a cluster partition (split-brain).
In
order to add nodes to a 2-node cluster, it may be necessary to change the
default fencing with scdidadm -G prefer3 or cluster set -p
global_fencing=prefer3, create a SCSI-3 quorum device with clq add, then remove
the SCSI-2 quorum device with clq remove.
NetApp
filers and systems running the scqsd daemon may also be selected as quorum
devices. NetApp filers use SCSI-3 locking over the iSCSI protocol to perform
their quorum functions.
The
claccess deny-all command may be used to deny all other nodes access to the
cluster. claccess allow nodename re-enables access for a node.
Purging Quorum Keys
CAUTION:
Purging the keys from a quorum device may result in amnesia. It should only be
done after careful diagnostics have been done to verify why the cluster is not
coming up. This should never be done as long as the cluster is able to come up.
It may need to be done if the last node to leave the cluster is unable to boot,
leaving everyone else fenced out. In that case, boot one of the other nodes to
single-user mode, identify the quorum device, and:
For
SCSI 2 disk reservations, the relevant command is pgre, which is located in
/usr/cluster/lib/sc:
# pgre -c pgre_inkeys -d /dev/did/rdks/d#s2 (List the keys in
the quorum device.)
#
pgre -c pgre_scrub -d
/dev/did/rdks/d#s2 (Remove the keys from the quorum device.)
Similarly,
for SCSI 3 disk reservations, the relevant command is scsi:
# scsi -c inkeys -d /dev/did/rdks/d#s2 (List the keys in the
quorum device.)
#
scsi -c scrub -d /dev/did/rdks/d#s2
(Remove the keys from the quorum device.)
Global Storage
Sun
Cluster provides a unique global device name for every disk, CD, and tape drive
in the cluster. The format of these global device names is
/dev/did/device-type. (eg /dev/did/dsk/d2s3) (Note that the DIDs are a global
naming system, which is separate from the global device or global file system
functionality.)
DIDs
are componentsof SVM volumes, though VxVM does not recognize DID device names
as components of VxVM volumes.
DID
disk devices, CD-ROM drives, tape drives, SVM volumes, and VxVM volumes may be
used as global devices. A global device is physically accessed by just one node
at a time, but all other nodes may access the device by communicating across
the global transport network.
The
file systems in /global/.devices store the device files for global devices on
each node. These are mounted on mount points of the form /global/.devices/node@nodeid,
where nodeid is the identification number assigned to the node. These are
visible on all nodes. Symbolic links may be set up to the contents of these
file systems, if they are desired. Sun Cluster sets up some such links in the /dev/global
directory.
Global
file systems may be ufs, VxFS, or hsfs. To mount a file system as a global file
system, add a "global" mount option to the file system's vfstab entry
and remount. Alternatively, run a mount -o global... command.
(Note
that all nodes in the cluster should have the same vfstab entry for all cluster
file systems. This is true for both global and failover file systems, though
ZFS file systems do not use the vfstab at all.)
In
the Sun Cluster documentation, global file systems are also known as
"cluster file systems" or "proxy file systems."
Note
that global file systems are different from failover file systems. The former
are accessible from all nodes; the latter are only accessible from the active
node.
Maintaining Devices
New
devices need to be read into the cluster configuration as well as the OS. As
usual, we should run something like devfsadm or drvconfig; disks to create the
/device and /dev links across the cluster. Then we use the scgdevs or scdidadm
command to add more disk devices to the cluster configuration.
Some
useful options for scdidadm are:
# scdidadm -l: Show local DIDs
# scdidadm -L: Show all cluster DIDs
#
scdidadm -r: Rebuild DIDs
We
should also clean up unused links from time to time with devfsadm -C and
scdidadm -C
The
status of device groups can be checked with scstat -D. Devices may be listed
with cldev list -v. They can be switched to a different node via a cldg switch
-n target-node dgname command.
Monitoring
for devices can be enabled and disabled by using commands like:
# cldev monitor all
# cldev unmonitor d#
# cldev unmonitor -n nodename d#
#
cldev status -s Unmonitored
Parameters
may be set on device groups using the cldg set command, for example:
#
cldg set -p failback=false dgname
A
device group can be taken offline or placed online with:
# cldg offline dgname
#
cldg online dgname
VxVM-Specific Issues
Since
vxdmp cannot be disabled, we need to make sure that VxVM can only see one path
to each disk. This is usually done by implementing mpxio or a third party
product like Powerpath. The order of installation for such an environment would
be:
· Install Solaris and patches.
· Install and configure multipathing software.
· Install and configure Sun Cluster.
·
Install and configure VxVM
If
VxVM disk groups are used by the cluster, all nodes attached to the shared
storage must have VxVM installed. Each vxio number in /etc/name_to_major must
also be the same on each node. This can be checked (and fixed, if necessary)
with the clvxvm initialize command. (A reboot may be necessary if the
/etc/name_to_major file is changed.)
The
clvxvm encapsulate command should be used if the boot drive is encapsulated
(and mirrored) by VxVM. That way the /global/.devices information is set up
properly.
The
clsetup "Device Groups" menu contains items to register a VxVM disk
group, unregister a device group, or synchronize volume information for a disk
group. We can also re-synchronize with the cldg sync dgname command.
Solaris Volume
Manager-Specific Issues
Sun
Cluster allows us to add metadb or partition information in the /dev/did format
or in the usual format. In general:
Use
local format for boot drive mirroring in case we need to boot outside the
cluster framework.
Use
cluster format for shared disksets because otherwise we will need to assume the
same controller numbers on each node.
Configuration
information is kept in the metadatabase replicas. At least three local replicas
are required to boot a node; these should be put on their own partitions on the
local disks. They should be spread across controllers and disks to the degree
possible. Multiple replicas may be placed on each partition; they should be
spread out so that if any one disk fails, there will still be at least three
replicas left over, constituting at least half of the total local replicas.
When
disks are added to a shared diskset, database replicas are automatically added.
These will always be added to slice 7, where they need to remain. If a disk
containing replicas is removed, the replicas must be removed using metadb.
If
fewer than 50% of the replicas in a diskset are available, the diskset ceases
to operate. If exactly 50% of the replicas are available, the diskset will
continue to operate, but will not be able to be enabled or switched on another
node.
A
mediator can be assigned to a shared diskset. The mediator data is contained
within a Solaris process on each node and counts for two votes in the diskset
quorum voting.
Standard
c#t#d#s# naming should be used when creating local metadb replicas, since it
will make recovery easier if we need to boot the node outside of a cluster
context. On the other hand, /dev/did/rdsk/d#s# naming should be used for shared
disksets, since otherwise the paths will need to be identical on all nodes.
Creating a new shared
diskset involves the following steps:
(Create
an empty diskset.) # metaset -s set-name -a -h node1-name node2-name
(Create
a mediator.) # metaset -s set-name -a -m node1-name node2-name
(Add
disks to the diskset.) # metaset -s set-name -a /dev/did/rdsk/d# /dev/did/rdsk/d#
(Check
that the diskset is present in the cluster configuration.)
# cldev list -v
# cldg status
#
cldg show set-name