ORA-29740
ORA-29740 is an error message only for cluster databases, which indicates that an instance of the cluster is evicted by another member. It could be caused by various kinds of problems, but they are usually caused by the performance or other hardware faults. Let's see a real case in the real world, in which, the instance #1 is evicted by instance #2 due to a performance issue.
In my case, instance #2 found instance #1 was hard to communicate with. So, it ordered the instance #1 to be evicted from the cluster. The alert log of the instance #2 showed the situation:
...
Sun Aug 22 18:27:35 2010
Communications reconfiguration: instance 0
Evicting instance 1 from cluster
Sun Aug 22 18:28:01 2010
Waiting for instances to leave:
1
Sun Aug 22 18:28:08 2010
Trace dumping is performing id=[30224035720]
Sun Aug 22 18:28:21 2010
Waiting for instances to leave:
1
Sun Aug 22 18:28:41 2010
Waiting for instances to leave:
1
Sun Aug 22 18:29:01 2010
Waiting for instances to leave:
1
Sun Aug 22 18:29:21 2010
Waiting for instances to leave:
1
...
Sun Aug 22 18:36:20 2010
Reconfiguration started (old inc 9, new inc 10)
List of nodes:
1
Sun Aug 22 18:36:20 2010
Reconfiguration started (old inc 9, new inc 11)
List of nodes:
1
Nested/batched reconfiguration detected.
Global Resource Directory frozen
one node partition
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Resources and enqueues cleaned out
Resources remastered 14197
251008 GCS shadows traversed, 20 cancelled, 16610 closed
234471 GCS resources traversed, 0 cancelled
set master node info
Submitted all remote-enqueue requests
Update rdomain variables
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
251008 GCS shadows traversed, 0 replayed, 16629 unopened
Submitted all GCS remote-cache requests
0 write requests issued in 234379 GCS resources
5 PIs marked suspect, 0 flush PI msgs
Sun Aug 22 18:36:27 2010
Reconfiguration complete
Post SMON to start 1st pass IR
Sun Aug 22 18:36:27 2010
Instance recovery: looking for dead threads
Sun Aug 22 18:36:27 2010
ARC9: Completed archiving log 10 thread 2 sequence 869333
Sun Aug 22 18:36:27 2010
Beginning instance recovery of 1 threads
Sun Aug 22 18:36:27 2010
Started redo scan
Sun Aug 22 18:36:29 2010
Completed redo scan
125469 redo blocks read, 3706 data blocks need recovery
...
The alert log of the instance #1 showed that it met an ORA-29740 situation and it must be shutdown....
Sun Aug 22 18:32:15 2010
Trace dumping is performing id=[30224035720]
Sun Aug 22 18:32:53 2010
SMON: terminating instance due to error 481
Sun Aug 22 18:34:15 2010
Errors in file /oracle/admin/ORCL/bdump/orcl_lmon_7234.trc:
ORA-29740: evicted by member 1, group incarnation 10
Instance terminated by SMON, pid = 3452
...
The trace file of LMON of instance #1 recorded the situation before shutdown:...
*** 2010-08-22 18:29:34.050
kjxgrdtrt: Evicted by 1, seq (10, 9)
IMR state information
Member 0, thread 1, state 4, flags 0x0040
RR seq 9, propstate 5, pending propstate 0
Member information:
Member 0, incarn 9, version 678769
thrd 1, prev thrd 65535, status 0x0007, err 0x0000
Member 1, incarn 9, version 109808
thrd 2, prev thrd 65535, status 0x0007, err 0x0000
Group name: ORCL
Member id: 0
Cached SKGXN event: 0
Group State:
State: 9 6
Commited Map: 0 1
New Map: 0 1
SKGXN Map: 0 1
Master node: 0
Memcnt 2 Rcvcnt 0
Substate Proposal: false
Inc Proposal:
incarn 0 memcnt 0 master 0
proposal false matched false
map:
Master Inc State:
incarn 0 memcnt 0 agrees 0 flag 0x1
wmap:
nmap:
ubmap:
Submitting asynchronized dump request [1]
*** 2010-08-22 18:30:44.766
error 29740 detected in background process
ORA-29740: evicted by member 1, group incarnation 10
The causes of communication problems could be various possibilities:
- The system is halted or boots in progress makes the heartbeat stopped.
- The system is hung due to performance problems.
- Software or hardware faults from network interface cards.
Luckily, there were something specials found in the dmesg:
...
Aug 22 18:12:50 dbhost cl_runtime: [ID 661778 kern.warning] WARNING: clcomm: memory low: freemem 0x1605
Aug 22 18:23:42 dbhost in.mpathd[2052]: [ID 585766 daemon.error] Cannot meet requested failure detection time of 20000 ms on (inet ge3) new failure detection time for group "nafo0" is 227332 ms
Aug 22 18:35:34 dbhost in.mpathd[2052]: [ID 585766 daemon.error] Cannot meet requested failure detection time of 20000 ms on (inet ge3) new failure detection time for group "nafo0" is 933156 ms
Aug 22 18:35:46 dbhost eTAudit GenericRec: [ID 778245 user.error] Failed to submit message to router.
Aug 22 18:36:42 dbhost in.mpathd[2052]: [ID 302819 daemon.error] Improved failure detection time 466578 ms on (inet ge3) for group "nafo0"
Aug 22 18:37:32 dbhost in.mpathd[2052]: [ID 302819 daemon.error] Improved failure
detection time 233289 ms on (inet ge3) for group "nafo0"
...
It seemed that the server burdened memory overloading and caused the heartbeat stuck and failed to keep the tempo with other members in cluster. The bottleneck was found and reported to system administrator by DBA. And the system administrator decided to add more physical memory to ease the pressure. After bouncing instance #1, the cluster database is back to normal.