ORA-29740 in Cluster Databases

ORA-29740

ORA-29740 is an error message only for cluster databases, which indicates that an instance of the cluster is evicted by another member. It could be caused by various kinds of problems, but they are usually caused by the performance or other hardware faults. Let's see a real case in the real world, in which, the instance #1 is evicted by instance #2 due to a performance issue.

In my case, instance #2 found instance #1 was hard to communicate with. So, it ordered the instance #1 to be evicted from the cluster. The alert log of the instance #2 showed the situation:

...

Sun Aug 22 18:27:35 2010

Communications reconfiguration: instance 0

Evicting instance 1 from cluster

Sun Aug 22 18:28:01 2010

Waiting for instances to leave:

1

Sun Aug 22 18:28:08 2010

Trace dumping is performing id=[30224035720]

Sun Aug 22 18:28:21 2010

Waiting for instances to leave:

1

Sun Aug 22 18:28:41 2010

Waiting for instances to leave:

1

Sun Aug 22 18:29:01 2010

Waiting for instances to leave:

1

Sun Aug 22 18:29:21 2010

Waiting for instances to leave:

1

...

Sun Aug 22 18:36:20 2010

Reconfiguration started (old inc 9, new inc 10)

List of nodes:

 1

Sun Aug 22 18:36:20 2010

Reconfiguration started (old inc 9, new inc 11)

List of nodes:

 1

 Nested/batched reconfiguration detected.

 Global Resource Directory frozen

one node partition

 Communication channels reestablished

 Master broadcasted resource hash value bitmaps

 Non-local Process blocks cleaned out

 Resources and enqueues cleaned out

 Resources remastered 14197

 251008 GCS shadows traversed, 20 cancelled, 16610 closed

 234471 GCS resources traversed, 0 cancelled

 set master node info

 Submitted all remote-enqueue requests

 Update rdomain variables

 Dwn-cvts replayed, VALBLKs dubious

 All grantable enqueues granted

 251008 GCS shadows traversed, 0 replayed, 16629 unopened

 Submitted all GCS remote-cache requests

 0 write requests issued in 234379 GCS resources

 5 PIs marked suspect, 0 flush PI msgs

Sun Aug 22 18:36:27 2010

Reconfiguration complete

 Post SMON to start 1st pass IR

Sun Aug 22 18:36:27 2010

Instance recovery: looking for dead threads

Sun Aug 22 18:36:27 2010

ARC9: Completed archiving  log 10 thread 2 sequence 869333

Sun Aug 22 18:36:27 2010

Beginning instance recovery of 1 threads

Sun Aug 22 18:36:27 2010

Started redo scan

Sun Aug 22 18:36:29 2010

Completed redo scan

 125469 redo blocks read, 3706 data blocks need recovery

...

The alert log of the instance #1 showed that it met an ORA-29740 situation and it must be shutdown.

...

Sun Aug 22 18:32:15 2010

Trace dumping is performing id=[30224035720]

Sun Aug 22 18:32:53 2010

SMON: terminating instance due to error 481

Sun Aug 22 18:34:15 2010

Errors in file /oracle/admin/ORCL/bdump/orcl_lmon_7234.trc:

ORA-29740: evicted by member 1, group incarnation 10

Instance terminated by SMON, pid = 3452

...

The trace file of LMON of instance #1 recorded the situation before shutdown:

...

*** 2010-08-22 18:29:34.050

kjxgrdtrt: Evicted by 1, seq (10, 9)

IMR state information

  Member 0, thread 1, state 4, flags 0x0040

  RR seq 9, propstate 5, pending propstate 0

  Member information:

    Member 0, incarn 9, version 678769

      thrd 1, prev thrd 65535, status 0x0007, err 0x0000

    Member 1, incarn 9, version 109808

      thrd 2, prev thrd 65535, status 0x0007, err 0x0000

Group name: ORCL

Member id: 0

Cached SKGXN event: 0

Group State:

  State: 9 6

  Commited Map: 0 1

  New Map: 0 1

  SKGXN Map: 0 1

  Master node: 0

  Memcnt 2  Rcvcnt 0

  Substate Proposal: false

Inc Proposal:

  incarn 0  memcnt 0  master 0

  proposal false  matched false

  map:

Master Inc State:

  incarn 0  memcnt 0  agrees 0  flag 0x1

  wmap:

  nmap:

  ubmap:

Submitting asynchronized dump request [1]

*** 2010-08-22 18:30:44.766

error 29740 detected in background process

ORA-29740: evicted by member 1, group incarnation 10

The causes of communication problems could be various possibilities:

The system is halted or boots in progress makes the heartbeat stopped.
The system is hung due to performance problems.
Software or hardware faults from network interface cards.

Luckily, there were something specials found in the dmesg:

...

Aug 22 18:12:50 dbhost cl_runtime: [ID 661778 kern.warning] WARNING: clcomm: memory low: freemem 0x1605

Aug 22 18:23:42 dbhost in.mpathd[2052]: [ID 585766 daemon.error] Cannot meet requested failure detection time of 20000 ms on (inet  ge3) new failure detection time for group "nafo0" is 227332 ms

Aug 22 18:35:34 dbhost in.mpathd[2052]: [ID 585766 daemon.error] Cannot meet requested failure detection time of 20000 ms on (inet  ge3) new failure detection time for group "nafo0" is 933156 ms

Aug 22 18:35:46 dbhost eTAudit GenericRec: [ID 778245 user.error] Failed to submit message to router.

Aug 22 18:36:42 dbhost in.mpathd[2052]: [ID 302819 daemon.error] Improved failure detection time 466578 ms on (inet  ge3) for group "nafo0"

Aug 22 18:37:32 dbhost in.mpathd[2052]: [ID 302819 daemon.error] Improved failure

detection time 233289 ms on (inet  ge3) for group "nafo0"

...

It seemed that the server burdened memory overloading and caused the heartbeat stuck and failed to keep the tempo with other members in cluster. The bottleneck was found and reported to system administrator by DBA. And the system administrator decided to add more physical memory to ease the pressure. After bouncing instance #1, the cluster database is back to normal.

ORA-29740 in Cluster Databases

ORA-29740

Leave a Reply Cancel reply