How to analyze Oracle Solaris OS crash dump


In this topic I will provide information how to generate Oracle Solaris OS crash dump report using Solaris Crash Analyzer Tool.


Here I will provide information how to find out root cause of Oracle Solaris OS crash dump reboot or hung up.

If your system properly installed and configured according EIS (Enterprise Installation Standard) – then you dump

device must be configured as separated raw disk and must have size at least more than installed physical memory

on the server. In most system the dump device is the same swap device. If your system is frame based and installed memory

is more than 320GB ( SAS disk) – then you need assign to the system dedicated dump device no less thane size of physical

memory. Otherwise – in time when system will go to crash dump, the memory state will fully dumped to the dump device

and you will lose last memory snapshot to make root cause analyze of system crash.


First of first we need collect memory dump file from dump device. By default after system crash reboot, when is coming up

Solaris save core tool store dump file under /var/crash/server/vmdump.0. If file system under /var/ have not enough space – then

you can save dump output file in separate storage place of NFS shared resource. Here is command how to do it:

To find out dump device – check it by command dumpadm:

# dumpadm

Dump content: kernel pages

Dump device: /dev/md/dsk/d3 (swap) — dump device

Savecore directory: /var/crash/mexico

Savecore enabled: yes

Save compressed: on

Now we will save crash dump file to the NFS shared resource on ifs server:

# savecore -vd -f /dev/md/dsk/d3 /net/nfsserver/dump/server01/crash_dump/

We will got file /net/nfsserver/dump/server01/crash_dump/vmdump.0

root@server # savecore -vd -f /dev/md/dsk/d3 /net/nfsserver/dump/server01/crash_dump/

savecore: System dump time: Sun Nov 27 23:08:00 2011

savecore: Saving compressed system crash dump in /net/nfsserver/dump/server01/crash_dump/vmdump.0

savecore: Copying /dev/md/dsk/d3 to /net/nfsserver/dump/server01/crash_dump/vmdump.0

savecore: Decompress the crash dump with

‘savecore -vf /net/nfsserver/dump/server01/crash_dump/vmdump.0′

3:15 dump copy is done

Now we need decompress vmdump.0 file. We can make it with savecore command:

# savecore -vd -f vmdump.0 /net/nfsserver/dump/server01/crash_dump/

System dump time: Sun Nov 27 23:08:00 2011

Constructing namelist /net/nfsserver/dump/server01/crash_dump/unix.0

Constructing corefile /net/nfsserver/dump/server01/crash_dump/vmcore.0

: 3902138 of 3902138 pages saved


Now we have crash dump output files: unix.0 (Oracle Solaris kernel) and vmcore.0 (Oracle Solaris Memory snapshot)

Before continue – you need install SUNWscat – Solaris Crash Analyzer Tool. You can get it from Oracle Support.

After installing scat will be located in /opt/SUNWscat/bin/scat

There two type of generating crash dump analyze report – interactive or as explorer collection.

First – interactive :

# /opt/SUNWscat/bin/scat /net/nfsserver/dump/server01/crash_dump/vmcore.0

Interactive will provide next output:

 

Solaris[TM] CAT 5.2 for Solaris 10 64-bit UltraSPARC

SV4990M, Aug 26 2009

Copyright © 2009 Sun Microsystems, Inc. All rights reserved.

Use is subject to license terms.

Feedback regarding the tool should be sent to SolarisCAT_Feedback@Sun.COM

Visit the Solaris CAT blog at
http://blogs.sun.com/SolarisCAT

opening ./vmcore.0 …dumphdr…

WARNING: ./vmcore.0 incomplete/corrupt. size: 32167952384, expected: 32167960576

symtab…core…done

loading core data: modules…symbols…CTF…done

core file: /net/nfsserver/dump/server01/crash_dump/vmcore.0

user: UNIX Administrator Eldar Aydayev (eldara:5002)

release: 5.10 (64-bit)

version: Generic_144488-17

machine: sun4u

node name: mexico

domain: mtn.com.ng

hw_provider: Sun_Microsystems

system type: SUNW,SPARC-Enterprise (SPARC64-VII)

hostid: 847c1fc7

dump_conflags: 0x10000 (DUMP_KERNEL) on /dev/md/dsk/d3(125G)

kmem_flags: 0xf (AUDIT|DEADBEEF|REDZONE|CONTENTS)

time of crash: Sun Nov 27 23:06:55 WAT 2011 (core is 74 days old)

age of system: 13 days 1 hours 51 minutes 3.68 seconds

panic CPU: 210 (56 CPUs, 224G memory, 3 nodes)

panic string: BAD TRAP: type=34 rp=2a1040da720 addr=deadbeefdeadc187 mmu_fsr=0

sanity checks: settings…

NOTE: /etc/system: symbol not found "set noexec-stack=0x1"

NOTE: /etc/system: lwp_default_stksize set to 0x6000 2 times

NOTE: /etc/system: rpcmod:svc_default_stksize set to 0x6000 2 times

vmem…CPU…

WARNING: TS thread 0x303572b1120 on CPU32 using 99%CPU

WARNING: TS thread 0x376abc6e140 on CPU105 using 100%CPU

WARNING: CPU136 has cpu_intr_actv for PIL 6

WARNING: PIL6 interrupt thread 0x2a101917ca0 on CPU136 pinning TS thread 0x30351a6d200

WARNING: TS thread 0x4cecc9e7b00 on CPU208 using 100%CPU

sysent…

WARNING: unknown module acctctl seen 4 times in sysent table

clock…misc…

WARNING: hat_kpr_enabled is 0

WARNING: 80 severe kstat errors (run "kstat xck")

done

SolarisCAT(./vmcore.0/10U)>

Now we will check analyze output from crash dump:

 

SolarisCAT(./vmcore.0/10U)> analyze

core file: /net/nfsserver/dump/server01/crash_dump/vmcore.0

user: UNIX Administrator Eldar Aydayev (eldara:5002)

release: 5.10 (64-bit)

version: Generic_144488-17

machine: sun4u

node name: server01

domain: aydayev.com

hw_provider: Sun_Microsystems

system type: SUNW,SPARC-Enterprise (SPARC64-VII)

hostid: 847c1fc7

dump_conflags: 0x10000 (DUMP_KERNEL) on /dev/md/dsk/d3(125G)

kmem_flags: 0xf (AUDIT|DEADBEEF|REDZONE|CONTENTS)

time of crash: Sun Nov 27 23:06:55 WAT 2011 (core is 74 days old)

age of system: 13 days 1 hours 51 minutes 3.68 seconds

panic CPU: 210 (56 CPUs, 224G memory, 3 nodes)

panic string: BAD TRAP: type=34 rp=2a1040da720 addr=deadbeefdeadc187 mmu_fsr=0


==== checking for trap information ====

CPU 210 had the panic


==== panic thread: 0x3035f247c20 ==== CPU: 210 ====

==== panic user (LWP_SYS) thread: 0x3035f247c20 PID: 17031 on CPU: 210 affinity CPU: 210 ====

cmd: /usr/sbin/in.ftpd -l –a (Root cause command for crash system !!!!)

t_procp: 0x4fc4b0f4170

p_as: 0x4fc4c2ad000 size: 4210688 RSS: 3358720

hat: 0x4cf37103500

cnum: CPU160:948/4416 CPU162:544/7597 CPU164:460/5949 CPU166:489/4946 CPU136:985/46 CPU138:370/4159 CPU140:273/5099 CPU142:290/5991 CPU208:1090/6512 CPU210:655/4632 CPU212:473/2761 CPU214:467/3957 CPU0:1026/217 CPU2:382/4415 CPU4:274/4129 CPU6:267/6665 CPU72:1034/7355 CPU74:363/42 CPU76:252/6184 CPU78:252/1587 CPU32:1079/7151 CPU34:370/2611 CPU36:254/5165 CPU38:256/7132 CPU104:1057/6148 CPU106:380/1438 CPU108:273/671 CPU110:263/6088

cpusran: 0,1,2,3,4,5,6,7,33,34,35,36,37,38,39,72,73,74,75,76,77,78,79,104,105,106,107,108,109,110,111,136,137,138,139,140,141,142,143,160,161,162,163,164,165,166,167,208,209,210,211,212,213,214,215

zone: global

t_stk: 0x2a1040dbae0 sp: 0x18c0791 t_stkbase: 0x2a1040d6000

t_pri: 41(TS) t_tid: 1 pctcpu: 23.594872

t_lwp: 0x6013936acd8 machpcb: 0x2a1040dbae0

mstate: LMS_SYSTEM ms_prev: LMS_USER

ms_state_start: 0.0003415 seconds earlier

ms_start: 1 minutes 6.2930931 seconds earlier

psrset: 0 last CPU: 210

idle: 0 ticks (0 seconds)

start: Sun Nov 27 23:05:49 2011

age: 66 seconds (1 minutes 6 seconds)

syscall: #236 shutdown(, 0xffbfd370) (sysent: unix:shutdown+0x0)

tstate: TS_ONPROC – thread is being run on a processor

tflg: T_PANIC – thread initiated a system panic

T_DFLTSTK – stack is default size

tpflg: TP_TWAIT – wait to be freed by lwp_wait

TP_MSACCT – collect micro-state accounting information

tsched: TS_LOAD – thread is in memory

TS_DONT_SWAP – thread/LWP should not be swapped

pflag: SMSACCT – process is keeping micro-state accounting

SMSFORK – child inherits micro-state accounting

pc: unix:panicsys+0x48: call unix:setjmp

unix:panicsys+0x48(0x10a4ec8, 0x2a1040da4c8, 0x18c1160, 0x1, , , 0x1605, , , , , , , , 0x10a4ec8, 0x2a1040da4c8)

unix:vpanic_common+0x78(0x10a4ec8, 0x2a1040da4c8, 0x0, 0x0, 0x0, 0x60104661340)

unix:panic+0x1c(0x10a4ec8, 0x34, 0x2a1040da720, 0xdeadbeefdeadc187, 0x0, 0x7b2272f4)

unix:die+0x9c(0x34, 0x2a1040da720, 0xdeadbeefdeadc187, 0x0)

unix:trap+0x69c(0x2a1040da720, 0xdeadbeefdeadc187)

unix:ktl0+0x48()

— trap data type: 0x34 (memory address not aligned) rp: 0x2a1040da720 —

addr: 0xdeadbeefdeadc187

pc: 0x7b2272f4 ip:ip_output_options+0xd3c: ldx [%o7 + 0x298], %l7

npc: 0x7b2272f8 ip:ip_output_options+0xd40: ldub [%i1 + 0x19], %o4

global: %g1 0x7b2d4bfc

%g2 0x1 %g3 0x10000

%g4 0 %g5 0

%g6 0 %g7 0x3035f247c20

out: %o0 0x68d %o1 0xffffffffffffffff

%o2 0x68e %o3 0x68e

%o4 0x30230d79940 %o5 0x6011de11aa0

%sp 0x2a1040d9fc1 %o7 0xdeadbeefdeadbeef

loc: %l0 0 %l1 0xdeadbeefdeadbeef

%l2 0x33c20aa4b00 %l3 0x6011e6ef580

%l4 0 %l5 0x334bbebbe38

%l6 0x3006 %l7 0x4f9d8bf5890

in: %i0 0x33c20aa4b00 %i1 0x4f9d8bf5800

%i2 0x334bbebbe38 %i3 0

%i4 0x70065f14 %i5 0x2

%fp 0x2a1040da0d1 %i7 0x7b2d4bfc

ip:ip_output_options+0xd3c(, 0x6011de11aa0, 0x334bbebbe38?, , 0x70065f14, 0x2)

ip:ip_output(0x33c20aa4b00, 0x6011de11aa0, 0x334bbebbe38, 0x2) – frame recycled

ip:tcp_send_data+0x1d4(0x33c20aa4d00, 0x349955c03c0, 0x6011de11aa0)

ip:tcp_rput_data+0x35b4(, 0x6011de11aa0?)

ip:tcp_input(0x33c20aa4b00, 0x6011de11aa0, 0x601043f1700) – frame recycled

ip:squeue_enter_nodrain+0x31c(0x601043f1700, 0x6011de11aa0, 0x7b2ca160, 0x33c20aa4b00, 0x1a)

ip:ip_fanout_tcp+0x868(0x419671c5658, 0x6011de11aa0, 0x601042ee4a8, 0x3600ed592d0, 0xa3, 0x0, 0x0)

ip:ip_wput_local+0x6f4(0x419671c5658, 0x601042ee4a8, 0x3600ed592d0, 0x6011de11aa0, 0x5b3b17c8ff8, 0x0, 0x0)

ip:ip_wput_ire+0x2fbc(0x419671c5658, 0x6011de11aa0, 0x5b3b17c8ff8, 0x6011e6ef580, 0x2, 0x0)

ip:ip_output_options+0xa14(, 0x6011de11aa0, 0x419671c5658?, , 0x70065f14, 0x2)

ip:ip_output(0x6011e6ef580, 0x6011de11aa0, 0x419671c5658, 0x2) – frame recycled

ip:tcp_send_data+0x1d4(0x6011e6ef780, 0x419671c5658, 0x6011de11aa0)

ip:tcp_xmit_end+0x98(0x6011e6ef780)

ip:tcp_wput_proto+0x410(0x6011e6ef580, 0x6011de11aa0, 0x601043f1700)

ip:squeue_enter+0x74()

ip:tcp_wput(0x419671c5658, 0x6011de11aa0) – frame recycled

unix:putnext+0x218(0x4fbdec59650, 0x6011de11aa0?)

genunix:strput+0x1b4(0x334da7c7ad8, 0x6011de11aa0, 0x0, 0x2a1040db958, 0x0, 0x0)

genunix:kstrputmsg+0x33c(0x38444d08580, , 0x0, 0x0, 0x0, 0x2c4, 0x0)

sockfs:sotpi_shutdown+0x324(, 0x1)

sockfs:shutdown+0x28(, 0x1, 0x1, 0x0)

unix:syscall_trap32+0xcc()

— switch to user thread’s user stack —


==== analyzing panic thread stack for trap frames ====


==== using trap() frame 1 @ 0x2a1040da520, rp(%i0): 0x2a1040da720 ====

type(%l2): 0x34 (memory address not aligned)

pc: 0x7b2272f4 ip:ip_output_options+0xd3c: ldx [%o7 + 0x298], %l7 (Root Cause belong to IP stack bag in kernel)

npc: 0x7b2272f8 ip:ip_output_options+0xd40: ldub [%i1 + 0x19], %o4

global: %g1 ip:tcp_send_data+0x1d4

%g2 0x1 %g3 0x10000

%g4 0 %g5 0

%g6 0 %g7 0x3035f247c20

out: %o0 0x68d %o1 0xffffffffffffffff

%o2 0x68e %o3 0x68e

%o4 0x30230d79940 %o5 0x6011de11aa0

%sp 0x2a1040d9fc1 %o7 0xdeadbeefdeadbeef

loc: %l0 0 %l1 0xdeadbeefdeadbeef

%l2 0x33c20aa4b00 %l3 0x6011e6ef580

%l4 0 %l5 0x334bbebbe38

%l6 0x3006 %l7 0x4f9d8bf5890

in: %i0 0x33c20aa4b00 %i1 0x4f9d8bf5800

%i2 0x334bbebbe38 %i3 0

%i4 ip(bss):zero_info+0x0 %i5 0x2

%fp 0x2a1040da0d1 %i7 ip:tcp_send_data+0x1d4

ip:ip_output_options+0xd14: ldx [%fp + 0x7f7], %i0

ip:ip_output_options+0xd18: call genunix:freemsg

ip:ip_output_options+0xd1c: restore %g0, %g0, %g0 ( restore )

ip:ip_output_options+0xd20: 80: ldx [%fp + 0x7f7], %i1

ip:ip_output_options+0xd24: or %g0, %l5, %i0 ( mov %l5, %i0 )

ip:ip_output_options+0xd28: call genunix:putq

ip:ip_output_options+0xd2c: restore %g0, %g0, %g0 ( restore )

ip:ip_output_options+0xd30: 81: ldx [%fp + 0x7f7], %o5

ip:ip_output_options+0xd34: ldx [%l5 + 0x28], %o7

ip:ip_output_options+0xd38: ldx [%o5 + 0x28], %i1

ip:ip_output_options+0xd3c: ldx [%o7 + 0x298], %l7

ip:ip_output_options+0xd40: ldub [%i1 + 0x19], %o4

ip:ip_output_options+0xd44: subcc %o4, 0x0, %g0 ( cmp %o4, 0x0 )

ip:ip_output_options+0xd48: be,pn %icc, ip:ip_output_options+0xef8 (95f)

ip:ip_output_options+0xd4c: or %g0, %i3, %l1 ( mov %i3, %l1 )

ip:ip_output_options+0xd50: subcc %o4, 0xd, %g0 ( cmp %o4, 0xd )

ip:ip_output_options+0xd54: 82: bne,a,pn %icc, ip:ip_output_options+0xee8 (94f)

ip:ip_output_options+0xd58: or %g0, %i3, %i0 ( mov %i3, %i0 )

ip:ip_output_options+0xd5c: ldx [%fp + 0x7f7], %g1

ip:ip_output_options+0xd60: ldx [%g1 + 0x10], %l0

ip:ip_output_options+0xd64: ldx [%g1 + 0x18], %l2

SolarisCAT(./vmcore.0/10U)>

From marks in Red/yellow – you can see the exactly root cause of system crash and based on this information you can research released solution from Vendor company as patches or updates. Using this way of analyzing you will reduce time to resolving issues and bring production system back to Online.

 

You can also use Sun Crash Tool explorer output based report to send it to Vendor (Oracle) support department to get right solution to solving current issues:

 

# /opt/SUNWscat/bin/scat –scat_explore ./vmcore.0

WARNING: ./vmcore.0 incomplete/corrupt. size: 32167952384, expected: 32167960576

Was the system hung? [ y or n ] : y

Please enter a one line basic problem description [ max 256 chars ] :

System has unexpected reboot with crash dump

#Extracting crash data…

#Gathering Hang Related data…

#Successful extraction

SCAT_EXPLORE_DATA_DIR=./scat_explore_server01_847c1fc7_0x6bc73a4_vmcore.0

And you will get compressed tarball:

# ls -alF scat_explore_server01_847c1fc7_0x6bc73a4_vmcore.0/

total 2804

drwxr-xr-x 2 root root 32 Feb 10 04:04 ./

drwxr-xr-x 3 root root 8 Feb 10 04:04 ../

-rw-r–r– 1 root root 8238 Feb 10 04:02 analyze.out

-rw-r–r– 1 root root 60230 Feb 10 04:02 callout-a.out

-rw-r–r– 1 root root 0 Feb 10 04:02 callout-xck.out

-rw-r–r– 1 root root 62305 Feb 10 04:03 clockinfo.out

-rw-r–r– 1 root root 1603 Feb 10 04:02 coreinfo.out

-rw-r–r– 1 root root 66627 Feb 10 04:02 cpu-L.out

-rw-r–r– 1 root root 205444 Feb 10 04:03 cpu-t.out

-rw-r–r– 1 root root 156 Feb 10 04:04 dev_busy.out

-rw-r–r– 1 root root 33154 Feb 10 04:02 dev_info.out

-rw-r–r– 1 root root 6423 Feb 10 04:02 dispq.out

-rw-r–r– 1 root root 993 Feb 10 04:02 etcsystem.out

-rw-r–r– 1 root root 1293 Feb 10 04:02 ifconf.out

-rw-r–r– 1 root root 3270 Feb 10 04:03 intr.out

-rw-r–r– 1 root root 757 Feb 10 04:02 memerr.out

-rw-r–r– 1 root root 22679 Feb 10 04:02 modinfo.out

-rw-r–r– 1 root root 25528 Feb 10 04:02 msgbuf.out

-rw-r–r– 1 root root 4956 Feb 10 04:02 panic.out

-rw-r–r– 1 root root 1209 Feb 10 04:02 panic_buf.out

-rw-r–r– 1 root root 4860 Feb 10 04:02 panic_thread.out

-rw-r–r– 1 root root 45 Feb 10 04:02 prob_desc.out

-rw-r–r– 1 root root 37393 Feb 10 04:04 proc.out

-rw-r–r– 1 root root 12630 Feb 10 04:04 proc_tree.out

-rw-r–r– 1 root root 508755 Feb 10 04:04 scat_explore_server01_847c1fc7_0x6bc73a4_vmcore.0.tar.Z

-rw-r–r– 1 root root 16104 Feb 10 04:04 stack-l.out

-rw-r–r– 1 root root 26093 Feb 10 04:03 stack_summary.out

-rw-r–r– 1 root root 57167 Feb 10 04:03 stream_summary.out

-rw-r–r– 1 root root 1223 Feb 10 04:03 thread_summary.out

-rw-r–r– 1 root root 7938 Feb 10 04:04 tlist_rfscall.out

-rw-r–r– 1 root root 8978 Feb 10 04:02 tunables.out

-rw-r–r– 1 root root 7305 Feb 10 04:02 vfstab.out

 

Eldar Aydayev ©

UNIX Systems Professional Consultant | Aydayev’s Investment Business Group

1676. 23rd Ave, Noriega St. San Francisco, CA 94122

E-mail: eldar@aydayev.com

URL: http://eldar.aydayev.com

LinkedIn: http://www.linkedin.com/in/eldar

Phone: +1 (650) 2062624