Nagios 3.2.0 coredumps when started via SMF on Solaris 10
By Alasdair Lumsden on 16 Oct 2009
This one was quite interesting. If you compile your own nagios-3.2.0 from source on Solaris 10, and start it manually, it runs just fine. If you run it via SMF with a service manifest, the process continually dumps core, so you get messages such as:
[ Oct 16 19:24:48 Enabled. ] [ Oct 16 19:24:48 Executing start method ("/opt/nagios/bin/nagios /opt/nagios/etc/nagios.cfg &") ] [ Oct 16 19:24:48 Method "start" exited with status 0 ] [ Oct 16 19:24:49 Stopping because process dumped core. ] [ Oct 16 19:24:49 Executing stop method (:kill) ] Successfully shutdown... (PID=29180) [ Oct 16 19:24:49 Executing start method ("/opt/nagios/bin/nagios /opt/nagios/etc/nagios.cfg &") ] [ Oct 16 19:24:49 Method "start" exited with status 0 ] [ Oct 16 19:24:50 Stopping because process dumped core. ] [ Oct 16 19:24:50 Executing stop method (:kill) ] Successfully shutdown... (PID=29232) [ Oct 16 19:24:51 Executing start method ("/opt/nagios/bin/nagios /opt/nagios/etc/nagios.cfg &") ] [ Oct 16 19:24:51 Method "start" exited with status 0 ] [ Oct 16 19:24:52 Stopping because process dumped core. ] [ Oct 16 19:24:52 Executing stop method (:kill) ] Successfully shutdown... (PID=29246)
So, why does nagios crash when started via SMF? Well, I decided to enable core dumps via coreadm, to find out why. We do this with:
# mkdir /cores # coreadm -g /cores/core.%f.%p -i /cores/core.%f.%p -e global -e global-setid -e log -e process -e proc-setid # coreadm global core file pattern: /cores/core.%f.%p global core file content: all init core file pattern: /cores/core.%f.%p init core file content: all global core dumps: enabled per-process core dumps: enabled global setid core dumps: enabled per-process setid core dumps: enabled global core dump logging: enabled
We can then check the core dump with:
# gdb /opt/nagios/bin/nagios /cores/core.nagios.23536 GNU gdb 6.6 Copyright (C) 2006 Free Software Foundation, Inc. *snip* Core was generated by `/opt/nagios/bin/nagios /opt/nagios/etc/nagios.cfg'. Program terminated with signal 11, Segmentation fault. #0 0xfed3590c in strlen () from /lib/libc.so.1 (gdb) bt #0 0xfed3590c in strlen () from /lib/libc.so.1 #1 0xfed8eda6 in _ndoprnt () from /lib/libc.so.1 #2 0xfed9192d in fprintf () from /lib/libc.so.1 #3 0x08067c42 in run_async_host_check_3x () #4 0x08066f69 in run_scheduled_host_check_3x () #5 0x080658d0 in perform_scheduled_host_check () #6 0x0807c0e8 in handle_timed_event () #7 0x0807bd8c in event_execution_loop () #8 0x0805ecaa in main () (gdb) quit
Interesting – it’s crashing when the nagios function run_async_host_check_3x does a fprintf. Looks like a null pointer to me. Lets get the actual line number by installing a nagios binary which has not been stripped of debugging symbols. Thankfully the Nagios Makefile has a method of doing this already:
# cd /opt/src/nagios-3.2.0 # gmake install-unstripped cd ./base && gmake install-unstripped gmake[1]: Entering directory `/opt/src/build/nagios/files/nagios-3.2.0/base' gmake install-basic gmake[2]: Entering directory `/opt/src/build/nagios/files/nagios-3.2.0/base' /opt/sfw/bin/install -c -m 775 -o nagios -g nagios -d /opt/nagios/bin /opt/sfw/bin/install -c -m 774 -o nagios -g nagios nagios /opt/nagios/bin *snip*
Now we re-run nagios via SMF, then gdb the latest coredump:
gdb /opt/nagios/bin/nagios /globalcore/core.nagios.29248 GNU gdb 6.6 Copyright (C) 2006 Free Software Foundation, Inc. *snip* Core was generated by `/opt/nagios/bin/nagios /opt/nagios/etc/nagios.cfg'. Program terminated with signal 11, Segmentation fault. #0 0xfed3590c in strlen () from /lib/libc.so.1 (gdb) bt #0 0xfed3590c in strlen () from /lib/libc.so.1 #1 0xfed8eda6 in _ndoprnt () from /lib/libc.so.1 #2 0xfed9192d in fprintf () from /lib/libc.so.1 #3 0x08067c42 in run_async_host_check_3x (hst=0x8139b78, check_options=0, latency=0.048000000000000001, scheduled_check=1, reschedule_check=1, time_is_valid=0x8047b40, preferred_time=0x8047b48) at /opt/src/build/nagios/files/nagios-3.2.0/base/checks.c:3134 #4 0x08066f69 in run_scheduled_host_check_3x (hst=0x8139b78, check_options=0, latency=0.048000000000000001) at /opt/src/build/nagios/files/nagios-3.2.0/base/checks.c:2791 #5 0x080658d0 in perform_scheduled_host_check (hst=0x8139b78, check_options=0, latency=0.048000000000000001) at /opt/src/build/nagios/files/nagios-3.2.0/base/checks.c:2108 #6 0x0807c0e8 in handle_timed_event (event=0x8133010) at /opt/src/build/nagios/files/nagios-3.2.0/base/events.c:1261 #7 0x0807bd8c in event_execution_loop () at /opt/src/build/nagios/files/nagios-3.2.0/base/events.c:1132 #8 0x0805ecaa in main (argc=134510324, argv=0x8139b78) at nagios.c:849 (gdb) quit
A hah! Now we have a line number. The line in question, line 3134 of checks.c, reads:
fprintf(check_result_info.output_file_fp,"output=%sn",checkresult_dbuf.buf);
So this checkresult_dbuf.buf must be null. I googled, and found someone talking about it on the nagios-devel mailing list. Seems the fix they comitted (checking to see if checkresult_dbuf.buf is null) has been uncomitted/overwritten as this check is no longer in place in nagios 3.2.0. Not to worry, here’s a patch:
--- base/checks.c.orig 2009-10-16 19:28:42.082321083 +0100 +++ base/checks.c 2009-10-16 19:29:02.197305557 +0100 @@ -3131,7 +3131,7 @@ fprintf(check_result_info.output_file_fp,"early_timeout=%dn",check_result_info.early_timeout); fprintf(check_result_info.output_file_fp,"exited_ok=%dn",check_result_info.exited_ok); fprintf(check_result_info.output_file_fp,"return_code=%dn",check_result_info.return_code); - fprintf(check_result_info.output_file_fp,"output=%sn",checkresult_dbuf.buf); + fprintf(check_result_info.output_file_fp,"output=%sn",(checkresult_dbuf.buf==NULL)?"(null)":checkresult_dbuf.buf); /* close the temp file */ fclose(check_result_info.output_file_fp);
Apply this and you should be all set!