Nagios 3.2.0 coredumps when started via SMF on Solaris 10

By Alasdair Lumsden on 16 Oct 2009

Nagios-3-2-0-Coredumps-via-SMF-on-Solaris-10

This one was quite interesting. If you compile your own nagios-3.2.0 from source on Solaris 10, and start it manually, it runs just fine. If you run it via SMF with a service manifest, the process continually dumps core, so you get messages such as:

[ Oct 16 19:24:48 Enabled. ]
[ Oct 16 19:24:48 Executing start method ("/opt/nagios/bin/nagios /opt/nagios/etc/nagios.cfg &") ]
[ Oct 16 19:24:48 Method "start" exited with status 0 ]
[ Oct 16 19:24:49 Stopping because process dumped core. ]
[ Oct 16 19:24:49 Executing stop method (:kill) ]
Successfully shutdown... (PID=29180)
[ Oct 16 19:24:49 Executing start method ("/opt/nagios/bin/nagios /opt/nagios/etc/nagios.cfg &") ]
[ Oct 16 19:24:49 Method "start" exited with status 0 ]
[ Oct 16 19:24:50 Stopping because process dumped core. ]
[ Oct 16 19:24:50 Executing stop method (:kill) ]
Successfully shutdown... (PID=29232)
[ Oct 16 19:24:51 Executing start method ("/opt/nagios/bin/nagios /opt/nagios/etc/nagios.cfg &") ]
[ Oct 16 19:24:51 Method "start" exited with status 0 ]
[ Oct 16 19:24:52 Stopping because process dumped core. ]
[ Oct 16 19:24:52 Executing stop method (:kill) ]
Successfully shutdown... (PID=29246)

So, why does nagios crash when started via SMF? Well, I decided to enable core dumps via coreadm, to find out why. We do this with:

# mkdir /cores
# coreadm -g /cores/core.%f.%p -i /cores/core.%f.%p -e global -e global-setid -e log -e process -e proc-setid
# coreadm
     global core file pattern: /cores/core.%f.%p
     global core file content: all
       init core file pattern: /cores/core.%f.%p
       init core file content: all
            global core dumps: enabled
       per-process core dumps: enabled
      global setid core dumps: enabled
 per-process setid core dumps: enabled
     global core dump logging: enabled

We can then check the core dump with:

# gdb /opt/nagios/bin/nagios /cores/core.nagios.23536
GNU gdb 6.6
Copyright (C) 2006 Free Software Foundation, Inc.
*snip*
Core was generated by `/opt/nagios/bin/nagios /opt/nagios/etc/nagios.cfg'.
Program terminated with signal 11, Segmentation fault.
#0  0xfed3590c in strlen () from /lib/libc.so.1
(gdb) bt
#0  0xfed3590c in strlen () from /lib/libc.so.1
#1  0xfed8eda6 in _ndoprnt () from /lib/libc.so.1
#2  0xfed9192d in fprintf () from /lib/libc.so.1
#3  0x08067c42 in run_async_host_check_3x ()
#4  0x08066f69 in run_scheduled_host_check_3x ()
#5  0x080658d0 in perform_scheduled_host_check ()
#6  0x0807c0e8 in handle_timed_event ()
#7  0x0807bd8c in event_execution_loop ()
#8  0x0805ecaa in main ()
(gdb) quit

Interesting – it’s crashing when the nagios function run_async_host_check_3x does a fprintf. Looks like a null pointer to me. Lets get the actual line number by installing a nagios binary which has not been stripped of debugging symbols. Thankfully the Nagios Makefile has a method of doing this already:

# cd /opt/src/nagios-3.2.0
# gmake install-unstripped
cd ./base && gmake install-unstripped
gmake[1]: Entering directory `/opt/src/build/nagios/files/nagios-3.2.0/base'
gmake install-basic
gmake[2]: Entering directory `/opt/src/build/nagios/files/nagios-3.2.0/base'
/opt/sfw/bin/install -c -m 775 -o nagios -g nagios -d /opt/nagios/bin
/opt/sfw/bin/install -c -m 774 -o nagios -g nagios nagios /opt/nagios/bin
*snip*

Now we re-run nagios via SMF, then gdb the latest coredump:

 gdb /opt/nagios/bin/nagios /globalcore/core.nagios.29248
GNU gdb 6.6
Copyright (C) 2006 Free Software Foundation, Inc.
*snip*
Core was generated by `/opt/nagios/bin/nagios /opt/nagios/etc/nagios.cfg'.
Program terminated with signal 11, Segmentation fault.
#0  0xfed3590c in strlen () from /lib/libc.so.1
(gdb) bt
#0  0xfed3590c in strlen () from /lib/libc.so.1
#1  0xfed8eda6 in _ndoprnt () from /lib/libc.so.1
#2  0xfed9192d in fprintf () from /lib/libc.so.1
#3  0x08067c42 in run_async_host_check_3x (hst=0x8139b78, check_options=0, latency=0.048000000000000001,
    scheduled_check=1, reschedule_check=1, time_is_valid=0x8047b40, preferred_time=0x8047b48)
    at /opt/src/build/nagios/files/nagios-3.2.0/base/checks.c:3134
#4  0x08066f69 in run_scheduled_host_check_3x (hst=0x8139b78, check_options=0, latency=0.048000000000000001)
    at /opt/src/build/nagios/files/nagios-3.2.0/base/checks.c:2791
#5  0x080658d0 in perform_scheduled_host_check (hst=0x8139b78, check_options=0, latency=0.048000000000000001)
    at /opt/src/build/nagios/files/nagios-3.2.0/base/checks.c:2108
#6  0x0807c0e8 in handle_timed_event (event=0x8133010) at /opt/src/build/nagios/files/nagios-3.2.0/base/events.c:1261
#7  0x0807bd8c in event_execution_loop () at /opt/src/build/nagios/files/nagios-3.2.0/base/events.c:1132
#8  0x0805ecaa in main (argc=134510324, argv=0x8139b78) at nagios.c:849
(gdb) quit

A hah! Now we have a line number. The line in question, line 3134 of checks.c, reads:

fprintf(check_result_info.output_file_fp,"output=%sn",checkresult_dbuf.buf);

So this checkresult_dbuf.buf must be null. I googled, and found someone talking about it on the nagios-devel mailing list. Seems the fix they comitted (checking to see if checkresult_dbuf.buf is null) has been uncomitted/overwritten as this check is no longer in place in nagios 3.2.0. Not to worry, here’s a patch:

--- base/checks.c.orig  2009-10-16 19:28:42.082321083 +0100
+++ base/checks.c       2009-10-16 19:29:02.197305557 +0100
@@ -3131,7 +3131,7 @@
                                fprintf(check_result_info.output_file_fp,"early_timeout=%dn",check_result_info.early_timeout);
                                fprintf(check_result_info.output_file_fp,"exited_ok=%dn",check_result_info.exited_ok);
                                fprintf(check_result_info.output_file_fp,"return_code=%dn",check_result_info.return_code);
-                               fprintf(check_result_info.output_file_fp,"output=%sn",checkresult_dbuf.buf);
+                               fprintf(check_result_info.output_file_fp,"output=%sn",(checkresult_dbuf.buf==NULL)?"(null)":checkresult_dbuf.buf);

                                /* close the temp file */
                                fclose(check_result_info.output_file_fp);

Apply this and you should be all set!