By Alasdair Lumsden on 23 Mar 2010
We use Solaris Zones, with each zone stored on its own zpool. The ZPool is stored on a SAN, and accessed via iSCSI. We’ve been doing this since Solaris 10 update 6, and Solaris 10 update 8 introduced an interesting issue we’ve run into.
When we asked a S10u8 box to reboot, it sat there for 10 minutes shutting down. Why? Because it was trying to stop the iSCSI initiator whilst there were live iSCSI filesystems in use. Duh! Stupid Solaris.
So I compared the iSCSI manifest from S10u7 to S10u8 and they’ve changed it in a few places. It used to depend on svc:/network/physical and svc:/system/metainit, and now it depends on svc:/network/service and svc:/network/loopback. However the biggest change was the timeout value, it was upped from 5 seconds to 600 seconds. Yes, 10 minutes.
So this highlighted an interesting problem – when rebooting boxes previously, Solaris would always try to stop the iSCSI initiator with live filesystems on it, and give up after 5 seconds and the box would come down.
Rather than hack the timeout value back to 5 seconds, I decided to investigate and see if I could add a dependency to fix this properly. I decided to make the svc:filesystem/local service depend on the iSCSI initiator service. The theory here was that filesystem/local mounts and unmounts the ZFS filesystems, so if it depends on the initiator, the initiator won’t be stopped before it unmounts the ZFS filesystems.
Unfortunately this didn’t work. Somewhere in the enormous SMF dependency tree, I ended up with a cycle, and upon boot services wouldn’t come up. At this point, I gave up and set the timeout back to 5 seconds.
If I can find the time, I’ll try and reproduce this issue on OpenSolaris, then file it on defects.opensolaris.org. After it’s been accepted, I’ll escalate it against our Solaris 10 premium support contract, and see if Sun will actually fix something for us.