Adjusting drive timeouts with mdb on Solaris or OpenIndiana

By Alasdair Lumsden on 14 May 2011

Update (New): These timeouts don’t do squat because mpt_sas doesn’t honour the timeouts. This was recently uncovered by Nexenta and a patch to fix it is about to hit Illumos shortly. I’ll post when it does. Another patch is in progress which will further improve how mpt_sas handles failed drives. Thanks to Albert Lee for his work on them – you, sir, rock!

Update (Old): These timeouts don’t work nearly as well as one would hope, unfortunately the sd timeouts get passed to the driver which in the case of mpt/mpt_sas, appear to do very little with them. I have raised this as an issue within the Illumos community and the debate was quite polarising; the kernel developers deny there is a problem or disagree on how to solve it, despite lots of people complaining of the same symptoms. Unfortunately I think it’s a difficult problem to solve due to the wide variety of hardware types that ZFS/Illumos is deployed on.

Our way of coping with dodgy drives is to preempt their failure via trigger happy SMART/iostat monitoring scripts that zpool offline bad drives before they fail.

Yesterday we suffered our first disk failure in our shiny new NFS cluster that has been operating flawlessly for 3 months. The NFS cluster we have is quite nice – it consists of a pair of NFS servers (96GB of RAM, Dual Intel E5620 CPUs) dual-attached to a set of LSI SAS 6Gbps JBOD arrays, with lots of Seagate Constellation ES 2TB enterprise SAS drives. For good measure there’s 1.5TB of SSD cache (6x256GB SSDs) acting as a read cache (L2ARC), and a ZeusRAM SSD acting as the write cache (ZIL). It runs a custom build of OpenIndiana.

Ordinarily a disk failure would result in at most a few minutes of stall while the OS waits for the drive to recover, and gives up. However, this drive decided simply to run glacially slowly, so it didn’t get removed in a timely fashion. In fact, it didn’t get removed at all, resulting in all IO to the SAN being stuck, causing a rather severe outage. 45 minutes in total.

When things became unresponsive, we logged in, and “iostat -xn” showed a 100% busy time on one of the disks, while the others did nothing. We attempted to “zpool offline baddisk”. Nothing much happened, presumably because the OS thought the drive was fine and was waiting on some queued IO finishing, or something along those lines. We had no immediate way of yanking the disk out, so we decided to failover the cluster from the primary NFS node to the secondary. This consists of powering off the primary node and letting the cluster software import the ZFS zpool and bring NFS services online.

When the secondary NFS node started importing the zpool, iostat once again showed a 100% busy time on the bad disk. Crap. Andrzej had the bright idea of deleting the disk entries from /dev, and sure enough this prompted ZFS to think the drive had disappeared, and the pool finally imported.

So immediately the question springs to mind, why did the OS not take this bad disk out of service? We consulted with our upstream vendor (contacted the folks over at Illumos) and all became clear.

The answer lays in the defaults in the Solaris SCSI subsystem. The default timeout for IO is 60 seconds with 5 retries (or 3 retries if its fibre channel/eSAS). For a storage array like ours, this is a 3 minute timeout for a single IO – or in other words, a very long time. Since the disk was accepting a trickle of IO, this timeout was never really reached.

Thankfully the timeouts can be adjusted, and Garrett D’Amore, the founder of Illumos and one of the lead developers who works at Nexenta, strongly suggested tuning the timeout to 5 seconds, with 3 retries.

Setting the timeout value is quite easy – its the system wide tunable sd_io_time. Keep in mind this will affect all disks. Edit /etc/system and drop in:

set sd:sd_io_time=5

If you have desktop SATA drives you’ll probably want a higher timeout, especially if you don’t have TLER (Time limited error recovery) on them, which limits error recovery to around 7 seconds.

The number of retries is set via /kernel/drv/sd.conf via sd-config-list. This file allows the setting to be set per-disk type via sd-config-list. To get 3 retries, the variable would be “retries-timeout:3”. The format of this file is a bit weird, here is an example for two disks:

sd-config-list = "STEC    ZeusRAM         ", "throttle-max:32, disksort:false, cache-nonvolatile:true",
                 "SEAGATE ST32000444SS    ", "retries-timeout:3";

The bit where you define the disk type is a fixed length field, consisting of 8 characters for the vendor, and 16 characters for the product. So you have to pad the field out to the correct length with spaces.

Once these are set, reboot to activate. You can check the values are set by doing:

## Print system wide sd_io_time timeout value:
# echo "sd_io_time::print" | mdb -k

## Print per-disk timeout and retry values:
# echo "::walk sd_state | ::grep '.!=0' | ::sd_state" | mdb -k | egrep "^un|un_retry_count|un_cmd_timeout"
un: ffffff093239d9c0
    un_retry_count = 0x3
    un_cmd_timeout = 0x5
un: ffffff093239d380
    un_retry_count = 0x3
    un_cmd_timeout = 0x5

The return values are in hexadecimal, so for example 0x3c is 60 seconds.

Adjusting values without rebooting

We have a number of storage servers in production, some of which we really didn’t want to reboot just to change the timeout value. After discussions with some of the Illumos kernel developers,
we worked out how to set the property at runtime using the modular Solaris debugger, mdb. This allows editing kernel values at runtime.

The system wide sd_io_time is used to populate a per-disk timeout value which is also stored in the same structure as the per-disk retry count. So changing the values is pretty similar.

First, we want to obtain the memory values for the settings we wish to edit:

# echo "::walk sd_state | ::grep '.!=0' | ::print -a struct sd_lun un_cmd_timeout" | mdb -k > /tmp/un_cmd_timeouts

# cat /tmp/un_cmd_timeouts
ffffff0d347a3a7c un_cmd_timeout = 0x3c
ffffff0d247983bc un_cmd_timeout = 0x3c
ffffff0d3429d3fc un_cmd_timeout = 0x3c
ffffff0d55daf37c un_cmd_timeout = 0x3c

Now we have the values in /tmp/un_cmd_timeouts, we can set the value using mdb -kw:

# for i in `cat /tmp/un_cmd_timeouts  | awk '{print $1}'` ; do echo ${i}/W 0x5 | mdb -kw ; done

We can then check the value was set by re-running:

# echo "::walk sd_state | ::grep '.!=0' | ::print -a struct sd_lun un_cmd_timeout" | mdb -k

Now we can do the same for un_retry_count:

# echo "::walk sd_state | ::grep '.!=0' | ::print -a struct sd_lun un_retry_count" | mdb -k > /tmp/un_retry_count
# for i in `cat /tmp/un_retry_count  | awk '{print $1}'` ; do echo ${i}/W 0x3 | mdb -kw ; done

Hey presto, we just adjusted boot time kernel parameters on the fly :-)

If you need to know which disk is which, you can assume the output from mdb is ordered, and do:

echo "::walk sd_state | ::grep '.!=0' | ::print struct sd_lun un_sd | ::print struct scsi_device sd_dev | ::devinfo -q" | mdb -k

This returns the sd instance id, which can be seen from “iostat -E”. StackOverflow has some answers for mapping from sd to device name should you need to.

Concluding Remarks

With these values in place, our timeout is reduced from upwards of 3 minutes, to a mere 15 seconds. This is far more likely to cause the OS to offline dodgy disks like the one we were experiencing issues with.

There has been some recent discussion on the Illumos mailing lists regarding the default sd_io_time value, suggesting that the default should be lowered to 8 seconds. This has caused a bit of a furore, as people using Solaris with fibre channel disk arrays require higher timeouts, say 180 seconds. So there are people on both sides of the fence. But one thing is for sure – its a setting more people should know about.