illumos watch – December 2013 & January 2014

By Alasdair Lumsden on 6 Feb 2014

illumos watch

Hi all, welcome to my second illumos watch blog post. As I missed December due to the festive period, this January edition will include updates from December as well.

Some big news is that Nexenta have released their fork of illumos-gate to the wild, with complete history and changesets. This is great news for the community, as there are a lot of improvements, including slow drive ejection code. Looking forward to giving this a spin.

Commits down at the illumos-gate

Over December and January there haven’t been too many major commits to the gate directly. The biggest noteworthy commit was the introduction of ZFS Bookmarks:

4369 implement zfs bookmarks

4369 implement zfs bookmarks

This new ZFS feature (com.delphix:bookmarks) allows you to create a bookmark of a snapshot. The bookmark can be used as a point in time for an incremental send, even if you destroy the snapshot freeing the data. A curious feature, I’m still not completely clear on when I’d use it.

4506 GCC should be the primary compiler

4506 GCC should be the primary compiler

Short and sweet, this commit simply changes the default compiler to be GCC instead of Sun Studio. Since Sun Studio is closed source and not generally free, this makes sense. Most major illumos consumers, such as Joyent, are using gcc. There was some talk of removing cw (a tool that translates between Sun Studio and gcc arguments) and using gcc arguments directly on IRC, however that’s quite a step.

4370 avoid transmitting holes during zfs send

4370 avoid transmitting holes during zfs send

It wasn’t immediately clear if this related to sparse files or not, but a chap on IRC in #illumos said he thought it related to ZVOL SCSI unmaps. Either way, this should improve zfs send performance when you’ve got holey data.

4493 want siginfo, 4494 Make dd show progress when you send INFO/USR1 signals

4493 want siginfo

This is my personal favourite. Solaris/illumos dd has long missed out on a progress indicator, present in GNU dd, printed to the console by sending a “kill -USR1 [pid of dd]”. We’ve also, as part of this commit, gained the siginfo signal, available on the BSDs.

4211 Some syslog facility names and symbols are missing

4211 Some syslog facility names and symbols are missing

This is quite a cute one. We finally gain authpriv, ftp, ntp, console and cron syslog names. To quote Gary:

The reason for adding the missing facility codes and names is to make syslog work better when syslog messages are forwarded from other systems. Without these changes, the only way to specify those facilities in syslog.conf is to use numbers. With these changes, names can be specified instead.

4304 fmdump shall emit JSON

4304 fmdump shall emit JSON

fmdump is a utility allowing the viewing of fault management events, and adding JSON support provides an easy way of sucking the information into software that wants it.

1023 nv_sata support for NVIDIA MCP61

1023 nv_sata support for NVIDIA MCP61

Useful for those with such chipsets :-)

4500 mptsas_hash_traverse() is unsafe, leads to missing devices

4500 mptsas_hash_traverse() is unsafe, leads to missing devices

Another fix for everyone’s favourite SAS controller, this helps in cases where a timing issue results in an incomplete set of devices showing up.

Fixes and minor enhancements

In addition to the above, there were a bunch of enhancements to snoop, such as large file support and output enhancements (rpcbind GETTIME, rpc summary, NFS_ACL mask in hex). MDB (the modular debugger command line tool) gained tab completion support for ::help. There was a fix for a memory leak in the l2arc compression code. We updated our updated pci.ids database. We got a fix so uptime uses locale settings for current time. There were also numerous other fixes, see the full list here.

Commits over at SmartOS Towers


There were surprisingly few commits that didn’t make their way back up to illumos-gate from Joyent over the past few months. The only ones of interest are OS-2629 want ability to provide additional objects at boot, OS-2679 NFS mount hardcodes udp (not sure what problem was addressed here), and OS-2706 zone hung in down state, stuck in ilb_stack_fini.

I must say, Joyent are doing an excellent job of upstreaming things.

A sneek peak at Nexenta’s illumos-gate


Nexenta recently snuck their NexentaStor 4.0 illumos-gate work onto Github. This came as a very welcome shock, as I was beginning to wonder if we’d ever see it. Nexenta had hinted they’d make their illumos-gate work public when NexentaStor 4.0 shipped, but having waited for 4.0 so long now (2010) that it got the nick name Duke Nukem Forever.

Duke Nukem Forever Logo

The Github repo contains a considerable number of changes. However since we don’t have access to their private Jira bug tracker, many of them are without comment or context. I heard third-hand that Nexenta do intend to upstream them into illumos, which will be very welcome. Not only do all illumos consumers benefit, but the upstreaming process involves code review, improving quality too.

Below are a list of some of the most exciting commits.

Features to evict slow disk drives


A single slow disk caused by partial failure is one of the worst failure modes you can encounter. Such a disk in a busy storage array will stall all your IO, causing a catastrophic outage. We’ve encountered this situation across a range of drive types and systems. It caused considerable pain on our shared NFS storage product. I love ZFS, and the alternatives are considerably inferior, but our customers do not accept outages, so we were very regrettably forced to switch to NetApp.

Thankfully Nexenta, who compete with NetApp, have been working on several features to detect and evict slow drives from a system, so that IO can continue unimpeded. If these work as advertised, it’s a big win for illumos and ZFS based storage. ZFS may once again tempt us back.

There appears to be several main components to their work: a Fault Management (fmd) module, code in the SCSI SD layer, and support within the mpt_sas driver. The first commit I can see removes Nexenta’s previous primitive way of detecting drive timeouts directly within mpt_sas:

OS-59 remove automated target removal mechanism from mpt_sas.

The next commit adds an FM module, mpt_sas hooks and sd hooks:

OS-62 slow io error detector is needed.

Then there are other related commits:

OS-60 mptsas watchdog resolution considered way to long for accurate CMD timeouts.

OS-61 Need ability for fault injection in mptsas.

OS-70 remove zio timer code.

OS-65 New FMA agent is needed to consume diagnosed slow IO.

OS-58 ZFS and failfast need to recover from unresponsive devices.

OS-91 mptsas does inquiry without setting pkt_time.

OS-116 provide more detailed information about diagnosed fault.

OS-117 slow IO DE creates bad FMA messages.

These commits touch many parts of the system, so it will take time to untangle them and get a working build out, but we’re going to give this a go.

Class of Service (Tiered Storage) support

It looks like Nexenta have added Class of Service support to ZFS, with support for storing particular types of data on particular vdevs. This could be useful for storing metadata or important data on higher speed storage. Their comment block contains a lot of information:

+ * There already exist several types of "special" vdevs in zpool:
 + * log, cache, and spare. However, there are other dimensions of
 + * the issue that could be addressed in a similar fashion:
 + *  - vdevs for storing ZFS metadata, including DDTs
 + *  - vdevs for storing important ZFS data
 + *  - vdevs that absorb write load spikes and move the data
 + *    to regular devices during load valleys
 + *
 + * Clearly, there are lots of options. So, a generalized "special"
 + * vdev class is introduced that can be configured to assume the
 + * following personalities:
 + *  - ZIL     - store ZIL blocks in a way quite similar to SLOG
 + *  - META    - in addition to ZIL blocks, store ZFS metadata
 + *  - WRCACHE - in addition to ZIL blocks and ZFS metadata, also
 + *              absorb write load spikes (store data blocks),
 + *              and move the data blocks to "regular" vdevs
 + *              when the system is not too busy
 + *
 + * The ZIL personality is self-explanatory. The remaining two
 + * personalities are also given additional parameters:
 + *  - low/high watermarks for space use
 + *  - enable/disable special device
 + *
 + * The watermarks for META personality determine if the metadata
 + * can be placed on the special device, with hysteresis:
 + * until the space used grows above high watermark, metadata
 + * goes to the special vdev, then it stops going to the vdev
 + * until the space used drops below low watermark
 + *
 + * For WRCACHE, the watermarks also gradually reduce the load
 + * on the special vdev once the space consumption grows beyond
 + * the low watermark yet is still below high watermark:
 + * the closer to the high watermark the space consumtion gets,
 + * the smaller percentage of writes goes to the special vdev,
 + * and once the high watermark is reached, all the data goes to
 + * the regular vdevs.
 + *
 + * Additionally, WRCACHE moves the data off the special device
 + * when the system write load subsides, and the amount of data
 + * moved off the special device increases as the load falls. Note
 + * that metadata is not moved off the WRCACHE vdevs.
 + *
 + * The pool configuration parameters that describe special vdevs
 + * are stored as nvlist in the vdevs' labels along with other
 + * standard pool and vdev properties. These parameters include:
 + * - class of special vdevs in the pool (ZIL, META, WRCACHE)
 + * - whether special vdevs are enabled or not
 + * - low and high watermarks for META and WRCACHE
 + * - a flag that marks special vdevs
 + *
 + * The currently supported modes are ZIL and META
 + * (see usr/src/common/zfs/zpool_prop.c) but WRCACHE support will
 + * be provided soon

Most (all?) of the available CoS work is contained in the following commits:

Moved closed ZFS files to open repo, changed Makefiles accordingly.
OS-80 support for vdev and CoS properties for the new I/O scheduler.
Issue #9: Support for persistent CoS/vdev attributes with feature flags.

Now, this “Moved closed ZFS files to open repo” got me wondering, what closed ZFS bits? Was this work previously closed source, and they’ve decided to open it? It turns out after asking around, that yes, this is the case. I’m glad Nexenta, as a company that defines themselves as selling “open storage”, saw the light on this. One wouldn’t want to think Nexenta was aspiring to be the next Oracle.

Oracle Borg

There are other examples of the closed vs open debate going on within Nexenta. It’s unfortunate that Nexenta considered this route, and indeed that they waited so long to release their source code. Delphix, Joyent and various others have been strong community players, releasing their work as they go (and benefiting from the public review process). I hope Nexenta moves towards this direction as it benefits everybody. It’s especially unfair if one player consumes the work of others without giving back.

Multi-threaded zpool import (speed improvement)

SUP-647 Long failover times dominated by zpool import times trigger client-side errors

On large systems with many disks, zpool import can take a very long time. In a clustered environment, cluster failover involves zpool importing the pool on the good system, and if zpool import is slow, failover is slow. In a previous incantation of our cloud environment where we used centralised ZFS storage for Virtual Machines, zpool import could be slow enough to result in drive timeouts within VMs, causing cloud-wide outages. Not ideal.

With the above commit, it appears Nexenta have added multi-threaded support for the mount portion of the import procedure, which I presume significantly improves pool import time. This will be very welcome for clustered systems.

Partial scrub support (scrub only Metadata or MOS)

Issue #26: partial scrub

This is an interesting feature – it allows ZFS to do a scrub of only metadata or MOS. I wasn’t sure what the MOS was, so I googled:

The MOS contains pool wide information for describing and managing relationships between and properties of various object sets. There is only one MOS per pool pointed to by the block pointer in the uberblock. The MOS also contains the pointer to the ZIL.

– Page 4 of Reliability Analysis of ZFS by Asim Kadav

I can see this being a useful feature on large systems, where long scrubs may degrade performance. If you can do regular scrubs of the Metadata/MOS, you can ensure pool integrity, which can often be of a higher importance than data integrity.

General error handling improvements

OS-90 Kernel should generate events when device gets retired / unretired
OS-119 use disk sense data to trigger over-temp fault
NEX-941 zfs doesn’t replace “UNAVAIL” disk from spares in pool
OS-104 handle attach-failure ereport

The above commits add some general improved error handling for various situations, which are always welcome. Whether these commits work in isolation or in combination with others remains to be seen. Again, hopefully someone will have the time to pick through them and upstream them.

From the Water Cooler


Tribblix Milestone 8 Released


Peter Tribble happily announced that Tribblix Milestone 8 is now available! Congratulations Peter, excellent work.

The main focus of this release was PXE support and network install. More info here.

Milestone 8 can be found on the Tribblix Downloads page.

Project Bardiche – Enhanced KVM network throughput


Robert Mustacchi at Joyent has been hard at work on a project called Bardiche, which aims to provide a high throughput layer 2 datapath, useful in virtualisation for KVM guests. In addition to improved performance for KVM, it should also provide firewalling via fwadm/ipf of KVM guests via the Global Zone – a very neat feature. More information can be found on the Bardiche Project Page.

Really looking forward to this feature making its way into SmartOS mainline!

Tickless Kerenel Project Musings

There was an interesting discussion over on illumos-developer about the status of the tickless kernel project. In theory a tickless kernel should use less power. Rafeal Vanoni reports:

The tickless project had a handful of sub-projects, but we only got to
implement two or three of them. The lbolt decoupling was the last one,
and not a particularly challenging task although a really fun one
(specially for a young engineer). You really need all of those
sub-projects before seeing any benefits.

It’s a shame, really. We’re just about the only OS that isn’t
tickless, which by itself should have been enough justification to
finish it.

I’d still love to get it done, fwiw.

The thread had a bit of a discussion regarding the value of Tickless for datacenter systems.

/etc/profile.d and /etc/.login.d

Alexander Pyhalov, who does a lot of fantastic work on OpenIndiana, proposed adding /etc/profile.d and /etc/.login.d directories, which are common on other systems to provide scripts to be run upon login. These directories are especially useful for package managers, which can easily add/remove files but not necessarily edit them.

The thread got over 40 replies, and was slightly polarising. Peter Tribble was strongly against:

I think this is a thoroughly bad idea. I’ve used systems that have these
sorts of startup initialisations in place, and have ended up having to
expend considerable effort to work around some of the rank stupidities
that 3rd-party packages (or no doubt well-intentioned but short sighted
power users or sysadmins) have inflicted on users. Often, the startup
scripts are only relevant to a small subset of users some of the time,
or should never be inflicted on any user at any time.

Consider this a -1 from me. Having a standard location for software
to drop its environment probably isn’t such a bad thing, although I’m
not sure the given location is that great (having become used to
such locations being used for self-assembly). Having every such
startup file forced on all users at all times regardless seems very
unwise. – Peter Tribble

Andrew Stormont for. Garrett D’Amore was in favour of making this a distribution issue, with Gordon Ross suggesting illumos go further and stop shipping profile bits from /etc:

An interesting suggestion made re. that issue was to
simply remove this environment setup from illumos.
(No common /etc/… no default .profile, nothing.)
Distributions can then easily add what they like.

It turns out that really none of those customizations
are needed to give you a reasonably functional
login shell.

The length of this thread suggests that might be
the easiest way to satisfy everyone. – Gordon Ross

I’m generally in favour of it. IMHO preventing useful features being added because “vendors might abuse it” is not helpful. I’d rather they add files that can be removed than edit files in place. Alexander reports that OpenIndiana’s Hipster branch has the change integrated, so upstreaming becomes less important, and with the discussion tailing off mid December, I imagine this will be left as-is in illumos.

ZFS throttle and txg sync times

Bryan Cantrill mailed illumos-developer regarding long transaction sync times, which spawned an interesting thread regarding the new ZFS IO scheduler which was put back in August 2013. The use of disksort algorithms with modern disks was also discussed.

Dan McDonald moves from Nexenta to OmniTI


Dan McDonald, who is a lovely person to chat to and has helped me with many issues in the past, including with mpt_sas, has moved from Nexenta to OmniTI. Dan was the IPsec project lead at Sun and has contributed many fixes and features to illumos. I wish him all the best at OmniTI!

NVMe (Non-Volatile Memory Host Controller Interface (NVMHCI))


Nexenta has been working on NVMe support for illumos, which is a high speed SSD interconnect technology. It’s based on the FreeBSD driver and has been done in collaboration with Intel.

Linux Branded Zone Support


David Mackay mailed illumos-developer mid-January with a webrev resurrecting (and improving) Linux 2.6 branded zone support. This work was very well received by the community:

Anyway, I think that this is great work — and it’s clear from the response here that many others in the community agree! – Bryan cantrill

… this is fantastic work and I tip my hat to you for getting
this done. – Saso Kiselkov

+1. exciting stuff. – Theo Schlossnagle

I’m personally very interested in strong Linux branded zone support. Although KVM on SmartOS and larger memory machines reduce the requirement for it, maintaining whole VMs is still painful. We have clients with legacy Linux applications that I’d be keen to switch over to it.

The web rev is out at the moment so hopefully if people are happy with it, it will end up being committed and see continued improvement over time. David has already added some Linux 3.9 system calls support, and is working on epoll support.

Closing Remarks

What a great end of 2013 and start of 2014! Lots of interesting work going on… Linux 2.6 branded zones support and the Nexenta drive timeout work are my favourites so far. Looking forward to a great year ahead :)

P.s. If you liked this post, and the previous illumos-watch, please feel free to share the article, for example using the social media buttons below…