By Alasdair Lumsden on 14 Jun 2010
Update 2010-07-01: Sun got back to one of the blog commenters regarding the issue with Broadcom NICs dropping out on HP servers and stated the issue relates to the HP supplied Broadcom drivers, and Sun recommended using these. So HP people may be seeing a different issue. Please see this blog comment for details. Many thanks for passing this information on Daniel!.
BREAKING NEWS – 2010-06-25 11:30 BST (GMT+1): I’ve just spoken with a chap called mui on #opensolaris on irc.freenode.net who reports that this issue relates to “C States”. Disabling “C States” in the BIOS (It’s in “Processor Settings” on Dell boxes) supposedly will work-around the issue. C States support was added in Solaris 10 update 8, so this is probably why our Solaris 10 update 7 boxes are unaffected.
Supposedly Sun/Oracle have a patch internally they can supply to you for Solaris 10 if you have a support contract. If you’re on OpenSolaris, Mui has made this package available that works with snv_134. DISCLAIMER: Please test this prior to putting it into production as it’s provided with no warranty. Alternatively you might be able to grab the latest 6.0.1 BNX driver from the on-closed-bins.i386.tar.bz2 package on the OpenSolaris website.
Here’s the rest of the (now somewhat out of date) post…
We’ve encountered this bug quite a few times and up until I found these bug reports, we weren’t sure what was causing the issue:
The symptoms are basically that the server loses network connectivity – traffic just stalls. Because this keeps happening on production boxes we have to reboot pretty damn quickly so haven’t had an opportunity to diagnose the issue in detail. We tried a number of fixes to no avail, and I was at my wits end until I encountered the above bug report.
Our servers are Dell R410 machines and we’ve seen this happening on Dell R710 machines as well, with Solaris 10 update 8. We’re running with the latest Solaris 10 patches and the latest Broadcom drivers from the Broadcom website (5.2.2). I believe we’ve seen this issue with the stock drivers shipped with Solaris 10 update 8 as well.
From the bug reports, the issue seems related to the firmware running on the cards – version 5* is affected, version 4* isn’t. I believe the Firmware is tied to the Dell BIOS running on the machine. Here’s the output from one of our affected boxes:
# prtdiag | head -n 2 System Configuration: Dell Inc. PowerEdge R410 BIOS Configuration: Dell Inc. 1.3.9 04/07/2010 # grep -i BCM /var/adm/mes* /var/adm/messages:Jun 12 03:21:38 bnx: [ID 995108 kern.info] NOTICE: bnx0: BCM5709 device with F/W Ver500000b is initialized. /var/adm/messages:Jun 12 03:21:38 bnx: [ID 995108 kern.info] NOTICE: bnx1: BCM5709 device with F/W Ver500000b is initialized.
Here is the output from a machine that’s not affected:
# prtdiag | head -n 2 System Configuration: Dell Inc. PowerEdge R410 BIOS Configuration: Dell Inc. 1.1.5 07/29/2009 # grep BCM /var/adm/messages* /var/adm/messages.2:May 27 15:11:43 bnx: [ID 995108 kern.info] NOTICE: bnx1: BCM5709 device with F/W Ver4060004 is initialized. /var/adm/messages.2:May 27 15:11:43 bnx: [ID 995108 kern.info] NOTICE: bnx0: BCM5709 device with F/W Ver4060004 is initialized.
My understanding is that the fix is to downgrade the BIOS of the machine to a previous release that uses a 4* Broadcom Firmware release. We haven’t yet tested this but should be able to later this week. So far it doesn’t look like Sun/Oracle have released a publicly available patch to address the issue.
Update: 2010-06-25 – Upgrading/Downgrading the system BIOS makes no difference to the Broadcom FW (duh! silly me). I’ve written an updated post with more information here: https://everycity.co.uk/alasdair/2010/06/update-to-broadcom-nic-dropping-out-on-solaris-issue/