By Alasdair Lumsden on 26 Jun 2010
Update 2010-07-01: Sun got back to one of the blog commenters regarding the issue with Broadcom NICs dropping out on HP servers and stated the issue relates to the HP supplied Broadcom drivers, and Sun recommended using these. So HP people may be seeing a different issue. Please see this blog comment for details. Many thanks for passing this information on Daniel!.
As previously mentioned, we’ve been having a nightmare with Broadcom NICs suddenly dropping out / hanging / freezing. All network traffic ceases / halts, despite the interfaces being up and showing no signs of any issues. This issue started affecting us after rolling out an upgrade to Solaris 10 update 8, but it also affects recent OpenSolaris builds. This has been on Dell R410 servers and R710 servers, and we’ve heard about people on HP servers having the same issue.
We thankfully found a workaround for it, which basically consists of disabling C-States in the BIOS. This is a power saving feature and support for it was added into Solaris 10 update 8, which is where we’re seeing the issue.
However prior to finding this workaround, I contacted Broadcom via their “Submit a support request” feature on their website. Nobody got back to me, and we were getting rather desperate so I was rather naughty and dropped one of their Kernel driver engineers a direct email. I won’t say who as he probably doesn’t want others mailing him directly.
The chap replied promptly, which was very impressive. He was very polite and explained that he couldn’t really help customers directly, as the OEM suppliers get upset, but he did offer some hints/tips. He mentioned that MSI-X was causing issues on Linux and suggested disabling it if we’re using v5.2.3 drivers or later. We’re not, we’re on 5.2.2 and 5.2.2 is the newest release available on the Broadcom website, so that was quite interesting.
He attached the release notes for the 6.0.1 driver which isn’t publicly available yet. Here is a snippet of the contents:
Broadcom NetXtreme II Gigabit Ethernet Driver For Solaris 10 for i386 platform Copyright (c) 2000-2010 Broadcom Corporation All rights reserved. Version 6.0.1 (21 May, 2010) ============================ Fixes ----- 1) Problem : default MTU now set to 1500, fixed jumboframe and vlan issues. Cause : buffer sizes weren't being allocated properly to account for MAC header overhead w/ vlan tags Change : allocations are now correct 2) Problem : when MSIX interrupt allocation failed driver fails to attach Cause : code didn't exist to revert down to Fixed Change : driver now reverts to Fixed when MSIX interrupt allocation fails Version 5.2.3 (23 March, 2010) ============================== Enhancements ------------ 1) Change : Reworked interrupt code to no longer use deprecated Solaris interrupt APIs. 2) Change : Added support for MSI-X interrupts. MSI-X is now used by default and can be turned off via "disable_msix" inside bnx.conf. When MSI-X is disabled then Fixed level interrupts are used. 2) Change : Added a new "statistics" group to kstat which contains driver version and interrupt information. Version 5.2.2 (14 December, 2009) ================================= Fixes ----- 1) Problem : Kernel Panic in the send routine: assertion failed: umpacket->mp == NULL, Cause : The umpacket->mp was not scrubbed properly because the umpacket never went through the bnx_xmit_ring_reclaim() function. Change : After recycling the packet in the TX routine, the packet is now reclaimed before it is being used.
The 6.x driver for Solaris 10 should hopefully be available later this year. The one that’s in OpenSolaris unfortunately can’t be used with Solaris 10 due to network stack differences.
But the interesting thing is that there *is* a newer 5.2.3 Driver out there that came out in March this year. So I had a google, and it looks like that this driver has been supplied to OEMs but still isn’t available from Broadcom directly. So I downloaded an IBM Driver ISO Image that contains this newer driver, and it installs fine. We’re going to be using this in conjunction with disabling C-States and I’ll report back on how that combination is going.
After discovering the C-States workaround for the NIC dropouts I mailed the Broadcom guy again to let him know, and stated we’d be disabling C-States to see if it fixes the issue. He replied with:
Please let me know if this works for you so that I can pass it on to our Solaris developers. I checked with them to see if this was a known issue and they replied that they had been trying to duplicate the problem but had not been successful to date. When performance testing we often disable certain CPU features in order to maximize Ethernet throughput so it may be that the system BIOS settings are the key difference here.
So this is very encouraging – hopefully this tip will enable the Broadcom Solaris engineers to reproduce the issue and fix it.
Another final thing – to keep all our servers identical, in addition to flashing the system bios, DRAC Firmware and LSI/SAS6i Firmware, we’ve now started upgrading the Firmware on all the Broadcom NICs too.
This is easier said than done. My method involved producing a 2.88MB Dos boot image with the appropriate files, taken from various places. I nabbed the latest Dell Broadcom NIC Firmware Linux package to get the firmware files. I then pinched the DOS uxdiag.exe tool from the Broadcom diagnostics ISO to do the upgrades. I then produced a .bat file which runs:
uxdiag -c 1 -t abcd -F -fbc bc09x50b.bin uxdiag -c 2 -t abcd -F -fbc bc09x50b.bin uxdiag -c 1 -t abcd -F -fncsi ncsifw_x.205 uxdiag -c 2 -t abcd -F -fncsi ncsifw_x.205 uxdiag -c 1 -t abcd -F -fib_ipv4n6 ib6btv41.06 uxdiag -c 2 -t abcd -F -fib_ipv4n6 ib6btv41.06 uxdiag -c 1 -t abcd -F -fmba bxmba508.nic uxdiag -c 2 -t abcd -F -fmba bxmba508.nic uxdiag -c 1 -t abcd -mfw 0 uxdiag -c 2 -t abcd -mfw 0
What a lot of faffing about. You’d think Dell would make this stuff easier to do. Anyway, if you’re interested, please feel free to download my Broadcom DOS Firmware update disk image.