Search This Blog

Tuesday, September 28, 2010

SAS 5/iR FAIL!

The Dell E2K-UCS-51 (B), simply known as UCS-51, is an inexpensive RAID 0,1 card for SAS drives. It'd worked dutifully for a couple years until our mission critical CentOS 5.3 box froze with:
sd 0:1:0:0 rejecting I/O to offline device

I thought to myself that a hard drive I/O error would have said sda, not just sd. Surely the controller didn't go bad - I've never had a raid controller die on me. Upon reboot the kernel quickly blurted out:
mptbase: ioc0: ERROR - Diagnostic reset FAILED! (ffffffffh)
mptbase: ioc0: ERROR - didn't initialize properly (-1)
Since this is our mission critical server my heart fell to the floor. Next boot, did CTRL-C to go into the card's raid configuration and received this:
I/O card parity interrupt at 41A7:A382
So now the card's own bios is crashing. Great, did it corrupt my raid-1 array?! Frantically I purchased another card off ebay and drove there and back as quick as possibly. The new card worked. All I had to do is reactivate the array which resync 'd the secondary harddrive automatically. CentOS boots fine.
Although I have backups, I did a lot of praying over the weekend because putting a backup box into production is a bit harder than just swapping out a card. The Lord Jesus brought me through it.
-eric wood

4 comments:

Hosting Nuggets said...

Hi Eric! I've got a DELL R410 with a SAS 6/iR and now get the same error message like in your post:
sd 4:1:0:0: rejecting I/O to offline device
and then the server simply crashes. You think it might also be the SAS card like in your case?

Eric Wood said...

Sound like it since it is an 'sd' error. Can you go into the card's BIOS? For me, the BIOS crashed while inside the CTRL-C configuration - much like a windows blue screen of death. You could have bad EEPROM chips which. I really think my card suffered a blown capacitor since I see it is cracked open a little bit on top.

Hosting Nuggets said...

I checked the BIOS and it least this looks fine, I could enter it without any trouble. Physicaly I also checked the card itself and can't see any anormalities on the circuit. DELL simply suggested that it's a problem with the partition related to the OS but I dont really believe that. Anyway thanks for your feedback ;-)

Eric Wood said...

Now on 12/11/2012 the replacement UCS-51 card (which I got off ebay) failed and I replaced it with another UCS-51 I had on standby and re-actived the array and Centos boots fine once more. So that's two of these cards thats failed me. I suspect aging capacitors since they look swollen.