FORUMS: list search recent posts

ATTO R680 Firmware reset ?

COW Forums : RAID Set-Up

<< PREVIOUS   •   VIEW ALL   •   PRINT   •   NEXT >>
Simon Blackledge
ATTO R680 Firmware reset ?
on Mar 13, 2015 at 4:21:59 pm

Hi all,
Have emailed ATTO support but thought someone here may have had a similar issue.

Just had an R680 do a firmware reset and the OSX server. It ejected the raid the remounted ok.

seems a disks SMART Raw Read Error Rate was going up and down lowest 75 ish on a disk but the threshold is 16 :-/

What I don’t understand is

1, Why was the disk not just taken offline and a rebuild started - Have a hot spare

2, The Threshhold is 16 so why would 75 trip the R680 to reset the firmware ?

I have Heartbeat enabled but I don’t understand why a disk read error would trip this. Seems wrong that a read error would reset the firmware meaning the disk is ejected :-/

Cheers

Simon


Return to posts index

Bob Zelin
Re: ATTO R680 Firmware reset ?
on Mar 13, 2015 at 11:27:29 pm

Hi Simon !
hope to see other responses on this thread.

bob Zelin

Bob Zelin
Rescue 1, Inc.
bobzelin@icloud.com


Return to posts index

Petros Kolyvas
Re: ATTO R680 Firmware reset ?
on Mar 14, 2015 at 1:26:57 am

This response is anything but definitive and almost exclusively anecdotal.

We had an R680 that was repeatedly suffering "firmware resets" and it ended up being a dying drive; something removing the affected disk completely alleviated.

Regarding your specific points:
1. I don't use auto-rebuild. I keep a spare spun up in each array, but I choose when rebuild occurs so I can't help there except to say (and this will actually transition to point 2) there was some interaction with OS X causing the resets.

2. SMART thresholds don't, AFAIK, trigger a disk to go offline but rather offer a baseline of where one should "take a very close look." In our case, the disk was useable to some degree, just incredibly slow and would run into sectors completely unreadable; but it didn't fail outright and it seems OS X would cause the array to continually try a read or write operation that the failing disk would further affect - the firmware reset would happen almost as if the R680 decided to terminate the operation itself (though this is a wholly anthropomorphized version of the events.) Pulling the affected disk and rebuilding the array on-demand solved the issue.

Furthermore this is from ATTO's heartbeat documentation:
Heartbeat
Choices: enabled, disabled Default: enabled
When enabled, the RAID controller’s firmware is required to respond to periodic activity. If the firmware does not respond, the system driver resets the firmware on the controller.


So indeed, if the array cause an I/O issue such that the controller was stuck in a loop/could not complete the firmware would indeed reset.

I wish there was "working" or "failed" for drives but there's a grey area a mile wide between the two unfortunately.

Sorry I couldn't offer you more.

--
There is no intuitive interface, not even the nipple. It's all learned. - Bruce Ediger


Return to posts index


Bob Zelin
Re: ATTO R680 Firmware reset ?
on Mar 14, 2015 at 2:28:55 pm

Hi Petros -
I enjoyed your reply. I too have seen about 15 systems experience firmware resets (with the ATTO R680), and I never had an explanation of why this happens.

I have a couple of questions for you.
1) do you disable heartbeat in the ATTO Configuration Tool, to prevent the firmware reset ?

2) when you say "take a very close look" - exactly what do you mean by this. In the ATTO Configuration Tool, all the drives appear to be ON LINE (as well as the raid group), and then all of a sudden, a drive will be marked offline, or the RAID Group will become degraded, yet no drives are indicated as "failed". The very amateur test that I do, when I see poor performance or strange behavior, is to run a simple diagnostic (like AJA System Test with the 16 Gig file), and watch the LED lights on the RAID to see if one of them is getting "stuck" while the others are flashing away. This sometimes will point to a failing drive, and I can remove that drive and install another. Sometimes I am lucky and the raid performance increases, and I avoid a disaster, and sometimes I am not so lucky, and the raid degrades. I wish there was a diagnostic that I could use (unless you can tell me how SMART can give you a hint that something bad is happening with a specific drive).

I appreciate your response, and I look forward to your reply.

bob Zelin

Bob Zelin
Rescue 1, Inc.
bobzelin@icloud.com


Return to posts index

Rainer Wirth
Re: ATTO R680 Firmware reset ?
on Mar 15, 2015 at 12:52:56 pm

my experience with drives that don't show "fail" is, that they are just about to fail and show the first signs of a failure. Later they go red.
The Aja test is a nice feature, thank you Bob.
What does Atto say to this?
They have a great support, and the R680 is a good piece of hardware to my experience - or not?

cheers

Rainer

factstory
Rainer Wirth
phone_0049-177-2156086
Mac pro 8core
Adobe,FCP,Avid
several raid systems


Return to posts index

Simon Blackledge
Re: ATTO R680 Firmware reset ?
on Mar 17, 2015 at 10:57:01 am

Ok so I have disabled heartbeat. Not at ATTO's recommendation.

Maybe if it happens now it will take the whole raid offline - Who knows!

The system has been stable since 2011 same disks etc..

all I can conclude as ATTO are unsure on the issue.

Some drives are showing degradation on some SMART reports. up and down between 100>95>100
Mainly on 1:Red Raw Error Rate

a few 08: Seek time performace drops but they rise again.

So All I can asses is that discs are old and starting to have issues.

No show bad blocks. Re allocated sectors etc..

Have replaced the 2 disks that show smart issues so will report back if anything further happens.

For me the main issue is ATTO only report SMART to users. Doesn't seem its used to take a disk offline.

So if a drives having issues having a HOT SPARE won't kick in a rebuild :-/

S



Return to posts index


Bob Zelin
Re: ATTO R680 Firmware reset ?
on Mar 17, 2015 at 12:09:39 pm

please keep us informed as to what happens !

Bob Zelin

Bob Zelin
Rescue 1, Inc.
bobzelin@icloud.com


Return to posts index

Petros Kolyvas
Re: ATTO R680 Firmware reset ?
on Mar 26, 2015 at 12:50:53 am
Last Edited By Petros Kolyvas on Mar 26, 2015 at 12:52:30 am

Hi Bob,

So so sorry for the late reply. Life has been crazy.



[Bob Zelin] "1) do you disable heartbeat in the ATTO Configuration Tool, to prevent the firmware reset ? "


I do not. In fact, in my limited experience it was heartbeat's firmware resets that prevented the array with the not-quite-failed-disk from taking down OS X's entire I/O subsystem. So instead of an apparent system hang, the firmware reset would release the system for a more controlled/managed crash landing.

Again though, my experience remains anecdotal.

[Bob Zelin] "2) when you say "take a very close look" - exactly what do you mean by this. In the ATTO Configuration Tool, all the drives appear to be ON LINE (as well as the raid group), and then all of a sudden, a drive will be marked offline, or the RAID Group will become degraded, yet no drives are indicated as "failed". The very amateur test that I do, when I see poor performance or strange behavior, is to run a simple diagnostic (like AJA System Test with the 16 Gig file), and watch the LED lights on the RAID to see if one of them is getting "stuck" while the others are flashing away. This sometimes will point to a failing drive, and I can remove that drive and install another. Sometimes I am lucky and the raid performance increases, and I avoid a disaster, and sometimes I am not so lucky, and the raid degrades. I wish there was a diagnostic that I could use (unless you can tell me how SMART can give you a hint that something bad is happening with a specific drive). "

Yes - they do appear to be ONLINE. And I fully agree with you, there's this grey zone where it's hard to figure out what's going on.

I can only tell you what I did, and maybe it's something you can add to your toolkit: I used the OS X system Console. I filter for ATTO firmware log entries (you can do so often by filtering by the model number of the HBA/RAID adapter and/or the manufacturer/vendor.)

In the case of the system where I experienced this, filtering by R680, all the disk activities are listed and there it would indicate which disk was pulling the volume offline. SAS firmware will often report disk activity (that is physical disks) and SMART changes directly to the console output. Each line is also coupled to the disk's serial number so as long as you have a "recall" sheet and/or chart of which disk is which you can co-related the smart issue and or disk firmware feedback to the actual disk itself. For example, in the case I spoke of, the drive's firmware would time-out, and then a short while later heartbeat would kick in and the controller's firmware would reset. You could watch it all happen in the console.

I've included a sample output here with markup to help/clarify, but I fear I'm often obtuse with these things, especially since some of this was not entirely fresh in my memory, so if you can stand the wait, I'm happy to work through any more issues - I don't pretend to have anywhere near a complete understanding of this and am happy to pool whatever knowledge I can.



--
There is no intuitive interface, not even the nipple. It's all learned. - Bruce Ediger


Return to posts index

Petros Kolyvas
Re: ATTO R680 Firmware reset ?
on Mar 26, 2015 at 12:57:43 am

One other thing, for Windows arrays this information is probably going to the Event Log, but don't quote me on that.

--
There is no intuitive interface, not even the nipple. It's all learned. - Bruce Ediger


Return to posts index

<< PREVIOUS   •   VIEW ALL   •   PRINT   •   NEXT >>
© 2017 CreativeCOW.net All Rights Reserved
[TOP]