RAID 6, With or Without Hot Spare
Being paranoid (inexperienced and uneducated), I plan on setting up my Pegasus2 R8 as RAID 6.
My question is whether or not I should also designate a hot spare drive in the array.
If I have to make this decision on my own (meaning me and the internet), I'd say yes. Designate a hot spare.
That would be the safe way to go.
Not necessarily. Having a cold spare on a shelf is more efficient if you don't run your array 24/7 and assuming you'll know immediately that a drive failed.
Hot spare means that (a) it's just sitting there doing nothing until a drive fails, (b) your capacity, efficiency are down by 1/8th, something to consider given that RAID6 already takes away 2/8th of the capacity, (c) auto-rebuild that may bring performance down quite a bit, possibly for more than a day.
I.e. hot spares make more sense on larger arrays with RAID6.
My personal preference in the event a drive is marked as failed on smaller arrays (less than 16 spindles) is to check that the drive indeed failed rather than just "marked as failed" because of a timeout. Quite often the drive is actually healthy, just hiccupped, and can be marked as "online" w/o side effects. (A rebuild may still be necessary.)
I also prefer to start and monitor a rebuild manually on smaller arrays with non-24/7 duty. Auto-rebuild in the middle of a project may not be a good idea.
If it's uptime and data protection you're looking for: RAID6 with a cold spare or two sitting on a shelf, and backups.
-- Alex Gerulaitis | Systems Engineer | DV411 - Los Angeles, CA
What's this for? Video production work or as a backup? What drives?
The raid6 RMW penalty is significantly worse than raid5, plus the hot spare means you're at best 5x read speeds, which depending on the drives will vary alot unless they are short stroked. Once even one disk has failed, the degraded mode performance I'm very skeptical will supply what you need to actively use the array while the rebuild is occurring. ISo I'm not sure what the extra drive redundancy gets you. In any case there should be sufficient regular backups that you shouldn't need to depend on raid6. If you're going to forego regular backups and think raid6 bridge that gap, I think that's a mistake.
As for hot spare, forget it. It costs you both performance and capacity and isn't worth it. Before raid6 I'd consider raid10. Before raid6+hotspare I'd consider raid15. But before all of that, make sure you're doing regular scrubs on the array, use drives that have proper error timeouts so their errors are actually corrected during scrubs, have one or two same model drives on hand and properly replace them when the time comes. If you don't do those things you can still get bit in the ass with raid6 and a hot spare. And if you don't need the extra capacity this layout implies, get an R6.
[Chris Murphy] "plus the hot spare means you're at best 5x read speeds"
You sure it's not 7x Chris?
(I understand real-world numbers support your thesis, yet theoretically 6 has the same read performance as 0. Perhaps it's the implementation that brings the performance down, not RAID level.)
Yes, because eight drives in raid6 with a hot spare means stripe members (the ones we get data chunks from during full stripe reads) is five. Parity chunks do not count, and that's 2 "drives" worth. And the 3rd is the unused hot spare.
Makes sense - thank you.
So yes you're right raid5 and raid6 and raid0 have no read penalty, but to compute the performance factor you only count data drives. A five drive raid0 has five data drives so that's 5x reads. A six drive raid5 also has five data drives (yes it's distributed parity but for any full stripe read one of the drives produces a parity chunk which is not read during normal operation). And a 7 drive raid6 likewise has five data drives. So for reads, those raids should perform the same all other things being equal. They each have equal number of data drives, but unequal total drives.
It sounds like you're suggesting RAID 5, no hot spare. Correct?
In my limited RAID experiences, I've never had a hot spare. But I thought a hot spare would be better than a cold spare. Aren't HDDs happier when they're spinning? They're designed to spin, not sit, right?
Although, I think I understand the reasoning behind not having a hot spare, since you can manually control when and if the array is rebuilt. (as Alex points out)
Still, wouldn't it be better for the drive to be sitting in the array, always up and running? I could use still use that single HDD for other purposes. (maybe cloned backups of my boot drive?) If a drive fails, shouldn't I be able to turn that extra HDD into the replacement HDD?
Wouldn't RAID 5 (7 drives) plus scratch drive (to be used as spare if/when needed) be kind of like RAID 6, but with a benefit of using the non-Array drive for other purposes until it's needed to rebuild. Plus, I'd be running at RAID 5 performance the entire time.
Also, why RAID 10 over RAID 6? RAID 0+1 or 1+0?
My understanding is that RAID 10 can recover from 1 drive failure per span. With an 8 drive array, that would be two spans, so it could recover if you lost one drive from each, two drive failure. But I guess you could loose more drives on one span and still recover, assuming you lost no drives on the other span. Correct?
Is RAID 10 faster (write) than the overhead required for RAID 6, with 8 drives and no spares?
Raid 6 is far more reliable than raid 5 because the controller has 2 levels of parity to verify to. The problem with current drives failing over time is parity information can corrupt on 1 drive over time causing considerable damage to the raid integrity. It is not uncommon to lose a raid 5 array to corruption because of this. However raid 6 having that extra level of parity protects against that provided you have the parity verification scheduled periodically. It is extremely rare to have 2 discs corrupting over time especially in the same parity data regions. I have never seen a raid 6 completely unravel with NTFS as the file system. The worst I have seen on a raid 6 is a few files corrupted at most. Also the performance difference between a 8 Drive raid 5 and 8 drive raid 6 is 720MB/s for the R5 to 650MB/s for the raid 6. Not nearly enough to justify the greater chance in failure.
Raid 10 is block level mirroring and far more expensive than Raid 6 in amount of disks used. You can only verify the data of 1 mirror to another versus 2 levels of parity so verification is not nearly as good as Raid 6. The performance on 8 drive raid 10 would still be around the 8 drive raid 6 so there is no gain there. I do not suggest raid 10 unless dealing with massive disk arrays where failure rate is much larger over all and needs to be reduced to mirrored partner level percentages.
If silent data corruption is a real problem needing mitigation, then we can't consider non-checksumming parity raid6 qualified to deal with that. That realm is for ZFS, Btrfs, ReFS, and PI.
Parity isn't a checksum, so any disagreement between a data and parity chunk, even when two parity chunks agree in a mismatch with a data chunk, is still ambiguous. Hence the "write hole" applies to raid6 every bit as much as to raid5 and raid1. It's an assumption to defer to two agreeing Q and P chunks against a data chunk, and in fact this is a wrong strategy because this very thing can happen in a power failure where data chunks were correctly written but parity chunks were not, they have their old values and therefore still agree.
Further, in normal operation, parity chunks aren't even consulted. So the system doesn't know about any mismatches. That's why regular scrubs are important. Most cases of raid5 total collapse despite only a single drive failure, is due to wrong setups. The wrong drives were spec'd, scrubs weren't scheduled, the drive and controller timeouts weren't set correctly. Bad sectors end up not being fixed. A drive fails, rebuild commences, and a bad sector is encountered, but since we're degraded there's no parity to rebuild from and the array collapses even though there's only been a single drive failure. So the way around this is go with raid6 rather than fix the underlying sources of the problem.
Now, there's no question that drive sizes are growing more quickly than drive performance. Therefore rebuild times are going up a ton and that's why it's sane to recommend raid6, because a 2nd drive could die during rebuild. But corruption mitigation isn't the use case. Raid is intended to defer the checksumming/corruption mitigation to the hardware, the drive does actually write checksums to each sector, and its ECC is designed to detect and correct problems. If it can't, it should report a read error and then the raid can do something about that in normal operation.
Many of those factors described why Raid 5 fails are not controllable by the client even with the Raid Controller management. Raid 6 is able to repair the corrupt parity data which I have seen done where as raid 5 has not been able to for medium to severe occurrences. The over all results have been near full proof reliability for years on every raid 6 I have dealt with. I cant say how many Raid 5's I have seen unravel on both OSX or Windows since I lost count years ago. The results speak for themselves. A Intel/LSI Raid Controller takes 5 to 6 hours to rebuild a 16TB 8 drive array. So even 32TB wont take more than 10 to 12 hours on 1 of those arrays. That is perfectly acceptable to rely on 1 level of parity until completed.
Haha, you know, I find the proper math is often surprisingly helpful!
4000GB @ 130MB/s does indeed translate into ~8.5 hours to fully write as a single block device if its sequential write performance is maximized. And raid1/10 (and 0+1) can do that. Parity raid will be slower, but how much slower greatly depends on the controller, and some even have settings that affect the trade off between array responsiveness while degraded vs rebuild performance. It's probably within +20% for a 1x raid6 failure. Uncertain how much slower 2x failures will be.
I think Kevin is fine choosing raid6, but still needs to look at the 1x fail raid6 performance, and if he can tolerate that performance level for ~10-12 hours. It's also worth looking at the 2x fail performance as well just to be aware of how long it's going to take. I still think he can skip the hot spare.
EricBowen:Many of those factors described why Raid 5 fails are not controllable by the client even with the Raid Controller management.
Sure, there are products that neither set things up correctly, nor expose the settings so that the user might have a chance of doing so themselves. But papering over these mistakes with raid6 isn't correct either. While it's reasonable to say, only the bottom line matters, it doesn't change the fact that wrong raid5 and raid6 configurations can fail for identical reasons - reasons that neither raid5 nor raid6 were designed to mitigate.
EricBowen:Raid 6 is able to repair the corrupt parity data which I have seen done where as raid 5 has not been able to for medium to severe occurrences.
How is the parity corrupted? What enables raid6 to either detect or correct it? And is this raid detectable corruption also detected by the drive ECC?
The raid6 I'm referring to is the commonly available, non-checksumming, P+Q parity based on Galois field algebra. There are no checksums for data or parity chunks. There's no way for this implementation to directly detect or correct corruption in data or parity chunks, nor is it designed to. It defers to drive and controller ECC, which do use checksumming, and it's the drive ECC that detects and corrects corruption. This kind of raid6 cannot detect or correct for silent data corruption. During normal operation, only data chunks are read, and so long as the drive doesn't report a read error, the raid doesn't question the veracity of the data. If the drive reports a read error due to ECC detection of error but inability to correct it, then the affected data chunk is reconstructed from parity, and the data propagates up to the application layer and is also written back to the device that previously reported the read error. The overwrite of the affected sector(s) fixes the problem, either by successful overwrite of the physical sector or the drive firmware remaps to a reserve sector if there's a persistent write failure for that sector.
There are obviously more data chunks than there are parity chunks. If either Q or P parity chunks are being corrupted somehow, then absolutely data chunks are being corrupted. And again in normal operation, parity chunks aren't consulted. If the drive doesn't report a read error, corrupted data chunks propagate to the application layer undetected or corrected.
So why is it that raid5 instances are having so many unravelings? Because they're configured wrong. If they're configured correctly, the incidence of bogus ejection of drives as faulty for being unresponsive goes to essentially zero. Bad sectors are corrected on the fly, and also during normally scheduled scrubs. Yes, of course, there still could be two legitimate drive failures at the same time, and mitigating that possibility is why we have raid6. But dual drive failures are still rare in the raid sizes discussed, compared to single drive failure with a subsequent bad sector causing a 2nd disk ejection and hence array collapse.
Likely the Parity is corrupted from bad blocks in the drive in those areas would be the most likely probability to me. As to how the parity information is corrected you would have to ask Intel or LSI. I just watch the logs that report from the web management consoles or error logs. If it says fixed then I assume it means it was fixed. Either way the raid integrity was maintained with the raid 6 in those cases where as the raid 5's would fall apart in some cases.
I am not miss configuring raid 5's when I create them nor am I failing to schedule parity checks. There is no magic formula or settings when creating a raid 5 volume. There is just the Console settings and initialization. None of the console settings other than Full initialization are going to effect the Reliability of the raid 5 volume. I am talking raid 5's with enterprise drives only which include the timeout recovery option in the firmware. They are still unraveling and until I see greater reliability to rebuild with corruption or handle corruption than raid 6 I wont suggest it over 6.
EricBowen: Likely the Parity is corrupted from bad blocks in the drive in those areas would be the most likely probability to me.
Why only parity? Drives know nothing about RAID, they won't discriminate between data and parity chunks. Bad blocks have a much greater chance of corrupting data chunks, simply because there are more of them. If you're really seeing parity chunks corrupted more often than the ratio of parity to data disks, that sounds like raid firmware bugs to me.
Bad blocks that are not detected or corrected by drive ECC is quite rare. When it happens, t's not something parity raid knows about about in normal operation. The usual case is the drive's ECC detects error and corrects it without informing the controller; another possibility is detection without correction while informing the controller with a read error. That read error includes the LBA of the bad sector so the controller knows what data needs to be reconstructed from parity, and then it sends a copy up to the application layer as if nothing has happened, and causes a copy to be written to that same (bad) LBA. Then it's up to the drive firmware to determine if merely overwriting the sector fixes the problem, or if it's a persistent write failure it will remove that physical sector from use by dereferencing, the LBA and data get assigned to a reserve sector. Once that happens, the old sector isn't accessible with general purpose commands (it has no LBA).
Anyway, nothing else in the storage stack discriminates between data and parity. So if you mean to indicate a high instance of parity corruption (either single or dual parities) compared to data corruption, that sounds like firmware problems. And to the contrary, it's not unheard of.
An Analysis of Data Corruption in the Storage Stack
EricBowen: As to how the parity information is correct you would have to ask Intel or LSI. I just watch the logs that report from the web management consoles or error logs. If it says fixed then I assume it means it was fixed.
It may very well be there are proprietary implementations that the manufacturer's won't talk about.
EricBowen: I am not miss configuring raid 5's when I create them nor am I failing to schedule parity checks.
I take your work for it. But then, well before raid5 unravels, scrubbing would reveal mismatches. Mismatches aren't normal or OK. In small amounts, it can represent silent data corruption, which again raid6 doesn't mitigate. Anything more than this indicates a problem.
EricBowen: I am talking raid 5's with enterprise drives only which include the timeout recovery option in the firmware. They are still unraveling and until I see greater reliability to rebuild with corruption or handle corruption than raid 6 I wont suggest it over 6.
OK but again raid6 isn't meant to handle corruption, it's meant to handle two drive failures. Corruption≠failure. There is an expected redundancy improvement in raid6 over raid5 in the use case it's designed for, which is protection from an additional drive failure while still partially degraded from a one disk failure. Here's a presentation from NetApp.
Parity Lost and Parity Regained
Corruption occurs outside of drives a significant percent of the time, before it even gets to the raid controller, which will promptly write corrupt data and correct parities for that corrupt data to disk (short of additional corruptions).
Are Disks the Dominant Contributor for Storage Failures?
You've got enterprise drives: the mid-range enterprise SATA drives spec an order of magnitude fewer unrecoverable errors than consumer drives, and enterprise SAS yet another order of magnitude fewer. And on top of that regular scrubs, which should spot mismatches before they become problems. Yet you're reporting significantly higher raid5 implosions compared to raid6? This is unexpected but without logs, maybe even debug logs, it may remain obscure what the cause is.
[Chris Murphy] "So why is it that raid5 instances are having so many unravelings? Because they're configured wrong. "
The Pegasus unit does have controller settings, but they don't offer much explanation about the settings, nor any recommendations. The options are:
SMART Log: On or off
SMART Polling Interval: 1 to 1440 minutes
Enable Coercion: On or off (I believe this is for mixed drive capacities)
Coercion Method: GBTruncate, 10GBTruncate, GrpRounding, TableRounding
Write Back Cache Flush Interval: 1 to 12 seconds
Enclosure Polling: 15 to 255 seconds.
Are these the kind of settings you're referring to? I believe these are the only controller settings.
I contacted Promise, but only talked briefly with sales. I will be trying to contact their tech support tomorrow to get some more information.
I guess I should be glad my Mac Pro won't get here until February, so I can spend the time fully understanding the Pegasus array and it's options. As opposed to plugging it in and happily running away with a complete lack of RAID knowledge.
In case I forget, thanks Alex, Chris and Eric for all your comments and advice.
One drawback of sane UI is that by not showing esoteric settings, users have no way of knowing if they're being handled on their behalf or not. I don't see anything in here that could be set flat out wrong enough that it would explain corruption. The write back cache flush of up to 12 seconds could mean a rather spectacular amount of corruption if there were also a power loss during heavy write. However the user manual for the R6 says the write back cache is battery backed, so that ought to mean once power is reapplied, the contents of the cache are written to the drives on power-up. For there to be no data loss or corruption requires drive write caches are disabled. That way anything sent to the drives is committed (in theory) and anything in the controller write cache is preserved until power is restored. This doesn't mean there will be zero corruption but it's significantly reduced.
[Chris Murphy] "If either Q or P parity chunks are being corrupted somehow, then absolutely data chunks are being corrupted. "
Not sure I get it. If we're not looking at a possibility of simultaneous and independent corruption of two data blocks within one stripe... If one parity chunk is corrupted, you could always re-create it from the other parity chunk, and the data?
Isn't 6 equivalent to a 3-way RAID1 for the purpose of data recovery? One copy is corrupted, you're not sure which one is good out of three - just check which ones match, assume those ones are good, discard the mismatching one?
Applying that to 6: calculate P and Q again from the data chunks, compare them to existing P and Q; whichever one mismatches - re-write it, and Bob's your uncle?
If the data chunks were corrupted, then P and Q would still be healthy and you could re-create data from them, supposedly?
Alex Gerulaitis: If we're not looking at a possibility of simultaneous and independent corruption of two data blocks within one stripe...
Right, this would be a problem because two corruptions in the same stripe can't be treated the same as two missing (read error, or failed drives). For failures, the exact affected chunks are known. For corruptions it has to be deduced and with two corruptions that's difficult at best, and in the category of specialized data recovery.
Alex Gerulaitis: If one parity chunk is corrupted, you could always re-create it from the other parity chunk, and the data?
It could just recompute P+Q and overwrite - no need for a reconstruct. But before that, how was the corruption determined and isolated? Standard in parity raid implementations is a parity verification (scrub check or read-only scrub) but all this does is report mismatches. That is, the new parities don't match existing. So that just says there's a problem.
There might be proprietary implementations that go to the effort of deducing whether it's D or P or Q that are corrupt, but for a one off corruption rather than whole disk? Seems doubtful by virtue there are still raid6 corruptions in sufficient quantity that the industry has developed alternative solutions to avoid or mitigate them: T10 DIF/DIX (now called PI), ZFS, Btrfs, and ReFS.
Alex Gerulaitis: Isn't 6 equivalent to a 3-way RAID1 for the purpose of data recovery? One copy is corrupted, you're not sure which one is good out of three - just check which ones match, assume those ones are good, discard the mismatching one?
It's maybe worse with raid1 because the vote is an arbitrary decision. It seems like it makes sense to go with the majority, 2 vs 1. But in reality you'll get sufficiently arbitrary results that it doesn't really fix the problem.
Alex Gerulaitis: Applying that to 6: calculate P and Q again from the data chunks, compare them to existing P and Q; whichever one mismatches - re-write it, and Bob's your uncle? If the data chunks were corrupted, then P and Q would still be healthy and you could re-create data from them, supposedly?
Sure it's possible. I don't know any implementations that do this, but that proves nothing. It seems like a lot of additional code, testing and maintenance of that code, ensuing greater risk for bugs and additional corruptions, for a small use case that should only rarely be a problem with the kind of hardware we're talking about. Again, raid6 is about enabling recovery from specific known missing chunks, not deducing what's still present but maybe wrong. The use cases where some corruption is a big problem, there are better solutions for this than raid6.
[Chris Murphy] "scrubs weren't scheduled"
I'm not sure what a scrub is. Both the definition and how that translates to the features of a Pegasus array. The background activities that are available for the array are:
Media Patrol: Default on. This appears to check media, not data. If a media error is encountered, it triggers PDM.
Redundancy Check: Ensures all data matches exactly. Settings for Low, Medium and High. Balancing system resources vs. r/w operations. Options to Auto Fix or Pause on Error.
Synchronization: Recalculates redundancy data on physical drives to ensure sync. Settings for Low, Media and High.
[Chris Murphy] "drive and controller timeouts weren't set correctly"
I'm not sure this is even a setting for the Pegasus unit. (unless I missed something)
Media Patrol requires translation, it seems like a marketing term to me. I'll guess it's either a selective or extended self-test via SMART. Default On means it's happening, but how often and what time of day? If it's a SMART test, it slightly reduces drive performance so it might be something you'd rather want done on demand if the schedule can't be specified.
Redundancy check sounds like a scrub/verify. I'm not sure what the options are, I'd rather just have a mismatch count, rather than it either stopping or fixing. I also don't know what autofix means exactly - if it ignores disk read errors or fixes them; or if it overwrites parities in the event of a mismatch, or what. There are multiple potential problems, each with different fixes possible.
Synchronization sounds like it'll read data chunks, recompute parities, and overwrite existing parities. You wouldn't normally do this without a reason. For example if a redundancy check shows mismatches have spiked since the last time you ran it, you'd want to power down, check all connections, maybe even reseat the drives if they're in a backplane. Power back up. And do a file system check/repair - of the full variety. Now you can rebuild parities, which will also reset any mismatch counts. If you were to immediately do another redundancy check there should be no errors/mismatches at all. If there are, then there's almost certainly a hardware problem and it needs to be found.
Disk Warrior if you want GUI only, or check the btree rebuild options under the -R option in "man fsck_hfs". Disk Utility does not rebuild any of the btrees. I'd use the 2nd form, -fy -f first, then separately rebuild each btree with -Rc, -Re, -Ra. This could take a while. And it only fixes file system metadata. Actual data files are untouched, and the file system itself knows nothing of the underlying raid so that's not fixed either. However, if the raid is in bad shape, file system checks will be in bad shape and likely not repairable. If it doesn't need repairing or is readily fixable, then at least the data chunks for the file system are likely consistent.
I'm not recommending raid5, maybe raid6 is a good fit. But you asked an open ended question without any detail on the use case, work load, backup strategy, or the drives being used. So I'm just poking holes. Maybe raid6 fits your use case exactly right.
RAID6+hotspare is a red flag to me, that says, "this data is really super important and needs to always be available". And it necessarily implies the workload can still get by on degraded performance, or you can tolerate waiting for the rebuild. Video production workloads are demanding as is parity array rebuild, so I'm skeptical that you can do both in a 1x fail raid6. But I have to defer to more experienced people on this question. For a 2x fail raid6, the performance must obviate doing any meaningful work.
Let's look at the rebuild times. The Promise web site says "Pegasus2 R8 32TB model is populated with 5900 RPM SATA drives" which is why I asked what drives you're using. Those are probably Seagate Barracuda XTs, which average about 130MB/s on sequential writes meaning it will take 30 hours to fully write. Raid1/10 can rebuild at the drive's max sustained write speed. Parity raid will take longer. How much longer depends on the controller.
So there are three questions: Is the 1x degraded raid6 performance still decent enough to get work done? If not, can you tolerate the downtime for the rebuild? And how much longer than 30 hours is it?
There are all sorts of ways to resolve this. The extremes are getting 10K-15K SAS drives of smaller capacities so the rebuild times are shorter and the performance is better, except the Promise spec sheet says drive support is SATA, not SAS, so that's maybe a dead end. The other is raid10 which gets you a much smaller performance hit in degraded mode, with even a 3x drive failure, let alone the 1x failure you're most likely to encounter, the rebuild time is also faster.
Also, raid6+hot spare means you're setting aside 37.5% of your storage capacity. And another 12.5% hit isn't much compared to the gain of raid10 over raid6+hot spare.
With regard to raid10 vs 0+1: Use raid10 over 0+1 because it always rebuilds faster, and you can lose more drives than 0+1. Their performance in normal mode is the same.
You can relax your concern whether drives are better off spinning or on a shelf. Consider the manufacturer themselves will have a pile of a given drive model in reserve for years. I don't consider raid5+hot spare to be raid6. And I'm unaware of an implementation that permits a hot spare to be a working rw mount that is suddenly yanked from the user without warning, destroyed, and used in a rebuild upon a single disk failure calling it to duty.