Tuesday, July 26, 2011

NAS devices and Linux Raid

http://www.nber.org/sys-admin/linux-nas-raid.html

The NBER has several file stores, including proprietary boxes from Netapp, semi-proprietary NAS boxes From Excel-Meridian and Dynamic Network Factory (DNF) based on Linux (with proprietary MVD or Storbank software added) and home-brewed Linux software raid boxes based on stock Redhat distributions and inexpensive Promise IDE (not raid) controllers. Along the way we have learned some simple lessons that don't seem to be widely appreciated. Here are a few.

How high can you fill it?

The two NAS devices from Netapp run contentedly at 99% of "capacity" without a noticeable slowdown when read or writing. Of course capacity doesn't include 20% for snapshots or 10% for "overhead". Our two boxes based on Mountain View Data software and the one from ECCS begin to seriously slow down at about 70% of capacity if snapshots are enabled. The slowdown can be extreme - several orders of magnitude. The problem is most noticeable if 4 or 5 users are trying to write simultaneously, and looks like a crash to the users. Sometimes i/o operations will fall to zero for several minutes, then resume. Our homebrew Linux systems don't have snapshots, but work well even at high levels of space utilization, even if overhead is reduced to 5% via tunefs.

Can you use whole drive partitions?

We got in the habit of using 3 or 4 drives in a RAID 5 array with the entire drives used for the raid partition. Then we noticed that just because two drives are the same brand, model and rev level, doesn't mean they have the same number of sectors. At the initial install this won't matter - the Linux md software will set the partition size to be the minimum of the member drive sizes. The problem won't appear untill a drive needs to be replaced, and then you need to find a drive with at least as many sectors as the remaining good drives. For example, out of six Maxtor 160s delivered in the same shipment, four had 19,929 cylinders and two had 19,930 cylinders, a difference of six meagabytes. Of course you can't tell which size the store has in stock, so it becomes a difficult to replace a drive without saving and restoring the entire array - or buying the next larger size drive. 3ware automatically rounds down in a process they call "drive coercion".

How easy is it to upgrade? (software)

Tech support for Excel Meridian box required us to do a backup/restore of the data for an OS upgrade. That turns 20 minutes of scheduled downtime to several days. I can only assume the motivation was to discourage the upgrade.

Does it work with NIS?

All NAS appliances promise to work with NIS, but only Netapp and ECCS recognized netgroups. The MVD based devices did not.

Can you get under the GUI?

As you might expect, the failure to recognize netgroups was a GUI problem in the semi-proprietary Linux based systems, If one "loooked under the hood" at the Linux code that actually ran the NAS, it was easy specify netgroups in /etc/exports. Both Excel Meridian and Dynamic Network Factory threatened to withdraw all support if we did so, however.

How do you access snapshots?

On the Netapp, every directory has a subdirectory ".snapshot" and within that directory are subdirectories for snapshots of the prior state - "nightly.0", "nightly.1", etc. This is very convinient if you just want yesterday's file, however if you are doing a backup that crosses midnight you have the problem that "nightly.0" changes its name to "nightly.1" right in the middle of your run. That is a nuisnace and will cause some software to choke. You can create a snapshot with indefinite life over the telnet/rsh interface, but we didn't want to put plaintext passwords on the LAN, so we use the weekly snapshot for backups. [Chris Madden has a solution to this problem - shown in the letter appended at the end of this documnet].
The ECCS box puts the snapshots in a separately exported filesystem named after the month and day it was created. This avoids the name-changing problem, but means backup software can't be given an unchanging name for the backup source.
The FreeBSD version of snapshots lets the system administrator place and name the snapshots to his liking - an advantage of a slightly lower-level user interface. However, we don't use snapshots in FreeBSD because they take so long - five times as long as fsck on our system.
An early version of the DNF Storbank system exported the snapshots as SMB filesystems only, even if the current filesystem was exported as NFS. That makes sure even system administrators will have trouble accessing the snapshots. Later, NFS exports of snapshots were added at our request. This is the only time any NAS vendor was willing to learn anything from us.

How many snapshots can you have? How long can you keep them?

For some reason vendors think they know how many snapshots you will want, and how long you will want them. Earlier versions of the Netapp restricted us to 21 snapshotsbut that has been increased to 32. Netapp keeps a bitmap with one bit per block of storage per snapshot, so a limit of 32 is not completely arbitrary. MVD based systems have license surcharges for more than 5 snapshots, with a maximum of 32. Storbank was also limited to 32. Some of our filesystems change very slowly, so that older snapshots would not be a burden in terms of inode or space, but we are constrained by these policies.

How long will it take to get a replacement drive?

With the homebrew filestores, we can order a drive for next day delivery 6 days a week, or buy one locally on any day. Of course the other vendors have service plans with 24 hour service, and with most NAS boxes the drives are commodity items anyway - you could replace them with store-bought drives if the vendor was uncooperative.
The Netapp nearly requires drives purchased from Netapp. Once we were a bit shocked when the "next business day" for Netapp to replace a drive which failed on June 31st was July 6th. The explanation? It was past noon on the 31st. The 1st and 2nd were a weekend. The 3rd was the day before the 4th and everyone had been given an extra day off. The 4th was of course a holiday, and the 5th was the earliest ship date, for delivery on the 6th. They offered (and we accepted) same day delivery for a $500 surcharge. They called this "sudden" service.

Why do drive failures come in pairs?

Most of the drives in our NAS boxes and drive arrays claim a MTBF of 500,000 hours. That's about 2% per year. With three drives the chance of at least one failing is a little less than 6%. (1-(1-.98)^3). Our experience is that such numbers are at least a reasonable approximation of reality (but see Schroeder and Gibson ,2007). (We especially like the 5400 RPM Maxtor 5A300J0 300GB drives for long life).
Suppose you have three drives in a RAID 5. If it takes 24 hours to replace and reconstruct a failed drive, one is tempted to calculate that the chance of a second drive failing before full redundancy is established is about .02/365, or about one in a hundred thousand. The total probability of a double failure seems like it should be about 6 in a million per year.
Our double failure rate is about 5 orders of magnitude worse than that - the majority of single drive failures are followed by a second drive failure before redundancy is established. This prevents rebuilding the array with a new drive replacing the original failed drive, however you can probably recover most files if you stay in degraded mode and copy the files to a different location. It isn't that failures are correlated because drives are from the same batch, or the controller is at fault, or the environment is bad (common electrical spike or heat problem). The fault lies with the Linux md driver, which stops rebuilding parity after a drive failure at the first point it encounters a uncorrectable read error on the remaining "good" drives. Of course with two drives unavailable, there isn't an unambiguous reconstruction of the bad sector, so it might be best to go to the backups instead of continuing. At least that is the apparently the reason for the decision.
Alternatively, if the first drive failed was readable on that sector, (even if not reading some other sectors) it should be possible to fully recover all the data with a high degree of confidence even if a second drive is failed later. Since that is far from an unusual situation (a drive will be failed for a single uncorrectable error even if further reads are possible on other sectors) it isn't clear to us why that isn't done. Even if that sector isn't readable, logging the bad block, writing zeroes to the targets, and going on might be better than simply giving up.
A single unreadable sector isn't unusual among the tens of millions of sectors on a modern drive. If the sector has never been written to, there is no occasion for the drive electronics or the OS to even know it is bad. If the OS tried to write to it, the drive would automatically remap the sector and no damage would be done - not even a log entry. But that one bad sector will render the entire array unrecoverable no matter where on the disk it is if one other drive has already been failed.
Let's repeat the reliability calculation with our new knowledge of the situation. In our experience perhaps half of drives have at least one unreadable sector in the first year. Again assume a 6 percent chance of a single failure. The chance of at least one of the remaining two drives having a bad sector is 75% (1-(1-.5)^2). So the RAID 5 failure rate is about 4.5%/year, which is .5% MORE than the 4% failure rate one would expect from a two drive RAID 0 with the same capacity. Alternatively, if you just had two drives with a partition on each and no RAID of any kind, the chance of a failure would still be 4%/year but only half the data loss per incident, which is considerably better than the RAID 5 can even hope for under the current reconstruction policy even with the most expensive hardware.
We don't know what the reconstruction policy is for other raid controllers, drivers or NAS devices. None of the boxes we bought acknowledged this "gotcha" but none promised to avoid it either. We assume Netapp and ECCS have this under control, since we have had several single drive failures on those devices with no difficulty resyncing. We have not had a single drive failure yet in the MVD based boxes, so we really don't know what they will do. [Since that was written we have had such failures, and they were able to reconstruct the failed drive, but we don't know if they could always do so].
Some mitigation of the danger is possible. You could read and write the entire drive surface periodically, and replace any drives with even a single uncorrectable block visible. A daemon Smartd is available for Linux that will scan the disk in background for errors and report them. We had been running that, but ignored errors on unwritten sectors, because we were used to such errors disappearing when the sector was written (and the bad sector remapped).
Our inclination was to shift to a recent 3ware controller, which we understand has a "continue on error" rebuild policy available as an option in the array setup. But we would really like to know more about just what that means. What do the apparently similar RAID controllers from Mylex, LSI Logic and Adaptec do about this? A look at their web sites reveals no information. [Note added 10/2009: For some time now we have stuck with software raid, because it renders the drives pretty much hardware independent and there doesn't appear to be much of a performance loss.]

Can you add drives?

With Netapp and many other propietary systems, the answer is yes, provided you have shelves and buy the drives from the OEM. Each filer unit has a software programmed maximum number of bytes that it will support. In our case the limit was very constraining for an otherwise lightly used unit. The Netapp allows you to add drives to a volume without a backup/restore or any downtime, a remarkable ability they don't advertise. But Netapp has a couple of ways to discourage you from buying third party drives. The first is that you can't get drive caddies. The disk shelf you buy appears to have drive caddies for 14 drives, but the empty caddies are not drilled to accept drives.
There also seem to be tests in the OS software that detect and refuse to mount drives not purchased from Netapp. There are a few postings on the web from users who claim to have used non-Netapp drives (and Netapp at one time posted a list of compatible drives), but in our experience, even drives from that list marked with the appropriate firmware revision level and sector size were rejected as unsupported. Recently a vendor modified a third party drive for us so that the Netapp would accept it. A comparative bargain at $110. We don't know what they did.
For Linux/FreeBSD based boxes, there are few or no compatibility issues. But expansion capability is more than just a count of ports and slots. The big question for us is "Will the system still boot from the original boot drive after adding new drives?" This will no doubt depend on the interaction of the motherboard and OS, and has given us no end of trouble. In our experience with cheap 2 channel controllers such as Promise or SIIG, you can add new drives (but not new controllers) without interfering with the boot order, especially if one disables the label feature of the fstab, and specifies drives by letter.
Our experience adding 4 drives to a 3ware 9500-s controller was less pleasant. This was probably "our fault" for installing the OS on the first RAID group composed of drives on ports 1,3,5 and 7 instead of on ports 0,2,4 and 6 as a knowledgebase article suggested. But it was a hard won lesson and we were unable to install additional drives on the remaining ports without disrupting the boot. Later experience with a 3ware 9550 controller was marred by incorrect instructions from 3ware for adding driver support to the FreeBSD kernel.
We have always used drives from the local Micro-Center in our IDE and SATA arrays, but this gives us pause.

What if non-drive hardware fails?

Netapp has a thriving aftermarket, so you could be fairly confident of finding replacement shelves and "heads" even if Netapp decided a product was sufficiently non-strategic to EOL it. With some of the other vendors, you'd be looking for generic motherboards, cables, etc. That also gives me some confidence I could repair a system even if the vendor didn't want to help.
Both MVD devices use a standard XFS filesystem, and a readily available 3ware controller, so we have some confidence we could recover the data from the drives alone, without the rest of the hardware. This is an important insurance policy.
I'd be very reluctant to put data on a proprietary system with no aftermarket support. Every vendor is in constant danger of being acquired, divested or turned around. When that happens you and your box are no longer "strategic", and contract or not, requests for help are likely to be brushed aside. Even with an enforcable contract, the vendor can easily discourage calls for service by proposing solutions that don't save your data.
We recently received an email message from Netapp dated July 14th, 2008, announcing that spare parts for their S300 product would not be available for purchase after August 14th of the same year. So don't imagine that "big company" ensures parts availability, only competitive supply does that.

Are SCSI/FC drives really better?

They probably are, the fiber channel based Netapps have had the fewest problems. But none of the problems mentioned above seem to be in any way related to the controllers or drives, all are really policy decisions made by implementors/vendors who don't see things quite our way.

Are the drives kept cool?

We have assiduously avoided the many systems that appear to allow little or no air to flow around the drives. Vendors claim these systems meet the drive manufactorer's heat specification, and perhaps they do. But drives kept 10 or 20 degrees cooler than the maximum specified will last a lot longer, even if they do take up a bit more space.
The DNF box comes with 14 mostly tiny fans, and the failure any one of of those fans generally endangers the data on two adjacent disks. 5 of those fans are accessible for replacement only with considerable dissasembly. Note that if 2 fans are next to each other, that is not redundancy, since a failed fan acts as an alternate source of feed air for the working fan, and the breeze then bypasses the electronic components.
In boxes we construct ourselves, we have used a single 120mm fan for cooling (plus the cpu fan) which increases the (fan) MTBF by an order of magnitude, and keeps the drives much cooler. Motherboard CMOS setup for inexpensive motherboards seems always to include a high temperature shutoff (based on the CPU temperature) which we set at the lowest level offered. This shuts down the power to all internal components should the cpu or case fans fail. For some reason our more deluxe HP compute servers don't have this function.
In April 2006 we had an air conditioning failure, and the room temperature reached over 100 degrees before we got to turn systems off. Remarkably, none of the NAS boxes turned off themselves at that temperature. The NetApp shut down the head, but it did not shut down the drives, a remarkable oversight, to my mind. In a day when every $500 desktop has an overheating shutdown in the CMOS setup, it is distressing that $10,000-$100,000 storage servers don't do as well.

Is hotplugging really necessary

In the disk storage business there are two possible interpretations of "reliable". One is never going offline, another is never losing data. The former requires hotswapping hardware, which introduces new failure possibilities and aggravates the later. Our experience is that caddy problems are in the neigborhood of .5%/year/caddy - less than disk problems, but still significant. About half the time all that is necessary is to remove and reinsert the drive, but this is still disruptive. We are most interested in keeping user data safe, and can tolerate minor downtime if necessary. Therefore in our home-made boxes we do not use caddies, but mount the drives directly in the bays.

Why is it beeping?

All the systems have a large number of monitored fans and thermocouples. If a fan slows down, or the temperature gets to hot, they start beeping. In a crowded server room you won't be able to tell which system is beeping, so some visual indicator is essential - but not generally provided. Furthermore, with as many as 20 fans, some not easily inspected, it would be nice to know which fan was spinning a tad to slowly, or which chip was a tad too hot. That isn't available either.

Interesting Reading

  1. . A study of data loss on 1.5 million drives in deployed Netapp systems.
  2. Also from Netapp, a study of non-drive storage failures.
  3. A comparison of data protection mechanisms, with brand names.

Comments Welcome

If I've said anything wrong - please write me (address below) and I will do my best to post corrected information. I'd also be interested in other lessons.

Daniel Feenberg
feenberg isat nber dotte org

with thanks to Mohan Ramanajan

last update 22 April 2006 then 4 December 2005 then 9 December 2005 then 21 February 2006 then 28 April 2006 then 26 September 2006 then 12 March 2008 then 15 July 2008
Since the original posting I have received the following very informative email message about the Netapp. Quotes from the page above are between rows of equal signs, the rest of the message is from Chris Madden:
On Sat, 26 Nov 2005, Chris Madden wrote:

 Hi,

 I was just reading your article
 (http://www.nber.org/sys-admin/linux-nas-raid.html) and thought I'd give
 some comments since I know NetApp pretty well.

 ===============================================================
 How do you access snapshots?
 On the Netapp, every directory has a subdirectory ".snapshot" and within
 that directory are subdirectories for snapshots of the prior state -
 "nightly.0", "nightly.1", etc. This is very convinient if you just want
 yesterday's file, however if you are doing a backup that crosses midnight
 you have the problem that "nightly.0" changes its name to "nightly.1" right
 in the middle of your run. That is a nuisnace and will cause some software
 to choke. You can create a snapshot with indefinite life over the telnet/rsh
 interface, but we didn't want to put plaintext passwords on the LAN, so we
 use the weekly snapshot for backups.
 ===============================================================

 My comments...
 I think you can avoid this by mounting the relevant snapshot directory
 itself (like "/vol/data/.snapshot/nightly.0") at a mount point and then
 backing up the mount point.

 So like this and then backup /mnt/test:

 bnl-chris:/vol/vol0   25052408   3431728  21620680  14% /mnt/bnl-chris-vol0
 bnl-chris:/vol/vol0/.snapshot/test.1
                        6263100      3564   6259536   1% /mnt/test

 ===============================================================
 How many snapshots can you have? How long can you keep them?
 For some reason vendors think they know how many snapshots you will want,
 and how long you will want them. So Netapp restricts you to 21 snapshots.
 MVD based systems have license surcharges for more than 5 snapshots, with a
 maximum of 32. Storbank was also limited to 32. Some of our filesystems
 change very slowly, so that older snapshots would not be a burden in terms
 of inode or space, but we are constrained by these policies.
 ===============================================================

 My comments...
 We use NetApp snapshots for vaulting purposes and have over 200 on a volume
 and I think the limit is 255.  Maybe you are running an older ONTAP version
 and could upgrade for support of more snapshots?



 ===============================================================

 We don't know what the reconstruction policy is for other raid controllers,
 drivers or NAS devices. None of the boxes we bought acknowledged this
 "gotcha" but none promised to avoid it either. We assume Netapp and ECCS
 have this under control, since we have had several single drive failures on
 those devices with no difficulty resyncing. We have not had a single drive
 failure yet in the MVD based boxes, so we really don't know what they will
 do.

 ===============================================================

 My comments...

 NetApp does a few things that will make it better than the linux storage
 stack.  First thing that NetApp does is sector checksumming.  So every
 sector written has a checksum also written.  The technique varies based on
 512byte/sector (ATA disks & older FC disks) vs. 520 byte/sector (newer FC
 disks)  but on all disk types this is done.  Second thing is that NetApp
 does is constant media scrubing.  This is to say that the NetApp issues a
 SCSI verify request to every sector on every drive if the drive is idle.
 Since a verify request doesn't actually read data there is almost no cost
 for this operation.  If the verify fails however it's an indication that
 there is a drive problem and can be monitored by the system more closely
 with too many of these issuing a proactive drive failure.  The third thing
 that is done once per week is volume scrubing (or actually raid scrubbing).
 So this is the process of reading each stripe and comparing with parity. If
 something doesn't agree then using sector checksums and pairity data the
 incorrect block can be determined and fixed.  There's also a feature called
 "rapid raid recovery" that in the case of disk that isn't dead yet (but
 shows signs of dieing) where it is copied, block for block, to a spare disk
 before it is failed out.  This avoids the performance cost of a raid
 recovery because there is 1 read and 1 write vs. 7 reads and 1 write (raid-4
 with 8 disks RG).  I agree that the biggest cause of raid reconstruct
 failures is unrecoverable read error on the surviving disks but with the
 above techniques the chance of error can be minimized by these proactive
 (media and vol scrubbing) tasks.  There's also RAID-DP which gives you two
 parity disks per raidgroup and even allows for that unrecoverable read error
 during reconstruct...

 Let me know if you have any questions!

 Cheers,

 Chris Madden


And the following two messages deprecating RAID-5 are a little harsh, if like us you do long sequential I/O. But if you have a random access database it is quite pertinent. I only hope that implementers of the RAID drivers for Linux and FreeBSD don't use it as an excuse to leave the RAID-5 reconstruction procedure so broken. Mark's message is marked with angle brackets, the unmarked section is my message to him.
On Sat, 10 Dec 2005, Morris, Mark wrote:

> Hi,
> 
> I enjoyed reading your web page about disk failure modes.  I am amazed
> by how perceptive you are about the failure rate issues caused by 
> unrecoverable bit errors.  This is a topic that is rarely discussed by
> RAID vendors or even academic researchers.   There are emerging double
> protection schemes - like RAID-6 - that aim to cover this failure mode,
> but they suffer from even poorer write performance than RAID-5.    The



I wonder about poor raid-5 performance. I understand that for random
access the system has to write to every drive in the set to update one
sector on one drive, but when doing sequential writes, can't the drives
be arranged so that, for instance if I am writing a large number of
sequential sectors, that the parity calculation can be done in memory
using only the sectors being written? I am thinking that logical sector 

Consider 3 drives in a raid 5 array, a, b and c each with 4 sectors:

   a   b  c
   1*   2   3
   4    5*  6
   7    8   9*
  10*  11  12

The * marks the parity sector for the horizontal stripe.

Now if I want to write 2 sectors in a row, such as 4 and 6, I can do the
entire calculation of parity in memory before writing any sectors, and
eventually I only need to write 3 sectors for 2 sectors worth of
storage, rather than 3 sectors, as most people claim. Now it becomes a
speed advantage to have many drives in a set, rather than a
disadvantage.

I realize that a database might not write sectors in sequence, but we do
a lot of sequencial writes, and always wonder why raid 5 is slow for us.

> disk vendors specify the unrecoverable bit error rate - usually 1 in
> 10**14 bits read for PC class disks, and 1 in 10**15 for "enterprise"

We don't get nearly that good results. A 500 gig drive is only 5*10**12
bits, or 5% of 10**14, or about 10 times better than we get. Mybe we
could do that well if we paid special attention to smartdisk.

> class disks.  But still that doesn't stop many RAID-5 vendors 
> recommending wide stripes to achieve lower cost of redundancy.  But as

> you mention, the wider the RAID config, the more data required to 
> reconstruct and therefore the more likely to hit the second failure 
> during the rebuild.
> 
> Mark
> 


From jm160021 at ncr dotte com Sat Dec 10 15:58:30 2005
Date: Sat, 10 Dec 2005 13:56:35 -0500
From: "Morris, Mark" 
To: Daniel Feenberg 
Subject: RE: your web page about disk failure modes

Sure, you can add it to your web page including the new stuff below.  I
would also like to refer you to one of my favorite web sites:
www.baarf.com  - this is a web site dedicated to raging against RAID-5.
You might consider contributing your observations there as well, because
I don't think I've seen anyone cover this issue of unrecoverable bit
errors.   

Regarding your RAID-5 question, yes full stripe writes avoid a lot of
write overhead in RAID-5.  In fact many RAID vendors tout this in their
literature.  Unfortunately many applications don't have this type of
workload pattern.  To get the array to use full stripe writes, I think
you either have to write in very large chunks that span a whole stripe
(considering alignment), or you have to have write back caching enabled
so that the array can piece the full stripe write together.

For random writes (that don't cross a stripe boundary), RAID-5 takes 4
IO's:  read old parity, read old data, write new parity, write new data.

My company vends large massively parallel databases.  Because of both
poor write performance and a host of other issues with RAID-5 (including
the bit error problem) we pretty much switched our entire product line
to RAID-1 in 1999 at the transition from 4 GB to 9 GB disks.  Sure there
were always die hards that still configured their systems in RAID-5 but
that was almost always driven by some tightwad in purchasing who didn't
understand the issues.

Mark


And here is something I haven't experienced, but we can all learn from others:
From rreynolds@postcix.co.uk Mon Jun 26 11:43:55 2006
Date: Mon, 26 Jun 2006 16:43:16 +0100
From: Rupert Reynolds
To: Daniel Feenberg 
Subject: Re: To Daniel Feenberg re linux-nas-raid web page


On Mon, 26 Jun 2006, Rupert Reynolds
 wrote:

I read your linux-nas-raid page at nber.org with interest, many thanks.

One small point is that you mention removing and re-inserting a hot 
swappable drive. I came across a gotcha with this, which may affect other 
hardware.

On an Intel branded server box I had to nurse for a while, if a drive is
removed and re-inserted /before/ the RAID controller spots the missing
drive fault (which can take a minute or two, I'm told) then the controller 
continues writing to the drive as part of the array without realising that
some of the data (in cache when it was removed) was never written. This
can damage the array and prevent rebuilding, obviously depending on which
drive actually fails next.

Perhaps this is common knowledge: I just thought I'd mention it in case 
not:-)

Rupert Reynolds

No comments:

Post a Comment