Tuesday, July 26, 2011

Petabytes on a Budget v2.0




135 Terabytes for $7384
It’s been over a year since Backblaze revealed the designs of our first generation (67 terabyte) storage pod. During that time, we’ve remained focused on our mission to provide an unlimited online backup service for $5 per month. To maintain profitability, we continue to avoid overpriced commercial solutions, and we now build the Backblaze Storage Pod 2.0: a 135-terabyte, 4U server for $7,384. It’s double the storage and twice the performance—at lower cost than the original.
In this post, we’ll share how to make a 2.0 storage pod, and you’re welcome to use the design. We’ll also share some of our secrets from the last three years of deploying more than 16 petabytes worth of Backblaze storage pods. As before, our hope is that others can benefit from this information and help us refine the pods. (Some of the enhancements are contributions from helpful kindred pod builders, so if you do improve your Backblaze pod farm, please balance the Karma and send us your suggestions!)

Quick Review – What makes a Backblaze Storage Pod

A Backblaze Storage Pod is a self-contained unit that puts storage online. It’s made up of a custom metal case with commodity hardware inside. You can find a parts list in Appendix A. You can also link to a power wiring diagram, see an exploded diagram of parts, and check out a half-assembled pod. The two most noteworthy factors are that the cost of the hard drives dominates the price of the overall pod and that the system is made entirely of commodity parts. For more background, read the original blog post. Now let’s talk about the changes.

Density Matters – Double the Storage in the Same Enclosure

We upgraded the hard drives inside the 4U sheet metal pod enclosure to store twice as much data in the same space. After the cost of filling a rack with pods, one datacenter rack containing 10 pods costs Backblaze about $2,100 per month to operate, roughly divided equally into thirds for physical space rental, bandwidth, and electricity. Doubling the density saves us half of the money spent on both physical space and electricity. The picture below is from our datacenter, showing 15 petabytes racked in a single row of cabinets. The newest cabinets squeeze one petabyte into three-quarters of a single cabinet for $56,696.

Backblaze Storage Servers in Datacenter

Our online backup cloud storage is our largest cost, and we are obsessed with providing a service that remains secure, reliable and, above all, inexpensive. We’ve seen competitors unable to react to these demands who were forced to exit the market, like Iron Mountain, or raise prices, like Mozy and Carbonite. Controlling the hardware design has allowed us to keep prices low.
We are constantly looking at new hard drives, evaluating them for reliability and power consumption. The Hitachi 3TB drive (Hitachi Deskstar 5K3000 HDS5C3030ALA630) is our current favorite for both its low power demand and astounding reliability. The Western Digital and Seagate equivalents we tested saw much higher rates of popping out of RAID arrays and drive failure. Even the Western Digital Enterprise Hard Drives had the same high failure rates. The Hitachi drives, on the other hand, perform wonderfully.

Twice as Fast

We’ve made several improvements to the design that have doubled the performance of the storage pod. Most of the improvements were straightforward and helped by Moore’s Law. We bumped the CPU up from the Intel dual core CPU to the Intel i3 540 and upgraded the motherboard from one Gigabit Ethernet port to a Supermicro motherboard with two Gigabit Ethernet ports. RAM dropped in price, so we doubled it to 8 GB in the new pod. More RAM enables our custom Backblaze software layer to create larger disk caches that can really speed up certain types of disk I/O.
In the first generation storage pod, we ran out of the faster PCIe slots and had to use one slower PCI slot, creating a bottleneck. Justin Stottlemyer from Shutterfly found a better PCIe SATA card, which enabled us to reduce the SATA cards from four to three. Our upgraded motherboard has three PCIe slots, completely eliminating the slower PCI bottleneck from the system. The updated SATA wiring diagram is seen below. Hint: The pod will work if you connect every port multiplier backplane to a random SATA connection, but if you wire it up as shown below, the 45 drives will appear named in sequential order.

We upgraded the Linux 64-bit OS from Debian 4 to Debian 5, but we no longer use JFS as the file system. We selected JFS years ago for its ability to accommodate large volumes and low CPU usage, and it worked well. However, ext4 has since matured in both reliability and performance, and we realized that with a little additional effort we could get all the benefits and live within the unfortunate 16 terabyte volume limitation of ext4. One of the required changes to work around ext4’s constraints was to add LVM (Logical Volume Manager) above the RAID 6 but below the file system. In our particular application (which features more writes than reads), ext4’s performance was a clear winner over ext3, JFS, and XFS.
With these performance improvements, we see the new storage pods in our datacenter accepting customer data more than twice as fast as the older generation pods. It takes approximately 25 days to fill a new pod with 135 terabytes of data. The chart below shows the measured fill rates of an old Pod versus a new Pod, both under real-world maximum load in our datacenter.


Please note: The above graph is not the benchmarked write performance of a pod; we have easily saturated the Gigabit pipes copying data from one pod to another internally. This graph shows pods running in production, accepting data from thousands of simultaneous and independent desktop machines running Windows and Mac OS, where each desktop is forming HTTPS connections to the Tomcat web server and pushing data to the pod. At the same time, as customers are preparing restores that read data off those drives, there are system cleanup processes running, occasional RAID repairs, etc. In this end-to-end measurement, the new pods are twice as fast in our environment.

Lessons Learned: Three Years, 16 Petabytes and Counting

Backblaze is employee owned (with no VC funding or other deep pockets), so we have two choices: 1) stay profitable by keeping costs low or 2) go out of business. Staying profitable is not just about upfront hardware costs; there are ongoing expenses to consider.
One of the hidden costs to a datacenter is the headcount (salary) for the employees who deploy pods, maintain them, replace bad drives with good, and generally manage the facility. Backblaze has 16 petabytes and growing, and we employ one guy (Sean) whose fulltime job is to maintain our fleet of 201 pods, which hold 9,045 drives. Typically, once every two weeks, Sean deploys six pods during an eight-hour work day. (He gets a little help from one of us to lift each pod into place because they each weigh 143 pounds.)
Our philosophy is to plan for equipment failure and build a system that operates in spite of it. We have a lot of redundancy, ensuring that if a drive fails, immediate replacement isn’t critical. So at his leisure, Sean also spends one day each week replacing drives that have gone bad. As of this week, Backblaze has more than 9,000 hard drives spinning in the datacenter, the oldest of which we purchased four years ago. We see fairly high infant mortality on the hard drives deployed in brand new pods, so we like to burn the pods in for a few days before storing any customer data. We have yet to see any drives die because of old age, which will be fascinating to monitor in the next few years. All told, Sean replaces approximately 10 drives per week, indicating a 5 percent per year drive failure rate across the entire fleet, which includes infant mortality and also the higher failure rates of previous drives. (We are currently seeing failures in less than 1 percent of the Hitachi Deskstar 5K3000 HDS5C3030ALA630 drives that we’re installing in pod 2.0.)
We monitor the temperature of every drive in our datacenter through the standard SMART interface, and we’ve observed in the past three years that: 1) hard drives in pods in the top of racks run three degrees warmer on average than pods in the lower shelves; 2) drives in the center of the pod run five degrees warmer than those on the perimeter; 3) pods do not need all six fans—the drives maintain the recommended operating temperature with as few as two fans; and 4) heat doesn’t correlate with drive failure (at least in the ranges seen in storage pods).
One important note: Because all of the parts (including drives) in the Backblaze storage pod come with a three-year warranty, we rarely pay for a replacement part. The drive manufacturers take back failed drives with “no questions asked” and send free replacements. If you figure that storage resellers, such as NetApp and EMC, tack on a three-year support fee, a petabyte of Backblaze storage costs less than their support contract alone. A chart below takes all of our experience into account and shows what it costs to own and maintain a Petabyte of storage for three years:


In the chart above, the economies of scale only kick in if you really do need to store a full petabyte or more. For a small amount of data (a few terabytes), Amazon S3 could easily save money, but the Amazon option is clearly a dubious financial choice for a company with large, multi-petabyte storage needs.

Final Thoughts

The Backblaze storage pod is just one building block in making a cloud storage service. If all you need is cheap storage, this may suffice. If you need to build a reliable, redundant, monitored storage system, you’ve got more work ahead of you. At Backblaze we’ve developed software that manages and monitors the cloud service, proprietary technology that we’ve developed over the years.
We offer our storage pod design free of any licensing or any future claims of ownership. Anybody is allowed to use and improve upon it. You may build your own cloud system and use the Backblaze storage pod as part of your solution. The steps to assemble a storage pod, including diagrams, can be found on our original blog post, and an updated list of parts is provided below in Appendix A. We don’t sell the design, so we don’t provide support or a warranty for people who build their own. To all of those builders who take up the challenge, we’d love to hear from you and welcome any insights you provide about the experience. And please send us a photo of your new 135 Terabyte pod.

Appendix A – Price List:

Appendix A – Price List:

Item Qty Price Total
3 Terabyte Drives

Hitachi 3TB 5400 RPM HDS5C3030ALA630
45 $120.00 $5,400
4U Custom Case

(Available in quantities of 1 from Protocase for $875) – link to 3D design
1 $350 $350
760 Watt Power Supply

Zippy PSM-5760 Power Supply
2 $270 $540
Port Multiplier Backplanes

Available in qty of 9 for $47 from (CFI Group) CFI-B53PM 5 Port Backplane (SiI3726)
9 $41 $369
Intel i3 540 3.06 Ghz CPU 1 $110 $110
Port PCIe SATA II Card

Syba PCI Express SATA II 4 x Ports RAID Controller Card SY-PEX40008
3 $50 $150
Motherboard

SuperMicro MBD-X8SIL-F-B
1 $154 $154
Case Fan

Mechatronics G1238M(OR E)12B1-FSR 12V 3-Wire Fan
6 $12 $70
8GB DDR3 RAM

Crucial CT25672BA1339 2GB, DDR3 PC3-10600 (4x 2GB = 8GB total)
2 $58 $116
160 GB Boot Drive

Western Digital Caviar Blue WD1600AAJS 160GB 7200 RPM
1 $39 $39
On/Off Switch

FrozenCPU ele-302 Bulgin Vandal Momentary LED Power Switch 12″ 2-pin
1 $30 $30
SATA II Cable

Newegg GC36AKM12 3 Foot SATA Cable
9 $2 $18
Nylon Backplane Standoffs

Fastener SuperStore 1/4″ Round Nylon Standoffs Female/Female 4-40 x 3/4″
72 $.18 $13
HD Anti-Vibration Sleeves

Aero Rubber Co. 3.0 x .500 inch EPDM (0.03″ Wall)
45 $.23 $10
Power Supply Vibration Dampener

Vantec VDK-PSU Power Supply Vibration Dampener
2 $4.5 $9
Fan Mount (front)

Acoustic Ultra Soft Anti-Vibration Fan Mount AFM02
12 $.18 $2
Fan Mount (middle)

Acoustic Ultra Soft Anti-Vibration Fan Mount AFM03
12 $.18 $2
Nylon Screws

Small Parts MPN-0440-06P-C Nylon Pan Head Phillips 4-40 x 3/8″
72 $.02 $1
Foam Rubber Pad

House of Foam 16″ x 17″ x 1/8″ Foam Rubber Pad
1 $1 $1



TOTAL: $7,384



Custom wiring harnesses for PSU1 and PSU2 (the Zippy power supplies):
See detailed wiring harness diagrams.
 

Power Supply Wiring Harness

SATA Chipsets
SiI3726 on each port multiplier backplane to attach five drives to one SATA port.
SiI3124 on three PCIe SATA cards. Each PCIe card has four SATA ports on it, although we only use three of the four ports.

NAS devices and Linux Raid

http://www.nber.org/sys-admin/linux-nas-raid.html

The NBER has several file stores, including proprietary boxes from Netapp, semi-proprietary NAS boxes From Excel-Meridian and Dynamic Network Factory (DNF) based on Linux (with proprietary MVD or Storbank software added) and home-brewed Linux software raid boxes based on stock Redhat distributions and inexpensive Promise IDE (not raid) controllers. Along the way we have learned some simple lessons that don't seem to be widely appreciated. Here are a few.

How high can you fill it?

The two NAS devices from Netapp run contentedly at 99% of "capacity" without a noticeable slowdown when read or writing. Of course capacity doesn't include 20% for snapshots or 10% for "overhead". Our two boxes based on Mountain View Data software and the one from ECCS begin to seriously slow down at about 70% of capacity if snapshots are enabled. The slowdown can be extreme - several orders of magnitude. The problem is most noticeable if 4 or 5 users are trying to write simultaneously, and looks like a crash to the users. Sometimes i/o operations will fall to zero for several minutes, then resume. Our homebrew Linux systems don't have snapshots, but work well even at high levels of space utilization, even if overhead is reduced to 5% via tunefs.

Can you use whole drive partitions?

We got in the habit of using 3 or 4 drives in a RAID 5 array with the entire drives used for the raid partition. Then we noticed that just because two drives are the same brand, model and rev level, doesn't mean they have the same number of sectors. At the initial install this won't matter - the Linux md software will set the partition size to be the minimum of the member drive sizes. The problem won't appear untill a drive needs to be replaced, and then you need to find a drive with at least as many sectors as the remaining good drives. For example, out of six Maxtor 160s delivered in the same shipment, four had 19,929 cylinders and two had 19,930 cylinders, a difference of six meagabytes. Of course you can't tell which size the store has in stock, so it becomes a difficult to replace a drive without saving and restoring the entire array - or buying the next larger size drive. 3ware automatically rounds down in a process they call "drive coercion".

How easy is it to upgrade? (software)

Tech support for Excel Meridian box required us to do a backup/restore of the data for an OS upgrade. That turns 20 minutes of scheduled downtime to several days. I can only assume the motivation was to discourage the upgrade.

Does it work with NIS?

All NAS appliances promise to work with NIS, but only Netapp and ECCS recognized netgroups. The MVD based devices did not.

Can you get under the GUI?

As you might expect, the failure to recognize netgroups was a GUI problem in the semi-proprietary Linux based systems, If one "loooked under the hood" at the Linux code that actually ran the NAS, it was easy specify netgroups in /etc/exports. Both Excel Meridian and Dynamic Network Factory threatened to withdraw all support if we did so, however.

How do you access snapshots?

On the Netapp, every directory has a subdirectory ".snapshot" and within that directory are subdirectories for snapshots of the prior state - "nightly.0", "nightly.1", etc. This is very convinient if you just want yesterday's file, however if you are doing a backup that crosses midnight you have the problem that "nightly.0" changes its name to "nightly.1" right in the middle of your run. That is a nuisnace and will cause some software to choke. You can create a snapshot with indefinite life over the telnet/rsh interface, but we didn't want to put plaintext passwords on the LAN, so we use the weekly snapshot for backups. [Chris Madden has a solution to this problem - shown in the letter appended at the end of this documnet].
The ECCS box puts the snapshots in a separately exported filesystem named after the month and day it was created. This avoids the name-changing problem, but means backup software can't be given an unchanging name for the backup source.
The FreeBSD version of snapshots lets the system administrator place and name the snapshots to his liking - an advantage of a slightly lower-level user interface. However, we don't use snapshots in FreeBSD because they take so long - five times as long as fsck on our system.
An early version of the DNF Storbank system exported the snapshots as SMB filesystems only, even if the current filesystem was exported as NFS. That makes sure even system administrators will have trouble accessing the snapshots. Later, NFS exports of snapshots were added at our request. This is the only time any NAS vendor was willing to learn anything from us.

How many snapshots can you have? How long can you keep them?

For some reason vendors think they know how many snapshots you will want, and how long you will want them. Earlier versions of the Netapp restricted us to 21 snapshotsbut that has been increased to 32. Netapp keeps a bitmap with one bit per block of storage per snapshot, so a limit of 32 is not completely arbitrary. MVD based systems have license surcharges for more than 5 snapshots, with a maximum of 32. Storbank was also limited to 32. Some of our filesystems change very slowly, so that older snapshots would not be a burden in terms of inode or space, but we are constrained by these policies.

How long will it take to get a replacement drive?

With the homebrew filestores, we can order a drive for next day delivery 6 days a week, or buy one locally on any day. Of course the other vendors have service plans with 24 hour service, and with most NAS boxes the drives are commodity items anyway - you could replace them with store-bought drives if the vendor was uncooperative.
The Netapp nearly requires drives purchased from Netapp. Once we were a bit shocked when the "next business day" for Netapp to replace a drive which failed on June 31st was July 6th. The explanation? It was past noon on the 31st. The 1st and 2nd were a weekend. The 3rd was the day before the 4th and everyone had been given an extra day off. The 4th was of course a holiday, and the 5th was the earliest ship date, for delivery on the 6th. They offered (and we accepted) same day delivery for a $500 surcharge. They called this "sudden" service.

Why do drive failures come in pairs?

Most of the drives in our NAS boxes and drive arrays claim a MTBF of 500,000 hours. That's about 2% per year. With three drives the chance of at least one failing is a little less than 6%. (1-(1-.98)^3). Our experience is that such numbers are at least a reasonable approximation of reality (but see Schroeder and Gibson ,2007). (We especially like the 5400 RPM Maxtor 5A300J0 300GB drives for long life).
Suppose you have three drives in a RAID 5. If it takes 24 hours to replace and reconstruct a failed drive, one is tempted to calculate that the chance of a second drive failing before full redundancy is established is about .02/365, or about one in a hundred thousand. The total probability of a double failure seems like it should be about 6 in a million per year.
Our double failure rate is about 5 orders of magnitude worse than that - the majority of single drive failures are followed by a second drive failure before redundancy is established. This prevents rebuilding the array with a new drive replacing the original failed drive, however you can probably recover most files if you stay in degraded mode and copy the files to a different location. It isn't that failures are correlated because drives are from the same batch, or the controller is at fault, or the environment is bad (common electrical spike or heat problem). The fault lies with the Linux md driver, which stops rebuilding parity after a drive failure at the first point it encounters a uncorrectable read error on the remaining "good" drives. Of course with two drives unavailable, there isn't an unambiguous reconstruction of the bad sector, so it might be best to go to the backups instead of continuing. At least that is the apparently the reason for the decision.
Alternatively, if the first drive failed was readable on that sector, (even if not reading some other sectors) it should be possible to fully recover all the data with a high degree of confidence even if a second drive is failed later. Since that is far from an unusual situation (a drive will be failed for a single uncorrectable error even if further reads are possible on other sectors) it isn't clear to us why that isn't done. Even if that sector isn't readable, logging the bad block, writing zeroes to the targets, and going on might be better than simply giving up.
A single unreadable sector isn't unusual among the tens of millions of sectors on a modern drive. If the sector has never been written to, there is no occasion for the drive electronics or the OS to even know it is bad. If the OS tried to write to it, the drive would automatically remap the sector and no damage would be done - not even a log entry. But that one bad sector will render the entire array unrecoverable no matter where on the disk it is if one other drive has already been failed.
Let's repeat the reliability calculation with our new knowledge of the situation. In our experience perhaps half of drives have at least one unreadable sector in the first year. Again assume a 6 percent chance of a single failure. The chance of at least one of the remaining two drives having a bad sector is 75% (1-(1-.5)^2). So the RAID 5 failure rate is about 4.5%/year, which is .5% MORE than the 4% failure rate one would expect from a two drive RAID 0 with the same capacity. Alternatively, if you just had two drives with a partition on each and no RAID of any kind, the chance of a failure would still be 4%/year but only half the data loss per incident, which is considerably better than the RAID 5 can even hope for under the current reconstruction policy even with the most expensive hardware.
We don't know what the reconstruction policy is for other raid controllers, drivers or NAS devices. None of the boxes we bought acknowledged this "gotcha" but none promised to avoid it either. We assume Netapp and ECCS have this under control, since we have had several single drive failures on those devices with no difficulty resyncing. We have not had a single drive failure yet in the MVD based boxes, so we really don't know what they will do. [Since that was written we have had such failures, and they were able to reconstruct the failed drive, but we don't know if they could always do so].
Some mitigation of the danger is possible. You could read and write the entire drive surface periodically, and replace any drives with even a single uncorrectable block visible. A daemon Smartd is available for Linux that will scan the disk in background for errors and report them. We had been running that, but ignored errors on unwritten sectors, because we were used to such errors disappearing when the sector was written (and the bad sector remapped).
Our inclination was to shift to a recent 3ware controller, which we understand has a "continue on error" rebuild policy available as an option in the array setup. But we would really like to know more about just what that means. What do the apparently similar RAID controllers from Mylex, LSI Logic and Adaptec do about this? A look at their web sites reveals no information. [Note added 10/2009: For some time now we have stuck with software raid, because it renders the drives pretty much hardware independent and there doesn't appear to be much of a performance loss.]

Can you add drives?

With Netapp and many other propietary systems, the answer is yes, provided you have shelves and buy the drives from the OEM. Each filer unit has a software programmed maximum number of bytes that it will support. In our case the limit was very constraining for an otherwise lightly used unit. The Netapp allows you to add drives to a volume without a backup/restore or any downtime, a remarkable ability they don't advertise. But Netapp has a couple of ways to discourage you from buying third party drives. The first is that you can't get drive caddies. The disk shelf you buy appears to have drive caddies for 14 drives, but the empty caddies are not drilled to accept drives.
There also seem to be tests in the OS software that detect and refuse to mount drives not purchased from Netapp. There are a few postings on the web from users who claim to have used non-Netapp drives (and Netapp at one time posted a list of compatible drives), but in our experience, even drives from that list marked with the appropriate firmware revision level and sector size were rejected as unsupported. Recently a vendor modified a third party drive for us so that the Netapp would accept it. A comparative bargain at $110. We don't know what they did.
For Linux/FreeBSD based boxes, there are few or no compatibility issues. But expansion capability is more than just a count of ports and slots. The big question for us is "Will the system still boot from the original boot drive after adding new drives?" This will no doubt depend on the interaction of the motherboard and OS, and has given us no end of trouble. In our experience with cheap 2 channel controllers such as Promise or SIIG, you can add new drives (but not new controllers) without interfering with the boot order, especially if one disables the label feature of the fstab, and specifies drives by letter.
Our experience adding 4 drives to a 3ware 9500-s controller was less pleasant. This was probably "our fault" for installing the OS on the first RAID group composed of drives on ports 1,3,5 and 7 instead of on ports 0,2,4 and 6 as a knowledgebase article suggested. But it was a hard won lesson and we were unable to install additional drives on the remaining ports without disrupting the boot. Later experience with a 3ware 9550 controller was marred by incorrect instructions from 3ware for adding driver support to the FreeBSD kernel.
We have always used drives from the local Micro-Center in our IDE and SATA arrays, but this gives us pause.

What if non-drive hardware fails?

Netapp has a thriving aftermarket, so you could be fairly confident of finding replacement shelves and "heads" even if Netapp decided a product was sufficiently non-strategic to EOL it. With some of the other vendors, you'd be looking for generic motherboards, cables, etc. That also gives me some confidence I could repair a system even if the vendor didn't want to help.
Both MVD devices use a standard XFS filesystem, and a readily available 3ware controller, so we have some confidence we could recover the data from the drives alone, without the rest of the hardware. This is an important insurance policy.
I'd be very reluctant to put data on a proprietary system with no aftermarket support. Every vendor is in constant danger of being acquired, divested or turned around. When that happens you and your box are no longer "strategic", and contract or not, requests for help are likely to be brushed aside. Even with an enforcable contract, the vendor can easily discourage calls for service by proposing solutions that don't save your data.
We recently received an email message from Netapp dated July 14th, 2008, announcing that spare parts for their S300 product would not be available for purchase after August 14th of the same year. So don't imagine that "big company" ensures parts availability, only competitive supply does that.

Are SCSI/FC drives really better?

They probably are, the fiber channel based Netapps have had the fewest problems. But none of the problems mentioned above seem to be in any way related to the controllers or drives, all are really policy decisions made by implementors/vendors who don't see things quite our way.

Are the drives kept cool?

We have assiduously avoided the many systems that appear to allow little or no air to flow around the drives. Vendors claim these systems meet the drive manufactorer's heat specification, and perhaps they do. But drives kept 10 or 20 degrees cooler than the maximum specified will last a lot longer, even if they do take up a bit more space.
The DNF box comes with 14 mostly tiny fans, and the failure any one of of those fans generally endangers the data on two adjacent disks. 5 of those fans are accessible for replacement only with considerable dissasembly. Note that if 2 fans are next to each other, that is not redundancy, since a failed fan acts as an alternate source of feed air for the working fan, and the breeze then bypasses the electronic components.
In boxes we construct ourselves, we have used a single 120mm fan for cooling (plus the cpu fan) which increases the (fan) MTBF by an order of magnitude, and keeps the drives much cooler. Motherboard CMOS setup for inexpensive motherboards seems always to include a high temperature shutoff (based on the CPU temperature) which we set at the lowest level offered. This shuts down the power to all internal components should the cpu or case fans fail. For some reason our more deluxe HP compute servers don't have this function.
In April 2006 we had an air conditioning failure, and the room temperature reached over 100 degrees before we got to turn systems off. Remarkably, none of the NAS boxes turned off themselves at that temperature. The NetApp shut down the head, but it did not shut down the drives, a remarkable oversight, to my mind. In a day when every $500 desktop has an overheating shutdown in the CMOS setup, it is distressing that $10,000-$100,000 storage servers don't do as well.

Is hotplugging really necessary

In the disk storage business there are two possible interpretations of "reliable". One is never going offline, another is never losing data. The former requires hotswapping hardware, which introduces new failure possibilities and aggravates the later. Our experience is that caddy problems are in the neigborhood of .5%/year/caddy - less than disk problems, but still significant. About half the time all that is necessary is to remove and reinsert the drive, but this is still disruptive. We are most interested in keeping user data safe, and can tolerate minor downtime if necessary. Therefore in our home-made boxes we do not use caddies, but mount the drives directly in the bays.

Why is it beeping?

All the systems have a large number of monitored fans and thermocouples. If a fan slows down, or the temperature gets to hot, they start beeping. In a crowded server room you won't be able to tell which system is beeping, so some visual indicator is essential - but not generally provided. Furthermore, with as many as 20 fans, some not easily inspected, it would be nice to know which fan was spinning a tad to slowly, or which chip was a tad too hot. That isn't available either.

Interesting Reading

  1. . A study of data loss on 1.5 million drives in deployed Netapp systems.
  2. Also from Netapp, a study of non-drive storage failures.
  3. A comparison of data protection mechanisms, with brand names.

Comments Welcome

If I've said anything wrong - please write me (address below) and I will do my best to post corrected information. I'd also be interested in other lessons.

Daniel Feenberg
feenberg isat nber dotte org

with thanks to Mohan Ramanajan

last update 22 April 2006 then 4 December 2005 then 9 December 2005 then 21 February 2006 then 28 April 2006 then 26 September 2006 then 12 March 2008 then 15 July 2008
Since the original posting I have received the following very informative email message about the Netapp. Quotes from the page above are between rows of equal signs, the rest of the message is from Chris Madden:
On Sat, 26 Nov 2005, Chris Madden wrote:

 Hi,

 I was just reading your article
 (http://www.nber.org/sys-admin/linux-nas-raid.html) and thought I'd give
 some comments since I know NetApp pretty well.

 ===============================================================
 How do you access snapshots?
 On the Netapp, every directory has a subdirectory ".snapshot" and within
 that directory are subdirectories for snapshots of the prior state -
 "nightly.0", "nightly.1", etc. This is very convinient if you just want
 yesterday's file, however if you are doing a backup that crosses midnight
 you have the problem that "nightly.0" changes its name to "nightly.1" right
 in the middle of your run. That is a nuisnace and will cause some software
 to choke. You can create a snapshot with indefinite life over the telnet/rsh
 interface, but we didn't want to put plaintext passwords on the LAN, so we
 use the weekly snapshot for backups.
 ===============================================================

 My comments...
 I think you can avoid this by mounting the relevant snapshot directory
 itself (like "/vol/data/.snapshot/nightly.0") at a mount point and then
 backing up the mount point.

 So like this and then backup /mnt/test:

 bnl-chris:/vol/vol0   25052408   3431728  21620680  14% /mnt/bnl-chris-vol0
 bnl-chris:/vol/vol0/.snapshot/test.1
                        6263100      3564   6259536   1% /mnt/test

 ===============================================================
 How many snapshots can you have? How long can you keep them?
 For some reason vendors think they know how many snapshots you will want,
 and how long you will want them. So Netapp restricts you to 21 snapshots.
 MVD based systems have license surcharges for more than 5 snapshots, with a
 maximum of 32. Storbank was also limited to 32. Some of our filesystems
 change very slowly, so that older snapshots would not be a burden in terms
 of inode or space, but we are constrained by these policies.
 ===============================================================

 My comments...
 We use NetApp snapshots for vaulting purposes and have over 200 on a volume
 and I think the limit is 255.  Maybe you are running an older ONTAP version
 and could upgrade for support of more snapshots?



 ===============================================================

 We don't know what the reconstruction policy is for other raid controllers,
 drivers or NAS devices. None of the boxes we bought acknowledged this
 "gotcha" but none promised to avoid it either. We assume Netapp and ECCS
 have this under control, since we have had several single drive failures on
 those devices with no difficulty resyncing. We have not had a single drive
 failure yet in the MVD based boxes, so we really don't know what they will
 do.

 ===============================================================

 My comments...

 NetApp does a few things that will make it better than the linux storage
 stack.  First thing that NetApp does is sector checksumming.  So every
 sector written has a checksum also written.  The technique varies based on
 512byte/sector (ATA disks & older FC disks) vs. 520 byte/sector (newer FC
 disks)  but on all disk types this is done.  Second thing is that NetApp
 does is constant media scrubing.  This is to say that the NetApp issues a
 SCSI verify request to every sector on every drive if the drive is idle.
 Since a verify request doesn't actually read data there is almost no cost
 for this operation.  If the verify fails however it's an indication that
 there is a drive problem and can be monitored by the system more closely
 with too many of these issuing a proactive drive failure.  The third thing
 that is done once per week is volume scrubing (or actually raid scrubbing).
 So this is the process of reading each stripe and comparing with parity. If
 something doesn't agree then using sector checksums and pairity data the
 incorrect block can be determined and fixed.  There's also a feature called
 "rapid raid recovery" that in the case of disk that isn't dead yet (but
 shows signs of dieing) where it is copied, block for block, to a spare disk
 before it is failed out.  This avoids the performance cost of a raid
 recovery because there is 1 read and 1 write vs. 7 reads and 1 write (raid-4
 with 8 disks RG).  I agree that the biggest cause of raid reconstruct
 failures is unrecoverable read error on the surviving disks but with the
 above techniques the chance of error can be minimized by these proactive
 (media and vol scrubbing) tasks.  There's also RAID-DP which gives you two
 parity disks per raidgroup and even allows for that unrecoverable read error
 during reconstruct...

 Let me know if you have any questions!

 Cheers,

 Chris Madden


And the following two messages deprecating RAID-5 are a little harsh, if like us you do long sequential I/O. But if you have a random access database it is quite pertinent. I only hope that implementers of the RAID drivers for Linux and FreeBSD don't use it as an excuse to leave the RAID-5 reconstruction procedure so broken. Mark's message is marked with angle brackets, the unmarked section is my message to him.
On Sat, 10 Dec 2005, Morris, Mark wrote:

> Hi,
> 
> I enjoyed reading your web page about disk failure modes.  I am amazed
> by how perceptive you are about the failure rate issues caused by 
> unrecoverable bit errors.  This is a topic that is rarely discussed by
> RAID vendors or even academic researchers.   There are emerging double
> protection schemes - like RAID-6 - that aim to cover this failure mode,
> but they suffer from even poorer write performance than RAID-5.    The



I wonder about poor raid-5 performance. I understand that for random
access the system has to write to every drive in the set to update one
sector on one drive, but when doing sequential writes, can't the drives
be arranged so that, for instance if I am writing a large number of
sequential sectors, that the parity calculation can be done in memory
using only the sectors being written? I am thinking that logical sector 

Consider 3 drives in a raid 5 array, a, b and c each with 4 sectors:

   a   b  c
   1*   2   3
   4    5*  6
   7    8   9*
  10*  11  12

The * marks the parity sector for the horizontal stripe.

Now if I want to write 2 sectors in a row, such as 4 and 6, I can do the
entire calculation of parity in memory before writing any sectors, and
eventually I only need to write 3 sectors for 2 sectors worth of
storage, rather than 3 sectors, as most people claim. Now it becomes a
speed advantage to have many drives in a set, rather than a
disadvantage.

I realize that a database might not write sectors in sequence, but we do
a lot of sequencial writes, and always wonder why raid 5 is slow for us.

> disk vendors specify the unrecoverable bit error rate - usually 1 in
> 10**14 bits read for PC class disks, and 1 in 10**15 for "enterprise"

We don't get nearly that good results. A 500 gig drive is only 5*10**12
bits, or 5% of 10**14, or about 10 times better than we get. Mybe we
could do that well if we paid special attention to smartdisk.

> class disks.  But still that doesn't stop many RAID-5 vendors 
> recommending wide stripes to achieve lower cost of redundancy.  But as

> you mention, the wider the RAID config, the more data required to 
> reconstruct and therefore the more likely to hit the second failure 
> during the rebuild.
> 
> Mark
> 


From jm160021 at ncr dotte com Sat Dec 10 15:58:30 2005
Date: Sat, 10 Dec 2005 13:56:35 -0500
From: "Morris, Mark" 
To: Daniel Feenberg 
Subject: RE: your web page about disk failure modes

Sure, you can add it to your web page including the new stuff below.  I
would also like to refer you to one of my favorite web sites:
www.baarf.com  - this is a web site dedicated to raging against RAID-5.
You might consider contributing your observations there as well, because
I don't think I've seen anyone cover this issue of unrecoverable bit
errors.   

Regarding your RAID-5 question, yes full stripe writes avoid a lot of
write overhead in RAID-5.  In fact many RAID vendors tout this in their
literature.  Unfortunately many applications don't have this type of
workload pattern.  To get the array to use full stripe writes, I think
you either have to write in very large chunks that span a whole stripe
(considering alignment), or you have to have write back caching enabled
so that the array can piece the full stripe write together.

For random writes (that don't cross a stripe boundary), RAID-5 takes 4
IO's:  read old parity, read old data, write new parity, write new data.

My company vends large massively parallel databases.  Because of both
poor write performance and a host of other issues with RAID-5 (including
the bit error problem) we pretty much switched our entire product line
to RAID-1 in 1999 at the transition from 4 GB to 9 GB disks.  Sure there
were always die hards that still configured their systems in RAID-5 but
that was almost always driven by some tightwad in purchasing who didn't
understand the issues.

Mark


And here is something I haven't experienced, but we can all learn from others:
From rreynolds@postcix.co.uk Mon Jun 26 11:43:55 2006
Date: Mon, 26 Jun 2006 16:43:16 +0100
From: Rupert Reynolds
To: Daniel Feenberg 
Subject: Re: To Daniel Feenberg re linux-nas-raid web page


On Mon, 26 Jun 2006, Rupert Reynolds
 wrote:

I read your linux-nas-raid page at nber.org with interest, many thanks.

One small point is that you mention removing and re-inserting a hot 
swappable drive. I came across a gotcha with this, which may affect other 
hardware.

On an Intel branded server box I had to nurse for a while, if a drive is
removed and re-inserted /before/ the RAID controller spots the missing
drive fault (which can take a minute or two, I'm told) then the controller 
continues writing to the drive as part of the array without realising that
some of the data (in cache when it was removed) was never written. This
can damage the array and prevent rebuilding, obviously depending on which
drive actually fails next.

Perhaps this is common knowledge: I just thought I'd mention it in case 
not:-)

Rupert Reynolds

Saturday, July 16, 2011

IOS编程心得 zz

 Neil Ferguson,是iPhone 游戏「病毒攻击 | Virus Strike」的开发者,他总结出了开发一款iPhone(或者说 iOS平台)游戏(程序)的10个步骤。Neil Ferguson目前在伦敦一家软件创业公司工作,虽然他是一名「老」程序员了,但他认为,开发一款成功的iOS 游戏也许并不需要你有太多的程序开发和编程经验哦!我们来看看他的心得吧。 

(一)原创的想法 

我大概是在1年前才有了这个 Virus Strike的想法。我一直在玩一个基于物理原理的游戏叫Linerider ,还有飞行控制方面的游戏。我觉得如果有一款游戏通过物理引擎,画一根线让3个相同东西匹配(译者注:类似俄罗斯方块),一定会很有趣。于是我就到App store查遍了所有的益智游戏(puzzle game),看看是不是有这类的游戏。花了几天时间,结果我一个都没看到。那时候我就意识到,我第一个想到这个游戏的创意,至少还没有人做出,我为何不开 发一款这样的游戏呢?于是就开始了这个 Virus Strike的开发。 

译者注:并不一定是惊天动地的想法, 一点点的创新点子都可以成就一个出色的产品。大多时候你未必是第一个,可也许你稍加用心,你就可以成为最棒的那个。 

(二)使用正确的工具 

如果是一个初学的程序员,你可以尝 试使用“托-放” 形式的游戏制作库,比如 GameSalad。这让你可以在没有多少编程知识的情况下一样创建你的游戏,而且GameSalad 是特别为iPhone设计的工具。除此,你也许会发现在Flash平台下写游戏比在Objective C(iPhone 程序开发的标准语言)下要容易一些。你现在可以转换Flash的游戏在iPhone上运行,而且对于初学者,也有很多不错的 Flash游戏开发方面的书籍。 

如果你一定要使用 Objective C开发,那你一定要用游戏框架,在游戏编码上会容易很多。我个人使用的是Cocos2D,这是一个非常棒的iPhone游戏开发的框架工具,而且是免费开源的。它还具有一个集成的物 理引擎,给我当时开发 Virus Strike带来了不少方便。 

(三)充分利用免费教程 

Virus Strike是我的第一个 iPhone游戏,而且我以前也从来没有使用过 Objective C编写程序,所以在开发这个游戏的时候,我也确实学到了很多。很多在线的教程确实帮了我不少,比如说 Ray Wenderlich 的网站www.raywenderlich.com,提供了很多关于 iOS 编程的免费教程。非常的有用!在苹果的官方开发者网站资源也很多,developer.apple.com。 

(四) 外包你做不来的东西 

如果你自己本身不是一个程序员,我觉得一开始你最好是把你的最初好的想法外包给经验丰富的人来替 你做。比如,你可以将你 app的想法发到 odesk.com ,会有程序员来申请包办你的项目。同样,如果在你的应用程序app开发过程中,你有一个单独的部分做不成,你也最好外包出去。只是提醒你的是,你外包应用 出去时你要给程序员提供尽可能多的信息和细节,这样App开发出来的时候才会更让你满意。 

(五)想想关于iPhone特有的功能 

App Store里最成功的游戏一定是那些符合iPhone特点的、适合在iPhone上玩的游戏。我们来看看 Virus Strike,我采用了经典的俄罗斯方块类似的游戏玩法,结合iPhone特有的触摸屏和加速体验。你在屏幕上划一道线,用来指引病毒,你倾斜 iPhone的屏幕,这些病毒也会跟着倾斜。在你开发游戏的时候,你一定要想着如何把iPhone的一些独特的控制方式融入到你的游戏当中。尽可能的实现 原创,有特色! 

(六)确保游戏有挑战性 

在我搞定了最基本游戏的技术部分 —划线条、色彩匹配、还有物理引擎之后,最大的问题是我怎样才能把我当初的想法和概念转化成一个有挑战性的游戏,而且可以让玩家很快地上手。 

对 于一个游戏来说,我想玩家每玩一次游戏所耗费的时间和游戏的挑战性非常重要。游戏要逐渐加大难度,但同时要有公正性— 也就是要让玩家觉得是因为自己的失误才丢了一局。另外必须可以让玩家觉得他在游戏当中有所进展,在整个游戏的过程当中随着更多级别的游戏,不管是通过更高 的得分还是其他形式的奖励,要让玩家有一种成就感。 

(七) 免费的声音效果 

我游戏里的所有声音效果都来自 freesound.org。这是一个很厌烦的过程,所以我建议最好多问一 问其他人的观点和建议,看有些你喜欢的声音是不是别人会觉得讨厌。在编辑声音效果的时候,我还用到了一个免费的程序,Audacity ,这样可以让声音更加搭配游戏。 

(八)获得反馈 

你可 不要以为这么游戏就开发完毕了。直到你从其他人那里得到反馈,你才算真的了解到底有多少人觉得你的游戏有挑战性,有意思,值得一玩。而且你未必知道是不是 人们也许都会玩你的这个游戏。 

不要指望从你朋友那里得到真实的反馈意见,也不要给别人演示怎么去玩你的游戏。你要让他们独自拿着你的游戏 试一试,最好能站在一旁看一看,看他们是如何玩,是不是会遇到一些问题。 

你也可以轻易的从一些 iPhone论坛找到测试版尝鲜的人,他们可以免费的给你提供些反馈意见。 

(九) 做一个视频 

我的测试用户让我意识到做一个使用教程的视频是很有帮助的。我使用ScreenFlow 做了个 一分钟长的游戏的视频,测试后我有添加了 一页纸的文字描述,方便那些第一次打开这个游戏,跳过视频介绍的用户可。 

一段视频是非常值得的,这可以大大的帮助确保人们明白如何去玩这 款游戏。对于我的妻子 Donna,这段视频也非常有用,她负责 Virus Strike的公关推广。报道的人员可以很快的去看这段在线视频,这样他们可以在发布会的演示上不必要真实的体验过也可以知道这款游戏是如何操作的,当然通过视 频他们可以确保自己喜欢,再去花时间下载。 

(十) 推广你的游戏 

不论你的游戏有多么的棒,如果你不去做市场推广,有怎么会有人在 App Store找到你的游戏下载呢?你要做好准备花大量的时间在一些 App Review(应用评测)的网站,包括其他的一些科技网站。 

我 妻子在我推出 Virus Strike时,给我写的一篇新闻发布稿件就有相当不错的效果。当然你只能羡慕我有一个记者老婆,她知道怎么弄出来一篇好的稿件,放一 些会吸引其他报道者眼球的故事。我们当时付给 PRMac $20美金的发行费用,事实证明是非常值得的。这个稿件基本上传的整个网络都是,很多网站甚至是直接全文转载。