Monday, June 09, 2014

RAID++: Erasures aren't Errors

A previous piece in this series starts as quoted below the fold, raising the question: The Berkeley group in 1987 were very smart, and Leventhal in 2009 no less smart, so how did they both make the same fundamental attribution error? This isn't just a statistical "Type I" or "Type II" error, it's conflating and confusing completely differences sources of data loss.
Current RAID schemes, and going back to the 1987/8 Patterson, Gibson, Katz RAID paper, make no distinction between transient and permanent failures: errors or dropouts versus failure. 
This is a faulty assumptions in the 2009 "Triple Parity RAID"  piece by Leventhal, an author of ZFS, a proven production quality sub-system, and not just a commentator like myself. 
The second major error is relying on the 2005 "Kryder's Law" 40%/year growth in capacity. Capacity growth had collapsed by 2009 and will probably not even meet 10%/year from now on, because the market can't fund the new fabrication plants needed
The last major error by Leventhal was that 3.5 inch disks would remain the preferred Enterprise Storage. 2.5 inch drives are the only logical long-term choice of form-factor.
Storage has evolved considerably since 1987:
  • IBM sold drives for $25,000/GB ($40,000 now), allowing EMC to sell their Symmetrix 4200 around $10,000/GB, using 5.25" Seagate drives that sold for ~$5,000/GB.
    • EMC stuffed 24GB into a single rack, along with 256MB of RAM and some CPU's
    • The 1% cache was important in achieving the high IO/sec
    • The real competitive advantage was size and power (and cooling). One fifth the floorspace and one fifth the cooling.
  • disks spun at 3,600 to 5,400 RPM
    • fast 5.25" disks achieved 12-15msec avg seek times,
    • commodity 3.5" drives achieved 25 - 30 msec
    • Raw IO/sec (IOPS) varied from 30-40 for slow drives to 60-80 for fast drives
    • For database applications, typically 4KB blocks, the read time, ~4msec, becomes significant.
      • fast drives take 16.5msec to 20 msec to complete a read: 50-60 IO/sec.
  • heads read data off drives at around 10Mbps
    • sectors were still 512by (half a Kilobyte) vs the 4KB used now.
  • SCSI was new and ran at 5Mhz, 8-bits wide (40Mbps total, shared up and down).
  • The largest 3.5" disks were 100MB (Connor?) and 5.25" drives were 600Mb - 1,000MB (1GB)
  • Scan times of whole drives, allowing for seeking between tracks, was near the raw read rate:
    • 2-3 minutes for a 100MB 3.5" drive
    • 15-20 minutes for a 1GB drive
    • For 1-1.25Gbps current drives of 1,000GB to 6,000GB, scan times are 6-30 hours.
  • MTBF was 40,000 hrs for server-grade 5.25" drives and ~25,000hrs for commodity 3.5" drives, vs the 120,000 hrs for the 3380.
  • I'm guessing that design life, for server rooms, was the same 4-5 years. Compatible with the 40,000 MTBF of 5.25" drives (~5 years).
    • Commodity 3.5" drives used in PC's with a low duty cycle and 25% power-on time,
  • The Bit Error Rate, BER, of drives was 10-12 to 10-14, now 10-14 to 10-16.
The Berkeley group were very clear in their view: more, smaller drives were the future of RAID.
EMC, then StorTech, didn't deliver this vision, they stayed with the largest form factors available, and continued to do so. This Theory/Practice gap will be explored elsewhere.

For random IO, e.g. 4KB reads, fast drives could only achieve 4KB*40-50 IO/sec or 160-200KB, under ¼ MB/sec.
Over a year (32M seconds), 5TB-6.26TB, 5-6x1012 bytes, was theoretically possible, but not seen in practice.

This upper limit assumes both 24/7 power-on and 100% duty-cycle, far in excess of what was demanded from drives. PC's and LAN's were yet to happen. ARPAnet existed, but links very typically around 9,600bps, with "fast" being 1Mbps. Mainframes and mini-computers were connected to networks of terminals and printers and only staffed and run when the business required, modulo the after-hours batch processing and at month-end, additional shifts.

Airline reservation systems were the only sites commonly running 24/7, and even then demand was linked to operational hours. Bookings only happened when offices were open. Systems like SABRE had been running long enough, starting with drums, to routinely triplicate drives, now called RAID-1 or mirroring, with three drives. This allowed them to "split" a drive from the live system for maintenance, backups or testing, while keeping dual drives running in production.

Another factor was the hang-over from batch processing and adapted COBOL programs. Even serial processing, the classic "Master File Update", didn't push disk drives hard. A reasonable guide for 100MB 3330's was to allow 3-4 hours for a serial scan of their contents. Even with 1KB blocks (100,000 per drive), they could average 10 IO/sec, vs the notional 40-50 IO/sec.

An average site would power-on drives 30%-40% (2500-3200hrs/year), with duty cycles of 15%-25%. A reasonable derating from the naive rating, is 5%-10% actual utilisation.

For mainframes, the theoretical limit of 6x1012 bytes of random IO/year of drives, was under one-tenth that in practice: 5x1011 bytes/year.

In normal operations, drives were likely to suffer less than 1 error per year, caught by the internal CRC mechanisms.

The IBM 3380 drives, with 4 drives per cabinet and max 4 cabinets per string, for max 16 drives per controller, had lower BER's (10-13 to 10-14) and ~120,000 hrs MTBF. At 3,000 hrs/year operation, 16 drives would only see a drive failure every three years, much longer than the average stay of operators and junior Sys Progs. Most people didn't see a drive failure, though they did occur.

In the days of removable disk packs (3330's), it was common to sequentially copy whole drives as backups. With the advent of sealed drives (3380/90) starting at ~$200,000/cabinet, this option was no longer

A string of 16 drives might read 8TB, 8x1012 bytes, per year. Read errors might be seen once in 10 years on those drives. As well, IBM scheduled regular "preventative maintenance". Their engineers probably tracked problems and replaced susceptible components, like bearings, well before they exceeded specifications and caused problems.

These three forces, low power-on hours, low duty-cycle and active maintenance, drove down the rate of data errors experienced in mainframes.

In the mid-1970's, Unix acquired the reputation as "being very hard on disks". This was because the O/S multi-tasked, ran a system-wide filesystem cache and was able to adopt simple access optimisation schemes like "The Elevator Scheduler". During their busy periods, drives were pushed to 100% duty-cycle for extended periods. For Universities with students, this would ramp up to near saturated performance 24/7 for the weeks leading up to end of term. System owners would exchange their experiences in Unix newsletters and conferences, leading to increases in sales for stronger drives.

The commodity drive vendors were forced to address weaknesses in their designs to handle extended periods of 100% duty-cycle and 24/7 operations. Those that didn't lost sales and eventually failed.

When LANs and workstations, and later PC's and Fileservers, came on the scene, it became standard practice to run servers, and their disks, 24/7. As RAID became more common in Fileservers and workstations, fuelled by both cheap hardware RAID cards and O/S tools, like Logical Volume Managers (LVM), able to "slice and dice" disks for sharing, drive counts increased while total installed capacity experienced explosive growth for nearly 2 decades.

These forces have driven commodity drive vendors over decades to improve the MTBF of drives and improve error detection/correction as well.


In 1987, the mainframe market for Single Large Expensive Drive (SLED) addressed in the Berkeley group RAID paper.

They were well aware of the economics, physics and operational aspects of disk drives and their analysis was correct:

  • commodity drives at $1-2,500 each were 25%/GB the price IBM charged,
  • lashing together a number of commodity drives and adding extra-drive parity provided improved performance and increased MTBF beyond what was needed. Read errors, responsible for under 5%-10% of Data Loss events, were notionally handled by the same extra-drive parity.
  • Scaling drives down consumes far less power (fifth power on diameter) in aerodynamic drag, allowing both faster rotation rates and lower seek times due to the much smaller scale and mass of components. The 1987/8 RAID paper pitched 3.5" drives with later papers suggesting the Industry would further scale-down to  single platter 1" drives.

The reality of commodity drives, Storage System vendors and Computer system vendors was that fewer, larger drives were preferred. We are now in a period of transition to 2.5" drives for "performance" 10K & 15K RPM drives, but capacity drives still being 26.1mm thick, 3.5" drives, albeit variable speed ~5,900RPM.

The "balance" of drive characteristics has shifted substantially, with MTBF's raising from 25-fold (40k to 1M hrs), capacity by 5,000 times and BER's only 10-100-fold.

Close packed storage systems now achieve 430-450 drives per rack, vs 120-150 of the 1990's.

The combination of radically increased capacity, parity drives and extended scan times has promoted the minor problem of BER Data Loss, to a "drop-dead" for current RAID systems: a single read error during a RAID Volume rebuild due to a drive failure is death to single-parity protection. The likelihood of these events has compounded as drive scan-times continue to increase and the volume of data per Volume Group has increased, while BER's are relatively static.

All major vendors adopted dual-parity Data Protection by around 2005 to provide adequate reliability for their clientele. The 1-day to 1-week long RAID rebuild times, during which the Storage Array throughput is halved, seems to be accepted by clients.

2.5" drives which can potentially pack 5,000 drives per rack create a whole new class of problems.

No comments: