## Sunday, October 30, 2011

### Revisting RAID in 2011: Reappraising the 'Manufactured Block Device' Interface.

Reliable Persistent Storage is fundamental to I.T./Computing, especially in the post-PC age of "Invisible Computers". RAID, as large Disk Arrays backed up by tape libraries, has been a classic solution to this problem for around 25 years.

The techniques to construct "high-performance", "perfect" devices with "infinite" data-life from real-world devices with many failure modes, performance limitations and a limited service life vary with cost constraints, available technologies, processor organisation, demand/load and expectations of "perfect" and "infinite".

RAID Disk Arrays are facing at least 4 significant technology challenges or changes in use as I.T./Computing continues to evolve:
• From SOHO to Enterprise level, removable USB drives are becoming the media of choice for data-exchange,  off-site backups and archives as the price/Gb drives multiples below tape and optical media.
• M.A.I.D. (Massive Array of Idle Disks) is becoming more widely used for archive services.
• Flash memory, at A$2-$4/Gb, packaged as SSD's or direct PCI devices, is replacing hard disks (HDDs) for low-latency random-IO applications. E-bay, for example, has announced a new 'Pod' design based around flash memory, legitimising the approach for Enterprises and providing good "case studies" for vendors to use in marketing and sales.
• As Peta- and Exabyte systems are designed and built, its obvious that the current One-Big-Box model for Enterprise Disk Arrays is insufficient. IBM's General Parallel File System (GPFS) home page notes that large processor arrays are needed to create/consume the I/O load [e.g. 30,000 file creates/sec] the filesystem is designed to provide. The aggregate bandwidth provided is orders of magnitude greater than can be handled by any SAN (Storage Area Network,  sometimes called a 'Storage Fabric').
Internet-scale datacentres, consuming 20-30MW, house ~100,000 servers and potentially 200-500,000 disks, notably not as Disk Arrays. Over a decade ago, Google solved its unique scale-up/scale-out problems with whole-system replication and GFS (Google File System): a 21st century variant of the 1980's research work at Berkeley by Patterson et al called "N.O.W." (Network of Workstations). A portion of which was the concept written up in 1988 as the seminal RAID paper: "Redundant Arrays of Inexpensive Disks".

The 1988 RAID paper by Patterson et al and a follow-on paper in 1993/4 "RAID: High-Performance, Reliable Secondary Storage" plus  'NOW' talks/papers contain a very specific vision:
• RAID arrays of low-end/commodity drives, not even mid-range drives.
• Including a specific forecast for "1995-200" of 20Mb 1.3" drives being used to create 10Tb arrays (50,000 drives. Inferring a design from a diagram, taking a modest 5-6 racks).
RAID as a concept and product was hugely successful, as noted in the 1994 Berkeley paper, with a whole industry comprising many products and vendors springing up within 5 years. IBM announced the demise of the SLED (Single Large Expensive Disk) with the 3990 Model 9 in 1993 as a direct consequence of RAID arriving.

These were all potent victories, but the commercial reality was, and still is, very different to that initial vision of zillions of PC drives somehow lashed together:
• Enterprise Disk Arrays are populated with the most expensive, 'high-spec' drives available. This has been the case from the beginning for the most successful and enduring products.
• The drives are the cheapest element of Enterprise Disk Arrays:
the supporting infrastructure and software dominate costs.
• Specialised "Storage Fabrics" (SANs) are needed to connect the One-Big-Box to the many servers within a datacentre. These roughly double costs again.
• Vendor "maintenance" and upgrade costs, specialist consultants and dedicated admins create the OpEx (Operational Expenditure) costs dwarfing the cost of replacement drives.
• Large Disk Array organisations are intended to be space-efficient (ie. low overhead (5-10%) for Redundancy), but because of the low-level "block device" interface presented, used and usable filesystem space is often 25% of notional space. Committed but unused space, allocated but unused "logical devices" and backup/recovery buffers and staging areas are typical causes of waste. This has led directly to the need for "de-duplication", in-place compression and more.
• With these many layers and complexity, Disk Array failure, especially when combined with large database auto-management, has lead to spectacular public service-failures, such as the 4-week disruption/suspension of Microsoft's "Sidekick" sold by T-Mobile in 2009.  There is nothing publicly documented about the many private failures, disruptions and mass data losses that have occurred. It is against the commercial interests of the only businesses that could collect these statistics, the RAID and database vendors, to do so. Without reliable data, nothing can be analysed or changed.
• The resultant complexity of RAID systems has created a new industry: Forensic Data Recovery. Retrieving usable data from drives in a failed/destroyed Disk Array is not certain and requires specialist tools and training.
• To achieve "space efficiency", admins routinely define very large RAID sets (group size 50) versus the maximum group of 20 disks proposed in 1994. These overly large RAID groups, combined with link congestion, have lead to two increasing problems:
• RAID rebuild times for a failed drive have gone from minutes to 5-30 hours. During rebuilds, Array performance is 'degraded' as the common link to drives is congested: all data on all drives in the RAID set needs to be read, competing with "real" load.
• Single parity RAID is no longer sufficient for a 99+% probability of a rebuild succeeding. RAID 6, with dual parity (each parity calculated differently), has been recommended for over 5 years. These extra parity calculations and writes create further overhead and reduce normal performance more.
• Whilst intended to provide very high 'reliability', in practice large Disk Arrays and their supporting 'fabrics' are a major source of failures and downtime. The management of complex and brittle systems has proven to exceed the abilities of many paid-professional admins. As noted above, the extent and depth of these problems are unreported. Direct observation and anecdotal evidence suggests that the problems are wide-spread, on the point of being universal.
• At the low-end, many admins still prefer internal hardware RAID controllers in servers. As a cost-saving measure, they often also select fixed, not hot-swap, drives. This leads to drive failures not being noticed or corrected for months on end, completely obviating the data-protection offered by RAID. These controllers are relatively unsophisticated, which creates severe problems in the case of power loss: "failed" drive state can be forgotten, as the controller relies on read errors to prompt data reconstruction via parity.
Instant, irreversible corruption of complete hosted filesystems results: the worst result possible.
• Low-cost "appliances" have started to appear offering RAID of 2-8 drives. They use standard ethernet and iSCSI or equivalent instead of a dedicated "storage fabric". SOHO and SME's are increasingly using these appliances, both as on-site working storage and for off-site backups and archives. These are usually managed as "Set and Forget", avoiding many of the One-Big-Box failure modes, faults and administration errors.
Questions to consider and investigate:
• Reliable Persistent Storage solutions are now needed for a very large range of scale: from 1 disk in a household to multiple copies of the entire Public Internet in both Internet Search companies and Archival services, like "The Wayback Machine". There are massive collections of raw data from experiments, exploration, simulations and even films to be stored/retrieved. Some of this data is relatively insensitive to data-corruption, other needs to be bit-perfect.
A "one-size-fits-all" approach is clearly not sufficient, workable or economic.
• Like Security and encryption, every site, from household, to service provider, to government agencies, requires multiple levels of "Data Protection" (DP). Not every bit needs "perfect protection", nor "infinite datalife", but identifying protection levels may not be easy or obvious.
• Undetected errors are now certain with the simple RAID "block device" interface.
Filesystem and object-level error detection techniques, with error-correction hooks back to the Storage system, are required as the stored data-time product and total IO-volume increases (exponentially) beyond the ability of simple CRC/checksum systems to detect all errors.
• There may be improvements in basic RAID design and functionality that apply across all scales and DP levels, from single disk to Exabyte. Current RAID designs work in only a single dimension, duplication across drives, whilst at least one other dimension is available, and already used in optical media (CD-ROMs): data duplication along tracks. "Lateral" and "Longitudinal" data duplication seems an adequate descriptions of the two techniques.
• Specific problems and weaknesses/vulnerabilities of RAID systems need analysis to understand if improvements can be made within existing frameworks or new techniques are required.
• Scale and Timing are everything. The 1988 solutions and tradeoffs were perfect and correct at the time. Solutions and approaches created for existing and near-term problems and technologies cannot be expected to be Universal or Optimal for different technologies, or even current technologies for all time.
Why did the 1988 and 1994 Patterson et al papers get some things so wrong?

Partly because of human perceptual difficulties dealing with super-linear changes (exponential or higher growth exceeds our perceptual systems abilities)
and also because "The Devil is in the Detail".
Like software, a "paper design" can be simple and elegant but radically incomplete. To probe the designs' limits, assumptions, constraints and blind-spots, it requires implementation and real-world exposure. The more real-world events and situations any piece of software is exposed to, the deeper the level of understanding of the problem-field that is possible.