Tuesday, April 22, 2014


Current RAID schemes, and going back to the 1987/8 Patterson, Gibson, Katz RAID paper, make no distinction between transient and permanent failures: errors or dropouts versus failure.

This is a faulty assumptions in the 2009 "Triple Parity RAID"  piece by Leventhal, an author of ZFS, a proven production quality sub-system, and not just a commentator like myself.

The second major error is relying on the 2005 "Kryder's Law" 40%/year growth in capacity. Capacity growth had collapsed by 2009 and will probably not even meet 10%/year from now on, because the market can't fund the new fabrication plants needed.

The last major error by Leventhal was that 3.5 inch disks would remain the preferred Enterprise Storage. 2.5 inch drives are the only logical long-term choice of form-factor.

This is a pretty simple observation:
Long-length error mitigation and correction techniques are well known and used in CD/CD-ROM and later DVD/Bluray. The same long-range techniques could be applied on top of the ECC techniques already used by drive controller electronics.
With the complex geometry and techniques used in current drives, we can't know tracks or precisely the raw transfer rate. How long on a disk is a 4kB block? The current 0.7Gbit/square-inch is around 850,000 bits/inch, making a 4kB (32,000 bit) block around 1mm long. But that's wrong, disks are much more complex than that these days.

What we can reliably deal with is transfer rates.
At 1Gbps, a reasonable current rate, a 4kB block transfers in ~30 micro-seconds, or 32,000 blocks/second.

64kB or 128kB reads take 0.5 & 1.0 milliseconds. In the context of 8-12msec per revolution, this is a reasonable overhead.

If you're taking the trouble to detect and correct up to 4kB in a read (more is possible with higher overhead), then you need to also impose higher level error checking: MD5 or SHA1 fingerprints.

Storing raw data on disks could be done using an error correcting technique based on Erasure Codes (Galois Fields), in super-blocks.

For data to be portable, standards are needed. This needs:

  • a tag for the record type
  • a small number of "chunk" sizes, 64kB and 128kB are already used in some file systems and SSD blocks
  • the convolution data
  • a high-level checksum, either MD5 or SHA1, or the raw data.
These chunks can be converted at the drive or in a controller. The MD5/SHA1, once computed, should travel with the data over the whole data path, back to its point of consumption.

I'd like to be able to make more specifications on how chunks are organised, but a first cut would be the 3-tier scheme of CD-ROM's. Test implementations could start with the 2kB blocks of CD's, the code is well known, free and well tested.

Doing this and adopting 2.5" drives will go a long way to avoiding the Error Meltdown predicted by Leventhal. It also provides a very good platform for increasing the performance of other RAID levels.

No comments: