Wednesday, December 14, 2011

The 35TB drive (of 2020) and Using them.

What's the maximum capacity possible in a disk drive?

Kryder, 2009, projects 7TB/platter for 2.5 inch platters will be commercially available in 2020.
[10Tbit/in² demo by 2015 and $3/TB for drives]

Given that prices of drive components are driven by production volumes, in the next decade we're likely to see the end of 3.5 inch platters in commercial disks with 2.5 inch platters taking over.
The fith-power relationship between platter-size and drag/power-consumed also suggests "Less is More". A 3.5 inch platter needs 5+ times more power to twirl it around than a 2.5 inch platter - the reason that 10K and 15K drives run the small platters: they already use the same media/platters for 3.5 inch and 2.5 inch drives.

Sankar, Gurumurthi, and Stan in "Intra-Disk Parallelism: An Idea Whose Time Has Come" ISCA, 2008, discuss both the fifth-power relationship and that multiple actuators (2 or 4) make a significant difference in seek times.

How many platters are fitted in the 25.4 mm (1 inch) thickness of a 3.5 inch drive's form-factor?

This report on the Hitachi 4TB drive (Dec, 2011) says they use 4 * 1TB platters in a 3.5 inch drive, with 5 possible.

It seems we're on-track to at least the Kryder 2020 projection, with 6TB per 3.5 inch platter already demonstrated using 10nm grains enhanced with Sodium Chloride.

How might those maximum capacity drives be lashed together?

If you want big chunks of data, then even in a world of 2.5 inch componentry, it still makes sense to use the thickest form-factor around to squeeze in more platters. All the other power-saving tricks of variable-RPM and idling drives are still available.
The 101.6mm [4 inch] width of the 3.5 inch form-factor allows 4 to sit comfortably side-by-side in the usual 17.75 inch wide "19 inch rack", using just more than half the 1.75 inch height available.

It makes more sense to make a half-rack-width storage blade, with 4 * 3.5 inch disks (2 across, 2 deep) with a small/low-power CPU, a reasonable amount of RAM and "SCM" (Flash Memory or similar) as working-memory and cache and dual high-speed ethernet, infiniband or similar ports (10Gbps) as redundant uplinks.
SATA controllers with 4 drives per motherboard are already common.
Such "storage bricks", to borrow Jim Grays' term, would store a protected 3 * 35Tb, or 100TB per unit, or 200Tb per Rack Unit (RU). A standard 42RU rack, allowing for a controller (3RU), switch (2RU), patch-panel (1RU) and common power-supplies (4RU), would have a capacity of 6.5PB.

Kryder projected a unit cost of $40 per drive, with the article suggesting 2 platters/drive.
Scaled up, ~$125 per 35TB drive, or ~$1,000 for 100TB protected ($10/TB) [$65-100,000 per rack]

The "scan time" or time-to-populate a disk is the rate-limiting factor for many tasks, especially RAID parity rebuilds.
For a single actuator drive using 7TB  platters and streaming at 1GB/sec, "scan time" is a daunting 2 hours per platter: At best 10 hours to just read a 35TB drive.

Putting 4 actuators in the drive, cuts scan time to 2-2.5 hours, with some small optimisations.

While not exceptional, its compares favourably with 3-5 hours minimum currently reported with 1TB drives.

But a single-parity drive won't work for such large RAID volumes!

Leventhal, 2009, in "Triple Parity and Beyond", suggested that the UER (Unrecoverable Error Rate) of large drives would force force parity-group RAID implementations to use a minimum of 3 parity drives to achieve a 99.2% probability of a successful (Nil Data Loss) RAID rebuild following a single-drive failure. Obviously, triple parity is not possible with only 4 drives.

The extra parity drives are NOT to cover additional drive failures (this scenario is not calculated), but to cover read errors, with the assumption that a single error invalidates all data on a drive.

Leventhal uses in his equations:
  •  512 byte sectors,
  • 1 in 10^16 probability of UER,
  • hence one unreadable sector per 200 billion (10TB) read, or
  • 10 sectors per 2 trillion (100TB) read.
Already, drives are using 4Kb sectors (with mapping to the 'standard' 0.5Kb sectors) to achieve the higher UER's.  The calculation should be done with the native disk sector size.

If platter storage densities are increased by 32-fold, it makes sense to similarly scale up the native sector size to decrease the UER. There is a strong case for 64-128Kb sectors on 7Tb platters.

Recasting Leventhal's equations with:
  • 100TB to be read,
  • 64KB native sectors,
  • or 1 in 1.5625 * 10^9 native sectors read for a UER of 1 in 10^16.
What UER would enable a better than 99.2% probability of reading 1.5 billion native sectors?
First approximation is 1 in 10^18 [confirm].
Zeta claims UER better than 1 in 10^58. Is possible to do much better.

Inserting Gibson's "horizontal" error detection/correction (extra redundancy on the one disk) is around the same overhead, or less. [do exact calculation].

Rotating parity or single-disk parity RAID?

The reasons to rotate parity around disk are simple - avoid "hot-spots", otherwise the full parallel IO bandwidth possible over all disks is reduced to just that of the parity disk. NetApp neatly solve this problem with their WAFL (Write Anywhere File Layout).

In order to force disks into mainly sequential access, "seek then stream", writes won't be simply cached, but shouldn't be written to HDD but kept to SMC/Flash until writes have quiesced.

The single parity-disk problem only occurs on writes. Reading, in normal or degraded mode, occurs at equal speed.

If writes across all disks are stored then written in large blocks, there is no IO performance difference between single-parity disk and rotating parity.

No comments: