Saturday, December 24, 2011

Big Drives in 2020

Previously I've written about Mark Kyrder's 7TB/platter (2.5 inch) prediction for 2020.
This is more speculation around that topic.

1. What if we don't hit 7TB/platter, maybe only 4TB?

There have  been any number of "unanticipated problems" encountered with scaling Silicon and Computing technologies, will more be encountered with HDD before 202?

We already have 1TB platters in 3.5 inch announced in Dec-2011, with at least one new technique announced to increase recording density (Sodium Chloride doping), so it's not unreasonable to expect another 2 doublings in capacity, just in taking what's in the Labs and figuring out how to put it into production.

Which means we can expect 2-4TB/platter (2.5 inch) to be delivered in 2020.
At $40 per single-platter disk?
That depends on a) the two major vendors and the oligopoly pricing and b) the yields and costs of the new fabrication plants.

Seems to me that Price/GB will drop, but maybe not to levels expected.
Especially if the rapid decline in SSD's/Flash Memory Price/Gb plateaus and removes price competition.


2. Do we need to offer The Full Enchilada to everyone?

Do laptop and ultrabook users really need 4TB of HDD when they are constantly on-line?
1-2TB will store a huge amount of video, many virtual machine images and a lifetimes' worth of audio.
There might be a market for smaller capacity disks, either through smaller platters, smaller form-factors or underusing a full-width platter.

Each option has merits.
The final determinant will be perceived consumer Value Proposition, the Price/Performance in the end-user equipment.


3. What will the 1.8 inch market be doing?

If these very small form-factor drives in mobile equipment get to 0.5-2TB, that will seem effectively infinite.

There is no point in adopting old/different platter coatings and head-manufacturing techniques for these smaller form-factors unless other engineering or usability factors come into play: such as sensitivity to electronic noise, contamination, heat, vibration, ...


4. The fifth-power of diameter and cube-of-RPM: impact of size and speed?

2.5 inch drives are set to completely displace 3.5 inch in new Enterprise Storage systems within a year. This is primarily driven by Watts/GB and GB/cubic-space.

The aerodynamic drag of disk platters, hence the power consumed by a drive, varies with the fifth-power of platter diameter and the cube of the rotational velocity (RPM).

If you halve the platter size (5.25 inch to 2.5 inch), drive power reduces 32-fold.
If you then double the RPM of the drive (3600 to 7200), power increases 8-fold,
a nett reduction in power demand of 4 times.

Changing platter diameter by square-root of 2 (halving the recordable area), the drive power reduction is 5.5-fold. This is the same proportion for 5.25::3.5 inch, 3.5::2.5 inch and 2.5::1.8 inch.

Reducing a 2.5 inch platter to 1.92 inches allows a drive to be spun up from 5400 RPM to 7200 RPM whilst using the same drive power, with 60% of the original surface area.

Whilst not in the class of Enterprise Storage "performance optimised" drives (10K and 15K), it would be a noticeable improvement for Desktop PC's, given they will also be using large SSD's/Flash Memory as well in 2020 and this will be solely for "Seek and Stream" tasks.

There is very little reason to "de-stroke" drives and limit them to less than full-platter access if they are not "performance-optimised". It's a waste of resource for exactly the same input cost.


5. Will 3.5 inch "capacity-optimised" disks survive?
Will everything be 2.5 or 1.8 inch form-factor?

There are 3 markets that are interested in "capacity-optimised" disks:
  • Storage Appliances [SOHO, SME, Enterprise and Cloud]
  • Desktop PC
  • Consumer Electronics: PVR's etc.
When 1TB 2.5 inch drives are affordable, they will make new, smaller and lighter Desktop PC designs possible. Dell and HP might even offer modules that attach on the 100mm x 100mm "VIA" standard to the back of LCD screens. A smaller variant of the Apple Mac Mini is possible, especially if a single power-supply is available.

Consumer PVR's are interested in Price/GB, not Watts/GB. They will be driven by HDD price.
The manufacturers don't pay for power consumed, customers don't evaluate/compare TCO's and there is no legislative requirement for low-power devices.  Government regulation could be the wild-card driving this market.

There's a saying something like this I though made by Dennis Ritchie:
 "Memory is Cheap, until you need to buy enough for 10,000 PC's".
[A comment on MS-Windows lack of parsimony with real memory.]

Corporations will look to trimming costs of their PC (laptop and Desktop) fleets, and the PC vendors will respond to this demand.

Storage Appliances:
Already Enterprise and Cloud providers are moving to 2.5 inch form-factor to reduce power demand (Watts/GB) and floor-space footprint (GB/cubic-space).

Consumer and entry-level servers and storage appliances (NAS and iSCSI) are currently mostly 3.5 inch because that has always been the "capacity-optimised" sweet spot.

Besides power-use, the slam-dunk reasons for SOHO and SME users to move to 2.5 inch are:
  • lighter
  • smaller
    • smaller footprint and higher drive count per Rack Unit.
    • more aggregate bandwidth from higher actuator count
    • mirrored drives or better, are possible in a small portable and low-power case.
  • more robust, better able to cope with knocks and movement.
2.5 inch drives may be much better suited to "canister" or (sealed) "drive-pack" designs, such as used by Copan in their MAID systems. This is due to their lighter weight and lower power dissipation.
The 14-drive Copan 3.5 inch "Canister" of 4RU could be replaced by a 20-24 drive 2.5 inch Canister of 3RU, putting 3-4 times the number of drives in the same space.

6. What if there are some unforeseen "drop-deads", like low data-retention rates or hyper-sensitivity to heat, that limit useful capacities to the current 3-600GB/platter (2.5 inch)?

We can't know the future perfectly, so can't say just what surprises lie ahead.
If there is some technical reason why current drive densities are an engineering maximum, we cannot rely on technology advances to automatically reduce the Price/GB each year.

Even if technology is frozen, useful price reductions, albeit minor in comparison to "50% per year", will be achievable in the production process. It might take a decade for prices to drop 50% per GB.

I'm not sure how exactly designs might be made scale if drive sizes/densities are pegged to current levels.
What is apparent and universal, "Free Goods" with apparently Infinite Supply, will engender Infinite Demand.

If we do hit a "capacity wall", then the best Social Engineering response is to limit demand, which requires a "Cost" on capacity. This could be charging, as Google does with its Gmail service, or by other means, such as publicly ranking "capacity hogs".

Thursday, December 22, 2011

IDC on Hard Disk Drive market: Transformational Times

One of the problems, as an "industry outsider", of researching the field is lack of access to hard data/research. It's there, it's high-quality and timely. Just expensive and behind pay-walls.

A little information leaks via Press Releases and press articles promoting the research companies.

When one of these professional analyst firms makes a public statement alerting us to a radical restructuring of the industry, that's big news. [Though you'd expect "insiders" to have been aware of this for quite some time.]

What's not spelled out publicly, is How will this impact Enterprise Storage vendors/manufacturers?
There seems an implication that the two major HDD vendors will start to compete 'up' the value-chain with RAID and Enterprise Storage vendors, and across storage segments with Flash memory/SSD vendors.

IDC's Worldwide Hard Disk Drive 2011-2015 Forecast: Transformational Times
 published  in May, 2011. This report consists of Pages: 62 and the price starts from US $ 4500.

Headline: Transformation to just 3 major vendors. (really 2 major + 1 minor @ 10%)
 "The hard disk drive industry has navigated many technology and product transitions over the past 50 years, but not a transformation. [emphasis added]

 The HDD industry is poised to consolidate from five to three HDD vendors by 2012, and
 HDD unit shipment growth over the next five years will slow.

 HDD revenue will grow faster than unit shipments after 2012, in part because HDD vendors will offer higher-performance hybrid HDD solutions that will command a price premium.

 But for the remaining three HDD vendors to achieve faster revenue growth,
 it will be necessary by the middle of the decade for HDD vendors to transform into [bullets added]
  •  storage device and
  •  storage solution suppliers,
  •  with a much broader range of products for a wider variety of markets
  •  but at the same time a larger set of competitors."

Platters per Disk.

Headline:
  • 2.5 inch:
    • 9.5mm = 1 or 2 platters
    • 12.5mm = 2 or 3 platters
    • 15mm = ? platters. Guess at least 3. 4 unlikely, compare to 3.5 inch density
  • 3.5 inch
    • 25.4mm = commonly 4. Max. 5 platters.
Why is this useful, interesting or important?
To compare capacity across form-factors and for future configuration/design possibilities.

Disk form-factors are related by an approximate halving of platter area between sizes:
8::5.25 inch, 5.25::3.5 inch, 3.5::2.5 inch, 2.5::1.8 inch, 1.8::1.3 inch, 1.3::1 inch...
What we (as outsiders) know, but only approximately, is the recording area per platter for the platter sizes.  We know there are at least 3 regions of disk platter, but not their ratios/sizes, and these will vary per form-factor/platter-size:
  • motor/hub. The area of the inside 'torus' is small, not much is lost.
  • recorded area
  • outer 'ring' for landing and idling or "unloading" heads. Coated differently (plastic?) to not damage heads if they "skid" or come into contact with a surface (vs 'flying' on the aerodynamic air-cushion).
Research:

Chris Mellor, 12th September 2011 12:02 GMT, The Register, "Five Platters, 4TB".
Seagate has a 4TB GoFlex Desk external drive but this is a 5-platter
disk with 800GB platters.
IDC, 2009, report sponsored by Hewlett-Packard:
By 2010, the HDD industry is expected to increase the maximum number of platters per 2.5inch performance-optimized HDD from two to three,
enabling them to accelerate delivering a doubling of capacity per drive, and subsequently achieving 50% capacity increases per drive over a shorter time frame.
7th September 2011 06:00 GMT, The Register.
Oddly Hitachi GST is only shipping single-platter versions of these new drives, although it is saying they are the first ones in a new family, with their 569Gbit/in2 areal density. The announced but not yet shipping terabyte platter Barracuda had a 635Gbit/in2 areal density.
Sebastian Anthony on December 12, 2011 at 12:25 pm, Extreme Tech
Hitachi, seemingly in defiance of the weather gods, has launched the
world’s largest 3.5-inch hard drive:
 The monstrous 4TB Deskstar 5K.
 With a rotational speed of 5,900RPM,
 a 6Gbps SATA 3 interface,
 and the same 32MB of cache as
 its 2 and 3TB siblings,
 the 4TB model is basically the same beast
— just with four platters instead of two or three.
 The list price is around $345
Silverton Consulting, 13-Sep-2011:
shipping over 1TB/disk platter using 3.5″ platters shipping with 569Gb/sqin technology

Monday, December 19, 2011

"Missed by _that_ much": Disk Form Factor vs Rack Units

Apologies to 1965 TV series "Get Smart" and the catch-phrase "Missed by that much" (with a visual indication of a near-miss).

This is a lament, not a call to action or grumble. Standards are necessary and good.
We have two standards that we just have to live with now: too many devices depend on them for a change. Unlike the "imperial" to metric conversion, there would be few discernible benefits.

There's a fundamental mismatch between the Rack Unit (1.75 inches) or the vertical space allowed for equipment in 19 inch Racks (standard EIA-310) and the Disk form factors of  5.25, 3.5 and 2.5 inches defined by the Small Form Factor Committee.

There is no way to mount a standard disk drive (3.5 or 2.5 inch) exactly in a Rack. There are various amounts of wasted space.
Originally, "full-height" 5.25 inch drives could be mounted horizontally exactly in 2 Rack Units (3.5 inches), three abreast.

The "headline" size of the form-factor is the notional size of the platters or removable media.
The envelope allows for the enclosure.

So whilst "3.5 inch" looks like a perfect multiple of the 1.75 inch Rack Unit, a "3.5 inch" drive is  around 0.5 inch larger.
Manufacturers of vertical-mount "hot-swap" drives allow around 1mm on the thinnest dimension, 9 mm on the "height" and 1.5 inches (42-43 mm) on the longest dimension (depth).

A guess at the dimensions of hot-swap housings:
1/32 in (0.8mm) or 1mm sheet metal could be used between drives (upright)
and 1.5-2mm sheet metal would be needed to support the load (with an upturned edge?).

In total, around 0.5 inch (12.5mm) might need be allowed vertically for supporting structures.
An ideal Rack Unit size for the "3.5 inch" drive form-factor would be 4.5 inches.

Or, "3.5 inch" drives could be 3.00 - 3.25 inches wide to fit exactly in 2 Rack Units.

Different manufacturers approach this problem differently:
  • Copan/SGI and Backblaze mount 3.5 inch drives vertically in 4 Rack Units (7 inches).
    Both of these solutions aim for high-density packing, 28 and 11.25 drives per Rack Unit .
    • Copan, via US Patent # 7145770, uses 4U hot-swap "canisters" that store 14 drives in 2 rows, with 8 canisters per "shelf" (112 drives/shelf). In a 42 U rack, they can house 8 shelves, for 896 drives per Rack. Their RAID system is 3+1, with max 5 spares per shelf, yielding 79 data drives per shelf, and 632 drives per Rack.
      These systems are designed specifically to hold archival data, with up to 25% or 50% of drives active at any one time, as "MAID": Massive Array of Idle Disks.
    • Backblaze are not a storage vendor, but have made their design public with a hardware vendor able to supply cases and pre-built (but not populated) systems.
      Their solution, fixed-disks not hot-swap, is 3 rows of 15 disks mounted vertically, sitting on their connectors. The Backblaze systems include a CPU and network card and are targeted at providing affordable and reliable on-line Cloud Backup services [and are specifically "low performance"]. Individual "storage pods" do not supply "High Availability", there is little per-unit redundancy. Like Google, Backblaze rely on whole-system replication and software to achieve redundancy and resilience.
  • Most server and storage appliance vendors use "shelves" of 3 Rack Units (5.25 inches), but fit 13-16 drives across the rack (~17.75 inches or 450mm) depending on their hot-swap carriers.
  • "2.5 inch" drives fitted vertically (2.75 inch) need 2 Rack Units (3.5 inches). Most vendors fit 24 drives across a shelf. "Enterprise class" 2.5 inch drives are typically 12.5 or 15 mm thick.
Another possibility,  not been widely pursued, is to build disk housings or shelves that don't exactly fit the EIA-310 standard Rack Units. Unfortunately, the available internal width of 450mm cannot be varied.



The form factors:
"5.25 inch": (5.75 in x 8 in x 1.63 in =  146.1 mm x 203 mm x 41.4 mm)
"3.5 inch"  : (4 in x 5.75 in x 1 in =  101.6 mm x 146.05 mm x 25.4 mm)
"2.5" inch  : (2.75 in x 3.945 in x 0.25-0.75 in = 69.85 mm x 100.2 mm x [7, 9.5, 12.5, 15, 19] mm)
Old disk height form factors, originating in 5.25 inch disks circa mid-1980's.
low-profile = 1 inch.
Half-height = 1.63 inch.
Full-height = 3.25 inch. [Fitting well into 2 Rack Units]

Wednesday, December 14, 2011

"Disk is the new Tape" - Not Quite Right. Disks are CD's

Jim Gray, in recognising that Flash Memory was redefining the world of Storage, famously developed between 2002 and 2006 the view that:
Tape is Dead
Disk is Tape
Flash is Disk
RAM Locality is King
My view is that: Disk is the new CD.

Jim Gray was obviously intending that Disk had replaced Tape as the new backup storage media, with Flash Memory being used for "high performance" tasks. In this he was completely correct. Seeing this clearly and annunciating it a decade ago was remarkably insightful.

Disks do both the Sequential Access of Tapes and Random I/O.
In the New World Order of Storage, they can be considered functionally identical to Read-Write Optical disks or WORM (Write Once, Read-only Memory).

As the ratios between access time (seek or latency) and sequential transfer rate, or throughput, continues to change in favour of capacity and throughput, managing disks becomes more about running them in "Seek and Stream" mode than doing Random I/O.

With current 1TB disks, the sequential scan time (capacity ÷ sustained transfer rate) [1,000GB/ 1Gbps] is 2-3 hours. However, to read a disk with 4KB random I/O's at ~250/sec (4msec avg. seek), the type of workload a filesystem causes, gives an effective through put of around 1MB/sec, or 128 times slower than a sequential read.

It behoves system designers to treat disks as faster RW Optical Disk, not as primary Random IO media, and as Jim Gray observed, "Flash is the New Disk".

The 35TB drive (of 2020) and Using them.

What's the maximum capacity possible in a disk drive?

Kryder, 2009, projects 7TB/platter for 2.5 inch platters will be commercially available in 2020.
[10Tbit/in² demo by 2015 and $3/TB for drives]

Given that prices of drive components are driven by production volumes, in the next decade we're likely to see the end of 3.5 inch platters in commercial disks with 2.5 inch platters taking over.
The fith-power relationship between platter-size and drag/power-consumed also suggests "Less is More". A 3.5 inch platter needs 5+ times more power to twirl it around than a 2.5 inch platter - the reason that 10K and 15K drives run the small platters: they already use the same media/platters for 3.5 inch and 2.5 inch drives.

Sankar, Gurumurthi, and Stan in "Intra-Disk Parallelism: An Idea Whose Time Has Come" ISCA, 2008, discuss both the fifth-power relationship and that multiple actuators (2 or 4) make a significant difference in seek times.

How many platters are fitted in the 25.4 mm (1 inch) thickness of a 3.5 inch drive's form-factor?

This report on the Hitachi 4TB drive (Dec, 2011) says they use 4 * 1TB platters in a 3.5 inch drive, with 5 possible.

It seems we're on-track to at least the Kryder 2020 projection, with 6TB per 3.5 inch platter already demonstrated using 10nm grains enhanced with Sodium Chloride.

How might those maximum capacity drives be lashed together?

If you want big chunks of data, then even in a world of 2.5 inch componentry, it still makes sense to use the thickest form-factor around to squeeze in more platters. All the other power-saving tricks of variable-RPM and idling drives are still available.
The 101.6mm [4 inch] width of the 3.5 inch form-factor allows 4 to sit comfortably side-by-side in the usual 17.75 inch wide "19 inch rack", using just more than half the 1.75 inch height available.

It makes more sense to make a half-rack-width storage blade, with 4 * 3.5 inch disks (2 across, 2 deep) with a small/low-power CPU, a reasonable amount of RAM and "SCM" (Flash Memory or similar) as working-memory and cache and dual high-speed ethernet, infiniband or similar ports (10Gbps) as redundant uplinks.
SATA controllers with 4 drives per motherboard are already common.
Such "storage bricks", to borrow Jim Grays' term, would store a protected 3 * 35Tb, or 100TB per unit, or 200Tb per Rack Unit (RU). A standard 42RU rack, allowing for a controller (3RU), switch (2RU), patch-panel (1RU) and common power-supplies (4RU), would have a capacity of 6.5PB.

Kryder projected a unit cost of $40 per drive, with the article suggesting 2 platters/drive.
Scaled up, ~$125 per 35TB drive, or ~$1,000 for 100TB protected ($10/TB) [$65-100,000 per rack]

The "scan time" or time-to-populate a disk is the rate-limiting factor for many tasks, especially RAID parity rebuilds.
For a single actuator drive using 7TB  platters and streaming at 1GB/sec, "scan time" is a daunting 2 hours per platter: At best 10 hours to just read a 35TB drive.

Putting 4 actuators in the drive, cuts scan time to 2-2.5 hours, with some small optimisations.

While not exceptional, its compares favourably with 3-5 hours minimum currently reported with 1TB drives.

But a single-parity drive won't work for such large RAID volumes!

Leventhal, 2009, in "Triple Parity and Beyond", suggested that the UER (Unrecoverable Error Rate) of large drives would force force parity-group RAID implementations to use a minimum of 3 parity drives to achieve a 99.2% probability of a successful (Nil Data Loss) RAID rebuild following a single-drive failure. Obviously, triple parity is not possible with only 4 drives.

The extra parity drives are NOT to cover additional drive failures (this scenario is not calculated), but to cover read errors, with the assumption that a single error invalidates all data on a drive.

Leventhal uses in his equations:
  •  512 byte sectors,
  • 1 in 10^16 probability of UER,
  • hence one unreadable sector per 200 billion (10TB) read, or
  • 10 sectors per 2 trillion (100TB) read.
Already, drives are using 4Kb sectors (with mapping to the 'standard' 0.5Kb sectors) to achieve the higher UER's.  The calculation should be done with the native disk sector size.

If platter storage densities are increased by 32-fold, it makes sense to similarly scale up the native sector size to decrease the UER. There is a strong case for 64-128Kb sectors on 7Tb platters.

Recasting Leventhal's equations with:
  • 100TB to be read,
  • 64KB native sectors,
  • or 1 in 1.5625 * 10^9 native sectors read for a UER of 1 in 10^16.
What UER would enable a better than 99.2% probability of reading 1.5 billion native sectors?
First approximation is 1 in 10^18 [confirm].
Zeta claims UER better than 1 in 10^58. Is possible to do much better.

Inserting Gibson's "horizontal" error detection/correction (extra redundancy on the one disk) is around the same overhead, or less. [do exact calculation].


Rotating parity or single-disk parity RAID?

The reasons to rotate parity around disk are simple - avoid "hot-spots", otherwise the full parallel IO bandwidth possible over all disks is reduced to just that of the parity disk. NetApp neatly solve this problem with their WAFL (Write Anywhere File Layout).

In order to force disks into mainly sequential access, "seek then stream", writes won't be simply cached, but shouldn't be written to HDD but kept to SMC/Flash until writes have quiesced.

The single parity-disk problem only occurs on writes. Reading, in normal or degraded mode, occurs at equal speed.

If writes across all disks are stored then written in large blocks, there is no IO performance difference between single-parity disk and rotating parity.

Tuesday, December 13, 2011

Revolutions End: Computing in 2020

We haven't reached the end of the Silicon Revolution yet, but "we can see it from here".

Why should anyone care? Discussed at the end.

There are two expert commentaries that point the way:
  • David Patterson's 2004 HPEC Keynote, "Latency vs Bandwidth", and
  • Mark Kryder's 2009 paper in IEEE Magnetics, "After Hard Drives—What Comes Next?"
    [no link]
Kryder projected the current expected limits of magnetic recording technology in 2020 (2.5": 7Tb/platter) and how another dozen technologies will compare, but there's no guarantee. Some unanticipated problem might, like CPU's, derail Kryders' Law before then: disk space doubles every year.
We will get an early "heads-up": by 2015 Kryder expects 7Tb/platter to be demonstrated.

This "failure to fulfil the roadmap" has happened before: In 2005 Herb Sutter pointed out that 2003 marked the end of Moore's Law for single-core CPU's in "The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software". Whilst Silicon fabrication kept improving, CPU's hit a "Heat Wall" limiting the clock-frequency, spawning a new generation of "multi-core" CPUs.

IBM with its 5.2GHz Z-series processors and gamers "over-clocking" standard x86 CPUs showed part of the problem was a "Cooling Wall". This is still to play out fully with servers and blades.
Back to water-cooling, anyone?
We can't "do a Cray" anymore and dunk the whole machine in a vat of Freon (a CFC refrigerant, now banned).

Patterson examines the evolution of four computing techologies over 25 years from ~1980 and the increasing disparity between "latency" (like disk access time) and "bandwidth" (throughput):
  • Disks
  • Memory (RAM)
  • LANs (local Networking)
  • CPUs
He  neglects "backplanes", PCI etc, Graphic sub-systems/Video interfaces and non-LAN peripheral interconnection.

He argues there are 3 ways to cope with "Latency lagging Bandwidth":
  • Caching (substitute different types of capacity)
  • Replication (leverage capacity)
  • Prediction (leverage bandwidth)
Whilst  Patterson doesn't attempt to forecast the limits of technologies like Kryder, he provides an extremely important and useful insight:
If everything improves at the same rate, then nothing really changes
When rates vary, require real innovation
In this new milieu, Software and System designers will have to step-up to build systems that are effective and efficient, and any speed improvements will only come from better software.

There is an effect that will dominate bandwidth improvement, especially in networking and interconnections (backplanes, video, CPU/GPU and peripheral interconnects):
the bandwidth-distance product
This affects both copper and fibre-optic links. Using a single technology, a 10-times speed-up shortens the effective distance 10-times. Well know in transmission line theory.

For LANs to go from 10Mbps to 100Mbps to 1Gbps, higher-spec cable (Cat 4, Cat 5, Cat 5e/6) had to be used. Although 40Gbps and 100Gbps Ethernet have been agreed and ratified, I expect these speeds will only ever be Fibre Optic. Copper versions will either be very limited in length (1-3m) or use very bulk, heavy and expensive cables: worse in every dimension than fibre.

See the "International Technology Roadmap for Semiconductors" for the expert forecasts of the underlying Silicon Fabrication technologies, currently out to 2024. There is a lot of detail in there.

The one solid prediction I have is Kryder's 7Tb/platter.
A 32 times increase in bit-areal density, Or 5 doublings of capacity.
This implies the transfer rate of disks will increase 5-6 times, given there's no point in increasing rotational speed, to roughly 8Gbps. Faster than "SATA 3.0" (6Gbps) but within the current cable limits. Maintaining the current "headroom" would require a 24Gbps spec - needing a new generation of cable. The SATA Express standard/proposal of 16Gbps might work.

There are three ways disk connectors could evolve:
  • SATA/SAS (copepr) at 10-20Gbps
  • Fibre Optic
  • Thunderbolt (already 2 * 10Gbps)
Which type to dominate will be determined by the Industry, particularly the major Vendors.

The disk "scan time" (to fully populate a drive) at 1GB/sec, will be about 2hours/platter. Or 6 hours for a 20Tb laptop drive, or 9 hours for a 30Tb server class drive. [16 hours if 50TB drives are packaged in 3.5" (25.4mm thick) enclosures].  Versus the ~65 minutes for a 500Gb drive now.

There is one unequivocal outcome:
Populating a drive using random I/O, as we now do via filesystems, is not an option. Random I/O is 10-100 times slower than streaming/sequential I/O. It's not good enough to take a month or two to restore a single drive, when 1-24 hours are the real business requirements.

Also, for laptops and workstations with large drives (SSD or HDD), they will require 10Gbps networking as a minimum. This may be Ethernet or the much smaller and available Thunderbolt.

A caveat: This piece isn't "Evolutions' End", but "(Silicon) Revolutions' End". Hardware Engineers are really smart folk, they will keep innovating and providing Bigger, Faster, Better hardware. Just don't expect the rates of increase to be nearly as fast. Moores' Law didn't get repealed in 2003, the rate-of-doubling changed...


Why should anyone care? is really: Who should care?

If you're a consumer of technology or a mid-tier integrator, very little of this will matter. In the same way that now when buying a motor vehicle, you don't care about the particular technologies under the hood, just what it can do versus your needs and budget.

People designing software and systems, the businesses selling those technology/services and Vendors supplying parts/components hardware or software that others build upon, will be intimately concerned with the changes wrought by Revolutions End.

One example is provided above:
 backing up and restoring disks can no longer be a usual filesystem copy. New techniques are required.

Wednesday, December 07, 2011

RAID: Something funny happened on the way to the Future...

With apologies to Stephen Sondheim et al, "A Funny Thing Happened on the Way to the Forum", the book, 1962 musical and 1966 film.

Contents:

Summary:
Robin Harris of "StorageMojo" in  "Google File System Eval", June 13th, 2006, neatly summaries my thoughts/feelings:
As regular readers know, I believe that the current model of enterprise storage is badly broken.
Not discussed in this document is The Elephant in the Room, or the new Disruptive Technology: Enterprise Flash Memory or SSD (Solid State Disk). It offers (near) "zero latency" access and random I/O performance 20-50 times cheaper than "Tier 1" Enterprise Storage arrays.

Excellent presentations by Jim Gray about the fundamental changes in Storage are available on-line:
  • 2006 "Flash is good": "Flash is Disk, Disk is Tape, Tape is dead".
  • 2002 "Storage Bricks". Don't ship tapes or even disks. Courier whole fileservers, it's cheaper, faster and more reliable.

[Top]
The current state of the Enterprise Storage market: Pricing and Budget Impact.

  • The promise inherent in the first RAID paper's title (Inexpensive Disks) doesn't seem to be met.
  • Are there other challenges, limits or oddities?
Budget Impact and per-GB pricing
A new entrant, Coraid, talking-up the benefits of its solution lays out some disturbing statistics:
With storage cost consuming 25% to 45% of IT budgets ... (ours) offer up to a 5-8x price-performance advantage over legacy Fibre Channel and iSCSI solutions.
Vendor Gross Margins
A commentary on one of the 6-7 dominant players (EMC, IBM, Network Appliance, Hewlett-Packard, Hitachi Data Systems, Dell, SUN/Oracle) who control 70-80% of the market by revenue: [compare to gross margins on Intel servers of 20-30%]:
Committing to massive disruption of the storage and networking businesses by moving to 40% gross margins from the current 60+% range through the use of volume and scale out technologies, open source software and targeted innovation leveraging the latest technology.
Can HP compete with 40% gross margins? Well, Apple has done pretty well. [Apple is known for 'premium pricing' and the best returns in the industry.]
A large, mature market
The market is large and growing, according to IDG:
... end-user spending on enterprise storage systems reached $30.8 billion in 2010, and the 18% growth over 2009 was the highest since IDC started tracking the market in 1993.
The enterprise storage systems market will grow at a comfortable 3.9% compound annual growth rate (CAGR) between 2010 and 2015 ...
Poor utilisation of raw disk capacity?
 IBM, touting the benefits of it's XIV Storage , claims massive waste by others' systems:
Overall, we find that the reliability attributes of the system limit the net capacity of a system to 46%-84% depending, for the most part, on the RAID configuration. [46%  for a RAID-1 (mirror) config. 84% for RAID-5 (parity/check disk)]

Overall, we see that, by virtue of its built-in efficiency, the XIV system uses 100% of its net capacity, compared with an estimated 28-61% net capacity used by comparable systems. [IBM list 3 types of "wasted space": Orphaned/Unreclaimable Space, Full Backups, Thick Provisioning. Clones and Backups are replaced by 'snapshots' in XIV. IBM neglect filesystem/database 'slack space' within allocated storage.]
The combined effect of the reliability and efficiency attributes is such that, on average, a traditional storage system using mirroring effectively uses less than 21% of its raw capacity (37% when using RAID-5). An XIV system uses approximately 46% of its raw capacity.
In the absence of good Operational Expenditure [OpEx] data, a guess
There is speculation that the Operational costs of 'Tier 1' (high-performance, high-availability, most expensive) storage is very high, but no good figures are available [Compare this recurrent cost to the retail price of 1TB SATA disks of ~$100, while fast, durable SAS drives, e.g. 146-160GB, are $200-300, 300-600GB are ~$400]:
With an estimated cost of Tier 1 storage at around $8,000 per TB per year,
Indirect Total Cost of Ownership [TCO] estimates
Hitachi Data Systems lays out the costs ownership and product conversion/data migration when pushing the benefits of their "virtual storage" architecture:
Forrester 2007 research has shown that in excess of 70 percent of enterprise IT budgets is devoted to maintaining existing infrastructure.
Migration project expenditures are on average 200 percent of the acquisition cost of enterprise storage.
 With an average of four years useful life,  the annual operating expenses associated to migration represent ~50 percent of acquisition cost.
  • Enterprise storage migration costs can exceed  US$15,000 per terabyte migrated. [implying 2007(?) acquisition cost of $7,500/Tb.]
  • For example, an average FORTUNE 1000® company has an average of  800TB of network attached storage (NAS) and  nearly 3PB of storage (InfoPro Wave 12–Q2, 2009)  with, on average, 300TB per storage system
  • As the useful life of most storage systems is three to five years, ...
Current admin challenges
The size and complexity of current Enterprise Storage solutions, and the resulting administrator workload, has provoked comments along the lines quoted below. System complexity increases combinatorially as additional layers are added and heterogeneous systems and networks are interfaced. This increases admin workload, task difficultly and execution times, consequentially increasing preventable faults and errors.
Today, storage is the single most complex and expensive component in the virtualized data center.
How do Enterprises choose between Vendors and Products?
Many commentators assert that Storage is bought on a single metric: Price per GB.
Comparing other important performance metrics, {latency, IO/second, throughput MB/sec} for "random" and "sequential" I/O workloads is being addressed by the Storage Performance Council, with their SPC1 and SPC2 specifications. For those vendors who choose to participate and publish data on their systems, it provides an "Apples and Apples" comparison for potential customers: including system pricing and discount information.

But there's a problem: system price unrelated to cost of drives.
Overwhelmingly, the price of the raw disk drives is an almost insignificant fraction of the purchase price of "Tier 1" Storage arrays. Competitors claim that the dominant players price their disks up to 30 times the normal retail price, hiding the true cost of the infrastructure wrapped around the drives. Vendors often load special firmware in their drives to prevent substitution. [SPC1 and SPC2 detailed pricing confirms published retail prices of $1-2,000 per drive, 5-20 times retail prices.]

This pricing practice distorts customer system specifications by purchasing far fewer drives, creating additional administrative work in managing allocated disk space and achieving target performance levels.

What is missing is good data on: Price / GB-available-to-Applications.
[Ignoring all the other overheads and "slack space" for Logical Volume Managers and Operating Systems 

With Enterprise Storage systems, this is at least 50-100 times the raw cost of drives.

The lesson: Optimising the utilisation of the cheapest and paradoxically least available resource in a system is poor practice.

Meanwhile, there are significant technical and performance issues looming in the world of Enterprise Storage:
  • In Triple Parity RAID and Beyond, Adam Leventhal, ACM Queue, Dec 2009, suggests that by 2020 three parity drives will be needed to achieve a 99.2% probability of a successful RAID rebuild recovering from a single drive failure. Multi-parity drives introduce new problems:
    • increases Price per GB (more drives for same capacity),
    • reduces write performance (1P = 4 IO, 2P = 6 IO, 3P = 8 IO, NP = 2*(N+1) IO)
    • increases compute intensity for parity calculations (1P uses trivial 'XOR')
    • increases system complexity in efforts to compensate for performance, such as delayed parity writes and caching.
    • system robustness and durability is adversely affected by increased component count and software complexity.
  • RAID rebuilds severely affect access times and throughput and now take from 3-24 hours, up from "minutes" in the first systems. Documented in "Comparison Test: Storage Vendor Drive Rebuild Times and Application Performance Implications", Feb 18, 2009, Dennis Martin. There are anecdotal reports of RAID rebuilds taking up to a week, degrading performance of all tasks and leaving organisations with protected data for the duration.


[Top]
Disk Drives: Characteristics and Evolution.

  • The architecture and organisation of Enterprise Storage Systems are driven by Usage demands and the underlying storage components.
  • What's gone before and what might becoming?
In "A Converstaion with Jim Gray", ACM Queue, 2003, the evolution and limits of Hard Disk Drive (HDD) technology is discussed, projected to occur around 2020:
But starting about 1989, disk densities began to double each year. Rather than going slower than Moore’s Law, they grew faster. Moore’s Law is something like 60 per-cent a year, and disk densities improved 100 percent per year.

Today disk-capacity growth continues at this blistering rate, maybe a little slower. But disk access, which is to say, “Move the disk arm to the right cylinder and rotate the disk to the right block,” has improved about tenfold. The rotation speed has gone up from 3,000 to 15,000 RPM, and the access times have gone from 50 milliseconds down to 5 milliseconds. That’s a factor of 10. Bandwidth has improved about 40-fold, from 1 megabyte per second to 40 megabytes per second. Access times are improving about 7 to 10 percent per year. Meanwhile, densities have been improving at 100 percent per year.

At the FAST [File and Storage Technologies] conference about a year-and-a-half ago, Mark Kryder of Seagate Research was very apologetic. He said the end is near; we only have a factor of 100 left in density—then the Seagate guys are out of ideas. So this 200-gig disk that you’re holding will soon be 20 terabytes, and then the disk guys are out of ideas. [now revised to 14Tb @ $40 for 2.5 inch drive]
The definitive paper on the limits and evolution of hard disk technology, including a comparison with a dozen other prospective technologies, by Mark Kryder :

"After Hard Drives—What Comes Next?" Kryder and Chang Soo Kim. IEEE  Magnetics, Oct 2009.
Assuming HDDs continue to progress at the pace they have in the recent past, in 2020 a two-disk, 2.5-in disk drive [2 platters] will be capable of storing over 14 TB and will cost about $40.
 Given the current 40% compound annual growth rate in areal density, this technology should be in volume production by 2020. [Expect a demonstration in 2015]
In 2005, Scientific American discussed his work and described "Kyrder's Law", in recent years HDD capacity has doubled every year, outstripping Moore's Law for CPU speed.

Sankar, Gurumurthi and Stan, ICSA 2008,  describe the relationship of power consumption to RPM and platter size necessary to understand current drive design [reformatted]:
Since the power consumption of a disk drive is
proportional to the fifth-power of the platter size,
is cubic with the RPM, and
is linear with the number of platters... [citing a 1990 IEEE paper]
The external form-factor defines capacity of current HDD's in surprising ways:
  • The 2.5 inch form-factor allows thickness to vary between 7 mm and 19mm, though 19mm is now unusual.
    • Compare to the 25.4mm (1 inch) thickness of 3.5" drives.
    • Consumer drives, in laptops and PC's, are now normally 9.5mm, with some 7mm. [single platter?]
    • Enterprise drives are 15mm, allowing higher capacities by including more platters. [2-3?.
    • In 2020, Enterprise 2.5 inch drives will be 2 or 3 platters, hence  14-21TB.
  • The maximum platter size is not used in every drive. 
    • For consumer drives and high-capacity/low-energy Enterprise drives, the largest platters possible are used.
    • For high-speed (10,000 [10K] and 15,000 RPM [15K]) drives, the same size platters are used in both 3.5 inch and 2.5 inch drives. This reduces power consumption and seek time through reduced head travel.
    • Vendors sell 3.5 inch 15K drives because they can fit more platters in the 25.4 mm vs 15 mm form factor. [4 platters are common, 5 platters are "possible"]
      "More than an interface — SCSI vs. ATA", Anderson, Dykes, Riedel, Seagate, FAST 2003.
A fundamental characteristic of HDD's, "access density, or IOPS/GB" and its importance and evolution as discussed in 2004 by a disk drive manufacture. [Last row added to table].

19872004times increase
CPU Performance1 MIPS2,000,000 MIPS2,000,000 x
Memory Size16 Kbytes32 Gbytes2,000,000 x
Memory Performance100 usec2 nsec50,000 x
Disc Drive Capacity20 Mbytes300 Gbytes15,000 x
Disc Drive Performance60 msec5.3 msec11 x
Disc Drive Bandwidth0.6 MB/sec86 MB/sec140 xPatterson 2004

Figure 2. Disc drive performance has not kept pace with other components of the
system.
Last row data from 2004 Patterson address on "Latency and Bandwidth", covering 20+ years evolution of all system components.

The vendor table omits "transfer rate" or "bandwidth", necessary to calculate the other fundamental characteristic of HDD's, described by Leventhal [ACM Queue 2009], disk "scan time":
By dividing capacity by throughput, we can compute the amount of time required to fully scan or populate a drive.
This storage analyst discussion of access density, cost and capacity using SPC benchmarks delves into more fine detail on this and related topics.

RAID array IO/sec performance suffers through the declining access density, which can addressed through higher RPM drives, "short-stroking"/"de-stroking" (using outer 20% of drives) or provisioning the number of drives ('spindles') not on capacity, but desired IO/sec. (Most designers and storage architects would consider this as "over-provisioning" and unnecessarily expensive.)

RAID rebuild times cannot proceed faster than the individual drive scan time, whilst in normal RAID-5 configurations, the total amount of data read for an (ND+1P) parity group, is N-times the drive size. The minimum time taken is (N*Disk-size ÷ common-channel speed), the parity-group scan time, analogous to single-drive scan time. The Networking concept of "over-subscription", the ratio of uplink to total downlink capacity (1:1 is ideal but expensive, higher numbers cause congestion in high-performance environments like server rooms), can be applied to the common-channel and supported drives.


[Top]
The Early View
  • Why are we now facing these problems?
  • Is this the future that the instigators of RAID foresaw?
In 1987/8 when Patterson, Gibson and Katz at UCB (University of California, Berkeley) wrote their industry-changing paper, "A Case of Redundant Arrays of Inexpensive Disks", their group was working on new processing architectures (MPP - Massively Parallel Processing) amongst other things.

Patterson et al describe an application scale-up problem as CPU's speed increases  faster than disks  and proffer this:  "A Solution: Arrays of Inexpensive Disks".
1.2 The Pending I/O Crisis
What is the impact of improving the performance of some pieces of a problem while leaving others the same? Amdahl's answer is now known as Amdahl's Law...

Suppose that some current applications spend 10% of their time in I/O. Then when computers are 10X faster - according to Bill Joy in just over three years - then Amdahl's Law predicts effective speedup will be only 5X. When we have computers 100X faster - via evolution of uniprocessors or by multiprocessors - this application will be less than 10X faster, wasting 90% of the potential speedup.
and
 In transaction-processing situations using no more than 50% of storage capacity, then the choice is mirrored disks (Level 1). However, if the situation calls for using more than 50% of storage capacity, or for supercomputer applications, or for combined supercomputer applications and transaction processing, then Level 5 looks best.
They also suggest that storage arrays will commonly be comprised of very large numbers of disks:
MTTR and thereby increase the MTTF of a large system. For example, a 1000 disk level 5 RAID with a group size of 10 and a few standby spares could have a calculated MTTF of 45 years.
But explicitly pose as unsolved questions, just how this might be accomplished:
  • Can information be automatically redistributed over 100 to 1000 disks to reduce contention? 
  • Will disk controller design limit RAID performance?
  • How should 100 to 1000 disks be constructed and physically connected to the processor? 
  • Where should a RAID be connected to a CPU so as not to limit performance? Memory bus? I/O bus? Cache?
A further insight into the thinking of these pioneers is found in the Introduction of "RAIDframe: A Rapid Prototyping Tool for RAID Systems" by Courtright,  Gibson, et al in 1997. They give a very exact description of how they envision RAID arrays developing [many tiny drives] and why [access time and bandwidth].
Unfortunately, the 1.3 inch drive by HP (C3014 'Kittyhawk') cited was introduced in early 1992 and discontinued due to slow sales at the end of 1994.
The 1 inch drives they forecast did appear in 1999, the IBM Microdrive, packaged as Compact Flash (CF) cards, and while for a number of years were the largest capacity available in the CF form-factor. They were discontinued in 2006 when Flash Memory overtook them in capacity and Price per GB, never having made it into the Enterprise Storage market.
Wikipedia (Dec 2011) claims that by 2009 all drives smaller than 1.8 inch (used in portable devices) have been discontinued.

Another good source from 1995 is "ISCA’95 Reliable, Parallel Storage Tutorial", Garth Gibson, CMU. Gibson co-authored the original Berkeley RAID paper.

From the "Raidframe" paper:
Further impetus for this trend derived from the fact that smaller-form-factor drives have several inherent advantages over large disks:
  • smaller disk platters and smaller, lighter disk arms yield faster seek operations,
  • less mass on each disk platter allows faster rotation,
  • smaller platters can be made smoother, allowing the heads to fly lower, which improves storage density,
  • lower overall power consumption reduces noise problems.
These advantages, coupled with very aggressive development efforts necessitated by the highly competitive personal computer market, have caused the gradual demise of the larger drives.

 In 1994, the best price/performance ratio was achieved using 3.5 inch disks, and the 14-inch form factor has all but disappeared.

 The trend is toward even smaller form factors:
  •  2.5 inch drives are common in laptop computers [ST9096], and
  •  1.3-inch drives are available [HPC3013].
  •  One-inch-diameter disks should appear on the market by 1995 and  should be common by about 1998. [appeared 1999, only achieved commercial success in Digital Cameras]
  •  At a (conservative) projected recording density in excess of 1-2 GB per square inch [Wood93], one such disk should hold well over 2 GB of data. [got there about 2002]
These tiny disks will enable very large-scale arrays.
 For example, a one-inch disk might be fabricated for surface-mount, rather than using cables for interconnection as is currently the norm,  and thus a single, printed circuit board could easily hold an 80-disk array. [did they mean 8x11in cards? see Gibson's 1995 tutorial, Slide 5/74, for a diagram]

 Several such boards could be mounted in a single rack to produce an array containing on the order of 250 disks.
 Such an array would store at least 500 GB, and even if disk performance does not improve at all between now and 1998,  could service either 12,500 concurrent I/O operations or deliver 1.25-GB-per-second aggregate bandwidth.

The entire system (disks, controller hardware, power supplies, etc.) would fit in a volume the size of a filing cabinet.

To summarize, the inherent advantages of small disks,  coupled with their ability to provide very high I/O performance through disk-array technology,  leads to the conclusion that storage subsystems are and will continue to be, constructed from a large number of small disks,  rather than from a small number of powerful disks.


[Top]
The Gap: Then and Now.
  • Around 25 years on, have the original expectations been met?
  • Has the field and technology developed as they envisioned?
  • What are the unresolved questions?
Whilst EMC released one of the first commercial RAID system, one of the two commercial systems described in the UCB groups' followup 1994 paper, "RAID: High-Performance, Reliable Secondary Storage", was the Storage Technology Iceberg 9200, released that year. The Iceberg and the 1989 Berkeley RAID-I prototype used (32 and 28) 5.25 inch drives. The Berkeley RAID-II, written up in 1994, graduated to 72-144 * 320MB 3.5 inch drives supplied by IBM.

Whilst the main thrust of the 1987/8 paper was recommending the use of the smallest disk drives available, at the time 3.5 inch, this was not the practice, especially in commercial systems.

Even in the beginning, vendors and their customers, chose differently, using 5.25 inch drives.
The direct impact on system design was that the Iceberg needed dual-parity to achieve sufficient data-protection, especially for read-errors when rebuilding a RAID parity group after a single-disk failure. The UBER (Unrecoverable Bit Error Rate) of 10 in 10^14 and total parity-group size of 8-16 GB gave an unacceptably high chance of a rebuild failure.

The 5.25 inch drive was the smallest 'enterprise' drive available. The Fujitsu Super Eagle, at 10.5 inches, was discussed in the 1988 paper. At the same time 8 inch SCSI drives were on the market.

In 2005, Herb Sutter wrote a commentary on the end of Moore Law for single-core CPU's "The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software". Early in 2003, CPU's hit a "Heat Wall" limiting the clock-frequency. Whilst more transistors could be placed on a CPU die, the clock-frequency (speed) stalled. To improve CPU throughput, designers placed more cores in each CPU, the equivalent to a network connection using more parallel conductors to increase total bandwidth without increase the speed of an individual connection/conductor.

In contrast to this sudden change, Hard Disk access density has been steadily eroding whole array performance for the last two decades, steadily forcing design changes and increasing complexity.

It's unlikely that when fundamental limits are reached an "IO Performance Wall" will be created for hard disks. Though, like CPU's, some unexpected physical constraint may limit realisable capacities.

The projected maximum areal density of disk drives results in fixed sizes for 2.5 inch and 3.5 inch platters. The Industry Standard form-factors fix the size available, and potentially the number of platters possible in each form factor. With the advent of cheap Flash Memory in SSD's, there is little need for high RPM Hard Drives.

There has been one major form factor conversion, from 5.25 inch to 3.5 inch. Currently 'Tier 1' storage is being sold in both 3.5 inch and 2.5 inch form factors. To understand if 2.5 inch drives will become universal, we need to examine the history of the last conversion and the differences to now.

 Slide 25/74, "Disk Diameter Trends",  of Garth Gibson's ISCA’95 Reliable, Parallel Storage Tutorial covers these transitions:
Decreasing diameter dominated 1980s
• 5.25” created desktop market (16+ GB soon)
• 3.5” created laptop market (4+ GB 1/2 high; 500+ MB 19mm)
• 2.5” dominating laptop market (200+ MB; IBM 720 MB)
• 1.8” creating PCMCIA disk market (80+ MB)

Decreasing diameter trend slowed to a stop
• 1.8” market not 10X 2.5” market
• 1.3” (HP Kittyhawk) discontinued
• vendors continue work on smaller disks to lower access time
It isn't clear when 5.25 inch drives went out of production and Enterprise storage swapped completely to 3.5 inch drives.  Seagate's 10Gb "Elite 9" ST410800N (the ST410800N manual, static URL) was released in 1994. The cut-over time will have been before the product's end-of-life in 1997/8.
Slide 20/74 of Gibson's 1995 tutorial plots the year of introduction of different drive capacities by drive size. It tracks 5.25 inch drives until 1994 with 10GB introduced, confirming the first estimate.

The reasons will have been complex and many, but at some point all the significant metrics would've favoured 3.5 inch drives and then result was foregone.

Metrics that matter to Enterprise Storage vendors:
  • Price per GB
    • Both form factors are in high-volume production, with a small premium for 2.5 inch drives.
    • Laptops and notebooks have outsold desktop PC's for a number of years, at least balancing demand for 2.5 inch drives.
  • GB per cubic-unit (normally per "Rack Unit" (RU) in a "19 inch" rack (1.75 in x 17.75 in x 24-36 in))
    • Standard vertical mounting sees 13-16 3.5 inch  hot-swap  drives in a 3RU space, and
    • 24 2.5 inch drives in a 2RU space.
    • 4.3-5.3 drives/RU vs 12 drives/RU.
    • The ratio of platter area, a close approximation to  capacity per platter, is 2:1, making the ratio about 5:6.
    •  2.5" drives a 20% higher storage density per platter for a given technology,
    • But there extra platters in 25.4 mm vs 15 mm. (50% more platters)
    • For high RPM drives with identical capacities, 2.5 inch wins,
    • for same-RPM drives, 3.5 inch drives store
    • Orienting 3.5 inch drives horizontally, 3-high by 4-wide (12) will fit in 2RU, the best packing is 6/RU.
  • Watts per GB and implied operational costs.
    • In higher RPM drives, Watts per GB currently is comparable (shown above).
    • For 'slow' 7200RPM drives, the fifth-power-law of platter diameter to power consumption means 2.5 inch drives will consume significantly (535%!) less power. (3-4W vs 15W)
    • Western Digital now sell a line of 'Green' 3.5 inch drives with variable RPM, reducing power demand considerably.
    • Operational costs, e.g. cooling capacity and cost of electricity, over the 5 year service life of equipment (~45,000 hours) may be the deciding factor.
      1,000 * 3.5 inch drives will consume ~650,000 kilo-Watt-hours vs 125,000 kWhrs.
      At $0.25/kWhr, a lifetime saving of ~$135 per drive.
  • Average access time (based on rotational latency and seek times). 
    • Smaller drives can be spun faster and still consume less power.

[Top]

A little production theory: Why "popular" drives are much cheaper.

Accounting Theory has the concept of the "Learning Curve", also called "Experience Curve".
It originated with Wright in 1936 examining small-scale production and noting that doubling production, reduces costs 10-15%.


Does the Experience Curve scale-up?
If you produce 100,000 times more, (16+ doublings), can you realise these benefits all the way?
  • 10%-per-doubling improvement = 18.5% (81.5% cost reduction)
  • 15%-per-doubling improvement =  7.5% (92.5% cost reduction)
Large-scale Silicon chip production has very high "barriers to entry", the cost of building "fabrication plants" is now measured in Billions. As components are made smaller, with tighter tolerances, volume production plants for Hard Disk Drives must be following the same trend. To make "next generation" products with features and tolerances 70% the size of the current technology, every process is affected, needing to be more precise, cleaner and with less vibration. These new technologies don't come cheaply.

These plant costs and the product Research and Development costs have to be amortised across all units produced. For fixed and overhead costs to be a small fraction of the per-unit cost, the number of units now has to be very large: 10-100 million.

In 2011, there are forecast to be 400M+ PC's (desktops and laptops) produced. Laptops, using 2.5 inch drives, passed desktops as the most popular format around 5 years ago. These high volume demands underpin the production of both 3.5 inch and 2.5 inch disk drives.

Volume production counts for a lot.
If there isn't already a high demand for a product, the per-unit price will be considerably inflated to cover "sunk costs": everything involved in creating the product.
Once the plant and Research costs are paid for, per-unit costs can be lowered more.
After a time the Price/Capacity of old technology products, even when fully amortised, will be too much higher than new technology and demand will fall away rapidly.
When demand falls and production runs are too small, the plant becomes uneconomic because fixed-costs (operational/production costs) start to overwhelm the per-unit price. Manufacturers must close plants when they start to make losses: the situation can only get worse, rapidly.

As an example, consider the two machines that Seymour Cray designed for Control Data Corporation (CDC) in the 1960's/70's before leaving to create his own company:
  • CDC 6600  World's fastest computer: 1964-1969.
  • CDC 7600  World's fastest computer: 1969-1975. reputedly $5M (equivalent to $30M now).
Both CDC and Cray himself knew everything there was to know about both ECL and TTL technologies they used, yet by the mid-1990's, all the fastest computers were VLSI CMOS: the CPU chips in use today.

The difference between Intel et al and Cray/CDC was volume production.
When you hand-craft 100-1,000 machines, they cost millions.
When you build CPU's by the million, they cost $100-$1,000.
It redefines how you architect and organise super-computers.

This "substitution effect" affects all commodity products.


[Top]

Further Questions

Economics asserts "Price is the Mediator between Supply and Demand".
In a mature, free market, where purchasers have "perfect information" and products are fungible (perfect substitutes for each other),  competition will drive prices down (and consumption of goods will increase) and inefficient producers will be driven from the market.

But the Enterprise Storage market has the very high gross margins associated with new markets or non-substitutable products.

Q: What's happening between vendors and purchasers to produce this skewed market?
  • Are Enterprise Storage products fungible or not?
  • Are secondary effects at work preventing product substitution? [warranty conditions, staff capability, integrated platform management software, decision inertia/product loyalty or technical conservatism.]
  • Is the purchasing criteria "Price/GB", I/O performance, (perceived) Reliability/Availability etc, functionality or something else?


Performance (latency and throughput) of HDD-based Enterprise Storage arrays is rated "good enough" in the market because for 10-15 years many new entrants have attempted to break into the market by offering low-cost or high-performance products.

Q: What "figure of merit" do purchases use to chose between vendors and products? Is it "just" price/GB, because it isn't "performance".
  • Without extensive customer research, this may be unknowable.


There aren't convincing reasons favouring smaller or larger form-factor drives unless high RPM drives or SSD (packaged in 2.5 inch or 1.8 inch) are included.

Q: Why has the take-up of SSD's in Enterprise environments been slow when the price/performance ratios are so far ahead of 'conventional' Storage systems?



Kryder's Law, the on-going compound increase in HDD size and price/GB has caused some changes in fundamental ratios:
 Storage Arrays initially lashed together many small devices into "large enough" logical/virtual devices.
 Somewhere in the last 25 years, drive capacity exceeded normal use cases by multiples.

The basic RAID I/O performance drivers for both latency and throughput, many spindles and actuators reduce latency and parallel transfers increase throughput, were invalidated, but the designs seemed not to change.
  • 1988: IBM 3380 7.5GB. Databases, Files fitted within this limit.
    • RAID from 320MB-1GB, more spindles, more IO/sec, higher throughpu

  • 2010: 600GB 10/15K, 2TB SATA drives.
    • These units are now (much!) larger than most common Databases and file stores.
    • Not just video, images, audio, scanned stuff. MS-Office documents as well.

Q: Why did Storage arrays not respond to this change in fundamental drivers?
  • Did the change happen so slowly that nobody noticed?
  • Storage vendors are generally very innovative and competitive, employing some of the "best and brightest" in computing.
    This wasn't a failure of ability or capability. Perhaps of vision?
  • Are incumbent vendors locked into their own solutions, leaving innovation to new entrants?
  • Did consumers demand products they were familiar and comfortable with and prevent vendor changing designs?


Why didn't large numbers of really small HDD's get tried by major vendors, even as an experiment?
IBM, the leader in 1 inch drives, sold it's drive business to Hitachi Data Systems (HDS), a leading Storage system vendor with expertise in many related areas.

HDS had the capability to create custom packaging, custom electronics (ASIC's) and to redesign the 1 inch drive format (Compact Flash with IDE). For very small drives to be soldered onto boards, a simple, continuous serial interface was needed. SAS, Serial Attached SCSI, would fit the bill today.

In the 3 years before 1 inch HDD's lost their price advantage to Flash memory, HDS could have built a prototype and proven the concept, but (seemingly) didn't. Nor did any academic projects.

Q: Why did the Industry and Academic researchers not build a "many tiny drives" Array between 2000 and 2005?



To show that very low overhead Storage devices can be built outside Google datacentres, these are the full on-line instructions and Bill of Materials for one service providers solution:

"Petabytes on a budget: How to build cheap cloud storage", September 1, 2009
"Petabytes on a Budget v2.0: Revealing More Secrets", July 20, 2011
135TB for $7,384, around 50% more than the raw cost of the drives.
A chart in the first piece comparing the Price / GB of different solutions.

RAW:               $81,000 [660 @ 1.5TB ] 45 @ $120 == $5400
backblaze:        $117,000 [50% o'head]
Dell MD1000       $826,000
SUN/Ora X4550   $1,000,000
NetApp FAS-6000 $1,714,000
Amazon S3       $2,806,000
EMC NS-960      $2,860,000

Pix of Storage Pod & component costs.