Saturday, December 24, 2011

Big Drives in 2020

Previously I've written about Mark Kyrder's 7TB/platter (2.5 inch) prediction for 2020.
This is more speculation around that topic.

1. What if we don't hit 7TB/platter, maybe only 4TB?

There have  been any number of "unanticipated problems" encountered with scaling Silicon and Computing technologies, will more be encountered with HDD before 202?

We already have 1TB platters in 3.5 inch announced in Dec-2011, with at least one new technique announced to increase recording density (Sodium Chloride doping), so it's not unreasonable to expect another 2 doublings in capacity, just in taking what's in the Labs and figuring out how to put it into production.

Which means we can expect 2-4TB/platter (2.5 inch) to be delivered in 2020.
At $40 per single-platter disk?
That depends on a) the two major vendors and the oligopoly pricing and b) the yields and costs of the new fabrication plants.

Seems to me that Price/GB will drop, but maybe not to levels expected.
Especially if the rapid decline in SSD's/Flash Memory Price/Gb plateaus and removes price competition.

2. Do we need to offer The Full Enchilada to everyone?

Do laptop and ultrabook users really need 4TB of HDD when they are constantly on-line?
1-2TB will store a huge amount of video, many virtual machine images and a lifetimes' worth of audio.
There might be a market for smaller capacity disks, either through smaller platters, smaller form-factors or underusing a full-width platter.

Each option has merits.
The final determinant will be perceived consumer Value Proposition, the Price/Performance in the end-user equipment.

3. What will the 1.8 inch market be doing?

If these very small form-factor drives in mobile equipment get to 0.5-2TB, that will seem effectively infinite.

There is no point in adopting old/different platter coatings and head-manufacturing techniques for these smaller form-factors unless other engineering or usability factors come into play: such as sensitivity to electronic noise, contamination, heat, vibration, ...

4. The fifth-power of diameter and cube-of-RPM: impact of size and speed?

2.5 inch drives are set to completely displace 3.5 inch in new Enterprise Storage systems within a year. This is primarily driven by Watts/GB and GB/cubic-space.

The aerodynamic drag of disk platters, hence the power consumed by a drive, varies with the fifth-power of platter diameter and the cube of the rotational velocity (RPM).

If you halve the platter size (5.25 inch to 2.5 inch), drive power reduces 32-fold.
If you then double the RPM of the drive (3600 to 7200), power increases 8-fold,
a nett reduction in power demand of 4 times.

Changing platter diameter by square-root of 2 (halving the recordable area), the drive power reduction is 5.5-fold. This is the same proportion for 5.25::3.5 inch, 3.5::2.5 inch and 2.5::1.8 inch.

Reducing a 2.5 inch platter to 1.92 inches allows a drive to be spun up from 5400 RPM to 7200 RPM whilst using the same drive power, with 60% of the original surface area.

Whilst not in the class of Enterprise Storage "performance optimised" drives (10K and 15K), it would be a noticeable improvement for Desktop PC's, given they will also be using large SSD's/Flash Memory as well in 2020 and this will be solely for "Seek and Stream" tasks.

There is very little reason to "de-stroke" drives and limit them to less than full-platter access if they are not "performance-optimised". It's a waste of resource for exactly the same input cost.

5. Will 3.5 inch "capacity-optimised" disks survive?
Will everything be 2.5 or 1.8 inch form-factor?

There are 3 markets that are interested in "capacity-optimised" disks:
  • Storage Appliances [SOHO, SME, Enterprise and Cloud]
  • Desktop PC
  • Consumer Electronics: PVR's etc.
When 1TB 2.5 inch drives are affordable, they will make new, smaller and lighter Desktop PC designs possible. Dell and HP might even offer modules that attach on the 100mm x 100mm "VIA" standard to the back of LCD screens. A smaller variant of the Apple Mac Mini is possible, especially if a single power-supply is available.

Consumer PVR's are interested in Price/GB, not Watts/GB. They will be driven by HDD price.
The manufacturers don't pay for power consumed, customers don't evaluate/compare TCO's and there is no legislative requirement for low-power devices.  Government regulation could be the wild-card driving this market.

There's a saying something like this I though made by Dennis Ritchie:
 "Memory is Cheap, until you need to buy enough for 10,000 PC's".
[A comment on MS-Windows lack of parsimony with real memory.]

Corporations will look to trimming costs of their PC (laptop and Desktop) fleets, and the PC vendors will respond to this demand.

Storage Appliances:
Already Enterprise and Cloud providers are moving to 2.5 inch form-factor to reduce power demand (Watts/GB) and floor-space footprint (GB/cubic-space).

Consumer and entry-level servers and storage appliances (NAS and iSCSI) are currently mostly 3.5 inch because that has always been the "capacity-optimised" sweet spot.

Besides power-use, the slam-dunk reasons for SOHO and SME users to move to 2.5 inch are:
  • lighter
  • smaller
    • smaller footprint and higher drive count per Rack Unit.
    • more aggregate bandwidth from higher actuator count
    • mirrored drives or better, are possible in a small portable and low-power case.
  • more robust, better able to cope with knocks and movement.
2.5 inch drives may be much better suited to "canister" or (sealed) "drive-pack" designs, such as used by Copan in their MAID systems. This is due to their lighter weight and lower power dissipation.
The 14-drive Copan 3.5 inch "Canister" of 4RU could be replaced by a 20-24 drive 2.5 inch Canister of 3RU, putting 3-4 times the number of drives in the same space.

6. What if there are some unforeseen "drop-deads", like low data-retention rates or hyper-sensitivity to heat, that limit useful capacities to the current 3-600GB/platter (2.5 inch)?

We can't know the future perfectly, so can't say just what surprises lie ahead.
If there is some technical reason why current drive densities are an engineering maximum, we cannot rely on technology advances to automatically reduce the Price/GB each year.

Even if technology is frozen, useful price reductions, albeit minor in comparison to "50% per year", will be achievable in the production process. It might take a decade for prices to drop 50% per GB.

I'm not sure how exactly designs might be made scale if drive sizes/densities are pegged to current levels.
What is apparent and universal, "Free Goods" with apparently Infinite Supply, will engender Infinite Demand.

If we do hit a "capacity wall", then the best Social Engineering response is to limit demand, which requires a "Cost" on capacity. This could be charging, as Google does with its Gmail service, or by other means, such as publicly ranking "capacity hogs".

Thursday, December 22, 2011

IDC on Hard Disk Drive market: Transformational Times

One of the problems, as an "industry outsider", of researching the field is lack of access to hard data/research. It's there, it's high-quality and timely. Just expensive and behind pay-walls.

A little information leaks via Press Releases and press articles promoting the research companies.

When one of these professional analyst firms makes a public statement alerting us to a radical restructuring of the industry, that's big news. [Though you'd expect "insiders" to have been aware of this for quite some time.]

What's not spelled out publicly, is How will this impact Enterprise Storage vendors/manufacturers?
There seems an implication that the two major HDD vendors will start to compete 'up' the value-chain with RAID and Enterprise Storage vendors, and across storage segments with Flash memory/SSD vendors.

IDC's Worldwide Hard Disk Drive 2011-2015 Forecast: Transformational Times
 published  in May, 2011. This report consists of Pages: 62 and the price starts from US $ 4500.

Headline: Transformation to just 3 major vendors. (really 2 major + 1 minor @ 10%)
 "The hard disk drive industry has navigated many technology and product transitions over the past 50 years, but not a transformation. [emphasis added]

 The HDD industry is poised to consolidate from five to three HDD vendors by 2012, and
 HDD unit shipment growth over the next five years will slow.

 HDD revenue will grow faster than unit shipments after 2012, in part because HDD vendors will offer higher-performance hybrid HDD solutions that will command a price premium.

 But for the remaining three HDD vendors to achieve faster revenue growth,
 it will be necessary by the middle of the decade for HDD vendors to transform into [bullets added]
  •  storage device and
  •  storage solution suppliers,
  •  with a much broader range of products for a wider variety of markets
  •  but at the same time a larger set of competitors."

Platters per Disk.

  • 2.5 inch:
    • 9.5mm = 1 or 2 platters
    • 12.5mm = 2 or 3 platters
    • 15mm = ? platters. Guess at least 3. 4 unlikely, compare to 3.5 inch density
  • 3.5 inch
    • 25.4mm = commonly 4. Max. 5 platters.
Why is this useful, interesting or important?
To compare capacity across form-factors and for future configuration/design possibilities.

Disk form-factors are related by an approximate halving of platter area between sizes:
8::5.25 inch, 5.25::3.5 inch, 3.5::2.5 inch, 2.5::1.8 inch, 1.8::1.3 inch, 1.3::1 inch...
What we (as outsiders) know, but only approximately, is the recording area per platter for the platter sizes.  We know there are at least 3 regions of disk platter, but not their ratios/sizes, and these will vary per form-factor/platter-size:
  • motor/hub. The area of the inside 'torus' is small, not much is lost.
  • recorded area
  • outer 'ring' for landing and idling or "unloading" heads. Coated differently (plastic?) to not damage heads if they "skid" or come into contact with a surface (vs 'flying' on the aerodynamic air-cushion).

Chris Mellor, 12th September 2011 12:02 GMT, The Register, "Five Platters, 4TB".
Seagate has a 4TB GoFlex Desk external drive but this is a 5-platter
disk with 800GB platters.
IDC, 2009, report sponsored by Hewlett-Packard:
By 2010, the HDD industry is expected to increase the maximum number of platters per 2.5inch performance-optimized HDD from two to three,
enabling them to accelerate delivering a doubling of capacity per drive, and subsequently achieving 50% capacity increases per drive over a shorter time frame.
7th September 2011 06:00 GMT, The Register.
Oddly Hitachi GST is only shipping single-platter versions of these new drives, although it is saying they are the first ones in a new family, with their 569Gbit/in2 areal density. The announced but not yet shipping terabyte platter Barracuda had a 635Gbit/in2 areal density.
Sebastian Anthony on December 12, 2011 at 12:25 pm, Extreme Tech
Hitachi, seemingly in defiance of the weather gods, has launched the
world’s largest 3.5-inch hard drive:
 The monstrous 4TB Deskstar 5K.
 With a rotational speed of 5,900RPM,
 a 6Gbps SATA 3 interface,
 and the same 32MB of cache as
 its 2 and 3TB siblings,
 the 4TB model is basically the same beast
— just with four platters instead of two or three.
 The list price is around $345
Silverton Consulting, 13-Sep-2011:
shipping over 1TB/disk platter using 3.5″ platters shipping with 569Gb/sqin technology

Monday, December 19, 2011

"Missed by _that_ much": Disk Form Factor vs Rack Units

Apologies to 1965 TV series "Get Smart" and the catch-phrase "Missed by that much" (with a visual indication of a near-miss).

This is a lament, not a call to action or grumble. Standards are necessary and good.
We have two standards that we just have to live with now: too many devices depend on them for a change. Unlike the "imperial" to metric conversion, there would be few discernible benefits.

There's a fundamental mismatch between the Rack Unit (1.75 inches) or the vertical space allowed for equipment in 19 inch Racks (standard EIA-310) and the Disk form factors of  5.25, 3.5 and 2.5 inches defined by the Small Form Factor Committee.

There is no way to mount a standard disk drive (3.5 or 2.5 inch) exactly in a Rack. There are various amounts of wasted space.
Originally, "full-height" 5.25 inch drives could be mounted horizontally exactly in 2 Rack Units (3.5 inches), three abreast.

The "headline" size of the form-factor is the notional size of the platters or removable media.
The envelope allows for the enclosure.

So whilst "3.5 inch" looks like a perfect multiple of the 1.75 inch Rack Unit, a "3.5 inch" drive is  around 0.5 inch larger.
Manufacturers of vertical-mount "hot-swap" drives allow around 1mm on the thinnest dimension, 9 mm on the "height" and 1.5 inches (42-43 mm) on the longest dimension (depth).

A guess at the dimensions of hot-swap housings:
1/32 in (0.8mm) or 1mm sheet metal could be used between drives (upright)
and 1.5-2mm sheet metal would be needed to support the load (with an upturned edge?).

In total, around 0.5 inch (12.5mm) might need be allowed vertically for supporting structures.
An ideal Rack Unit size for the "3.5 inch" drive form-factor would be 4.5 inches.

Or, "3.5 inch" drives could be 3.00 - 3.25 inches wide to fit exactly in 2 Rack Units.

Different manufacturers approach this problem differently:
  • Copan/SGI and Backblaze mount 3.5 inch drives vertically in 4 Rack Units (7 inches).
    Both of these solutions aim for high-density packing, 28 and 11.25 drives per Rack Unit .
    • Copan, via US Patent # 7145770, uses 4U hot-swap "canisters" that store 14 drives in 2 rows, with 8 canisters per "shelf" (112 drives/shelf). In a 42 U rack, they can house 8 shelves, for 896 drives per Rack. Their RAID system is 3+1, with max 5 spares per shelf, yielding 79 data drives per shelf, and 632 drives per Rack.
      These systems are designed specifically to hold archival data, with up to 25% or 50% of drives active at any one time, as "MAID": Massive Array of Idle Disks.
    • Backblaze are not a storage vendor, but have made their design public with a hardware vendor able to supply cases and pre-built (but not populated) systems.
      Their solution, fixed-disks not hot-swap, is 3 rows of 15 disks mounted vertically, sitting on their connectors. The Backblaze systems include a CPU and network card and are targeted at providing affordable and reliable on-line Cloud Backup services [and are specifically "low performance"]. Individual "storage pods" do not supply "High Availability", there is little per-unit redundancy. Like Google, Backblaze rely on whole-system replication and software to achieve redundancy and resilience.
  • Most server and storage appliance vendors use "shelves" of 3 Rack Units (5.25 inches), but fit 13-16 drives across the rack (~17.75 inches or 450mm) depending on their hot-swap carriers.
  • "2.5 inch" drives fitted vertically (2.75 inch) need 2 Rack Units (3.5 inches). Most vendors fit 24 drives across a shelf. "Enterprise class" 2.5 inch drives are typically 12.5 or 15 mm thick.
Another possibility,  not been widely pursued, is to build disk housings or shelves that don't exactly fit the EIA-310 standard Rack Units. Unfortunately, the available internal width of 450mm cannot be varied.

The form factors:
"5.25 inch": (5.75 in x 8 in x 1.63 in =  146.1 mm x 203 mm x 41.4 mm)
"3.5 inch"  : (4 in x 5.75 in x 1 in =  101.6 mm x 146.05 mm x 25.4 mm)
"2.5" inch  : (2.75 in x 3.945 in x 0.25-0.75 in = 69.85 mm x 100.2 mm x [7, 9.5, 12.5, 15, 19] mm)
Old disk height form factors, originating in 5.25 inch disks circa mid-1980's.
low-profile = 1 inch.
Half-height = 1.63 inch.
Full-height = 3.25 inch. [Fitting well into 2 Rack Units]

Wednesday, December 14, 2011

"Disk is the new Tape" - Not Quite Right. Disks are CD's

Jim Gray, in recognising that Flash Memory was redefining the world of Storage, famously developed between 2002 and 2006 the view that:
Tape is Dead
Disk is Tape
Flash is Disk
RAM Locality is King
My view is that: Disk is the new CD.

Jim Gray was obviously intending that Disk had replaced Tape as the new backup storage media, with Flash Memory being used for "high performance" tasks. In this he was completely correct. Seeing this clearly and annunciating it a decade ago was remarkably insightful.

Disks do both the Sequential Access of Tapes and Random I/O.
In the New World Order of Storage, they can be considered functionally identical to Read-Write Optical disks or WORM (Write Once, Read-only Memory).

As the ratios between access time (seek or latency) and sequential transfer rate, or throughput, continues to change in favour of capacity and throughput, managing disks becomes more about running them in "Seek and Stream" mode than doing Random I/O.

With current 1TB disks, the sequential scan time (capacity ÷ sustained transfer rate) [1,000GB/ 1Gbps] is 2-3 hours. However, to read a disk with 4KB random I/O's at ~250/sec (4msec avg. seek), the type of workload a filesystem causes, gives an effective through put of around 1MB/sec, or 128 times slower than a sequential read.

It behoves system designers to treat disks as faster RW Optical Disk, not as primary Random IO media, and as Jim Gray observed, "Flash is the New Disk".

The 35TB drive (of 2020) and Using them.

What's the maximum capacity possible in a disk drive?

Kryder, 2009, projects 7TB/platter for 2.5 inch platters will be commercially available in 2020.
[10Tbit/in² demo by 2015 and $3/TB for drives]

Given that prices of drive components are driven by production volumes, in the next decade we're likely to see the end of 3.5 inch platters in commercial disks with 2.5 inch platters taking over.
The fith-power relationship between platter-size and drag/power-consumed also suggests "Less is More". A 3.5 inch platter needs 5+ times more power to twirl it around than a 2.5 inch platter - the reason that 10K and 15K drives run the small platters: they already use the same media/platters for 3.5 inch and 2.5 inch drives.

Sankar, Gurumurthi, and Stan in "Intra-Disk Parallelism: An Idea Whose Time Has Come" ISCA, 2008, discuss both the fifth-power relationship and that multiple actuators (2 or 4) make a significant difference in seek times.

How many platters are fitted in the 25.4 mm (1 inch) thickness of a 3.5 inch drive's form-factor?

This report on the Hitachi 4TB drive (Dec, 2011) says they use 4 * 1TB platters in a 3.5 inch drive, with 5 possible.

It seems we're on-track to at least the Kryder 2020 projection, with 6TB per 3.5 inch platter already demonstrated using 10nm grains enhanced with Sodium Chloride.

How might those maximum capacity drives be lashed together?

If you want big chunks of data, then even in a world of 2.5 inch componentry, it still makes sense to use the thickest form-factor around to squeeze in more platters. All the other power-saving tricks of variable-RPM and idling drives are still available.
The 101.6mm [4 inch] width of the 3.5 inch form-factor allows 4 to sit comfortably side-by-side in the usual 17.75 inch wide "19 inch rack", using just more than half the 1.75 inch height available.

It makes more sense to make a half-rack-width storage blade, with 4 * 3.5 inch disks (2 across, 2 deep) with a small/low-power CPU, a reasonable amount of RAM and "SCM" (Flash Memory or similar) as working-memory and cache and dual high-speed ethernet, infiniband or similar ports (10Gbps) as redundant uplinks.
SATA controllers with 4 drives per motherboard are already common.
Such "storage bricks", to borrow Jim Grays' term, would store a protected 3 * 35Tb, or 100TB per unit, or 200Tb per Rack Unit (RU). A standard 42RU rack, allowing for a controller (3RU), switch (2RU), patch-panel (1RU) and common power-supplies (4RU), would have a capacity of 6.5PB.

Kryder projected a unit cost of $40 per drive, with the article suggesting 2 platters/drive.
Scaled up, ~$125 per 35TB drive, or ~$1,000 for 100TB protected ($10/TB) [$65-100,000 per rack]

The "scan time" or time-to-populate a disk is the rate-limiting factor for many tasks, especially RAID parity rebuilds.
For a single actuator drive using 7TB  platters and streaming at 1GB/sec, "scan time" is a daunting 2 hours per platter: At best 10 hours to just read a 35TB drive.

Putting 4 actuators in the drive, cuts scan time to 2-2.5 hours, with some small optimisations.

While not exceptional, its compares favourably with 3-5 hours minimum currently reported with 1TB drives.

But a single-parity drive won't work for such large RAID volumes!

Leventhal, 2009, in "Triple Parity and Beyond", suggested that the UER (Unrecoverable Error Rate) of large drives would force force parity-group RAID implementations to use a minimum of 3 parity drives to achieve a 99.2% probability of a successful (Nil Data Loss) RAID rebuild following a single-drive failure. Obviously, triple parity is not possible with only 4 drives.

The extra parity drives are NOT to cover additional drive failures (this scenario is not calculated), but to cover read errors, with the assumption that a single error invalidates all data on a drive.

Leventhal uses in his equations:
  •  512 byte sectors,
  • 1 in 10^16 probability of UER,
  • hence one unreadable sector per 200 billion (10TB) read, or
  • 10 sectors per 2 trillion (100TB) read.
Already, drives are using 4Kb sectors (with mapping to the 'standard' 0.5Kb sectors) to achieve the higher UER's.  The calculation should be done with the native disk sector size.

If platter storage densities are increased by 32-fold, it makes sense to similarly scale up the native sector size to decrease the UER. There is a strong case for 64-128Kb sectors on 7Tb platters.

Recasting Leventhal's equations with:
  • 100TB to be read,
  • 64KB native sectors,
  • or 1 in 1.5625 * 10^9 native sectors read for a UER of 1 in 10^16.
What UER would enable a better than 99.2% probability of reading 1.5 billion native sectors?
First approximation is 1 in 10^18 [confirm].
Zeta claims UER better than 1 in 10^58. Is possible to do much better.

Inserting Gibson's "horizontal" error detection/correction (extra redundancy on the one disk) is around the same overhead, or less. [do exact calculation].

Rotating parity or single-disk parity RAID?

The reasons to rotate parity around disk are simple - avoid "hot-spots", otherwise the full parallel IO bandwidth possible over all disks is reduced to just that of the parity disk. NetApp neatly solve this problem with their WAFL (Write Anywhere File Layout).

In order to force disks into mainly sequential access, "seek then stream", writes won't be simply cached, but shouldn't be written to HDD but kept to SMC/Flash until writes have quiesced.

The single parity-disk problem only occurs on writes. Reading, in normal or degraded mode, occurs at equal speed.

If writes across all disks are stored then written in large blocks, there is no IO performance difference between single-parity disk and rotating parity.

Tuesday, December 13, 2011

Revolutions End: Computing in 2020

We haven't reached the end of the Silicon Revolution yet, but "we can see it from here".

Why should anyone care? Discussed at the end.

There are two expert commentaries that point the way:
  • David Patterson's 2004 HPEC Keynote, "Latency vs Bandwidth", and
  • Mark Kryder's 2009 paper in IEEE Magnetics, "After Hard Drives—What Comes Next?"
    [no link]
Kryder projected the current expected limits of magnetic recording technology in 2020 (2.5": 7Tb/platter) and how another dozen technologies will compare, but there's no guarantee. Some unanticipated problem might, like CPU's, derail Kryders' Law before then: disk space doubles every year.
We will get an early "heads-up": by 2015 Kryder expects 7Tb/platter to be demonstrated.

This "failure to fulfil the roadmap" has happened before: In 2005 Herb Sutter pointed out that 2003 marked the end of Moore's Law for single-core CPU's in "The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software". Whilst Silicon fabrication kept improving, CPU's hit a "Heat Wall" limiting the clock-frequency, spawning a new generation of "multi-core" CPUs.

IBM with its 5.2GHz Z-series processors and gamers "over-clocking" standard x86 CPUs showed part of the problem was a "Cooling Wall". This is still to play out fully with servers and blades.
Back to water-cooling, anyone?
We can't "do a Cray" anymore and dunk the whole machine in a vat of Freon (a CFC refrigerant, now banned).

Patterson examines the evolution of four computing techologies over 25 years from ~1980 and the increasing disparity between "latency" (like disk access time) and "bandwidth" (throughput):
  • Disks
  • Memory (RAM)
  • LANs (local Networking)
  • CPUs
He  neglects "backplanes", PCI etc, Graphic sub-systems/Video interfaces and non-LAN peripheral interconnection.

He argues there are 3 ways to cope with "Latency lagging Bandwidth":
  • Caching (substitute different types of capacity)
  • Replication (leverage capacity)
  • Prediction (leverage bandwidth)
Whilst  Patterson doesn't attempt to forecast the limits of technologies like Kryder, he provides an extremely important and useful insight:
If everything improves at the same rate, then nothing really changes
When rates vary, require real innovation
In this new milieu, Software and System designers will have to step-up to build systems that are effective and efficient, and any speed improvements will only come from better software.

There is an effect that will dominate bandwidth improvement, especially in networking and interconnections (backplanes, video, CPU/GPU and peripheral interconnects):
the bandwidth-distance product
This affects both copper and fibre-optic links. Using a single technology, a 10-times speed-up shortens the effective distance 10-times. Well know in transmission line theory.

For LANs to go from 10Mbps to 100Mbps to 1Gbps, higher-spec cable (Cat 4, Cat 5, Cat 5e/6) had to be used. Although 40Gbps and 100Gbps Ethernet have been agreed and ratified, I expect these speeds will only ever be Fibre Optic. Copper versions will either be very limited in length (1-3m) or use very bulk, heavy and expensive cables: worse in every dimension than fibre.

See the "International Technology Roadmap for Semiconductors" for the expert forecasts of the underlying Silicon Fabrication technologies, currently out to 2024. There is a lot of detail in there.

The one solid prediction I have is Kryder's 7Tb/platter.
A 32 times increase in bit-areal density, Or 5 doublings of capacity.
This implies the transfer rate of disks will increase 5-6 times, given there's no point in increasing rotational speed, to roughly 8Gbps. Faster than "SATA 3.0" (6Gbps) but within the current cable limits. Maintaining the current "headroom" would require a 24Gbps spec - needing a new generation of cable. The SATA Express standard/proposal of 16Gbps might work.

There are three ways disk connectors could evolve:
  • SATA/SAS (copepr) at 10-20Gbps
  • Fibre Optic
  • Thunderbolt (already 2 * 10Gbps)
Which type to dominate will be determined by the Industry, particularly the major Vendors.

The disk "scan time" (to fully populate a drive) at 1GB/sec, will be about 2hours/platter. Or 6 hours for a 20Tb laptop drive, or 9 hours for a 30Tb server class drive. [16 hours if 50TB drives are packaged in 3.5" (25.4mm thick) enclosures].  Versus the ~65 minutes for a 500Gb drive now.

There is one unequivocal outcome:
Populating a drive using random I/O, as we now do via filesystems, is not an option. Random I/O is 10-100 times slower than streaming/sequential I/O. It's not good enough to take a month or two to restore a single drive, when 1-24 hours are the real business requirements.

Also, for laptops and workstations with large drives (SSD or HDD), they will require 10Gbps networking as a minimum. This may be Ethernet or the much smaller and available Thunderbolt.

A caveat: This piece isn't "Evolutions' End", but "(Silicon) Revolutions' End". Hardware Engineers are really smart folk, they will keep innovating and providing Bigger, Faster, Better hardware. Just don't expect the rates of increase to be nearly as fast. Moores' Law didn't get repealed in 2003, the rate-of-doubling changed...

Why should anyone care? is really: Who should care?

If you're a consumer of technology or a mid-tier integrator, very little of this will matter. In the same way that now when buying a motor vehicle, you don't care about the particular technologies under the hood, just what it can do versus your needs and budget.

People designing software and systems, the businesses selling those technology/services and Vendors supplying parts/components hardware or software that others build upon, will be intimately concerned with the changes wrought by Revolutions End.

One example is provided above:
 backing up and restoring disks can no longer be a usual filesystem copy. New techniques are required.

Wednesday, December 07, 2011

RAID: Something funny happened on the way to the Future...

With apologies to Stephen Sondheim et al, "A Funny Thing Happened on the Way to the Forum", the book, 1962 musical and 1966 film.


Robin Harris of "StorageMojo" in  "Google File System Eval", June 13th, 2006, neatly summaries my thoughts/feelings:
As regular readers know, I believe that the current model of enterprise storage is badly broken.
Not discussed in this document is The Elephant in the Room, or the new Disruptive Technology: Enterprise Flash Memory or SSD (Solid State Disk). It offers (near) "zero latency" access and random I/O performance 20-50 times cheaper than "Tier 1" Enterprise Storage arrays.

Excellent presentations by Jim Gray about the fundamental changes in Storage are available on-line:
  • 2006 "Flash is good": "Flash is Disk, Disk is Tape, Tape is dead".
  • 2002 "Storage Bricks". Don't ship tapes or even disks. Courier whole fileservers, it's cheaper, faster and more reliable.

The current state of the Enterprise Storage market: Pricing and Budget Impact.

  • The promise inherent in the first RAID paper's title (Inexpensive Disks) doesn't seem to be met.
  • Are there other challenges, limits or oddities?
Budget Impact and per-GB pricing
A new entrant, Coraid, talking-up the benefits of its solution lays out some disturbing statistics:
With storage cost consuming 25% to 45% of IT budgets ... (ours) offer up to a 5-8x price-performance advantage over legacy Fibre Channel and iSCSI solutions.
Vendor Gross Margins
A commentary on one of the 6-7 dominant players (EMC, IBM, Network Appliance, Hewlett-Packard, Hitachi Data Systems, Dell, SUN/Oracle) who control 70-80% of the market by revenue: [compare to gross margins on Intel servers of 20-30%]:
Committing to massive disruption of the storage and networking businesses by moving to 40% gross margins from the current 60+% range through the use of volume and scale out technologies, open source software and targeted innovation leveraging the latest technology.
Can HP compete with 40% gross margins? Well, Apple has done pretty well. [Apple is known for 'premium pricing' and the best returns in the industry.]
A large, mature market
The market is large and growing, according to IDG:
... end-user spending on enterprise storage systems reached $30.8 billion in 2010, and the 18% growth over 2009 was the highest since IDC started tracking the market in 1993.
The enterprise storage systems market will grow at a comfortable 3.9% compound annual growth rate (CAGR) between 2010 and 2015 ...
Poor utilisation of raw disk capacity?
 IBM, touting the benefits of it's XIV Storage , claims massive waste by others' systems:
Overall, we find that the reliability attributes of the system limit the net capacity of a system to 46%-84% depending, for the most part, on the RAID configuration. [46%  for a RAID-1 (mirror) config. 84% for RAID-5 (parity/check disk)]

Overall, we see that, by virtue of its built-in efficiency, the XIV system uses 100% of its net capacity, compared with an estimated 28-61% net capacity used by comparable systems. [IBM list 3 types of "wasted space": Orphaned/Unreclaimable Space, Full Backups, Thick Provisioning. Clones and Backups are replaced by 'snapshots' in XIV. IBM neglect filesystem/database 'slack space' within allocated storage.]
The combined effect of the reliability and efficiency attributes is such that, on average, a traditional storage system using mirroring effectively uses less than 21% of its raw capacity (37% when using RAID-5). An XIV system uses approximately 46% of its raw capacity.
In the absence of good Operational Expenditure [OpEx] data, a guess
There is speculation that the Operational costs of 'Tier 1' (high-performance, high-availability, most expensive) storage is very high, but no good figures are available [Compare this recurrent cost to the retail price of 1TB SATA disks of ~$100, while fast, durable SAS drives, e.g. 146-160GB, are $200-300, 300-600GB are ~$400]:
With an estimated cost of Tier 1 storage at around $8,000 per TB per year,
Indirect Total Cost of Ownership [TCO] estimates
Hitachi Data Systems lays out the costs ownership and product conversion/data migration when pushing the benefits of their "virtual storage" architecture:
Forrester 2007 research has shown that in excess of 70 percent of enterprise IT budgets is devoted to maintaining existing infrastructure.
Migration project expenditures are on average 200 percent of the acquisition cost of enterprise storage.
 With an average of four years useful life,  the annual operating expenses associated to migration represent ~50 percent of acquisition cost.
  • Enterprise storage migration costs can exceed  US$15,000 per terabyte migrated. [implying 2007(?) acquisition cost of $7,500/Tb.]
  • For example, an average FORTUNE 1000® company has an average of  800TB of network attached storage (NAS) and  nearly 3PB of storage (InfoPro Wave 12–Q2, 2009)  with, on average, 300TB per storage system
  • As the useful life of most storage systems is three to five years, ...
Current admin challenges
The size and complexity of current Enterprise Storage solutions, and the resulting administrator workload, has provoked comments along the lines quoted below. System complexity increases combinatorially as additional layers are added and heterogeneous systems and networks are interfaced. This increases admin workload, task difficultly and execution times, consequentially increasing preventable faults and errors.
Today, storage is the single most complex and expensive component in the virtualized data center.
How do Enterprises choose between Vendors and Products?
Many commentators assert that Storage is bought on a single metric: Price per GB.
Comparing other important performance metrics, {latency, IO/second, throughput MB/sec} for "random" and "sequential" I/O workloads is being addressed by the Storage Performance Council, with their SPC1 and SPC2 specifications. For those vendors who choose to participate and publish data on their systems, it provides an "Apples and Apples" comparison for potential customers: including system pricing and discount information.

But there's a problem: system price unrelated to cost of drives.
Overwhelmingly, the price of the raw disk drives is an almost insignificant fraction of the purchase price of "Tier 1" Storage arrays. Competitors claim that the dominant players price their disks up to 30 times the normal retail price, hiding the true cost of the infrastructure wrapped around the drives. Vendors often load special firmware in their drives to prevent substitution. [SPC1 and SPC2 detailed pricing confirms published retail prices of $1-2,000 per drive, 5-20 times retail prices.]

This pricing practice distorts customer system specifications by purchasing far fewer drives, creating additional administrative work in managing allocated disk space and achieving target performance levels.

What is missing is good data on: Price / GB-available-to-Applications.
[Ignoring all the other overheads and "slack space" for Logical Volume Managers and Operating Systems 

With Enterprise Storage systems, this is at least 50-100 times the raw cost of drives.

The lesson: Optimising the utilisation of the cheapest and paradoxically least available resource in a system is poor practice.

Meanwhile, there are significant technical and performance issues looming in the world of Enterprise Storage:
  • In Triple Parity RAID and Beyond, Adam Leventhal, ACM Queue, Dec 2009, suggests that by 2020 three parity drives will be needed to achieve a 99.2% probability of a successful RAID rebuild recovering from a single drive failure. Multi-parity drives introduce new problems:
    • increases Price per GB (more drives for same capacity),
    • reduces write performance (1P = 4 IO, 2P = 6 IO, 3P = 8 IO, NP = 2*(N+1) IO)
    • increases compute intensity for parity calculations (1P uses trivial 'XOR')
    • increases system complexity in efforts to compensate for performance, such as delayed parity writes and caching.
    • system robustness and durability is adversely affected by increased component count and software complexity.
  • RAID rebuilds severely affect access times and throughput and now take from 3-24 hours, up from "minutes" in the first systems. Documented in "Comparison Test: Storage Vendor Drive Rebuild Times and Application Performance Implications", Feb 18, 2009, Dennis Martin. There are anecdotal reports of RAID rebuilds taking up to a week, degrading performance of all tasks and leaving organisations with protected data for the duration.

Disk Drives: Characteristics and Evolution.

  • The architecture and organisation of Enterprise Storage Systems are driven by Usage demands and the underlying storage components.
  • What's gone before and what might becoming?
In "A Converstaion with Jim Gray", ACM Queue, 2003, the evolution and limits of Hard Disk Drive (HDD) technology is discussed, projected to occur around 2020:
But starting about 1989, disk densities began to double each year. Rather than going slower than Moore’s Law, they grew faster. Moore’s Law is something like 60 per-cent a year, and disk densities improved 100 percent per year.

Today disk-capacity growth continues at this blistering rate, maybe a little slower. But disk access, which is to say, “Move the disk arm to the right cylinder and rotate the disk to the right block,” has improved about tenfold. The rotation speed has gone up from 3,000 to 15,000 RPM, and the access times have gone from 50 milliseconds down to 5 milliseconds. That’s a factor of 10. Bandwidth has improved about 40-fold, from 1 megabyte per second to 40 megabytes per second. Access times are improving about 7 to 10 percent per year. Meanwhile, densities have been improving at 100 percent per year.

At the FAST [File and Storage Technologies] conference about a year-and-a-half ago, Mark Kryder of Seagate Research was very apologetic. He said the end is near; we only have a factor of 100 left in density—then the Seagate guys are out of ideas. So this 200-gig disk that you’re holding will soon be 20 terabytes, and then the disk guys are out of ideas. [now revised to 14Tb @ $40 for 2.5 inch drive]
The definitive paper on the limits and evolution of hard disk technology, including a comparison with a dozen other prospective technologies, by Mark Kryder :

"After Hard Drives—What Comes Next?" Kryder and Chang Soo Kim. IEEE  Magnetics, Oct 2009.
Assuming HDDs continue to progress at the pace they have in the recent past, in 2020 a two-disk, 2.5-in disk drive [2 platters] will be capable of storing over 14 TB and will cost about $40.
 Given the current 40% compound annual growth rate in areal density, this technology should be in volume production by 2020. [Expect a demonstration in 2015]
In 2005, Scientific American discussed his work and described "Kyrder's Law", in recent years HDD capacity has doubled every year, outstripping Moore's Law for CPU speed.

Sankar, Gurumurthi and Stan, ICSA 2008,  describe the relationship of power consumption to RPM and platter size necessary to understand current drive design [reformatted]:
Since the power consumption of a disk drive is
proportional to the fifth-power of the platter size,
is cubic with the RPM, and
is linear with the number of platters... [citing a 1990 IEEE paper]
The external form-factor defines capacity of current HDD's in surprising ways:
  • The 2.5 inch form-factor allows thickness to vary between 7 mm and 19mm, though 19mm is now unusual.
    • Compare to the 25.4mm (1 inch) thickness of 3.5" drives.
    • Consumer drives, in laptops and PC's, are now normally 9.5mm, with some 7mm. [single platter?]
    • Enterprise drives are 15mm, allowing higher capacities by including more platters. [2-3?.
    • In 2020, Enterprise 2.5 inch drives will be 2 or 3 platters, hence  14-21TB.
  • The maximum platter size is not used in every drive. 
    • For consumer drives and high-capacity/low-energy Enterprise drives, the largest platters possible are used.
    • For high-speed (10,000 [10K] and 15,000 RPM [15K]) drives, the same size platters are used in both 3.5 inch and 2.5 inch drives. This reduces power consumption and seek time through reduced head travel.
    • Vendors sell 3.5 inch 15K drives because they can fit more platters in the 25.4 mm vs 15 mm form factor. [4 platters are common, 5 platters are "possible"]
      "More than an interface — SCSI vs. ATA", Anderson, Dykes, Riedel, Seagate, FAST 2003.
A fundamental characteristic of HDD's, "access density, or IOPS/GB" and its importance and evolution as discussed in 2004 by a disk drive manufacture. [Last row added to table].

19872004times increase
CPU Performance1 MIPS2,000,000 MIPS2,000,000 x
Memory Size16 Kbytes32 Gbytes2,000,000 x
Memory Performance100 usec2 nsec50,000 x
Disc Drive Capacity20 Mbytes300 Gbytes15,000 x
Disc Drive Performance60 msec5.3 msec11 x
Disc Drive Bandwidth0.6 MB/sec86 MB/sec140 xPatterson 2004

Figure 2. Disc drive performance has not kept pace with other components of the
Last row data from 2004 Patterson address on "Latency and Bandwidth", covering 20+ years evolution of all system components.

The vendor table omits "transfer rate" or "bandwidth", necessary to calculate the other fundamental characteristic of HDD's, described by Leventhal [ACM Queue 2009], disk "scan time":
By dividing capacity by throughput, we can compute the amount of time required to fully scan or populate a drive.
This storage analyst discussion of access density, cost and capacity using SPC benchmarks delves into more fine detail on this and related topics.

RAID array IO/sec performance suffers through the declining access density, which can addressed through higher RPM drives, "short-stroking"/"de-stroking" (using outer 20% of drives) or provisioning the number of drives ('spindles') not on capacity, but desired IO/sec. (Most designers and storage architects would consider this as "over-provisioning" and unnecessarily expensive.)

RAID rebuild times cannot proceed faster than the individual drive scan time, whilst in normal RAID-5 configurations, the total amount of data read for an (ND+1P) parity group, is N-times the drive size. The minimum time taken is (N*Disk-size ÷ common-channel speed), the parity-group scan time, analogous to single-drive scan time. The Networking concept of "over-subscription", the ratio of uplink to total downlink capacity (1:1 is ideal but expensive, higher numbers cause congestion in high-performance environments like server rooms), can be applied to the common-channel and supported drives.

The Early View
  • Why are we now facing these problems?
  • Is this the future that the instigators of RAID foresaw?
In 1987/8 when Patterson, Gibson and Katz at UCB (University of California, Berkeley) wrote their industry-changing paper, "A Case of Redundant Arrays of Inexpensive Disks", their group was working on new processing architectures (MPP - Massively Parallel Processing) amongst other things.

Patterson et al describe an application scale-up problem as CPU's speed increases  faster than disks  and proffer this:  "A Solution: Arrays of Inexpensive Disks".
1.2 The Pending I/O Crisis
What is the impact of improving the performance of some pieces of a problem while leaving others the same? Amdahl's answer is now known as Amdahl's Law...

Suppose that some current applications spend 10% of their time in I/O. Then when computers are 10X faster - according to Bill Joy in just over three years - then Amdahl's Law predicts effective speedup will be only 5X. When we have computers 100X faster - via evolution of uniprocessors or by multiprocessors - this application will be less than 10X faster, wasting 90% of the potential speedup.
 In transaction-processing situations using no more than 50% of storage capacity, then the choice is mirrored disks (Level 1). However, if the situation calls for using more than 50% of storage capacity, or for supercomputer applications, or for combined supercomputer applications and transaction processing, then Level 5 looks best.
They also suggest that storage arrays will commonly be comprised of very large numbers of disks:
MTTR and thereby increase the MTTF of a large system. For example, a 1000 disk level 5 RAID with a group size of 10 and a few standby spares could have a calculated MTTF of 45 years.
But explicitly pose as unsolved questions, just how this might be accomplished:
  • Can information be automatically redistributed over 100 to 1000 disks to reduce contention? 
  • Will disk controller design limit RAID performance?
  • How should 100 to 1000 disks be constructed and physically connected to the processor? 
  • Where should a RAID be connected to a CPU so as not to limit performance? Memory bus? I/O bus? Cache?
A further insight into the thinking of these pioneers is found in the Introduction of "RAIDframe: A Rapid Prototyping Tool for RAID Systems" by Courtright,  Gibson, et al in 1997. They give a very exact description of how they envision RAID arrays developing [many tiny drives] and why [access time and bandwidth].
Unfortunately, the 1.3 inch drive by HP (C3014 'Kittyhawk') cited was introduced in early 1992 and discontinued due to slow sales at the end of 1994.
The 1 inch drives they forecast did appear in 1999, the IBM Microdrive, packaged as Compact Flash (CF) cards, and while for a number of years were the largest capacity available in the CF form-factor. They were discontinued in 2006 when Flash Memory overtook them in capacity and Price per GB, never having made it into the Enterprise Storage market.
Wikipedia (Dec 2011) claims that by 2009 all drives smaller than 1.8 inch (used in portable devices) have been discontinued.

Another good source from 1995 is "ISCA’95 Reliable, Parallel Storage Tutorial", Garth Gibson, CMU. Gibson co-authored the original Berkeley RAID paper.

From the "Raidframe" paper:
Further impetus for this trend derived from the fact that smaller-form-factor drives have several inherent advantages over large disks:
  • smaller disk platters and smaller, lighter disk arms yield faster seek operations,
  • less mass on each disk platter allows faster rotation,
  • smaller platters can be made smoother, allowing the heads to fly lower, which improves storage density,
  • lower overall power consumption reduces noise problems.
These advantages, coupled with very aggressive development efforts necessitated by the highly competitive personal computer market, have caused the gradual demise of the larger drives.

 In 1994, the best price/performance ratio was achieved using 3.5 inch disks, and the 14-inch form factor has all but disappeared.

 The trend is toward even smaller form factors:
  •  2.5 inch drives are common in laptop computers [ST9096], and
  •  1.3-inch drives are available [HPC3013].
  •  One-inch-diameter disks should appear on the market by 1995 and  should be common by about 1998. [appeared 1999, only achieved commercial success in Digital Cameras]
  •  At a (conservative) projected recording density in excess of 1-2 GB per square inch [Wood93], one such disk should hold well over 2 GB of data. [got there about 2002]
These tiny disks will enable very large-scale arrays.
 For example, a one-inch disk might be fabricated for surface-mount, rather than using cables for interconnection as is currently the norm,  and thus a single, printed circuit board could easily hold an 80-disk array. [did they mean 8x11in cards? see Gibson's 1995 tutorial, Slide 5/74, for a diagram]

 Several such boards could be mounted in a single rack to produce an array containing on the order of 250 disks.
 Such an array would store at least 500 GB, and even if disk performance does not improve at all between now and 1998,  could service either 12,500 concurrent I/O operations or deliver 1.25-GB-per-second aggregate bandwidth.

The entire system (disks, controller hardware, power supplies, etc.) would fit in a volume the size of a filing cabinet.

To summarize, the inherent advantages of small disks,  coupled with their ability to provide very high I/O performance through disk-array technology,  leads to the conclusion that storage subsystems are and will continue to be, constructed from a large number of small disks,  rather than from a small number of powerful disks.

The Gap: Then and Now.
  • Around 25 years on, have the original expectations been met?
  • Has the field and technology developed as they envisioned?
  • What are the unresolved questions?
Whilst EMC released one of the first commercial RAID system, one of the two commercial systems described in the UCB groups' followup 1994 paper, "RAID: High-Performance, Reliable Secondary Storage", was the Storage Technology Iceberg 9200, released that year. The Iceberg and the 1989 Berkeley RAID-I prototype used (32 and 28) 5.25 inch drives. The Berkeley RAID-II, written up in 1994, graduated to 72-144 * 320MB 3.5 inch drives supplied by IBM.

Whilst the main thrust of the 1987/8 paper was recommending the use of the smallest disk drives available, at the time 3.5 inch, this was not the practice, especially in commercial systems.

Even in the beginning, vendors and their customers, chose differently, using 5.25 inch drives.
The direct impact on system design was that the Iceberg needed dual-parity to achieve sufficient data-protection, especially for read-errors when rebuilding a RAID parity group after a single-disk failure. The UBER (Unrecoverable Bit Error Rate) of 10 in 10^14 and total parity-group size of 8-16 GB gave an unacceptably high chance of a rebuild failure.

The 5.25 inch drive was the smallest 'enterprise' drive available. The Fujitsu Super Eagle, at 10.5 inches, was discussed in the 1988 paper. At the same time 8 inch SCSI drives were on the market.

In 2005, Herb Sutter wrote a commentary on the end of Moore Law for single-core CPU's "The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software". Early in 2003, CPU's hit a "Heat Wall" limiting the clock-frequency. Whilst more transistors could be placed on a CPU die, the clock-frequency (speed) stalled. To improve CPU throughput, designers placed more cores in each CPU, the equivalent to a network connection using more parallel conductors to increase total bandwidth without increase the speed of an individual connection/conductor.

In contrast to this sudden change, Hard Disk access density has been steadily eroding whole array performance for the last two decades, steadily forcing design changes and increasing complexity.

It's unlikely that when fundamental limits are reached an "IO Performance Wall" will be created for hard disks. Though, like CPU's, some unexpected physical constraint may limit realisable capacities.

The projected maximum areal density of disk drives results in fixed sizes for 2.5 inch and 3.5 inch platters. The Industry Standard form-factors fix the size available, and potentially the number of platters possible in each form factor. With the advent of cheap Flash Memory in SSD's, there is little need for high RPM Hard Drives.

There has been one major form factor conversion, from 5.25 inch to 3.5 inch. Currently 'Tier 1' storage is being sold in both 3.5 inch and 2.5 inch form factors. To understand if 2.5 inch drives will become universal, we need to examine the history of the last conversion and the differences to now.

 Slide 25/74, "Disk Diameter Trends",  of Garth Gibson's ISCA’95 Reliable, Parallel Storage Tutorial covers these transitions:
Decreasing diameter dominated 1980s
• 5.25” created desktop market (16+ GB soon)
• 3.5” created laptop market (4+ GB 1/2 high; 500+ MB 19mm)
• 2.5” dominating laptop market (200+ MB; IBM 720 MB)
• 1.8” creating PCMCIA disk market (80+ MB)

Decreasing diameter trend slowed to a stop
• 1.8” market not 10X 2.5” market
• 1.3” (HP Kittyhawk) discontinued
• vendors continue work on smaller disks to lower access time
It isn't clear when 5.25 inch drives went out of production and Enterprise storage swapped completely to 3.5 inch drives.  Seagate's 10Gb "Elite 9" ST410800N (the ST410800N manual, static URL) was released in 1994. The cut-over time will have been before the product's end-of-life in 1997/8.
Slide 20/74 of Gibson's 1995 tutorial plots the year of introduction of different drive capacities by drive size. It tracks 5.25 inch drives until 1994 with 10GB introduced, confirming the first estimate.

The reasons will have been complex and many, but at some point all the significant metrics would've favoured 3.5 inch drives and then result was foregone.

Metrics that matter to Enterprise Storage vendors:
  • Price per GB
    • Both form factors are in high-volume production, with a small premium for 2.5 inch drives.
    • Laptops and notebooks have outsold desktop PC's for a number of years, at least balancing demand for 2.5 inch drives.
  • GB per cubic-unit (normally per "Rack Unit" (RU) in a "19 inch" rack (1.75 in x 17.75 in x 24-36 in))
    • Standard vertical mounting sees 13-16 3.5 inch  hot-swap  drives in a 3RU space, and
    • 24 2.5 inch drives in a 2RU space.
    • 4.3-5.3 drives/RU vs 12 drives/RU.
    • The ratio of platter area, a close approximation to  capacity per platter, is 2:1, making the ratio about 5:6.
    •  2.5" drives a 20% higher storage density per platter for a given technology,
    • But there extra platters in 25.4 mm vs 15 mm. (50% more platters)
    • For high RPM drives with identical capacities, 2.5 inch wins,
    • for same-RPM drives, 3.5 inch drives store
    • Orienting 3.5 inch drives horizontally, 3-high by 4-wide (12) will fit in 2RU, the best packing is 6/RU.
  • Watts per GB and implied operational costs.
    • In higher RPM drives, Watts per GB currently is comparable (shown above).
    • For 'slow' 7200RPM drives, the fifth-power-law of platter diameter to power consumption means 2.5 inch drives will consume significantly (535%!) less power. (3-4W vs 15W)
    • Western Digital now sell a line of 'Green' 3.5 inch drives with variable RPM, reducing power demand considerably.
    • Operational costs, e.g. cooling capacity and cost of electricity, over the 5 year service life of equipment (~45,000 hours) may be the deciding factor.
      1,000 * 3.5 inch drives will consume ~650,000 kilo-Watt-hours vs 125,000 kWhrs.
      At $0.25/kWhr, a lifetime saving of ~$135 per drive.
  • Average access time (based on rotational latency and seek times). 
    • Smaller drives can be spun faster and still consume less power.


A little production theory: Why "popular" drives are much cheaper.

Accounting Theory has the concept of the "Learning Curve", also called "Experience Curve".
It originated with Wright in 1936 examining small-scale production and noting that doubling production, reduces costs 10-15%.

Does the Experience Curve scale-up?
If you produce 100,000 times more, (16+ doublings), can you realise these benefits all the way?
  • 10%-per-doubling improvement = 18.5% (81.5% cost reduction)
  • 15%-per-doubling improvement =  7.5% (92.5% cost reduction)
Large-scale Silicon chip production has very high "barriers to entry", the cost of building "fabrication plants" is now measured in Billions. As components are made smaller, with tighter tolerances, volume production plants for Hard Disk Drives must be following the same trend. To make "next generation" products with features and tolerances 70% the size of the current technology, every process is affected, needing to be more precise, cleaner and with less vibration. These new technologies don't come cheaply.

These plant costs and the product Research and Development costs have to be amortised across all units produced. For fixed and overhead costs to be a small fraction of the per-unit cost, the number of units now has to be very large: 10-100 million.

In 2011, there are forecast to be 400M+ PC's (desktops and laptops) produced. Laptops, using 2.5 inch drives, passed desktops as the most popular format around 5 years ago. These high volume demands underpin the production of both 3.5 inch and 2.5 inch disk drives.

Volume production counts for a lot.
If there isn't already a high demand for a product, the per-unit price will be considerably inflated to cover "sunk costs": everything involved in creating the product.
Once the plant and Research costs are paid for, per-unit costs can be lowered more.
After a time the Price/Capacity of old technology products, even when fully amortised, will be too much higher than new technology and demand will fall away rapidly.
When demand falls and production runs are too small, the plant becomes uneconomic because fixed-costs (operational/production costs) start to overwhelm the per-unit price. Manufacturers must close plants when they start to make losses: the situation can only get worse, rapidly.

As an example, consider the two machines that Seymour Cray designed for Control Data Corporation (CDC) in the 1960's/70's before leaving to create his own company:
  • CDC 6600  World's fastest computer: 1964-1969.
  • CDC 7600  World's fastest computer: 1969-1975. reputedly $5M (equivalent to $30M now).
Both CDC and Cray himself knew everything there was to know about both ECL and TTL technologies they used, yet by the mid-1990's, all the fastest computers were VLSI CMOS: the CPU chips in use today.

The difference between Intel et al and Cray/CDC was volume production.
When you hand-craft 100-1,000 machines, they cost millions.
When you build CPU's by the million, they cost $100-$1,000.
It redefines how you architect and organise super-computers.

This "substitution effect" affects all commodity products.


Further Questions

Economics asserts "Price is the Mediator between Supply and Demand".
In a mature, free market, where purchasers have "perfect information" and products are fungible (perfect substitutes for each other),  competition will drive prices down (and consumption of goods will increase) and inefficient producers will be driven from the market.

But the Enterprise Storage market has the very high gross margins associated with new markets or non-substitutable products.

Q: What's happening between vendors and purchasers to produce this skewed market?
  • Are Enterprise Storage products fungible or not?
  • Are secondary effects at work preventing product substitution? [warranty conditions, staff capability, integrated platform management software, decision inertia/product loyalty or technical conservatism.]
  • Is the purchasing criteria "Price/GB", I/O performance, (perceived) Reliability/Availability etc, functionality or something else?

Performance (latency and throughput) of HDD-based Enterprise Storage arrays is rated "good enough" in the market because for 10-15 years many new entrants have attempted to break into the market by offering low-cost or high-performance products.

Q: What "figure of merit" do purchases use to chose between vendors and products? Is it "just" price/GB, because it isn't "performance".
  • Without extensive customer research, this may be unknowable.

There aren't convincing reasons favouring smaller or larger form-factor drives unless high RPM drives or SSD (packaged in 2.5 inch or 1.8 inch) are included.

Q: Why has the take-up of SSD's in Enterprise environments been slow when the price/performance ratios are so far ahead of 'conventional' Storage systems?

Kryder's Law, the on-going compound increase in HDD size and price/GB has caused some changes in fundamental ratios:
 Storage Arrays initially lashed together many small devices into "large enough" logical/virtual devices.
 Somewhere in the last 25 years, drive capacity exceeded normal use cases by multiples.

The basic RAID I/O performance drivers for both latency and throughput, many spindles and actuators reduce latency and parallel transfers increase throughput, were invalidated, but the designs seemed not to change.
  • 1988: IBM 3380 7.5GB. Databases, Files fitted within this limit.
    • RAID from 320MB-1GB, more spindles, more IO/sec, higher throughpu

  • 2010: 600GB 10/15K, 2TB SATA drives.
    • These units are now (much!) larger than most common Databases and file stores.
    • Not just video, images, audio, scanned stuff. MS-Office documents as well.

Q: Why did Storage arrays not respond to this change in fundamental drivers?
  • Did the change happen so slowly that nobody noticed?
  • Storage vendors are generally very innovative and competitive, employing some of the "best and brightest" in computing.
    This wasn't a failure of ability or capability. Perhaps of vision?
  • Are incumbent vendors locked into their own solutions, leaving innovation to new entrants?
  • Did consumers demand products they were familiar and comfortable with and prevent vendor changing designs?

Why didn't large numbers of really small HDD's get tried by major vendors, even as an experiment?
IBM, the leader in 1 inch drives, sold it's drive business to Hitachi Data Systems (HDS), a leading Storage system vendor with expertise in many related areas.

HDS had the capability to create custom packaging, custom electronics (ASIC's) and to redesign the 1 inch drive format (Compact Flash with IDE). For very small drives to be soldered onto boards, a simple, continuous serial interface was needed. SAS, Serial Attached SCSI, would fit the bill today.

In the 3 years before 1 inch HDD's lost their price advantage to Flash memory, HDS could have built a prototype and proven the concept, but (seemingly) didn't. Nor did any academic projects.

Q: Why did the Industry and Academic researchers not build a "many tiny drives" Array between 2000 and 2005?

To show that very low overhead Storage devices can be built outside Google datacentres, these are the full on-line instructions and Bill of Materials for one service providers solution:

"Petabytes on a budget: How to build cheap cloud storage", September 1, 2009
"Petabytes on a Budget v2.0: Revealing More Secrets", July 20, 2011
135TB for $7,384, around 50% more than the raw cost of the drives.
A chart in the first piece comparing the Price / GB of different solutions.

RAW:               $81,000 [660 @ 1.5TB ] 45 @ $120 == $5400
backblaze:        $117,000 [50% o'head]
Dell MD1000       $826,000
SUN/Ora X4550   $1,000,000
NetApp FAS-6000 $1,714,000
Amazon S3       $2,806,000
EMC NS-960      $2,860,000

Pix of Storage Pod & component costs.

Monday, November 28, 2011

Optical Disks as Dense near-line storage?

A follow-up to an earlier post [search for 'Optical']:
Could Optical Disks be a viable near-line Datastore?
Use a robot arm to pick and load individual disks from 'tubes' into multiple drives.
Something the size of a single filing cabinet drawer would be both cheap and easily contain a few thousand disks. That's gotta be interesting!

Short answer, no...

A 3.5" drive has a form-factor of:  4 in x 5.75 in x 1in. Cubic capacity: 23 in³
A 'tube' of 100 optical disks: 5.5in x 5.5in x 6.5in Cubic capacity: 160-200 in³ [footprint or packed]

A 'tube' of 100, minus all the supporting infrastructure to select a disk and read it, is 7-9 times the volume of a 3.5in hard disk drive, or each Optical Disk must contain at least 7-9% of a HDD to be competitive.

To replace a 1Tb HDD, optical disks must be at least 7% of 1,000Gb, or 70Gb. Larger than even Blu-ray and 15-20 times larger than Single layer DVD's (4.7Gb).

Current per Gb price of 3.5" HDD's is around $0.05-$0.10/Gb,  squeezing 4.7Gb DVD's on price as well.

2Tb drives are common, 3Tb are becoming available now (2011). Plus it gets worse.

There are estimate is of maximum possible 3.5" HDD size of 20-40Tb.
To be competitive, Optical disks would need to get up around 1Tb in size and cost under $1.

Around 2005, when 20-40Gb drives reigned, there was a time when 4.7Gb DVD's were both the densest and cheapest storage available. Kryders' Law, a doubling of HDD capacity every 1-2 years, has seen the end of that.

Sunday, November 27, 2011

Journaled changes: One solution to RAID-1 on Flash Memory

As I've posited before, simple-minded mirroring (RAID-1) of Flash Memory devices is not only a poor implementation, but worst-case.

My reasoning is: Flash wears-out and putting identical loads on identical devices will result in maximum wear-rate of all bits, which is bad but not catastrophic. It also creates a potential for simultaneous failures where a common weakness fails in two devices at the one time.

The solution is to not put the same write load on the two  devices, but still have two exact copies.
This problem would be an especial concern for PCI-SSD devices internal to a system. The devices can't normally can't be hot-plugged, though there are hot-plug standards for PCI devices (e.g. Thunderbolt and ExpressCard), they are not usually options for servers and may be limited performance.

One solution, I'm not sure if it's optimal or not, but it is 'sufficient', is to write blocks as normal to the primary device and maintain the secondary device as snapshot + (compressed) journal entries. When the journal space hits a high-water mark the snapshot is made an exact copy (e.g. bring the snapshot up-to-date when a timer expires (hourly, 6-hourly, daily, ...) or when the momentum of changes will fill the journal to 95% before the snapshot could be updated).

If the journal fills, the mirror is invalidated and either changes must be halted or the devices go into unprotected operation. Both not desirable operational outcomes. A temporary, though unprotected, work-around is to write the on-going journal either to the primary device or into memory.

Implementation Outline:

With existing devices, a set of blocks on both devices need to be allocated for the journal. Whilst the journal area won't be 'written to' on the primary device,  it needs to be there:
  • so identical data areas are available on both devices
  • if the primary and secondary devices are swapped, another device designated the new primary - either as an additional device or a replacement for the secondary.
I'd prefer additional chips be added to the Flash devices specifically for journaling. NOR chips, expensive but not as prone to wear, could even be used. [If similar speed.]

Better techniques of dealing with the journal as a set of version-changes with unique keys (e.g. sequence number) would allow a device to be removed from a mirror and rejoined with minimal updates, avoiding a slow and expensive full-copy. This edge-case would benefit from writing the journal to the primary. One of the most annoying behaviours of RAID systems is "popping" a drive for a second or two (as in "is this the right drive? Oops, no"), then having to wait hours for a full rebuild to complete. Even if nothing was changed on the volumes in that short time...


RAID-1 provides both protection against device failure and improves read I/O performance. Write performance is limited to the slowest device.

Mirroring can also be used as an operational technique to create full-copies of large/critical filesystems or databases on a live system with no downtime.
A 3rd or 4th volume is added to a mirror, synchronised, then after ensuring content-consistency "split-off" and used separately, typically as the base for a test/conversion environment or backups.  Because the disk, or set of disks, can be loaded onto a truck/plane, very high effective bandwidths are possible for the price of a courier. It can be faster than volume 'snapshots' and over-the-wire replication, not to mention a fraction of the cost. Airlines are known to have moved data-centres between continents this way, whilst maintaining their 24x7 booking and flight systems.

This (block-exact + snapshot-and-journal) model can be scaled up to N-replicas by replicating either or both types of replica. For different operational requirements, different combinations would be preferred. All combinations have uses/advantages in specific instances and won't be enumerated.

Friday, November 25, 2011

Flash Memory: will filesystems become the CPU bottleneck?

Flash memory with 50+k IO/sec may be too fast for Operating Systems (like Linux) with file-system operations consuming more CPU, even saturating it. They are on the way to becoming the system rate-limiting factor, otherwise known as a bottleneck.

What you can get away with at 20-100 IO/sec, i.e. consumes 1-2% of CPU, will be a CPU hog at 50k-500k IO/sec, a 5,000-50,000 times speed up.

The effect is the reverse of the way Amdahl speed-up is explained.

Amdahl throughput scaling is usually explained like this:
If your workload has 2 parts (A is single-threaded, B can be parallelised), when you decrease the time-taken for  'B' by adding parallel compute-units, the workload becomes dominated by the single-threaded part, 'A'. If you half the time it takes to run 'B', it doesn't halve the total run time. If 'A' and 'B' parts take equal time (4 units each, total 8), then a 2-times speed up of 'B' (4 units to 2) results in a 25% reduction in run-time (8 units to 6). Speeding 'B' up 4-times is a 37% reduction (8 to 5).
This creates a limit to the speed-up possible: If 'B' reduces to 0 units, it still takes the same time to run all the single-threaded parts, 'A'. (4 units here)

A corollary of this: the rate-of-improvement for each doubling of cost nears zero, if not well chosen.

The filesystem bottleneck is the reverse of this:
If your workload has an in-memory part (X) and wait-for-I/O part (W) both of which consume CPU, if you reduce the I/O wait to zero without reducing the CPU overhead of 'W', then the proportion of useful work done in 'X' decreases. In the limit, the system throughput is constrained by CPU expended on I/O overhead in 'W'.

The faster random I/O of Flash Memory will reduce application execution time, but at the expense of increasing % system CPU time. For a single process, the proportion and total CPU-effort of I/O overhead remains the same. For the whole system, more useful work is being done (it's noticeably "faster"), but because the CPU didn't get faster too, it needs to spend a lot more time on the FileSystem.
Jim Gary observed that:
  • CPU's are now mainly idle, i.e. waiting on RAM or I/O.
    Level-1 cache is roughly the same speed as the CPU, everything else is much slower and must be waited for.
  • The time taken to scan a 20Tb disk using random I/O will be measured in days whilst  a sequential scan ("streaming") will take hours.
Reading a "Linux Storage and Filesystem Workshop" (LSF) confrence report, I was struck by comments that:
  • linux file systems can consume large amount of CPU doing their work, not just fsck, but handling directories, file metadata, free block chains, inode block chains, block and file checksums, ...
There's a very simple demonstration of this: optical disk (CD-ROM or DVD) performance.
  • A block-by-block copy (dd) of a CD-ROM at "32x", or approx 3Mb/sec, will copy a full 650Mb in 3-4 minutes. Wikipedia states a 4.7Gb DVD takes 7 minutes (5.5Mb/sec) at "4x".
  • Mounting a CD or DVD then doing a file-by-file copy takes 5-10 times as long.
  • Installing or upgrading system software from the same CD/DVD is usually measured in hours.
The fastest way to upgrade software from CD/DVD is to copy an image with dd to hard-disk, then mount that image. The difference is the random I/O (seek) performance of the underlying media, not the FileSystem. [Haven't tested times or speedup with a fast Flash drive.]

This performance limit may have been something that the original Plan 9 writers knew and understood:
  • P9 didn't 'format' media for a filesystem: initialised a little and just started writing blocks.
  • didn't have fsck on client machines, only the fileserver.
  • the fileserver wrote to three levels of storage: RAM, disk, Optical disk.
    RAM and disk were treated as cache, not permanent storage.
    Files were pushed to Optical disk daily, creating a daily snapshot of the filesystem at the time. Like Apple's TimeMachine, files that hadn't changed were 'hard-linked' to the new directory tree.
  • The fileserver had operator activities like backup and restore. The design had no super-user with absolute access rights, so avoided many of the usual admin-related security issues.
  • Invented 'overlay mounts', managed at user not kernel level, to combine the disparate file-services available and allow users to define their own semantics.
Filesystems have never, until now, focussed on CPU performance, rather the opposite, they've traded CPU and RAM to reduce I/O latency, historically improving system throughput, sometimes by orders-of-magnitude.
Early examples were O/S buffers/caching (e.g. Unix) and the  'elevator algorithmn' to optimally reorder writes to match disk characteristics.

 This 'burn the CPU' trade-off shows up with fsck as well. An older LSF piece suggested that fsck runs slowly because it doesn't do a single pass of the disk, effectively forced into near worst-case unoptimised random I/O.

On my little Mac Mini with a 300Gb disk, there's 225Gb used. Almost all of which, especially the system files, is unchanging. Most of the writing to disk is "append mode" - music, email, downloads - either blocks-to-a-file or file-to-directory. With transactional Databases, it's a different story.

The filesystem treats the whole disk as if every byte could be changed in the next second - and I pay a penalty for that in complexity and CPU cycles. Seeing my little Mac or an older Linux desktop do a filesystem check after a power fail is disheartening...

I suggest future O/S's will have to contend with:
  • Flash or SCM with close to RAM performance 'near' the CPU(s)  (on the PCI bus, no SCSI controller)
  • near-infinite disk ("disk is tape", Jim Gray) that you'll only want to access as "seek and stream". It will also take "near infinite" time to scan with random I/O. [another Jim Gray observation]
And what are the new rules for filesystems in this environment?:
  • two sorts of filesystems that need to interwork:
    • read/write that needs fsck to properly recover after a failure and
    • append-only that doesn't need checking once "imaged", like ISO 9660 on optical disks.
  • ''Flash" file-system organised to minimise CPU and RAM use. High performance/low CPU use will become as important as managing "wear" for very fast PCI Flash drives.
  • 'hard disk' filesystem with on-the-fly append/change of media and 'clone disk' rather than 'repair f/sys'.
  • O/S must seamlessly/transparently:
    1. present a single file-tree view of the two f/sys
    2. like Virtual Memory, safely and silently migrate data/files from fast to slow storage.
I saw a quote from Ric Wheeler (EMC) from LSF-07 [my formatting]:
 the basic contract that storage systems make with the user
 is to guarantee that:
  •  the complete set of data will be stored,
  •  bytes are correct and
  •  in order, and
  •  raw capacity is utilized as completely as possible.
I disagree nowdays with his maximal space-utilisation clause for disk. When 2Tb costs $150 (7.5c/Gb) you can afford to waste a little here and there to optimise other factors.
With Flash Memory at $2-$5/Gb, you don't want to go wasting much of that space.

Jim Gray (again!) early on formulated "the 5-minute rule" which needs rethinking, especially with cheap Flash Memory redefining the underlying Engineering factors/ratios. These sorts of explicit engineering trade-off calculations have to be done for the current disruptive changes in technology.
  • Gray, J., Putzolu, G.R. 1987. The 5-minute rule for trading memory for disk accesses and the 10-byte rule for trading memory for CPU time. SIGMOD Record 16(3): 395-398.
  • Gray, J., Graefe, G. 1997. The five-minute rule ten years later, and other computer storage rules of thumb. SIGMOD Record 26(4): 63-68.
I think Wheeler's Storage Contract also needs to say something about 'preserving the data written', i.e. the durability and dependability of the storage system.
For how long? what what latency? How to express that? I don't know...

There is  also a matter of "storage precision", already catered for with CD's and CD-ROM, Wikipedia states:
The difference between sector size and data content are the header information and the error-correcting codes, that are big for data (high precision required), small for VCD (standard for video) and none for audio. Note that all of these, including audio, still benefit from a lower layer of error correction at a sub-sector level.
Again, I don't know how to express this, implement it nor a good user-interface. What is very clear to me is:
  • Not all data needs to come back bit-perfect, though it is always nice when it does.
  • Some data we would rather not have, in whole or part, than come back corrupted.
  • There are many data-dependent ways to achieve Good Enough replay when that's acceptable.
First, the aspects of Durability and Precision need to be defined and refined, then a common File-system interface created and finally, like Virtual Memory, automated and executed without thought or human interaction.

This piece describes FileSystems, not Tabular Databases nor other types of Datastore.
The same disruptive technology problems need to be addressed within these realms.
Of course, it'd be nicer/easier if other Datastores were able to efficiently map to a common interface or representation shared with FileSystems and all the work/decisions happened in Just One Place.

Will that happen in my lifetime? Hmmmm....

Sunday, November 20, 2011

Building a RAID disk array circa 1988

In "A Case for Redundant Arrays of Inexpensive Disks (RAID)" [1988], Patterson et al of University of California Berkeley started a revolution in Disk Storage still going today. Within 3 years, IBM had released the last of their monolithic disk drives, the 3390 Model K, with the line being discontinued and replaced with IBM's own Disk Array.

The 1988 paper has a number of tables where it compares Cost/Capacity, Cost/Performance and Reliability/Performance of IBM Large Drives, large SCSI drives and 3½in SCSI drives.

The prices ($$/MB) cited for the IBM 3380 drives are hard to reconcile with published prices:
 press releases in Computerworld  and IBM Archives for 3380 disks (7.5Gb, 14" platter, 6.5kW) and their controllers suggest $63+/Mb for 'SLED' (Single Large Expensive Disk) rather than the
"$18-10" cited in the Patterson paper.

The prices for the 600MB Fujitsu M2316A ("super eagle") [$20-$17] and 100Mb Conner Peripherals CP-3100 [$10-$7] are in-line with historical prices found on the web.

The last table in the 1988 paper lists projected prices for different proposed RAID configurations:
  • $11-$8 for 100 * CP-3100 [10,000MB] and
  • $11-$8 for 10 * CP-3100 [1,000MB]
There are no design details given.

1994, Chen et al in "RAID: High-Performance,Reliable Secondary Storage" use two widely sold commercial system as case studies:
The (low-end) NCR device was more what we'd call a 'hardware RAID controller' now, ranging from 5 to 25 disks. Pricing $22-102,000. It provided a SCSI interface and didn't buffer. A system diagram was included in the paper.

The StorageTek's Iceberg was high-end device meant for connection to IBM mainframes. Advertised as starting at 100GB (32 drives) for $1.3M, up to 400Gb for $3.6M, It provided multiple (4-16) IBM ESCON 'channels'.

For the NCR, from InfoWorld 1 Oct 1990, p 19 in Google Books
  • min config: 5 * 3½in drives, 420MB each.
  • $22,000 for 1.05Gb storage
  • Add 20*420Mb to 8.4Gb list $102,000. March 1991.
  • $4,000/drive + $2,000 controller.
  • NCR-designed controller chip + SCSI chip
  • 4 RAID implementations: RAID 0,1,3,5.
The StorTek Iceberg was released in late 1992 with projected shipments of 1,000 units in 1993.  It was aimed at replacing IBM 'DASD' (Direct Access Storage Device): exactly the comparison made in the 1988 RAID paper.
The IBM-compatible DASD, which resulted from an investment of $145 million and is technically styled the 9200 disk array subsystem, is priced at $1.3 million for a minimum configuration with 64MB of cache and 100GB of storage capacity provided by 32 Hewlett-Packard 5.25-inch drives.

A maximum configuration, with 512MB of cache and 400GB of storage capacity from 128 disks, will run more than $3.6 million. Those capacity figures include data compression and compaction, which can as much as triple the storage level beyond the actual physical capacity of the subsystem.
Elsewhere in the article more 'flexible pricing' (20-25% discount) is suggested:
with most of the units falling into the 100- to 200GB capacity range, averaging slightly in excess of $1 million apiece.
Whilst no technical reference is easily accessible on-line, more technical details are mentioned in the press release on the 1994 upgrade, the 9220. Chen et al [1994] claim "100,000 lines of code" were written.

More clues come from an feature, "Make Room for DASD" by Kathleen Melymuka  (p62) of CIO magazine, 1st June 1992 [accessed via Google Books, no direct link]:
  • 5¼in Hewlett-Packard drives were used. [model number & size not stated]
  • The "100Gb" may include compaction and compression. [300% claimed later]
  • (32 drives) "arranged in dual redundancy array of 16 disks each (15+1 spare)
  • RAID-6 ?
  • "from the cache, 14 pathways transfer data to and from the disk arrays, and each path can sustain a 5Mbps transfer rate"
The Chen et al paper (pg 175 of CACM,, Vol 26, No 2) gives this information on the Iceberg/9200:
  • it "implements an extended RAID level-5 and level-6 disk array"
    • 16 disks per 'array', 13 usable, 2 Parity (P+Q), 1 hot spare
    •  "data, parity and Reed-Solomon coding are striped across the 15 active drives of an array"
  • Maximum of 2 Penguin 'controllers' per unit.
  • Each controller is an 8-way processor, handling up to 4 'arrays' each, or 150Gb (raw).
    • Implying 2.3-2.5Gb per drive
      • The C3010, seemingly the largest HP disk in 1992, was 2.47Gb unformatted and 2Gb formatted (512by sectors), [notionally 595by unformatted sectors]
      • The C3010 specs included:
        • MTBF: 300,000 hrs
        • Unrecoverable Error Rate (UER): 1 in 10^14 bits transferred
        • 11.5 msec avg seek, (5.5msec rotational latency, 5400RPM)
        • 256Kb cache, 1:1 sector interleave, 1,7 RLL encoding, Reed-Solomon ECC.
        • max 43W 'fast-wide' option, 36W running.
    • runs up to 8 'channel programs' and independently transfer on 4 channels (to mainframe).
    • manages a 64-512Mb battery-backed cache (shared or per controller not stated)
    • implements on-the-fly compression, cites maximum doubling capacity.
      • and dynamic mapping necessary CKD (count, key, data) for variable-sized IBM blocks onto the fixed blocks internally.
      • a extra (local?) 8Mb of non-volatile memory is used to store these tables/maps.
    • Uses a "Log-Structured File System" so blocks are not written back to the same place on the disk.
    • Not stated if the SCSI buses are one-per-arry or 'orthogonal'. i.e. Redundancy groups are made up from one disk per 'array'.
Elsewhere, Katz, one of the authors, uses a diagram of a generic RAID system not subject to any "Single Point of Failure":
  • with dual-controllers and dual channel interfaces.
    • Controllers cross-connected to each interface.
  • dual-ported disks connected to both controllers.
    • This halves the number of unique drives in a system, or doubles the number of SCSI buses/HBA's, but copes with the loss of a controller.
  • Implying any battery-backed cache (not in diagram) would need to be shared between controllers.
From this, a reasonable guess at aspects of the design is:
  • HP C3010 drives were used, 2Gb formatted. [Unable to find list prices on-line]
    • These drives were SCSI-2 (up to 16 devices per bus)
    • available as single-ended (5MB/sec) or 'fast' differential (10MB/sec) or 'fast-wide' (16-bit, 20MB/sec). At least 'fast' differential, probably 'fast-wide'.
  • "14 pathways" could mean 14 SCSI buses, one per line of disks, but it doesn't match with the claimed 16 disks per array.
    • 16 SCSI buses with 16 HBA's per controller matches the design.
    • Allows the claimed 4 arrays of 16 drives per controller (64) and 128 max.
    • SCSI-2 'fast-wide' allows 16 devices total on a bus, including host initiators. This implies that either more than 16 SCSI
  • 5Mbps transfer rate probably means synchronous SCSI-1 rates of 5MB/sec or asynchronous SCSI-2 'fast-wide'.
    • It cannot mean the 33.5-42Mbps burst rate of the C3010.
    • The C3010 achieved transfer rates of 2.5MB/sec asynchronously in 'fast' mode, or 5MB/sec in 'fast-wide' mode.
    • Only the 'fast-wide' SCSI-2 option supported dual-porting.
    • The C3010 technical reference states that both powered-on and powered-off disks could be added/removed to/from a SCSI-2 bus without causing a 'glitch'. Hot swapping (failed) drives should've been possible.
  • RAID-5/6 groups of 15 with 2 parity/check disk overhead, 26Gb usable per array, max 208Gb.
    •  RAID redundancy groups are implied to be per (16-disk) 'array' plus one hot-spare .
    • But 'orthogonal' wiring of redundancy groups was probably used, so how many SCSI buses were needed per controller, in both 1 and 2-Controller configurations?
    • No two drives in a redundancy group should be connected via the same SCSI HBA, SCSI bus, power-group or cooling-group.
      This allows live hardware maintenance or single failures.
    • How were the SCSI buses organised?
      With only 14 devices total per SCSI-2 bus, a max of 7 disks per shared controller was possible.
      The only possibly configurations that allow in-place upgrades are: 4 or 6 drives per bus.
      The 4-drives/bus resolves to "each drive in an array on a separate bus".
    • For manufacturing reasons, components need standard configurations.
      It's reasonable to assume that all disk arrays would be wired identically, internally and with common mass terminations on either side, even to the extent of different connectors (male/female) per side.
      This allows simple assembly and expansion, and trivially correct installation of SCSI terminators on a 1-Controller system.
      Only separate-bus-per-drive-in-array (max 4-drives/bus), meets these constraints.
      SCSI required a 'terminator' at each end of the bus. Typically one end was the host initiator. For dual-host buses, one at each host HBA works.
    • Max 4-drives per bus results in 16 SCSI buses per Controller (64-disks per side).
      'fast-wide' SCSI-2 must have been used to support dual-porting.
      The 16 SCSI buses, one per slot in the disk arrays, would've continued across all arrays in a fully populated system.
      In a minimum system, 32 drives, would've been only 2 disks per SCSI bus.
  • 1 or 2 controllers with a shared 64M-512M cache and 8Mb for dynamic mapping.
This would be a high-performance and highly reliable design with a believable $1-2M price for 64 drives (200Gb notional, 150Gb raw):
  • 1 Controllers
  • 128Mb RAM
  • 8 ESCON channels
  • 16 SCSI controllers
  • 64 * 2Gb drives as 4*16 arrays, 60 drives active, 52 drive-equivalents after RAID-6 parity.
  • cabinets, packaging, fans and power-supplies
From the two price-points, can we tease out a little more of the costs [no allowance for ESCON channel cards]:
  • 1 Controller + 32 disks + 64M cache = $1.3M
  • 2 Controllers + 128 disks + 512M cache = $3.6M
As a first approximation, assume that 512M RAM costs half as much as 2 Controllers for a 'balanced' system. Giving us a solvable set of simultaneous equations:
  • 1.0625 Controllers + 32 disks  = $1.3M
  • 2.5 Controllers + 128 disks = $3.6M
  • $900,000 / Controller [probably $50,000 high]
  • $70,000 / 64M cache [probably $50,000 low]
  • $330,000 / 32 disks ($10k/drive, or $5/MB)
High-end multi-processor VAX system pricing at the time is in-line with this $900k estimate, but more likely an OEM'd RISC processor (MIPS or SPARC?) was used.
This was a specialist, low-volume device: expected 1st year sales volume was ~1,000.
In 1994, they'd reported sales of 775 units when the upgrade (9220) was released.

Another contemporary computer press article cites the StorageTek Array costing $10/Mb compared to $15/MB for IBM DASD. 100Gb @ $10/Mb  is $1M, so congruent with the claims above.

How do the real-world products in 1992 compare to the 1988 RAID estimates of Patterson et al?
  • StorageTek Iceberg: $10/Mb vs $11-$8 projected.
    • This was achieved using 2Gb 5¼in drives not the 100Mb 3½in drives modelled
    • HP sold a 1Gb 3½in SCSI-2 drive (C2247) in 1992. This may have formed the basis of the upgrade 9220 ~two years later.
  • Using the actual, not notional, supplied capacity (243Gb) the Iceberg cost $15/Mb.
  • The $15/Mb for IBM DASD compares well to the $18-$10 cited in 1988.
    • But IBM, in those intervening 5 years, had halved the per-Mb price of their drives once or twice. The 1988 "list price" from the archives of ~$60/Mb are reasonable.
  • In late 1992, 122Mb Conner CP-30104 were advertised for $400, or $3.25/Mb.
    These were IDE drives, though a 120Mb version of the SCSI CP-3100 was sold, price unknown.
The 8.4Gb 25-drive NCR  6298 gave $12.15/Mb, again close to the target zone.
From the Dahlin list, 'street prices' for 420Mb drives at the time, were $1600 for Seagate ST-1480A and $1300 for 425Mb Quantum or $3.05-$3.75/Mb.

The price can't be directly compared to either IBM DASD or StorageTek's Iceberg, because the NCR 6298 only provided a SCSI interface, not an IBM 'channel' interface.

The raw storage costs of the StorageTek Iceberg and NCR are roughly 2.5:1.
Not unexpected due to the extra complexity, size and functionality of the Iceberg.