Saturday, December 24, 2011

Big Drives in 2020

Previously I've written about Mark Kyrder's 7TB/platter (2.5 inch) prediction for 2020.
This is more speculation around that topic.

1. What if we don't hit 7TB/platter, maybe only 4TB?

There have  been any number of "unanticipated problems" encountered with scaling Silicon and Computing technologies, will more be encountered with HDD before 202?

We already have 1TB platters in 3.5 inch announced in Dec-2011, with at least one new technique announced to increase recording density (Sodium Chloride doping), so it's not unreasonable to expect another 2 doublings in capacity, just in taking what's in the Labs and figuring out how to put it into production.

Which means we can expect 2-4TB/platter (2.5 inch) to be delivered in 2020.
At $40 per single-platter disk?
That depends on a) the two major vendors and the oligopoly pricing and b) the yields and costs of the new fabrication plants.

Seems to me that Price/GB will drop, but maybe not to levels expected.
Especially if the rapid decline in SSD's/Flash Memory Price/Gb plateaus and removes price competition.


2. Do we need to offer The Full Enchilada to everyone?

Do laptop and ultrabook users really need 4TB of HDD when they are constantly on-line?
1-2TB will store a huge amount of video, many virtual machine images and a lifetimes' worth of audio.
There might be a market for smaller capacity disks, either through smaller platters, smaller form-factors or underusing a full-width platter.

Each option has merits.
The final determinant will be perceived consumer Value Proposition, the Price/Performance in the end-user equipment.


3. What will the 1.8 inch market be doing?

If these very small form-factor drives in mobile equipment get to 0.5-2TB, that will seem effectively infinite.

There is no point in adopting old/different platter coatings and head-manufacturing techniques for these smaller form-factors unless other engineering or usability factors come into play: such as sensitivity to electronic noise, contamination, heat, vibration, ...


4. The fifth-power of diameter and cube-of-RPM: impact of size and speed?

2.5 inch drives are set to completely displace 3.5 inch in new Enterprise Storage systems within a year. This is primarily driven by Watts/GB and GB/cubic-space.

The aerodynamic drag of disk platters, hence the power consumed by a drive, varies with the fifth-power of platter diameter and the cube of the rotational velocity (RPM).

If you halve the platter size (5.25 inch to 2.5 inch), drive power reduces 32-fold.
If you then double the RPM of the drive (3600 to 7200), power increases 8-fold,
a nett reduction in power demand of 4 times.

Changing platter diameter by square-root of 2 (halving the recordable area), the drive power reduction is 5.5-fold. This is the same proportion for 5.25::3.5 inch, 3.5::2.5 inch and 2.5::1.8 inch.

Reducing a 2.5 inch platter to 1.92 inches allows a drive to be spun up from 5400 RPM to 7200 RPM whilst using the same drive power, with 60% of the original surface area.

Whilst not in the class of Enterprise Storage "performance optimised" drives (10K and 15K), it would be a noticeable improvement for Desktop PC's, given they will also be using large SSD's/Flash Memory as well in 2020 and this will be solely for "Seek and Stream" tasks.

There is very little reason to "de-stroke" drives and limit them to less than full-platter access if they are not "performance-optimised". It's a waste of resource for exactly the same input cost.


5. Will 3.5 inch "capacity-optimised" disks survive?
Will everything be 2.5 or 1.8 inch form-factor?

There are 3 markets that are interested in "capacity-optimised" disks:
  • Storage Appliances [SOHO, SME, Enterprise and Cloud]
  • Desktop PC
  • Consumer Electronics: PVR's etc.
When 1TB 2.5 inch drives are affordable, they will make new, smaller and lighter Desktop PC designs possible. Dell and HP might even offer modules that attach on the 100mm x 100mm "VIA" standard to the back of LCD screens. A smaller variant of the Apple Mac Mini is possible, especially if a single power-supply is available.

Consumer PVR's are interested in Price/GB, not Watts/GB. They will be driven by HDD price.
The manufacturers don't pay for power consumed, customers don't evaluate/compare TCO's and there is no legislative requirement for low-power devices.  Government regulation could be the wild-card driving this market.

There's a saying something like this I though made by Dennis Ritchie:
 "Memory is Cheap, until you need to buy enough for 10,000 PC's".
[A comment on MS-Windows lack of parsimony with real memory.]

Corporations will look to trimming costs of their PC (laptop and Desktop) fleets, and the PC vendors will respond to this demand.

Storage Appliances:
Already Enterprise and Cloud providers are moving to 2.5 inch form-factor to reduce power demand (Watts/GB) and floor-space footprint (GB/cubic-space).

Consumer and entry-level servers and storage appliances (NAS and iSCSI) are currently mostly 3.5 inch because that has always been the "capacity-optimised" sweet spot.

Besides power-use, the slam-dunk reasons for SOHO and SME users to move to 2.5 inch are:
  • lighter
  • smaller
    • smaller footprint and higher drive count per Rack Unit.
    • more aggregate bandwidth from higher actuator count
    • mirrored drives or better, are possible in a small portable and low-power case.
  • more robust, better able to cope with knocks and movement.
2.5 inch drives may be much better suited to "canister" or (sealed) "drive-pack" designs, such as used by Copan in their MAID systems. This is due to their lighter weight and lower power dissipation.
The 14-drive Copan 3.5 inch "Canister" of 4RU could be replaced by a 20-24 drive 2.5 inch Canister of 3RU, putting 3-4 times the number of drives in the same space.

6. What if there are some unforeseen "drop-deads", like low data-retention rates or hyper-sensitivity to heat, that limit useful capacities to the current 3-600GB/platter (2.5 inch)?

We can't know the future perfectly, so can't say just what surprises lie ahead.
If there is some technical reason why current drive densities are an engineering maximum, we cannot rely on technology advances to automatically reduce the Price/GB each year.

Even if technology is frozen, useful price reductions, albeit minor in comparison to "50% per year", will be achievable in the production process. It might take a decade for prices to drop 50% per GB.

I'm not sure how exactly designs might be made scale if drive sizes/densities are pegged to current levels.
What is apparent and universal, "Free Goods" with apparently Infinite Supply, will engender Infinite Demand.

If we do hit a "capacity wall", then the best Social Engineering response is to limit demand, which requires a "Cost" on capacity. This could be charging, as Google does with its Gmail service, or by other means, such as publicly ranking "capacity hogs".

Thursday, December 22, 2011

IDC on Hard Disk Drive market: Transformational Times

One of the problems, as an "industry outsider", of researching the field is lack of access to hard data/research. It's there, it's high-quality and timely. Just expensive and behind pay-walls.

A little information leaks via Press Releases and press articles promoting the research companies.

When one of these professional analyst firms makes a public statement alerting us to a radical restructuring of the industry, that's big news. [Though you'd expect "insiders" to have been aware of this for quite some time.]

What's not spelled out publicly, is How will this impact Enterprise Storage vendors/manufacturers?
There seems an implication that the two major HDD vendors will start to compete 'up' the value-chain with RAID and Enterprise Storage vendors, and across storage segments with Flash memory/SSD vendors.

IDC's Worldwide Hard Disk Drive 2011-2015 Forecast: Transformational Times
 published  in May, 2011. This report consists of Pages: 62 and the price starts from US $ 4500.

Headline: Transformation to just 3 major vendors. (really 2 major + 1 minor @ 10%)
 "The hard disk drive industry has navigated many technology and product transitions over the past 50 years, but not a transformation. [emphasis added]

 The HDD industry is poised to consolidate from five to three HDD vendors by 2012, and
 HDD unit shipment growth over the next five years will slow.

 HDD revenue will grow faster than unit shipments after 2012, in part because HDD vendors will offer higher-performance hybrid HDD solutions that will command a price premium.

 But for the remaining three HDD vendors to achieve faster revenue growth,
 it will be necessary by the middle of the decade for HDD vendors to transform into [bullets added]
  •  storage device and
  •  storage solution suppliers,
  •  with a much broader range of products for a wider variety of markets
  •  but at the same time a larger set of competitors."

Platters per Disk.

Headline:
  • 2.5 inch:
    • 9.5mm = 1 or 2 platters
    • 12.5mm = 2 or 3 platters
    • 15mm = ? platters. Guess at least 3. 4 unlikely, compare to 3.5 inch density
  • 3.5 inch
    • 25.4mm = commonly 4. Max. 5 platters.
Why is this useful, interesting or important?
To compare capacity across form-factors and for future configuration/design possibilities.

Disk form-factors are related by an approximate halving of platter area between sizes:
8::5.25 inch, 5.25::3.5 inch, 3.5::2.5 inch, 2.5::1.8 inch, 1.8::1.3 inch, 1.3::1 inch...
What we (as outsiders) know, but only approximately, is the recording area per platter for the platter sizes.  We know there are at least 3 regions of disk platter, but not their ratios/sizes, and these will vary per form-factor/platter-size:
  • motor/hub. The area of the inside 'torus' is small, not much is lost.
  • recorded area
  • outer 'ring' for landing and idling or "unloading" heads. Coated differently (plastic?) to not damage heads if they "skid" or come into contact with a surface (vs 'flying' on the aerodynamic air-cushion).
Research:

Chris Mellor, 12th September 2011 12:02 GMT, The Register, "Five Platters, 4TB".
Seagate has a 4TB GoFlex Desk external drive but this is a 5-platter
disk with 800GB platters.
IDC, 2009, report sponsored by Hewlett-Packard:
By 2010, the HDD industry is expected to increase the maximum number of platters per 2.5inch performance-optimized HDD from two to three,
enabling them to accelerate delivering a doubling of capacity per drive, and subsequently achieving 50% capacity increases per drive over a shorter time frame.
7th September 2011 06:00 GMT, The Register.
Oddly Hitachi GST is only shipping single-platter versions of these new drives, although it is saying they are the first ones in a new family, with their 569Gbit/in2 areal density. The announced but not yet shipping terabyte platter Barracuda had a 635Gbit/in2 areal density.
Sebastian Anthony on December 12, 2011 at 12:25 pm, Extreme Tech
Hitachi, seemingly in defiance of the weather gods, has launched the
world’s largest 3.5-inch hard drive:
 The monstrous 4TB Deskstar 5K.
 With a rotational speed of 5,900RPM,
 a 6Gbps SATA 3 interface,
 and the same 32MB of cache as
 its 2 and 3TB siblings,
 the 4TB model is basically the same beast
— just with four platters instead of two or three.
 The list price is around $345
Silverton Consulting, 13-Sep-2011:
shipping over 1TB/disk platter using 3.5″ platters shipping with 569Gb/sqin technology

Monday, December 19, 2011

"Missed by _that_ much": Disk Form Factor vs Rack Units

Apologies to 1965 TV series "Get Smart" and the catch-phrase "Missed by that much" (with a visual indication of a near-miss).

This is a lament, not a call to action or grumble. Standards are necessary and good.
We have two standards that we just have to live with now: too many devices depend on them for a change. Unlike the "imperial" to metric conversion, there would be few discernible benefits.

There's a fundamental mismatch between the Rack Unit (1.75 inches) or the vertical space allowed for equipment in 19 inch Racks (standard EIA-310) and the Disk form factors of  5.25, 3.5 and 2.5 inches defined by the Small Form Factor Committee.

There is no way to mount a standard disk drive (3.5 or 2.5 inch) exactly in a Rack. There are various amounts of wasted space.
Originally, "full-height" 5.25 inch drives could be mounted horizontally exactly in 2 Rack Units (3.5 inches), three abreast.

The "headline" size of the form-factor is the notional size of the platters or removable media.
The envelope allows for the enclosure.

So whilst "3.5 inch" looks like a perfect multiple of the 1.75 inch Rack Unit, a "3.5 inch" drive is  around 0.5 inch larger.
Manufacturers of vertical-mount "hot-swap" drives allow around 1mm on the thinnest dimension, 9 mm on the "height" and 1.5 inches (42-43 mm) on the longest dimension (depth).

A guess at the dimensions of hot-swap housings:
1/32 in (0.8mm) or 1mm sheet metal could be used between drives (upright)
and 1.5-2mm sheet metal would be needed to support the load (with an upturned edge?).

In total, around 0.5 inch (12.5mm) might need be allowed vertically for supporting structures.
An ideal Rack Unit size for the "3.5 inch" drive form-factor would be 4.5 inches.

Or, "3.5 inch" drives could be 3.00 - 3.25 inches wide to fit exactly in 2 Rack Units.

Different manufacturers approach this problem differently:
  • Copan/SGI and Backblaze mount 3.5 inch drives vertically in 4 Rack Units (7 inches).
    Both of these solutions aim for high-density packing, 28 and 11.25 drives per Rack Unit .
    • Copan, via US Patent # 7145770, uses 4U hot-swap "canisters" that store 14 drives in 2 rows, with 8 canisters per "shelf" (112 drives/shelf). In a 42 U rack, they can house 8 shelves, for 896 drives per Rack. Their RAID system is 3+1, with max 5 spares per shelf, yielding 79 data drives per shelf, and 632 drives per Rack.
      These systems are designed specifically to hold archival data, with up to 25% or 50% of drives active at any one time, as "MAID": Massive Array of Idle Disks.
    • Backblaze are not a storage vendor, but have made their design public with a hardware vendor able to supply cases and pre-built (but not populated) systems.
      Their solution, fixed-disks not hot-swap, is 3 rows of 15 disks mounted vertically, sitting on their connectors. The Backblaze systems include a CPU and network card and are targeted at providing affordable and reliable on-line Cloud Backup services [and are specifically "low performance"]. Individual "storage pods" do not supply "High Availability", there is little per-unit redundancy. Like Google, Backblaze rely on whole-system replication and software to achieve redundancy and resilience.
  • Most server and storage appliance vendors use "shelves" of 3 Rack Units (5.25 inches), but fit 13-16 drives across the rack (~17.75 inches or 450mm) depending on their hot-swap carriers.
  • "2.5 inch" drives fitted vertically (2.75 inch) need 2 Rack Units (3.5 inches). Most vendors fit 24 drives across a shelf. "Enterprise class" 2.5 inch drives are typically 12.5 or 15 mm thick.
Another possibility,  not been widely pursued, is to build disk housings or shelves that don't exactly fit the EIA-310 standard Rack Units. Unfortunately, the available internal width of 450mm cannot be varied.



The form factors:
"5.25 inch": (5.75 in x 8 in x 1.63 in =  146.1 mm x 203 mm x 41.4 mm)
"3.5 inch"  : (4 in x 5.75 in x 1 in =  101.6 mm x 146.05 mm x 25.4 mm)
"2.5" inch  : (2.75 in x 3.945 in x 0.25-0.75 in = 69.85 mm x 100.2 mm x [7, 9.5, 12.5, 15, 19] mm)
Old disk height form factors, originating in 5.25 inch disks circa mid-1980's.
low-profile = 1 inch.
Half-height = 1.63 inch.
Full-height = 3.25 inch. [Fitting well into 2 Rack Units]

Wednesday, December 14, 2011

"Disk is the new Tape" - Not Quite Right. Disks are CD's

Jim Gray, in recognising that Flash Memory was redefining the world of Storage, famously developed between 2002 and 2006 the view that:
Tape is Dead
Disk is Tape
Flash is Disk
RAM Locality is King
My view is that: Disk is the new CD.

Jim Gray was obviously intending that Disk had replaced Tape as the new backup storage media, with Flash Memory being used for "high performance" tasks. In this he was completely correct. Seeing this clearly and annunciating it a decade ago was remarkably insightful.

Disks do both the Sequential Access of Tapes and Random I/O.
In the New World Order of Storage, they can be considered functionally identical to Read-Write Optical disks or WORM (Write Once, Read-only Memory).

As the ratios between access time (seek or latency) and sequential transfer rate, or throughput, continues to change in favour of capacity and throughput, managing disks becomes more about running them in "Seek and Stream" mode than doing Random I/O.

With current 1TB disks, the sequential scan time (capacity ÷ sustained transfer rate) [1,000GB/ 1Gbps] is 2-3 hours. However, to read a disk with 4KB random I/O's at ~250/sec (4msec avg. seek), the type of workload a filesystem causes, gives an effective through put of around 1MB/sec, or 128 times slower than a sequential read.

It behoves system designers to treat disks as faster RW Optical Disk, not as primary Random IO media, and as Jim Gray observed, "Flash is the New Disk".

The 35TB drive (of 2020) and Using them.

What's the maximum capacity possible in a disk drive?

Kryder, 2009, projects 7TB/platter for 2.5 inch platters will be commercially available in 2020.
[10Tbit/in² demo by 2015 and $3/TB for drives]

Given that prices of drive components are driven by production volumes, in the next decade we're likely to see the end of 3.5 inch platters in commercial disks with 2.5 inch platters taking over.
The fith-power relationship between platter-size and drag/power-consumed also suggests "Less is More". A 3.5 inch platter needs 5+ times more power to twirl it around than a 2.5 inch platter - the reason that 10K and 15K drives run the small platters: they already use the same media/platters for 3.5 inch and 2.5 inch drives.

Sankar, Gurumurthi, and Stan in "Intra-Disk Parallelism: An Idea Whose Time Has Come" ISCA, 2008, discuss both the fifth-power relationship and that multiple actuators (2 or 4) make a significant difference in seek times.

How many platters are fitted in the 25.4 mm (1 inch) thickness of a 3.5 inch drive's form-factor?

This report on the Hitachi 4TB drive (Dec, 2011) says they use 4 * 1TB platters in a 3.5 inch drive, with 5 possible.

It seems we're on-track to at least the Kryder 2020 projection, with 6TB per 3.5 inch platter already demonstrated using 10nm grains enhanced with Sodium Chloride.

How might those maximum capacity drives be lashed together?

If you want big chunks of data, then even in a world of 2.5 inch componentry, it still makes sense to use the thickest form-factor around to squeeze in more platters. All the other power-saving tricks of variable-RPM and idling drives are still available.
The 101.6mm [4 inch] width of the 3.5 inch form-factor allows 4 to sit comfortably side-by-side in the usual 17.75 inch wide "19 inch rack", using just more than half the 1.75 inch height available.

It makes more sense to make a half-rack-width storage blade, with 4 * 3.5 inch disks (2 across, 2 deep) with a small/low-power CPU, a reasonable amount of RAM and "SCM" (Flash Memory or similar) as working-memory and cache and dual high-speed ethernet, infiniband or similar ports (10Gbps) as redundant uplinks.
SATA controllers with 4 drives per motherboard are already common.
Such "storage bricks", to borrow Jim Grays' term, would store a protected 3 * 35Tb, or 100TB per unit, or 200Tb per Rack Unit (RU). A standard 42RU rack, allowing for a controller (3RU), switch (2RU), patch-panel (1RU) and common power-supplies (4RU), would have a capacity of 6.5PB.

Kryder projected a unit cost of $40 per drive, with the article suggesting 2 platters/drive.
Scaled up, ~$125 per 35TB drive, or ~$1,000 for 100TB protected ($10/TB) [$65-100,000 per rack]

The "scan time" or time-to-populate a disk is the rate-limiting factor for many tasks, especially RAID parity rebuilds.
For a single actuator drive using 7TB  platters and streaming at 1GB/sec, "scan time" is a daunting 2 hours per platter: At best 10 hours to just read a 35TB drive.

Putting 4 actuators in the drive, cuts scan time to 2-2.5 hours, with some small optimisations.

While not exceptional, its compares favourably with 3-5 hours minimum currently reported with 1TB drives.

But a single-parity drive won't work for such large RAID volumes!

Leventhal, 2009, in "Triple Parity and Beyond", suggested that the UER (Unrecoverable Error Rate) of large drives would force force parity-group RAID implementations to use a minimum of 3 parity drives to achieve a 99.2% probability of a successful (Nil Data Loss) RAID rebuild following a single-drive failure. Obviously, triple parity is not possible with only 4 drives.

The extra parity drives are NOT to cover additional drive failures (this scenario is not calculated), but to cover read errors, with the assumption that a single error invalidates all data on a drive.

Leventhal uses in his equations:
  •  512 byte sectors,
  • 1 in 10^16 probability of UER,
  • hence one unreadable sector per 200 billion (10TB) read, or
  • 10 sectors per 2 trillion (100TB) read.
Already, drives are using 4Kb sectors (with mapping to the 'standard' 0.5Kb sectors) to achieve the higher UER's.  The calculation should be done with the native disk sector size.

If platter storage densities are increased by 32-fold, it makes sense to similarly scale up the native sector size to decrease the UER. There is a strong case for 64-128Kb sectors on 7Tb platters.

Recasting Leventhal's equations with:
  • 100TB to be read,
  • 64KB native sectors,
  • or 1 in 1.5625 * 10^9 native sectors read for a UER of 1 in 10^16.
What UER would enable a better than 99.2% probability of reading 1.5 billion native sectors?
First approximation is 1 in 10^18 [confirm].
Zeta claims UER better than 1 in 10^58. Is possible to do much better.

Inserting Gibson's "horizontal" error detection/correction (extra redundancy on the one disk) is around the same overhead, or less. [do exact calculation].


Rotating parity or single-disk parity RAID?

The reasons to rotate parity around disk are simple - avoid "hot-spots", otherwise the full parallel IO bandwidth possible over all disks is reduced to just that of the parity disk. NetApp neatly solve this problem with their WAFL (Write Anywhere File Layout).

In order to force disks into mainly sequential access, "seek then stream", writes won't be simply cached, but shouldn't be written to HDD but kept to SMC/Flash until writes have quiesced.

The single parity-disk problem only occurs on writes. Reading, in normal or degraded mode, occurs at equal speed.

If writes across all disks are stored then written in large blocks, there is no IO performance difference between single-parity disk and rotating parity.

Tuesday, December 13, 2011

Revolutions End: Computing in 2020

We haven't reached the end of the Silicon Revolution yet, but "we can see it from here".

Why should anyone care? Discussed at the end.

There are two expert commentaries that point the way:
  • David Patterson's 2004 HPEC Keynote, "Latency vs Bandwidth", and
  • Mark Kryder's 2009 paper in IEEE Magnetics, "After Hard Drives—What Comes Next?"
    [no link]
Kryder projected the current expected limits of magnetic recording technology in 2020 (2.5": 7Tb/platter) and how another dozen technologies will compare, but there's no guarantee. Some unanticipated problem might, like CPU's, derail Kryders' Law before then: disk space doubles every year.
We will get an early "heads-up": by 2015 Kryder expects 7Tb/platter to be demonstrated.

This "failure to fulfil the roadmap" has happened before: In 2005 Herb Sutter pointed out that 2003 marked the end of Moore's Law for single-core CPU's in "The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software". Whilst Silicon fabrication kept improving, CPU's hit a "Heat Wall" limiting the clock-frequency, spawning a new generation of "multi-core" CPUs.

IBM with its 5.2GHz Z-series processors and gamers "over-clocking" standard x86 CPUs showed part of the problem was a "Cooling Wall". This is still to play out fully with servers and blades.
Back to water-cooling, anyone?
We can't "do a Cray" anymore and dunk the whole machine in a vat of Freon (a CFC refrigerant, now banned).

Patterson examines the evolution of four computing techologies over 25 years from ~1980 and the increasing disparity between "latency" (like disk access time) and "bandwidth" (throughput):
  • Disks
  • Memory (RAM)
  • LANs (local Networking)
  • CPUs
He  neglects "backplanes", PCI etc, Graphic sub-systems/Video interfaces and non-LAN peripheral interconnection.

He argues there are 3 ways to cope with "Latency lagging Bandwidth":
  • Caching (substitute different types of capacity)
  • Replication (leverage capacity)
  • Prediction (leverage bandwidth)
Whilst  Patterson doesn't attempt to forecast the limits of technologies like Kryder, he provides an extremely important and useful insight:
If everything improves at the same rate, then nothing really changes
When rates vary, require real innovation
In this new milieu, Software and System designers will have to step-up to build systems that are effective and efficient, and any speed improvements will only come from better software.

There is an effect that will dominate bandwidth improvement, especially in networking and interconnections (backplanes, video, CPU/GPU and peripheral interconnects):
the bandwidth-distance product
This affects both copper and fibre-optic links. Using a single technology, a 10-times speed-up shortens the effective distance 10-times. Well know in transmission line theory.

For LANs to go from 10Mbps to 100Mbps to 1Gbps, higher-spec cable (Cat 4, Cat 5, Cat 5e/6) had to be used. Although 40Gbps and 100Gbps Ethernet have been agreed and ratified, I expect these speeds will only ever be Fibre Optic. Copper versions will either be very limited in length (1-3m) or use very bulk, heavy and expensive cables: worse in every dimension than fibre.

See the "International Technology Roadmap for Semiconductors" for the expert forecasts of the underlying Silicon Fabrication technologies, currently out to 2024. There is a lot of detail in there.

The one solid prediction I have is Kryder's 7Tb/platter.
A 32 times increase in bit-areal density, Or 5 doublings of capacity.
This implies the transfer rate of disks will increase 5-6 times, given there's no point in increasing rotational speed, to roughly 8Gbps. Faster than "SATA 3.0" (6Gbps) but within the current cable limits. Maintaining the current "headroom" would require a 24Gbps spec - needing a new generation of cable. The SATA Express standard/proposal of 16Gbps might work.

There are three ways disk connectors could evolve:
  • SATA/SAS (copepr) at 10-20Gbps
  • Fibre Optic
  • Thunderbolt (already 2 * 10Gbps)
Which type to dominate will be determined by the Industry, particularly the major Vendors.

The disk "scan time" (to fully populate a drive) at 1GB/sec, will be about 2hours/platter. Or 6 hours for a 20Tb laptop drive, or 9 hours for a 30Tb server class drive. [16 hours if 50TB drives are packaged in 3.5" (25.4mm thick) enclosures].  Versus the ~65 minutes for a 500Gb drive now.

There is one unequivocal outcome:
Populating a drive using random I/O, as we now do via filesystems, is not an option. Random I/O is 10-100 times slower than streaming/sequential I/O. It's not good enough to take a month or two to restore a single drive, when 1-24 hours are the real business requirements.

Also, for laptops and workstations with large drives (SSD or HDD), they will require 10Gbps networking as a minimum. This may be Ethernet or the much smaller and available Thunderbolt.

A caveat: This piece isn't "Evolutions' End", but "(Silicon) Revolutions' End". Hardware Engineers are really smart folk, they will keep innovating and providing Bigger, Faster, Better hardware. Just don't expect the rates of increase to be nearly as fast. Moores' Law didn't get repealed in 2003, the rate-of-doubling changed...


Why should anyone care? is really: Who should care?

If you're a consumer of technology or a mid-tier integrator, very little of this will matter. In the same way that now when buying a motor vehicle, you don't care about the particular technologies under the hood, just what it can do versus your needs and budget.

People designing software and systems, the businesses selling those technology/services and Vendors supplying parts/components hardware or software that others build upon, will be intimately concerned with the changes wrought by Revolutions End.

One example is provided above:
 backing up and restoring disks can no longer be a usual filesystem copy. New techniques are required.

Wednesday, December 07, 2011

RAID: Something funny happened on the way to the Future...

With apologies to Stephen Sondheim et al, "A Funny Thing Happened on the Way to the Forum", the book, 1962 musical and 1966 film.

Contents:

Summary:
Robin Harris of "StorageMojo" in "Google File System Eval", June 13th, 2006, neatly summaries my thoughts/feelings:
As regular readers know, I believe that the current model of enterprise storage is badly broken.
Not discussed in this document is The Elephant in the Room, or the new Disruptive Technology:Enterprise Flash Memory or SSD (Solid State Disk). It offers (near) "zero latency" access and random I/O performance 20-50 times cheaper than "Tier 1" Enterprise Storage arrays.

Excellent presentations by Jim Gray about the fundamental changes in Storage are available on-line:
  • 2006 "Flash is good": "Flash is Disk, Disk is Tape, Tape is dead".
  • 2002 "Storage Bricks". Don't ship tapes or even disks. Courier whole fileservers, it's cheaper, faster and more reliable.

Monday, November 28, 2011

Optical Disks as Dense near-line storage?

A follow-up to an earlier post [search for 'Optical']:
Could Optical Disks be a viable near-line Datastore?
Use a robot arm to pick and load individual disks from 'tubes' into multiple drives.
Something the size of a single filing cabinet drawer would be both cheap and easily contain a few thousand disks. That's gotta be interesting!

Short answer, no...

A 3.5" drive has a form-factor of:  4 in x 5.75 in x 1in. Cubic capacity: 23 in³
A 'tube' of 100 optical disks: 5.5in x 5.5in x 6.5in Cubic capacity: 160-200 in³ [footprint or packed]

A 'tube' of 100, minus all the supporting infrastructure to select a disk and read it, is 7-9 times the volume of a 3.5in hard disk drive, or each Optical Disk must contain at least 7-9% of a HDD to be competitive.

To replace a 1Tb HDD, optical disks must be at least 7% of 1,000Gb, or 70Gb. Larger than even Blu-ray and 15-20 times larger than Single layer DVD's (4.7Gb).

Current per Gb price of 3.5" HDD's is around $0.05-$0.10/Gb,  squeezing 4.7Gb DVD's on price as well.

2Tb drives are common, 3Tb are becoming available now (2011). Plus it gets worse.

There are estimate is of maximum possible 3.5" HDD size of 20-40Tb.
To be competitive, Optical disks would need to get up around 1Tb in size and cost under $1.

Around 2005, when 20-40Gb drives reigned, there was a time when 4.7Gb DVD's were both the densest and cheapest storage available. Kryders' Law, a doubling of HDD capacity every 1-2 years, has seen the end of that.

Sunday, November 27, 2011

Journaled changes: One solution to RAID-1 on Flash Memory

As I've posited before, simple-minded mirroring (RAID-1) of Flash Memory devices is not only a poor implementation, but worst-case.

My reasoning is: Flash wears-out and putting identical loads on identical devices will result in maximum wear-rate of all bits, which is bad but not catastrophic. It also creates a potential for simultaneous failures where a common weakness fails in two devices at the one time.

The solution is to not put the same write load on the two  devices, but still have two exact copies.
This problem would be an especial concern for PCI-SSD devices internal to a system. The devices can't normally can't be hot-plugged, though there are hot-plug standards for PCI devices (e.g. Thunderbolt and ExpressCard), they are not usually options for servers and may be limited performance.

One solution, I'm not sure if it's optimal or not, but it is 'sufficient', is to write blocks as normal to the primary device and maintain the secondary device as snapshot + (compressed) journal entries. When the journal space hits a high-water mark the snapshot is made an exact copy (e.g. bring the snapshot up-to-date when a timer expires (hourly, 6-hourly, daily, ...) or when the momentum of changes will fill the journal to 95% before the snapshot could be updated).

If the journal fills, the mirror is invalidated and either changes must be halted or the devices go into unprotected operation. Both not desirable operational outcomes. A temporary, though unprotected, work-around is to write the on-going journal either to the primary device or into memory.

Friday, November 25, 2011

Flash Memory: will filesystems become the CPU bottleneck?

Flash memory with 50+k IO/sec may be too fast for Operating Systems (like Linux) with file-system operations consuming more CPU, even saturating it. They are on the way to becoming the system rate-limiting factor, otherwise known as a bottleneck.

What you can get away with at 20-100 IO/sec, i.e. consumes 1-2% of CPU, will be a CPU hog at 50k-500k IO/sec, a 5,000-50,000 times speed up.

The effect is the reverse of the way Amdahl speed-up is explained.

Amdahl throughput scaling is usually explained like this:
If your workload has 2 parts (A is single-threaded, B can be parallelised), when you decrease the time-taken for  'B' by adding parallel compute-units, the workload becomes dominated by the single-threaded part, 'A'. If you half the time it takes to run 'B', it doesn't halve the total run time. If 'A' and 'B' parts take equal time (4 units each, total 8), then a 2-times speed up of 'B' (4 units to 2) results in a 25% reduction in run-time (8 units to 6). Speeding 'B' up 4-times is a 37% reduction (8 to 5).
This creates a limit to the speed-up possible: If 'B' reduces to 0 units, it still takes the same time to run all the single-threaded parts, 'A'. (4 units here)

A corollary of this: the rate-of-improvement for each doubling of cost nears zero, if not well chosen.

The filesystem bottleneck is the reverse of this:
If your workload has an in-memory part (X) and wait-for-I/O part (W) both of which consume CPU, if you reduce the I/O wait to zero without reducing the CPU overhead of 'W', then the proportion of useful work done in 'X' decreases. In the limit, the system throughput is constrained by CPU expended on I/O overhead in 'W'.

The faster random I/O of Flash Memory will reduce application execution time, but at the expense of increasing % system CPU time. For a single process, the proportion and total CPU-effort of I/O overhead remains the same. For the whole system, more useful work is being done (it's noticeably "faster"), but because the CPU didn't get faster too, it needs to spend a lot more time on the FileSystem.
Jim Gary observed that:
  • CPU's are now mainly idle, i.e. waiting on RAM or I/O.
    Level-1 cache is roughly the same speed as the CPU, everything else is much slower and must be waited for.
  • The time taken to scan a 20Tb disk using random I/O will be measured in days whilst  a sequential scan ("streaming") will take hours.
Reading a "Linux Storage and Filesystem Workshop" (LSF) confrence report, I was struck by comments that:
  • linux file systems can consume large amount of CPU doing their work, not just fsck, but handling directories, file metadata, free block chains, inode block chains, block and file checksums, ...
There's a very simple demonstration of this: optical disk (CD-ROM or DVD) performance.
  • A block-by-block copy (dd) of a CD-ROM at "32x", or approx 3Mb/sec, will copy a full 650Mb in 3-4 minutes. Wikipedia states a 4.7Gb DVD takes 7 minutes (5.5Mb/sec) at "4x".
  • Mounting a CD or DVD then doing a file-by-file copy takes 5-10 times as long.
  • Installing or upgrading system software from the same CD/DVD is usually measured in hours.
The fastest way to upgrade software from CD/DVD is to copy an image with dd to hard-disk, then mount that image. The difference is the random I/O (seek) performance of the underlying media, not the FileSystem. [Haven't tested times or speedup with a fast Flash drive.]

This performance limit may have been something that the original Plan 9 writers knew and understood:
  • P9 didn't 'format' media for a filesystem: initialised a little and just started writing blocks.
  • didn't have fsck on client machines, only the fileserver.
  • the fileserver wrote to three levels of storage: RAM, disk, Optical disk.
    RAM and disk were treated as cache, not permanent storage.
    Files were pushed to Optical disk daily, creating a daily snapshot of the filesystem at the time. Like Apple's TimeMachine, files that hadn't changed were 'hard-linked' to the new directory tree.
  • The fileserver had operator activities like backup and restore. The design had no super-user with absolute access rights, so avoided many of the usual admin-related security issues.
  • Invented 'overlay mounts', managed at user not kernel level, to combine the disparate file-services available and allow users to define their own semantics.
Filesystems have never, until now, focussed on CPU performance, rather the opposite, they've traded CPU and RAM to reduce I/O latency, historically improving system throughput, sometimes by orders-of-magnitude.
Early examples were O/S buffers/caching (e.g. Unix) and the  'elevator algorithmn' to optimally reorder writes to match disk characteristics.

 This 'burn the CPU' trade-off shows up with fsck as well. An older LSF piece suggested that fsck runs slowly because it doesn't do a single pass of the disk, effectively forced into near worst-case unoptimised random I/O.

On my little Mac Mini with a 300Gb disk, there's 225Gb used. Almost all of which, especially the system files, is unchanging. Most of the writing to disk is "append mode" - music, email, downloads - either blocks-to-a-file or file-to-directory. With transactional Databases, it's a different story.

The filesystem treats the whole disk as if every byte could be changed in the next second - and I pay a penalty for that in complexity and CPU cycles. Seeing my little Mac or an older Linux desktop do a filesystem check after a power fail is disheartening...

I suggest future O/S's will have to contend with:
  • Flash or SCM with close to RAM performance 'near' the CPU(s)  (on the PCI bus, no SCSI controller)
  • near-infinite disk ("disk is tape", Jim Gray) that you'll only want to access as "seek and stream". It will also take "near infinite" time to scan with random I/O. [another Jim Gray observation]
And what are the new rules for filesystems in this environment?:
  • two sorts of filesystems that need to interwork:
    • read/write that needs fsck to properly recover after a failure and
    • append-only that doesn't need checking once "imaged", like ISO 9660 on optical disks.
  • ''Flash" file-system organised to minimise CPU and RAM use. High performance/low CPU use will become as important as managing "wear" for very fast PCI Flash drives.
  • 'hard disk' filesystem with on-the-fly append/change of media and 'clone disk' rather than 'repair f/sys'.
  • O/S must seamlessly/transparently:
    1. present a single file-tree view of the two f/sys
    2. like Virtual Memory, safely and silently migrate data/files from fast to slow storage.
I saw a quote from Ric Wheeler (EMC) from LSF-07 [my formatting]:
 the basic contract that storage systems make with the user
 is to guarantee that:
  •  the complete set of data will be stored,
  •  bytes are correct and
  •  in order, and
  •  raw capacity is utilized as completely as possible.
I disagree nowdays with his maximal space-utilisation clause for disk. When 2Tb costs $150 (7.5c/Gb) you can afford to waste a little here and there to optimise other factors.
With Flash Memory at $2-$5/Gb, you don't want to go wasting much of that space.

Jim Gray (again!) early on formulated "the 5-minute rule" which needs rethinking, especially with cheap Flash Memory redefining the underlying Engineering factors/ratios. These sorts of explicit engineering trade-off calculations have to be done for the current disruptive changes in technology.
  • Gray, J., Putzolu, G.R. 1987. The 5-minute rule for trading memory for disk accesses and the 10-byte rule for trading memory for CPU time. SIGMOD Record 16(3): 395-398.
  • Gray, J., Graefe, G. 1997. The five-minute rule ten years later, and other computer storage rules of thumb. SIGMOD Record 26(4): 63-68.
I think Wheeler's Storage Contract also needs to say something about 'preserving the data written', i.e. the durability and dependability of the storage system.
For how long? what what latency? How to express that? I don't know...

There is  also a matter of "storage precision", already catered for with CD's and CD-ROM, Wikipedia states:
The difference between sector size and data content are the header information and the error-correcting codes, that are big for data (high precision required), small for VCD (standard for video) and none for audio. Note that all of these, including audio, still benefit from a lower layer of error correction at a sub-sector level.
Again, I don't know how to express this, implement it nor a good user-interface. What is very clear to me is:
  • Not all data needs to come back bit-perfect, though it is always nice when it does.
  • Some data we would rather not have, in whole or part, than come back corrupted.
  • There are many data-dependent ways to achieve Good Enough replay when that's acceptable.
First, the aspects of Durability and Precision need to be defined and refined, then a common File-system interface created and finally, like Virtual Memory, automated and executed without thought or human interaction.

This piece describes FileSystems, not Tabular Databases nor other types of Datastore.
The same disruptive technology problems need to be addressed within these realms.
Of course, it'd be nicer/easier if other Datastores were able to efficiently map to a common interface or representation shared with FileSystems and all the work/decisions happened in Just One Place.

Will that happen in my lifetime? Hmmmm....

Sunday, November 20, 2011

Building a RAID disk array circa 1988

In "A Case for Redundant Arrays of Inexpensive Disks (RAID)" [1988], Patterson et al of University of California Berkeley started a revolution in Disk Storage still going today. Within 3 years, IBM had released the last of their monolithic disk drives, the 3390 Model K, with the line being discontinued and replaced with IBM's own Disk Array.

The 1988 paper has a number of tables where it compares Cost/Capacity, Cost/Performance and Reliability/Performance of IBM Large Drives, large SCSI drives and 3½in SCSI drives.

The prices ($$/MB) cited for the IBM 3380 drives are hard to reconcile with published prices:
 press releases in Computerworld  and IBM Archives for 3380 disks (7.5Gb, 14" platter, 6.5kW) and their controllers suggest $63+/Mb for 'SLED' (Single Large Expensive Disk) rather than the
"$18-10" cited in the Patterson paper.

The prices for the 600MB Fujitsu M2316A ("super eagle") [$20-$17] and 100Mb Conner Peripherals CP-3100 [$10-$7] are in-line with historical prices found on the web.

The last table in the 1988 paper lists projected prices for different proposed RAID configurations:
  • $11-$8 for 100 * CP-3100 [10,000MB] and
  • $11-$8 for 10 * CP-3100 [1,000MB]
There are no design details given.

1994, Chen et al in "RAID: High-Performance,Reliable Secondary Storage" use two widely sold commercial system as case studies:
The (low-end) NCR device was more what we'd call a 'hardware RAID controller' now, ranging from 5 to 25 disks. Pricing $22-102,000. It provided a SCSI interface and didn't buffer. A system diagram was included in the paper.

The StorageTek's Iceberg was high-end device meant for connection to IBM mainframes. Advertised as starting at 100GB (32 drives) for $1.3M, up to 400Gb for $3.6M, It provided multiple (4-16) IBM ESCON 'channels'.

For the NCR, from InfoWorld 1 Oct 1990, p 19 in Google Books
  • min config: 5 * 3½in drives, 420MB each.
  • $22,000 for 1.05Gb storage
  • Add 20*420Mb to 8.4Gb list $102,000. March 1991.
  • $4,000/drive + $2,000 controller.
  • NCR-designed controller chip + SCSI chip
  • 4 RAID implementations: RAID 0,1,3,5.
The StorTek Iceberg was released in late 1992 with projected shipments of 1,000 units in 1993.  It was aimed at replacing IBM 'DASD' (Direct Access Storage Device): exactly the comparison made in the 1988 RAID paper.
The IBM-compatible DASD, which resulted from an investment of $145 million and is technically styled the 9200 disk array subsystem, is priced at $1.3 million for a minimum configuration with 64MB of cache and 100GB of storage capacity provided by 32 Hewlett-Packard 5.25-inch drives.

A maximum configuration, with 512MB of cache and 400GB of storage capacity from 128 disks, will run more than $3.6 million. Those capacity figures include data compression and compaction, which can as much as triple the storage level beyond the actual physical capacity of the subsystem.
Elsewhere in the article more 'flexible pricing' (20-25% discount) is suggested:
with most of the units falling into the 100- to 200GB capacity range, averaging slightly in excess of $1 million apiece.
Whilst no technical reference is easily accessible on-line, more technical details are mentioned in the press release on the 1994 upgrade, the 9220. Chen et al [1994] claim "100,000 lines of code" were written.

More clues come from an feature, "Make Room for DASD" by Kathleen Melymuka  (p62) of CIO magazine, 1st June 1992 [accessed via Google Books, no direct link]:
  • 5¼in Hewlett-Packard drives were used. [model number & size not stated]
  • The "100Gb" may include compaction and compression. [300% claimed later]
  • (32 drives) "arranged in dual redundancy array of 16 disks each (15+1 spare)
  • RAID-6 ?
  • "from the cache, 14 pathways transfer data to and from the disk arrays, and each path can sustain a 5Mbps transfer rate"
The Chen et al paper (pg 175 of CACM,, Vol 26, No 2) gives this information on the Iceberg/9200:
  • it "implements an extended RAID level-5 and level-6 disk array"
    • 16 disks per 'array', 13 usable, 2 Parity (P+Q), 1 hot spare
    •  "data, parity and Reed-Solomon coding are striped across the 15 active drives of an array"
  • Maximum of 2 Penguin 'controllers' per unit.
  • Each controller is an 8-way processor, handling up to 4 'arrays' each, or 150Gb (raw).
    • Implying 2.3-2.5Gb per drive
      • The C3010, seemingly the largest HP disk in 1992, was 2.47Gb unformatted and 2Gb formatted (512by sectors), [notionally 595by unformatted sectors]
      • The C3010 specs included:
        • MTBF: 300,000 hrs
        • Unrecoverable Error Rate (UER): 1 in 10^14 bits transferred
        • 11.5 msec avg seek, (5.5msec rotational latency, 5400RPM)
        • 256Kb cache, 1:1 sector interleave, 1,7 RLL encoding, Reed-Solomon ECC.
        • max 43W 'fast-wide' option, 36W running.
    • runs up to 8 'channel programs' and independently transfer on 4 channels (to mainframe).
    • manages a 64-512Mb battery-backed cache (shared or per controller not stated)
    • implements on-the-fly compression, cites maximum doubling capacity.
      • and dynamic mapping necessary CKD (count, key, data) for variable-sized IBM blocks onto the fixed blocks internally.
      • a extra (local?) 8Mb of non-volatile memory is used to store these tables/maps.
    • Uses a "Log-Structured File System" so blocks are not written back to the same place on the disk.
    • Not stated if the SCSI buses are one-per-arry or 'orthogonal'. i.e. Redundancy groups are made up from one disk per 'array'.
Elsewhere, Katz, one of the authors, uses a diagram of a generic RAID system not subject to any "Single Point of Failure":
  • with dual-controllers and dual channel interfaces.
    • Controllers cross-connected to each interface.
  • dual-ported disks connected to both controllers.
    • This halves the number of unique drives in a system, or doubles the number of SCSI buses/HBA's, but copes with the loss of a controller.
  • Implying any battery-backed cache (not in diagram) would need to be shared between controllers.
From this, a reasonable guess at aspects of the design is:
  • HP C3010 drives were used, 2Gb formatted. [Unable to find list prices on-line]
    • These drives were SCSI-2 (up to 16 devices per bus)
    • available as single-ended (5MB/sec) or 'fast' differential (10MB/sec) or 'fast-wide' (16-bit, 20MB/sec). At least 'fast' differential, probably 'fast-wide'.
  • "14 pathways" could mean 14 SCSI buses, one per line of disks, but it doesn't match with the claimed 16 disks per array.
    • 16 SCSI buses with 16 HBA's per controller matches the design.
    • Allows the claimed 4 arrays of 16 drives per controller (64) and 128 max.
    • SCSI-2 'fast-wide' allows 16 devices total on a bus, including host initiators. This implies that either more than 16 SCSI
  • 5Mbps transfer rate probably means synchronous SCSI-1 rates of 5MB/sec or asynchronous SCSI-2 'fast-wide'.
    • It cannot mean the 33.5-42Mbps burst rate of the C3010.
    • The C3010 achieved transfer rates of 2.5MB/sec asynchronously in 'fast' mode, or 5MB/sec in 'fast-wide' mode.
    • Only the 'fast-wide' SCSI-2 option supported dual-porting.
    • The C3010 technical reference states that both powered-on and powered-off disks could be added/removed to/from a SCSI-2 bus without causing a 'glitch'. Hot swapping (failed) drives should've been possible.
  • RAID-5/6 groups of 15 with 2 parity/check disk overhead, 26Gb usable per array, max 208Gb.
    •  RAID redundancy groups are implied to be per (16-disk) 'array' plus one hot-spare .
    • But 'orthogonal' wiring of redundancy groups was probably used, so how many SCSI buses were needed per controller, in both 1 and 2-Controller configurations?
    • No two drives in a redundancy group should be connected via the same SCSI HBA, SCSI bus, power-group or cooling-group.
      This allows live hardware maintenance or single failures.
    • How were the SCSI buses organised?
      With only 14 devices total per SCSI-2 bus, a max of 7 disks per shared controller was possible.
      The only possibly configurations that allow in-place upgrades are: 4 or 6 drives per bus.
      The 4-drives/bus resolves to "each drive in an array on a separate bus".
    • For manufacturing reasons, components need standard configurations.
      It's reasonable to assume that all disk arrays would be wired identically, internally and with common mass terminations on either side, even to the extent of different connectors (male/female) per side.
      This allows simple assembly and expansion, and trivially correct installation of SCSI terminators on a 1-Controller system.
      Only separate-bus-per-drive-in-array (max 4-drives/bus), meets these constraints.
      SCSI required a 'terminator' at each end of the bus. Typically one end was the host initiator. For dual-host buses, one at each host HBA works.
    • Max 4-drives per bus results in 16 SCSI buses per Controller (64-disks per side).
      'fast-wide' SCSI-2 must have been used to support dual-porting.
      The 16 SCSI buses, one per slot in the disk arrays, would've continued across all arrays in a fully populated system.
      In a minimum system, 32 drives, would've been only 2 disks per SCSI bus.
  • 1 or 2 controllers with a shared 64M-512M cache and 8Mb for dynamic mapping.
This would be a high-performance and highly reliable design with a believable $1-2M price for 64 drives (200Gb notional, 150Gb raw):
  • 1 Controllers
  • 128Mb RAM
  • 8 ESCON channels
  • 16 SCSI controllers
  • 64 * 2Gb drives as 4*16 arrays, 60 drives active, 52 drive-equivalents after RAID-6 parity.
  • cabinets, packaging, fans and power-supplies
From the two price-points, can we tease out a little more of the costs [no allowance for ESCON channel cards]:
  • 1 Controller + 32 disks + 64M cache = $1.3M
  • 2 Controllers + 128 disks + 512M cache = $3.6M
As a first approximation, assume that 512M RAM costs half as much as 2 Controllers for a 'balanced' system. Giving us a solvable set of simultaneous equations:
  • 1.0625 Controllers + 32 disks  = $1.3M
  • 2.5 Controllers + 128 disks = $3.6M
roughly:
  • $900,000 / Controller [probably $50,000 high]
  • $70,000 / 64M cache [probably $50,000 low]
  • $330,000 / 32 disks ($10k/drive, or $5/MB)
High-end multi-processor VAX system pricing at the time is in-line with this $900k estimate, but more likely an OEM'd RISC processor (MIPS or SPARC?) was used.
This was a specialist, low-volume device: expected 1st year sales volume was ~1,000.
In 1994, they'd reported sales of 775 units when the upgrade (9220) was released.

Another contemporary computer press article cites the StorageTek Array costing $10/Mb compared to $15/MB for IBM DASD. 100Gb @ $10/Mb  is $1M, so congruent with the claims above.

How do the real-world products in 1992 compare to the 1988 RAID estimates of Patterson et al?
  • StorageTek Iceberg: $10/Mb vs $11-$8 projected.
    • This was achieved using 2Gb 5¼in drives not the 100Mb 3½in drives modelled
    • HP sold a 1Gb 3½in SCSI-2 drive (C2247) in 1992. This may have formed the basis of the upgrade 9220 ~two years later.
  • Using the actual, not notional, supplied capacity (243Gb) the Iceberg cost $15/Mb.
  • The $15/Mb for IBM DASD compares well to the $18-$10 cited in 1988.
    • But IBM, in those intervening 5 years, had halved the per-Mb price of their drives once or twice. The 1988 "list price" from the archives of ~$60/Mb are reasonable.
  • In late 1992, 122Mb Conner CP-30104 were advertised for $400, or $3.25/Mb.
    These were IDE drives, though a 120Mb version of the SCSI CP-3100 was sold, price unknown.
The 8.4Gb 25-drive NCR  6298 gave $12.15/Mb, again close to the target zone.
From the Dahlin list, 'street prices' for 420Mb drives at the time, were $1600 for Seagate ST-1480A and $1300 for 425Mb Quantum or $3.05-$3.75/Mb.

The price can't be directly compared to either IBM DASD or StorageTek's Iceberg, because the NCR 6298 only provided a SCSI interface, not an IBM 'channel' interface.

The raw storage costs of the StorageTek Iceberg and NCR are roughly 2.5:1.
Not unexpected due to the extra complexity, size and functionality of the Iceberg.

    Friday, November 11, 2011

    High Density Racking for 2.5" disks

    Datacentre space is expensive: one 2010 artice puts construction at $1200/sq ft and rental at $600/pa/sq ft for an approx design heat load of 10-50W/sq ft. Google is reputed to be spending $3,000/sq ft building datacentre with many times this heat load.

    There are 3 different measures of area:
    • "gross footprint". The room plus all ancillary equipment and spaces.
    • "room". The total area of the room. Each rack, with aisles & work-space uses 16-20 sq ft.
    • "equipment footprint". The floor area directly under computing equipment. 24"x40", ~7sq ft.
    Presumably the $600/pa rental cost is for "equipment footprint".

    The 2½ inch drive form-factor is (100.5 mm x 69.85 mm x 7-15 mm).
    Enterprise drives are typically 12.5 or 15mm thick. Vertically stacked, 20-22 removable 2½ inch drives can be fitted across a rack, taking a 2RU space, with around 15mm of 89mm unused.

    2½ inch drive don't fit well in standard 19" server racks (17" wide, by 900-1000mm deep, 1RU = 1.75" tall), especially if you want equal access (eg. from the front) to all drives without disturbing any drives. Communications racks are typically 600mm deep, but not used for equipment in datacentres.

    With cabling, electronics and power distribution, a depth of 150mm (6") should be sufficient to house 2½ inch drives. Power supply units take additional space.

    Usable space inside a server rack, 17" wide and 1000mm deep, would leave 850mm wasted.
    Mounting front and back, would still leave 700mm wasted, but create significant heat removal problems, especially in a "hot aisle" facility.

    The problems reduces to physical arrangements that maximise exposed area (long and thin rectangles vs the close-to-square 19" Rack, if dual sided, with a chimney) or maximise surface area and minimise floor space - a circle or cylinder.

    The "long-thin rectangle" arrangement was popular in mainframe days, often as an "X" or a central-spine with many "wings". It assumes that no other equipment will be sited within the working clearances needed to open doors and remove equipment.

    A cylinder, or pipe, must contain a central chimney to remove waste heat. There is also a requirement to plan cabling for power and data. Power supplies can be in the plinth as the central cooling void can't be blocked mid-height and extraction fan(s) need to be mounted at the top.

    For a 17" diameter pipe, 70-72 disks can be mounted around the circumference allowing 75 mm height per row,  20-24 rows high allowing for a plinth and normal heights. This leaves a central void of around 7" to handle the ~8kW of power of ~1550 drives.

    A 19" pipe would allow 75-80 disks per row and a 9" central void to handle ~9.5kW of ~2000 drives.
    Fully populated unit weight would be 350-450kg.

    Perhaps one in 8 disks could be removed allowing a cable tray, a not unreasonable loss of space.

    These "pipes" could be sited in normal racks either at the end of row requiring one free rack-space beside them, or in a row, taking 3 rack-spaces.

    As a series of freestanding units, they could be mounted in a hexagonal pattern (the closest-packing arrangement for circles) with minimum OH&S clearances around them, which may be 600-750mm.

    This provides 4-5 times the density of drives over the current usual 22 shelves of 24 drives (480) per rack, with better heat extraction.  At $4-5,000/pa rental per rack-space (or ~$10/drive), it's a useful saving.

    With current drive sizes of 600-1000Gb/drive, most organisations would get by with one unit of 1-2Pb.

    Update: A semi-circular variant 40inx40in for installation in 3 Rack-widths might work as well. Requires a door to create the chimney space/central void - and it could vent directly into a "hot aisle".


    Papers:
    2008: "Cost Model: Dollars per kW plus Dollars per Square Foot of Computer Floor"

    2007: "Data center TCO; a comparison of high-density and low-density spaces"

    2006: "Total Cost of Ownership Analysis for Data Center Projects"

    2006: "Dollars per kW plus Dollars per Square Foot Are a Better Data Center Cost Model than Dollars per Square Foot Alone"

    Thursday, November 10, 2011

    Questions about SSD / Flash Memory

    1. Seagate, in 2010, quote their SSD UER specs as:
      Nonrecoverable read errors, max: 1 LBA per 10^16 bits read
      where a Logical Block Address (LBA) is 512 bytes. Usually called a 'sector'.

      But we know that Flash memory is organised as 64Kb blocks (min read/write unit).
      Are Seagate saying that errors will be random localised cells, not "whole block at a time"?
      Of course, the memory controller does Error Correction to pick up the odd dropped bits.
    2. Current RAID schemes are antagonistic to Flash Memory:
      The essential problem with NAND Flash (EEPROM) Memory is that it suffers "wear" - after a number of writes and erasures, individual cells (with MLC, cell =/= bit) are no longer programmable. A secondary problem is "Data Retention". With the device powered down, Seagate quote "Data Retention" of 1 year.

      Because Flash Memory wears with writes, batches of components will likely have very similar wear characteristics and if multiple SSD's are mirrored/RAIDed in a system they will most likely be from the same batch, evenly spread RAID writes (RAID-5 writes two physical blocks per logical block) will cause a set of SSD's to suffer correlated wear failures. This is not unlike the management of piston engines in multi-engined aircraft: avoid needing to replace more than one at a time. Faults, Failures and Repair/Install Errors often show up in the first trip. Replacing all engines together maximises the risk of total engine failure.

      Not only is this "not ideal", it is exactly worst case for current RAID.
      A new Data Protection Scheme is required for SSD's.
      1. Update 23-Dec-2011. Jim Handy in "The SSD Guy" blog [Nov-17, 2011] discusses SSD's and RAID volumes:
        So far this all sounds good, but in a RAID configuration this can cause trouble.  Here’s why.

        RAID is based upon the notion that HDDs fail randomly.  When an HDD fails, a technician replaces the failed drive and issues a rebuild command.  It is enormously unlikely that another disk will fail during a rebuild.  If SSDs replace the HDDs in this system, and if the SSDs all come from the same vendor and from the same manufacturing lot, and if they are all exposed to similar workloads, then they can all be expected to fail at around the same time.

        This implies that a RAID that has suffered an SSD failure is very likely to see another failure during a rebuild – a scenario that causes the entire RAID to collapse.
    3. What are the drivers for the on-going reduction in prices of Flash Memory?
      Volume? Design? Fabrication method (line width, "high-K", ...)? Chip Density?

      The price of SSD's has been roughly halving every 12-18 months for near on a decade, but why?
      Understanding the "why" is necessary to be forewarned of any change to the pattern.
    4. How differently are DRAM and EEPROM fabricated?
      Why is there about a 5-fold price difference between them?
      Prices (Kingston from same store, http://msy.com.au, November 2011):
      DDR3 6Gb 1333Mhz $41 $7/Gb
      SSD 64Gb  $103    $1.50/Gb
      SSD 128Gb  $187 $1.25/Gb
      

      It would be nice to know if there was a structural difference or not for designing "balanced systems", or integrating Flash Memory directly into the memory hierarchy, not as a pretend block device.
    5. Main CPU's and O/S's can outperform any embedded hardware controller.
      Why do the PCI SSD's not just present a "big blocks of memory" interface, but insist on running their own controllers?
    6. Hybrid SSD-HDD RAID.
      For Linux particularly, is it possible to create multiple partitions per HDD in  a set, then use one HDD-partition to mirror a complete SSD.  The remaining partitions can be setup as RAID volumes in the normal way.
      The read/write characteristics of SSD and HDD are complementary: SSD is blindly fast for random IO/sec, while HDD's currently stream reads/writes at higher sustained writes.
      Specifically, given 1*128Gb SSD and 3*1Tb HDD, create a 128Gb partition on all HDD's. Then mirror (RAID-1), the SSD and one or more of the HDD's (RAID-1 isn't restricted to two copies, but can have as many replicas as desired to increase IO performance or resilience/Data Integrity). Remaining 128Gb HDD partitions can be stripped, mirrored or RAIDed amongst themselves or to other drives. The remaining HDD space can be partitioned and RAIDed to suit demand.

      Does it make sense, both performance- and reliability-wise, to mirror SSD and HDD?
      Does the combination yield the best, or worse, of both worlds?

      Is the cost/complexity and extra maintenance worth it?

    Friday, November 04, 2011

    Enterprise Drives: Moving to 2.5" form factor

    Update [23-Dec-2011]: IDC, in 2009, discussed in a HP report, the migration to 2.5 inch drives in Enterprise Storage. Started in 2004, projected to be complete in 2011.

    Elsewhere, a summary of IDC's 2009 Worldwide HDD shipments and revenue:

    • The transition from 3.5in. to 2.5in. performance-optimized form factor HDDs will be complete by 2012.
    • Growing interest in new storage delivery models such as storage as a service, or storage in the cloud is likely to put greater storage capacity growth demands on Internet datacenters.
    • The price per gigabyte of performance-optimized HDD storage will continue to decline at a rate of approximately 25% to 30% per year.

    Thursday, November 03, 2011

    Types of 'Flash Memory'

    "Flash Memory", or EEPROM (Electrically Erasable Programmable Read Only Memory), is at the heart of much of the "silicon revolution" of the last 10 years.

    How is it packaged and made available to consumers or system builders/designers?

    Mobile appliances are redefining Communications and The Internet, precisely because of the low-power, high-capacity and longevity - and affordable price -  of modern Flash Memory.

    There are many different Flash Memory configurations: NAND, NOR, SLC, MLC, ...
    This piece isn't about those details/differences but the in how they are packaged and organised.
    What technology does what/has what characteristics is out there on the InterWebs.

    Most Flash Memory chips are assembled into different packaging for different uses at different price points. Prices are "Retail Price (tax paid)" from a random survey of Internet sites:
    • Many appliances use direct-soldered Flash Memory are their primary or sole Persistent Datastore.
      The genesis of this was upgradeable BIOS firmware. Late 1990's?
      Per-Gb pricing not published: approx. derivable from model price differences.
    • Commodity 'cards' used in cameras, phones and more: SD-card and friends.
      Mini-CD and Micro-SD cards are special cases and attract a price premium.
      Some 'high-performance' variants for cameras.
      A$2-3/Gb
    • USB 'flash' or 'thumb drives':
      A$2-3/Gb.
    • High-end camera memory cards: Compact Flash (CF). The oldest mass-use format?
      IDE/ATA compatible interface. Disk drive replacement for embedded systems.
      Fastest cards are 100MB/sec (0.8Gbps). Max is UDMA ATA, 133MB/sec.
      Unpublished Bit Error Rate, Write Endurance, MTBF, Power Cycles, IO/sec.
      A$5-$30/Gb
    • SATA 2.5" SSD (Solid State Drives). Mainly 3Gbps and some 6Gbps interfaces.
      MTBF: 1-2M hours,
      Service Life: 3-5 years at 25% capacity written/day.
      IO/sec: 5,000 - 50,000 IO/sec [max seen: 85k IO/sec]
      BER: "1 sector per 1015-16 bits read"
      sustained read/write speed: 70-400MB/sec . (read often slowest)

      Power: 150-2000W active, 75-500mW idle
      32Gb - 256GB @ A$1.50-$2.50/Gb.
    • SATA 1.8" SSD. Internal configuration of some 2.5" SSD's.
      Not yet widely available.
    • SAS (Serial Attached SCSI) 2.5" drives.
      not researched. high-performance, premium pricing.
    • PCI "SSD". PCI card presenting as a Disk Device.
      Multiple vendors, usual prices ~A$3-4/Gb. Sizes 256Gb - 1Tb.
      "Fusion-io" specs quoted by Dell. Est A$20-25/Gb. [vs ~$5/Gb direct]
      640GB (Duo)
      NAND Type: MLC (Multi Level Cell)
      Read Bandwidth (64kB): 1.0 GB/s
      Write Bandwidth (64kB): 1.5 GB/s
      Read IOPS (512 Byte): 196,000
      Write IOPS (512 Byte): 285,000
      Mixed IOPS (75/25 r/w): 138,000
      Read Latency (512 Byte): 29 μs
      Write Latency (512 Byte): 2 μs
      Bus Interface: PCI-Express x4 / x8 or PCI Express 2.0 x4
    • Mini-PCIe cards, Intel:  40 and 80Gb. A$3/Gb
      Intel SSD 310 Series 80GB mini PCIe

      * Capacity: 80 GB,
      * Components: Intel NAND Flash Memory Multi-Level Cell (MLC) Technology
      * Form Factor, mini PCIe, mSATA, Interface,
      * Sequential Read - 200 MB/s, Sequential Write - 70 MB/s,
      * Latency - Read - 35 µs, Latency - Write - 65 µs,
      * Lithography - 34 nm
      * Mean Time Between Failures (MTBF) - 1,200,000 Hours
    Roughly, prices increase with size and performance.
    The highest density chips, or leading edge technology,  cost a premium.
    As do high-performance or "specialist" formats:
    • micro-SD cards
    • CF cards
    • SAS and PCI SSD's.

    Wednesday, November 02, 2011

    When is an SATA drive not a drive, when it's compact flash

    The CompactFlash Association  has released the new generation of the CF specification, CFast™ [from a piece on 100MB/sec CF cards]
    10-15 years ago, CF cards were used almost exclusively in all Digital cameras, now they are used only in "high-end" Digital SLR's (DSLR), presumably because of cost/availability compared to alternatives like SDHC cards.

    The UDMA based CF standard allows up to 133MB/sec transfer rates.
    The new SATA based standard, CFast, allows 3Gbps (~300MB/sec) transfer rates.

    In another context, you'd call this an SSD (Solid State Disk), even a "SATA drive".
    There are two problems:
    • common cameras don't currently support CFast™, the SATA based standard, and
    • 'fast' CF cards are slower than most SSD's and attract a price premium of 5-10 times.
    I'm not sure what decision camera manufacturers will make for their Next Generation high-end storage interface, they have 3 obvious directions:
    • CF-card format (43×36×5 mm), SATA interface, CFast™
    • 1.8inch or 2.5inch SSD, SATA interface
    • 34mm ExpressCard. PCIe interface.
    and the less obvious: somehow adapt commodity SDHC cards to high-performance.
    Perhaps in a 'pack' operating like RAID.

    The "Next Generation Interface" is a $64 billion question for high-end camera manufactures, the choice will stay with the industry for a very long time, negatively affecting sales if made poorly.

    Manufacturers are much better off selecting the same standard (common parts, lower prices for everyone), but need to balance the convenience of "special form factors" with cost. Whilst professional photographers will pay "whatever's needed" for specialist products, their budgets aren't infinite and excessive prices restrict sales to high-end amateurs.

    Perhaps the best we'll see is a transition period of dual- or triple-card cameras (SDHC, CF-card and CFast™), with the possibility of an e-SATA connector for "direct drive connection".

    Update 04-Nov-2011:
    Here's a good overview from the Sandisk site of form-factor candidates to replace the CF card form-factor of (43×36×5 mm):

    SanDisk® Solid State Drives for the Client
     "A variety of form factors, supporting multiple OEM design needs."

    SanDisk® Solid State Drives for the Client
    Product NameInterfaceForm FactorMeasurements
    SanDisk U100 SSDSATA2.5"100.5 mm x 69.85 mm x 7 mm
    std allows 9.5mm, 12.5mm, 15mm
    SanDisk U100 SSDSATAHalf-Slim
    SATA
    54.00 mm x 39.00 mm x 3.08 mm (8-64GB),
    x 2.88 mm (128-256GB),
    Connector 4.00 mm
    SanDisk U100 SSDSATAmSATA30.00 mm x 50.95 mm x 3.4 mm (8-64GB),
    x 3.2 mm (128-256GB)
    SanDisk U100 SSDSATAmSATA mini26.80 mm x 30 mm x 2.2 mm (8GB),
    x 3.2mm (16-128GB)
    SanDisk iSSD(TM)SATASATA uSSD16 mm x 20 mm
    x 1.20 mm (8 GB-32 GB)
    x 1.40 mm (64 GB)
    x 1.85 mm (128 GB)
    Standard 1.8 in
    disk
    SATA1.8"54 mm x 71 mm x 8 mm
    Express Card PCIeExpress
    Card
    34mm x 75 mm x 5mm
    54mm x 75 mm x 5mm
    Compact FlashATACF-II36mm x 43 mm x 5mm

    Tuesday, November 01, 2011

    Flash Memory vs 15,000 RPM drives

    Some I.T. technologies  have a limited life or "use window". For example:
    In 2001, the largest Compact Flash (CF) card for Digital cameras wasn't flash memory, but a hard disk, the 1Gb IBM "microdrive" ($500 claimed price). After Hitachi acquired the business, they produced 4Gb and 6Gb drives,  apparently for sale as late as 2006, with the 4Gb variant used in the now discontinued Apple iPod mini.
    Around 2006, the 1" microdrive hard disks were out-competed by flash memory and became yet another defunct technology.

    Flash memory storage, either SSD or PCI 'drives', for servers now cost A$2-3/Gb for SATA SSD and $3-4/Gb for PCIe cards [A$3/Gb for Intel mini-PCIe cards].

    Currently, Seagate 15k 'Cheetah' drives sell for around $2/Gb, but their 2msec (0.5k IO/sec) performance is no match for the 5KIO/sec of low-end SSD's or the 100-250KIO/sec of PCI flash.

    10,000 RPM 'Enterprise' drives cost less, around $1.50/Gb, whilst 1Tb 7200 RPM (Enterprise) drives come in at $0.25/Gb.

    The only usual criteria 15,000 RPM drives beat other media on is single-drive transfer rate*.
    Which in an 'Enterprise' RAID environment is not an advantage unless a) you're not paying full price or b) you have very special requirements or constraints, such as limited space.

    I'm wondering if 2010 was the year that 15,000 RPM Enterprise drives joined microdrives in the backlot of obsolete technologies - replaced in part by the same thing, Flash Memory.



    Part of the problem stems from the triple whammy for any specialist technology/device:
    • Overheads are maximised: By definition 'extreme' technologies are beyond "cutting-edge", more "bleeding-edge", meaning research costs are very high.
    • Per Unit fixed costs are maximised: Sales volumes of specialist or extreme devices are necessarily low, requiring those high research costs to be amortised over just a few sales.
      If the technology ever becomes mainstream, it is no longer 'specialist' and research costs are amortised over very large numbers of parts.
    • Highest Margin per Unit: If you want your vendor to stay in business, they have to make suitable profits, both to encourage 'capital' to remain invested in the business and have enough surplus available to fund the next, more expensive, round of research. Profitable businesses can be Low-Volume/High Margin or High-Volume/Low Margin (or Mid/Mid).
    Specialised or 'extreme performance' products aren't proportionately more expensive, they are necessarily radically more expensive, compounding the problem. When simple alternatives are available to use commodity/mainstream devices (defined as 'least cost per-perf. unit' or highest volume used), then they are adopted, all other things being equal.

    * There are many characteristics of magnetic media that are desirable or necessary for some situations, such as "infinite write cycles". These may be discussed in detail elsewhere.

    Monday, October 31, 2011

    RAID, Backups and Recovery

    There's an ironclad Law of SysAdmin:
    "Nobody ever asks for backups to be done, only ever 'restores'"
    Discussing RAID and Reliable Persistent Storage cannot be done without reference to the larger context: Whole System Data Protection and Operations.

    "Backups" need more thought and effort than a 1980-style dump-disk-to-tape. Conversely, as various politicians & business criminals have found to their cost with "stored emails", permanent storage of everything is not ideal either, even if you have the systems, space and budget for media and operations.

    "Backups" are more than just 'a second copy of your precious data' (more fully, 'of some indeterminate age'), which RAID alone partially provide.

    Sunday, October 30, 2011

    Revisting RAID in 2011: Reappraising the 'Manufactured Block Device' Interface.

    Reliable Persistent Storage is fundamental to I.T./Computing, especially in the post-PC age of "Invisible Computers". RAID, as large Disk Arrays backed up by tape libraries, has been a classic solution to this problem for around 25 years.

    The techniques to construct "high-performance", "perfect" devices with "infinite" data-life from real-world devices with many failure modes, performance limitations and a limited service life vary with cost constraints, available technologies, processor organisation, demand/load and expectations of "perfect" and "infinite".

    RAID Disk Arrays are facing at least 4 significant technology challenges or changes in use as I.T./Computing continues to evolve:
    • From SOHO to Enterprise level, removable USB drives are becoming the media of choice for data-exchange,  off-site backups and archives as the price/Gb drives multiples below tape and optical media.
    • M.A.I.D. (Massive Array of Idle Disks) is becoming more widely used for archive services.
    • Flash memory, at A$2-$4/Gb, packaged as SSD's or direct PCI devices, is replacing hard disks (HDDs) for low-latency random-IO applications. E-bay, for example, has announced a new 'Pod' design based around flash memory, legitimising the approach for Enterprises and providing good "case studies" for vendors to use in marketing and sales.
    • As Peta- and Exabyte systems are designed and built, its obvious that the current One-Big-Box model for Enterprise Disk Arrays is insufficient. IBM's General Parallel File System (GPFS) home page notes that large processor arrays are needed to create/consume the I/O load [e.g. 30,000 file creates/sec] the filesystem is designed to provide. The aggregate bandwidth provided is orders of magnitude greater than can be handled by any SAN (Storage Area Network,  sometimes called a 'Storage Fabric').
    Internet-scale datacentres, consuming 20-30MW, house ~100,000 servers and potentially 200-500,000 disks, notably not as Disk Arrays. Over a decade ago, Google solved its unique scale-up/scale-out problems with whole-system replication and GFS (Google File System): a 21st century variant of the 1980's research work at Berkeley by Patterson et al called "N.O.W." (Network of Workstations). A portion of which was the concept written up in 1988 as the seminal RAID paper: "Redundant Arrays of Inexpensive Disks".

    Tuesday, October 25, 2011

    My Ancient Computer Project

    Preamble: If you're really looking for advice on floppy drives: 8", 5¼" and 3½", try these links:
    http://www.retrotechnology.com/herbs_stuff/s_drives_howto.html
    http://www.classiccmp.org/dunfield/
    http://www.deviceside.com/

    The good folk at "Device Side Data" sell a USB adaptor for 5¼ inch floppy drives and a range of relevant cables, power-supplies and enclosures.
    For about US$100, you can have a working  5¼in USB setup, but it's bring-your-own-drive.
    Sourcing a 5¼ inch floppy drive may be tricky: they went out of production 10-15 years ago.
    Sourcing 5¼ inch floppy disks, new or 'slightly used' is probably a greater challenge.



    Some weeks ago a friend (AF) asked if I could copy a 5¼ inch floppy disk for him, leading to this little adventure...

    AF had a 3½ inch floppy that he might have copied everything onto and wanted to check.
    Without specialist equipment, I knew I wouldn't be able to recover all potentially readable data, but I offered to do what I could with a standard drive.

    In the end, AF decided to try something else, so I  didn't get to do his copy.
    But it did make me get my old 386-SX properly setup, networked so I could move files to/from it and a way to easily copy 5¼ in and 3½ floppy disks.
    Before this, I still booted it occassionally, but the real-time clock battery had died and it had no network card or CD-ROM drive.

    This decision led to weeks of farnarkling and some interesting lessons.

    While it's unclear if my 386-SX will survive another 2 decades, the software can live on through tools like QEMU, WINE/Crossover and even DOSBOX. So there is some value in recovering the data both on the hard disk and my collection of floppies.
    The impetus to recover data from unreadable media, 5¼in floppies, is obvious.
    For the readable 3½in floppies, taking a copy now is a good investment of time if I ever want to access the data again: magnetic media does degrade over time. In another 20 years, the coating on those floppies may be flaking off in big lumps.

    386-SX Initial Config:
    • purchased late 1991, ~$3250
    • 386-SX CPU, 20Mhz (selectable to 8Mhz). No FPU. 8Mhz ISA bus, 8-slots.
    • Dual floppy drives, 5¼ in [boot] and 3½ in.
    • IDE disk. 80Mb, WD AC280.
    • single parallel port, dual serial ports, one used for mouse, other for modem.
    • 5Mb RAM [max at time]
    • Super VGA1024x768 16-col, 640x480, 256 col. 512Kb [K-i-l-o not Mb]
    • 33cm display. [fixed scanrate, can be destroyed @ wrong Hz]
    • mini-tower. pre-ATX power-supply (no 'soft' power switch or 'halt')
    • No sound-card.
    • DOS 5.0 and Windows 3.0
    Current state-of-play is:
    • 386-SX has DOS 6.20, Win 3.11, Networking, CD-ROM and Zip drive all working.
      Hard-drive cloned/backed up and (5) ZIP disks read and copied. [ZIP drive back in its box]
      Single 3½in floppy as A: drive. BIOS only allows booting from first drive.
    • 5¼ in drive working as 2nd drive in a Linux machine (2001, Celeron 667) with built-in networking.
    • Working through copying all  5¼ in floppies I can find.
      Tally so far [40]: 1 unreadable, 1 with errors.
      Update 04-Nov-2011: 210 floppies read, 40-50 5 in: 8 "no data". 150+  3½in, 5 "no data"
    • Now have a USB 3½ in floppy drive, can read those at leisure on newer machines.
    The most important lesson for me came about two weeks in:
    • I was fixated on doing everything on my 1991 vintage 386-SX.
      At one point I was running through the options of replacing the motherboard and the various costs which weren't attractive given I was 'just playing'.
    • Since USB became ubiquitous, finding machines/motherboards with floppy drive controllers is increasingly difficult, which means even embedded boards with FDC's are rare and expensive.
    • Then I realised I already had everything I needed in the 2001 vintage Celeron system I had tucked away.
      It's loaded with Fedora Core 3 (support ended in 2004) with a linux 2.4 kernel. Old, but usable.
    • It was perfect for what I wanted to do.
      It had a floppy drive controller and I could transplant the cable (with  5¼in 'slot' connectors) from the 386-SX to the Celeron 667.
     I also learnt a little about floppy drive connectors:
    • 5¼in drives have a 'slot' (card-edge) connector on the drive and a header not unlike a 40-pin IDE connector (34-pin is used).
    • 'Classic' 3½in drives use a socket connector similar to IDE connectors (but 34-pin)
    • The $30 USB 3½in drive I bought uses an incompatible tiny connector (two versions, a plug and a socket with a conversion cable). Previously, I didn't know this variant existed. Noted to save other people from popping open put-together-permanently cases. I didn't care about the warranty, but the case needs to be firmly shut or drive operation is affected.

    Before starting any work, I had to backup the original 3860-SX disk. My memory was that I'd bought a 30Mb 'RLL' drive (a 20Mb ST-506 drive with a modified controller).
    Turns out I really had an 85Mb IDE (now called ATA or PATA), a "WD Caviar® AC280", not only larger, but it would allow me to connect an IDE CD-ROM drive in their as well. My unused hardware pile has any number of CD-ROM drives.

    Most importantly, an IDE/ATA drive gave me the option of backing up the drive via an IDE/USB interface... Of which I have a number of versions.

    I also had an old system backup of around 30 * 720kb 3½ floppies. DOS 5.0 and Win 3.0.
    Using QEMU on a Linux system, I was able to restore this backup to a virtual disk drive.
    Shuffling all those disks was painful and slow.

    Connecting an early IDE drive to a modern(ish) IDE/USB interface didn't work.
    Probably because additional commands were introduced to identify the drive, possibly because this old drive responded to "CHS" (cylinder-head-sector), not LBA (Logical Block Addressing). From the AC280 spec. sheet, the drive electronics did support any reasonable CHS settings, not only the physical layout.

    My next preferred method was to connect a second IDE drive as a 'slave' (D: in DOS), fdisk and format it and copy the original drive contents, then connect this drive to a modern system with IDE/USB interface and back it up.

    The first 3.5" HDD I tried from my unused pile (1Gb Fujitsu) had errors.
    Next drive tried was a 3.5" 4.3Gb. Worked reliably.
    In the final config, I replaced that drive with a slower, quieter 2Gb 2.5in drive, cloning it via an IDE/USB interface.

    The Phoenix BIOS in the 386-SX is very old. Not only doesn't it support LBA drives, it seemed to limit drives to 1023 cylinders - and 15 heads/63 sectors. Around 470Mb.
    A very large fraction of my time was spent fiddling with disk CHS specifications and attempting to get fdisk to ignore the BIOS settings.
    [No, I can't update the firmware, the BIOS is pre "Flash-the-BIOS"].

    I could setup multiple (large) partitions on the drive with Linux and the IDE/USB interface, but then would run into troubles under DOS and the 386-SX.
    I tried 'fdisk' from DOS 5.0, 6.2, 6.22, Win-95 and FreeDos on the 386-SX, but all would only see the 470Mb allowed by the BIOS. Extra partitions would be displayed, but couldn't be changed.

    I don't have enough unused 1.44Mb 3½in floppies to back up the entire 85Mb drive, so was very glad my 2nd method worked.

    Getting a working ISA-bus (not PCI) network card was simple: I had two in my "unused bits pile".
    The one I chose I'd bought new, a Netgear NE2000 clone, but I didn't realise that or find the box (with install floppy) for a while. Relied on Windows NE2000 driver and Internet downloads at first.
    The other card didn't have a clear name/identifier, nor did I get a good match from the chip numbers.

    This was another 'surprise': how to get any info on installed ISA cards.
    I also spent 2 or more days fiddling with the IRQ/DMA settings on the NE2000 card. I'd forgotten the problems that PCI made go away. I ended up with IRQ 5 (COM2) and found an IO address range through trial-and-error. I don't know for sure if its a clash or not.
    I couldn't find a tool that would list for me all the cards + settings in the system.
    Norton's "SI" provides everything but DMA ports.
    MSD (Microsoft Diagnostics) didn't help either.

    Getting the CD-ROM working was a good idea, if a little problematic.
    Using standard Linux tools, I was able to create an ISO image of the original DOS/Windows system and also add some additional tools.
    The problematic part was creating a disk image with Uppercase filenames. Whilst DOS 6.2 (really MSCDEX) reads the root directory correctly, no files or directories can be read/listed.
    Perhaps it is the 'Joliet' option I use that causes this... Had the same difficulty with the FreeDos CD.

    I managed to get non-DOS booting via 3½in floppies:
    Until I tried it, I wasn't aware that FreeDos uses linux as a base. Uses 'SYSLINUX' as the boot loader and seems to have a kernel.sys.

    It took a good deal of searching to find any Linux that would support:
    • vanilla 386
    • no FPU
    • no RAM disk for under 12Mb,.
    "Floppyfw" recognised the (single) NE2000-compatible network card, but didn't include a shell.
    BG-TLB does include BusyBox, but only support 'plip' networking over the parallel port (not tried).
    So while I have seen a linux shell prompt and been able to mount the DOS/FAT filesystem, I haven't been able to dual-boot the 386-SX or run it as a Linux only system.

    Whilst FreeDos could read its install CD-ROM, DOS 6.2 was unable to read files/directories contents (the lowercase name problem above).

    Another Linux not-tried was tomsrtbt: "Tom's floppy which has a root filesystem and is also bootable." http://www.toms.net/rb/tomsrtbt.FAQ. It advertises itself as "The most GNU/Linux on 1 floppy disk."
    Which might have worked, but it formats 3½in floppies at the non-standard 1.7Mb, not 1.44Mb, with the caveat "Doing this may destroy your floppy drive".

    One of the 'surprises' I got was being unable to replace the original 3½in floppy drive with a newer drive. The interface and connectors were all the same, but the newer drive wouldn't work in the old system.
    Was it me connecting it incorrectly, a faulty drive or something more?
    Unable to tell and unwilling to devote a bunch of time testing it.

    One of the worst 'surprises' I got was after installing the IOmega (parallel-port) ZIP drive software on the system after a fresh install of Windows 3.11. [Windows 3.11 had decent Networking support and Microsoft still have downloadable a good TCP/IP stack.]
    I had it all setup, tested and working and foolishly, in Windows, selected "Optimise settings" and the system hung.
    Whereafter, the system couldn't see any Comms ports, serial or parallel. Which was very problematic because the 386-SX didn't come with a mouse port (DIN or PS/2). I used a serial port for the mouse.
    Windows hung when it booted, leading me to try to revert the ZIP drive install and later to re-install Windows.
    There were countless reboots and 6-8 hours later I gave up and went to bed.
    I had realised/diagnosed that when the machine booted, the BIOS reported "0 serial ports" and "0 parallel ports". The BIOS setup screen only allowed me to selected HDD and floppy drive settings.

    First thing in the morning, I powered on the machine and it worked perfectly. Including the original copy of Windows that would hang.
    All I can think of was a power-cycle (off/on) cleared the fault, whereas a 'cold' or 'warm' reset (reset switch or ctrl-alt-del) didn't. In the many reboots, I hadn't thought to power-cycle the machine. [A note for myself and others experiencing weirdness on old hardware.]

    I also have an old Dell Inspiron 7000, dating from 1999. I got it with a removable ZIP drive, figuring I could do backups and bulk-data transfers using it and the parallel-port ZIP drive. While tinkering around, I disassembled the other ZIP drive. It's an IDE/ATAPI drive, but the connectors are non-standard. I was hoping I'd be able to kludge it to work on another system - but not to be.

    One of the BIOS limits, noted above, was it will only boot from the first floppy drive. When I'd configured the machine, I'd made the 5¼in drive "A:". Part of my reconfig was to move that drive and make the 3½in drive "A:", so it could boot from it. And most "floppy disk images" on the net are 1.44M for 3½in drives.
    Then I executed a perfect "rookie mistake": I forgot to make a bootable 3½in system disk before moving the 5¼in drive. And all my system disks were, of course, 5¼in.

    The 386-SX is in a "mini-tower" case with very limited space between the back of devices in the drive bays and the motherboard etc. This makes running cables and changing drives quite time consuming. Especially with older connectors that are loose and can be jiggled off. This was part of the reason to move to a 2½in HDD - much more space. I did need to find a spare 2½in-3½in mounting kit first.
    I didn't believe the weight of the whole system, let alone just the removable cover. Meaty!

    Early on I replaced the on-board/real-time clock battery so the system would remember the time over reboots. More modern systems use 3V lithium batteries (CR2032) for this. This old motherboard used a 4.5V alkaline battery (mounted off-board with Velcro). About a week in, I used 3 AAA alkaline batteries in a modified carrier (and a cannibalised connector) to craft a replacement. A less pretty way is to load 3 batteries in a cut-to-length tube with wire soldered directly to exposed battery ends. Soldering wires directly onto batteries requires some technique. You may need help if you try it. There could be an explosion risk with alkaline batteries becoming overheated (they are marked "do not dispose of in fire"). Research this properly or get help if you choose to do this.

    The system, including DOS, quite happily accepts dates of 2011. No problems there...

    One of the problems I haven't addressed yet is: "How do I clean the drive heads?"
    Back in the day when I was a sometime 'operator' on mainframes, cleaning tape drive heads was part of the ritual. We used Isopropyl alcohol + cotton swabs - because it didn't leave a residue. The swabs would always come away stained with oxide coating from the tape.
    For these old drives, I've two reasons to want to be able to clean the heads:
    a) these are old drives and may well have an internal dust build-up, and
    b) older disks are likely to shed more of their coating than when new.
    Some media formulations from the late 1980's are known to suffer problems. I've heard first-hand accounts of the work needed to recover period audio-tape recordings due to this problem.
    • 5¼ inch floppy drives load the heads directly in-line with feed-slot.
      It is possible to get a swab in there, though v. difficult to see what's happening.
    • 3½in floppy drives drop the disk to both lock the disk in-place and load the heads, hence the heads, being offset, aren't accessible for swabbing through the feed slot.
    Another thing to investigate...