SteveJ's lab-notes: November 2011

Monday, November 28, 2011

Optical Disks as Dense near-line storage?

A follow-up to an earlier post [search for 'Optical']:

Could Optical Disks be a viable near-line Datastore?
Use a robot arm to pick and load individual disks from 'tubes' into multiple drives.
Something the size of a single filing cabinet drawer would be both cheap and easily contain a few thousand disks. That's gotta be interesting!

Short answer, no...

A 3.5" drive has a form-factor of: 4 in x 5.75 in x 1in. Cubic capacity: 23 in³
A 'tube' of 100 optical disks: 5.5in x 5.5in x 6.5in Cubic capacity: 160-200 in³ [footprint or packed]

A 'tube' of 100, minus all the supporting infrastructure to select a disk and read it, is 7-9 times the volume of a 3.5in hard disk drive, or each Optical Disk must contain at least 7-9% of a HDD to be competitive.

To replace a 1Tb HDD, optical disks must be at least 7% of 1,000Gb, or 70Gb. Larger than even Blu-ray and 15-20 times larger than Single layer DVD's (4.7Gb).

Current per Gb price of 3.5" HDD's is around $0.05-$0.10/Gb, squeezing 4.7Gb DVD's on price as well.

2Tb drives are common, 3Tb are becoming available now (2011). Plus it gets worse.

There are estimate is of maximum possible 3.5" HDD size of 20-40Tb.
To be competitive, Optical disks would need to get up around 1Tb in size and cost under $1.

Around 2005, when 20-40Gb drives reigned, there was a time when 4.7Gb DVD's were both the densest and cheapest storage available. Kryders' Law, a doubling of HDD capacity every 1-2 years, has seen the end of that.

Sunday, November 27, 2011

Journaled changes: One solution to RAID-1 on Flash Memory

As I've posited before, simple-minded mirroring (RAID-1) of Flash Memory devices is not only a poor implementation, but worst-case.

My reasoning is: Flash wears-out and putting identical loads on identical devices will result in maximum wear-rate of all bits, which is bad but not catastrophic. It also creates a potential for simultaneous failures where a common weakness fails in two devices at the one time.

The solution is to not put the same write load on the two devices, but still have two exact copies.
This problem would be an especial concern for PCI-SSD devices internal to a system. The devices can't normally can't be hot-plugged, though there are hot-plug standards for PCI devices (e.g. Thunderbolt and ExpressCard), they are not usually options for servers and may be limited performance.

One solution, I'm not sure if it's optimal or not, but it is 'sufficient', is to write blocks as normal to the primary device and maintain the secondary device as snapshot + (compressed) journal entries. When the journal space hits a high-water mark the snapshot is made an exact copy (e.g. bring the snapshot up-to-date when a timer expires (hourly, 6-hourly, daily, ...) or when the momentum of changes will fill the journal to 95% before the snapshot could be updated).

If the journal fills, the mirror is invalidated and either changes must be halted or the devices go into unprotected operation. Both not desirable operational outcomes. A temporary, though unprotected, work-around is to write the on-going journal either to the primary device or into memory.

Flash Memory: will filesystems become the CPU bottleneck?

Flash memory with 50+k IO/sec may be too fast for Operating Systems (like Linux) with file-system operations consuming more CPU, even saturating it. They are on the way to becoming the system rate-limiting factor, otherwise known as a bottleneck.

What you can get away with at 20-100 IO/sec, i.e. consumes 1-2% of CPU, will be a CPU hog at 50k-500k IO/sec, a 5,000-50,000 times speed up.

The effect is the reverse of the way Amdahl speed-up is explained.

Amdahl throughput scaling is usually explained like this:

If your workload has 2 parts (A is single-threaded, B can be parallelised), when you decrease the time-taken for 'B' by adding parallel compute-units, the workload becomes dominated by the single-threaded part, 'A'. If you half the time it takes to run 'B', it doesn't halve the total run time. If 'A' and 'B' parts take equal time (4 units each, total 8), then a 2-times speed up of 'B' (4 units to 2) results in a 25% reduction in run-time (8 units to 6). Speeding 'B' up 4-times is a 37% reduction (8 to 5).
This creates a limit to the speed-up possible: If 'B' reduces to 0 units, it still takes the same time to run all the single-threaded parts, 'A'. (4 units here)

A corollary of this: the rate-of-improvement for each doubling of cost nears zero, if not well chosen.

The filesystem bottleneck is the reverse of this:

If your workload has an in-memory part (X) and wait-for-I/O part (W) both of which consume CPU, if you reduce the I/O wait to zero without reducing the CPU overhead of 'W', then the proportion of useful work done in 'X' decreases. In the limit, the system throughput is constrained by CPU expended on I/O overhead in 'W'.

The faster random I/O of Flash Memory will reduce application execution time, but at the expense of increasing % system CPU time. For a single process, the proportion and total CPU-effort of I/O overhead remains the same. For the whole system, more useful work is being done (it's noticeably "faster"), but because the CPU didn't get faster too, it needs to spend a lot more time on the FileSystem.

Jim Gary observed that:

CPU's are now mainly idle, i.e. waiting on RAM or I/O.
Level-1 cache is roughly the same speed as the CPU, everything else is much slower and must be waited for.
The time taken to scan a 20Tb disk using random I/O will be measured in days whilst a sequential scan ("streaming") will take hours.

Reading a "Linux Storage and Filesystem Workshop" (LSF) confrence report, I was struck by comments that:

linux file systems can consume large amount of CPU doing their work, not just fsck, but handling directories, file metadata, free block chains, inode block chains, block and file checksums, ...

There's a very simple demonstration of this: optical disk (CD-ROM or DVD) performance.

A block-by-block copy (dd) of a CD-ROM at "32x", or approx 3Mb/sec, will copy a full 650Mb in 3-4 minutes. Wikipedia states a 4.7Gb DVD takes 7 minutes (5.5Mb/sec) at "4x".
Mounting a CD or DVD then doing a file-by-file copy takes 5-10 times as long.
Installing or upgrading system software from the same CD/DVD is usually measured in hours.

The fastest way to upgrade software from CD/DVD is to copy an image with dd to hard-disk, then mount that image. The difference is the random I/O (seek) performance of the underlying media, not the FileSystem. [Haven't tested times or speedup with a fast Flash drive.]

This performance limit may have been something that the original Plan 9 writers knew and understood:

P9 didn't 'format' media for a filesystem: initialised a little and just started writing blocks.
didn't have fsck on client machines, only the fileserver.
the fileserver wrote to three levels of storage: RAM, disk, Optical disk.
RAM and disk were treated as cache, not permanent storage.
Files were pushed to Optical disk daily, creating a daily snapshot of the filesystem at the time. Like Apple's TimeMachine, files that hadn't changed were 'hard-linked' to the new directory tree.
The fileserver had operator activities like backup and restore. The design had no super-user with absolute access rights, so avoided many of the usual admin-related security issues.
Invented 'overlay mounts', managed at user not kernel level, to combine the disparate file-services available and allow users to define their own semantics.

Filesystems have never, until now, focussed on CPU performance, rather the opposite, they've traded CPU and RAM to reduce I/O latency, historically improving system throughput, sometimes by orders-of-magnitude.
Early examples were O/S buffers/caching (e.g. Unix) and the 'elevator algorithmn' to optimally reorder writes to match disk characteristics.

This 'burn the CPU' trade-off shows up with fsck as well. An older LSF piece suggested that fsck runs slowly because it doesn't do a single pass of the disk, effectively forced into near worst-case unoptimised random I/O.

On my little Mac Mini with a 300Gb disk, there's 225Gb used. Almost all of which, especially the system files, is unchanging. Most of the writing to disk is "append mode" - music, email, downloads - either blocks-to-a-file or file-to-directory. With transactional Databases, it's a different story.

The filesystem treats the whole disk as if every byte could be changed in the next second - and I pay a penalty for that in complexity and CPU cycles. Seeing my little Mac or an older Linux desktop do a filesystem check after a power fail is disheartening...

I suggest future O/S's will have to contend with:

Flash or SCM with close to RAM performance 'near' the CPU(s) (on the PCI bus, no SCSI controller)
near-infinite disk ("disk is tape", Jim Gray) that you'll only want to access as "seek and stream". It will also take "near infinite" time to scan with random I/O. [another Jim Gray observation]

And what are the new rules for filesystems in this environment?:

two sorts of filesystems that need to interwork:

read/write that needs fsck to properly recover after a failure and
append-only that doesn't need checking once "imaged", like ISO 9660 on optical disks.

''Flash" file-system organised to minimise CPU and RAM use. High performance/low CPU use will become as important as managing "wear" for very fast PCI Flash drives.
'hard disk' filesystem with on-the-fly append/change of media and 'clone disk' rather than 'repair f/sys'.
O/S must seamlessly/transparently:
1. present a single file-tree view of the two f/sys
2. like Virtual Memory, safely and silently migrate data/files from fast to slow storage.

I saw a quote from Ric Wheeler (EMC) from LSF-07 [my formatting]:

the basic contract that storage systems make with the user
is to guarantee that:

the complete set of data will be stored,

bytes are correct and

in order, and

raw capacity is utilized as completely as possible.

I disagree nowdays with his maximal space-utilisation clause for disk. When 2Tb costs $150 (7.5c/Gb) you can afford to waste a little here and there to optimise other factors.
With Flash Memory at $2-$5/Gb, you don't want to go wasting much of that space.

Jim Gray (again!) early on formulated "the 5-minute rule" which needs rethinking, especially with cheap Flash Memory redefining the underlying Engineering factors/ratios. These sorts of explicit engineering trade-off calculations have to be done for the current disruptive changes in technology.

Gray, J., Putzolu, G.R. 1987. The 5-minute rule for trading memory for disk accesses and the 10-byte rule for trading memory for CPU time. SIGMOD Record 16(3): 395-398.
Gray, J., Graefe, G. 1997. The five-minute rule ten years later, and other computer storage rules of thumb. SIGMOD Record 26(4): 63-68.

I think Wheeler's Storage Contract also needs to say something about 'preserving the data written', i.e. the durability and dependability of the storage system.
For how long? what what latency? How to express that? I don't know...

There is also a matter of "storage precision", already catered for with CD's and CD-ROM, Wikipedia states:

The difference between sector size and data content are the header information and the error-correcting codes, that are big for data (high precision required), small for VCD (standard for video) and none for audio. Note that all of these, including audio, still benefit from a lower layer of error correction at a sub-sector level.

Again, I don't know how to express this, implement it nor a good user-interface. What is very clear to me is:

Not all data needs to come back bit-perfect, though it is always nice when it does.
Some data we would rather not have, in whole or part, than come back corrupted.
There are many data-dependent ways to achieve Good Enough replay when that's acceptable.

First, the aspects of Durability and Precision need to be defined and refined, then a common File-system interface created and finally, like Virtual Memory, automated and executed without thought or human interaction.

This piece describes FileSystems, not Tabular Databases nor other types of Datastore.
The same disruptive technology problems need to be addressed within these realms.
Of course, it'd be nicer/easier if other Datastores were able to efficiently map to a common interface or representation shared with FileSystems and all the work/decisions happened in Just One Place.

Will that happen in my lifetime? Hmmmm....

Sunday, November 20, 2011

Building a RAID disk array circa 1988

In "A Case for Redundant Arrays of Inexpensive Disks (RAID)" [1988], Patterson et al of University of California Berkeley started a revolution in Disk Storage still going today. Within 3 years, IBM had released the last of their monolithic disk drives, the 3390 Model K, with the line being discontinued and replaced with IBM's own Disk Array.

The 1988 paper has a number of tables where it compares Cost/Capacity, Cost/Performance and Reliability/Performance of IBM Large Drives, large SCSI drives and 3½in SCSI drives.

The prices ($$/MB) cited for the IBM 3380 drives are hard to reconcile with published prices:
press releases in Computerworld and IBM Archives for 3380 disks (7.5Gb, 14" platter, 6.5kW) and their controllers suggest $63+/Mb for 'SLED' (Single Large Expensive Disk) rather than the
"$18-10" cited in the Patterson paper.

The prices for the 600MB Fujitsu M2316A ("super eagle") [$20-$17] and 100Mb Conner Peripherals CP-3100 [$10-$7] are in-line with historical prices found on the web.

The last table in the 1988 paper lists projected prices for different proposed RAID configurations:

$11-$8 for 100 * CP-3100 [10,000MB] and
$11-$8 for 10 * CP-3100 [1,000MB]

There are no design details given.

1994, Chen et al in "RAID: High-Performance,Reliable Secondary Storage" use two widely sold commercial system as case studies:

NCR 6298 and
StorageTek's Iceberg 9200 Disk Array
1996 Press Release: IBM to exclusively resell Iceberg as RAMAC via Archive.org
1996 RAID product page on Archive.org.

The (low-end) NCR device was more what we'd call a 'hardware RAID controller' now, ranging from 5 to 25 disks. Pricing $22-102,000. It provided a SCSI interface and didn't buffer. A system diagram was included in the paper.

The StorageTek's Iceberg was high-end device meant for connection to IBM mainframes. Advertised as starting at 100GB (32 drives) for $1.3M, up to 400Gb for $3.6M, It provided multiple (4-16) IBM ESCON 'channels'.

For the NCR, from InfoWorld 1 Oct 1990, p 19 in Google Books

min config: 5 * 3½in drives, 420MB each.

$22,000 for 1.05Gb storage

Add 20*420Mb to 8.4Gb list $102,000. March 1991.

$4,000/drive + $2,000 controller.

NCR-designed controller chip + SCSI chip

4 RAID implementations: RAID 0,1,3,5.

The StorTek Iceberg was released in late 1992 with projected shipments of 1,000 units in 1993. It was aimed at replacing IBM 'DASD' (Direct Access Storage Device): exactly the comparison made in the 1988 RAID paper.

The IBM-compatible DASD, which resulted from an investment of $145 million and is technically styled the 9200 disk array subsystem, is priced at $1.3 million for a minimum configuration with 64MB of cache and 100GB of storage capacity provided by 32 Hewlett-Packard 5.25-inch drives.

A maximum configuration, with 512MB of cache and 400GB of storage capacity from 128 disks, will run more than $3.6 million. Those capacity figures include data compression and compaction, which can as much as triple the storage level beyond the actual physical capacity of the subsystem.

Elsewhere in the article more 'flexible pricing' (20-25% discount) is suggested:

with most of the units falling into the 100- to 200GB capacity range, averaging slightly in excess of $1 million apiece.

Whilst no technical reference is easily accessible on-line, more technical details are mentioned in the press release on the 1994 upgrade, the 9220. Chen et al [1994] claim "100,000 lines of code" were written.

More clues come from an feature, "Make Room for DASD" by Kathleen Melymuka (p62) of CIO magazine, 1st June 1992 [accessed via Google Books, no direct link]:

5¼in Hewlett-Packard drives were used. [model number & size not stated]
The "100Gb" may include compaction and compression. [300% claimed later]
(32 drives) "arranged in dual redundancy array of 16 disks each (15+1 spare)
RAID-6 ?
"from the cache, 14 pathways transfer data to and from the disk arrays, and each path can sustain a 5Mbps transfer rate"

The Chen et al paper (pg 175 of CACM,, Vol 26, No 2) gives this information on the Iceberg/9200:

it "implements an extended RAID level-5 and level-6 disk array"
- 16 disks per 'array', 13 usable, 2 Parity (P+Q), 1 hot spare
- "data, parity and Reed-Solomon coding are striped across the 15 active drives of an array"
Maximum of 2 Penguin 'controllers' per unit.
Each controller is an 8-way processor, handling up to 4 'arrays' each, or 150Gb (raw).

Implying 2.3-2.5Gb per drive

The C3010, seemingly the largest HP disk in 1992, was 2.47Gb unformatted and 2Gb formatted (512by sectors), [notionally 595by unformatted sectors]
The C3010 specs included:

MTBF: 300,000 hrs
Unrecoverable Error Rate (UER): 1 in 10^14 bits transferred
11.5 msec avg seek, (5.5msec rotational latency, 5400RPM)
256Kb cache, 1:1 sector interleave, 1,7 RLL encoding, Reed-Solomon ECC.
max 43W 'fast-wide' option, 36W running.

runs up to 8 'channel programs' and independently transfer on 4 channels (to mainframe).
manages a 64-512Mb battery-backed cache (shared or per controller not stated)
implements on-the-fly compression, cites maximum doubling capacity.

and dynamic mapping necessary CKD (count, key, data) for variable-sized IBM blocks onto the fixed blocks internally.
a extra (local?) 8Mb of non-volatile memory is used to store these tables/maps.

Uses a "Log-Structured File System" so blocks are not written back to the same place on the disk.
Not stated if the SCSI buses are one-per-arry or 'orthogonal'. i.e. Redundancy groups are made up from one disk per 'array'.

Elsewhere, Katz, one of the authors, uses a diagram of a generic RAID system not subject to any "Single Point of Failure":

with dual-controllers and dual channel interfaces.

Controllers cross-connected to each interface.

dual-ported disks connected to both controllers.

This halves the number of unique drives in a system, or doubles the number of SCSI buses/HBA's, but copes with the loss of a controller.

Implying any battery-backed cache (not in diagram) would need to be shared between controllers.

From this, a reasonable guess at aspects of the design is:

HP C3010 drives were used, 2Gb formatted. [Unable to find list prices on-line]

These drives were SCSI-2 (up to 16 devices per bus)
available as single-ended (5MB/sec) or 'fast' differential (10MB/sec) or 'fast-wide' (16-bit, 20MB/sec). At least 'fast' differential, probably 'fast-wide'.

"14 pathways" could mean 14 SCSI buses, one per line of disks, but it doesn't match with the claimed 16 disks per array.

16 SCSI buses with 16 HBA's per controller matches the design.
Allows the claimed 4 arrays of 16 drives per controller (64) and 128 max.
SCSI-2 'fast-wide' allows 16 devices total on a bus, including host initiators. This implies that either more than 16 SCSI

5Mbps transfer rate probably means synchronous SCSI-1 rates of 5MB/sec or asynchronous SCSI-2 'fast-wide'.

It cannot mean the 33.5-42Mbps burst rate of the C3010.
The C3010 achieved transfer rates of 2.5MB/sec asynchronously in 'fast' mode, or 5MB/sec in 'fast-wide' mode.
Only the 'fast-wide' SCSI-2 option supported dual-porting.
The C3010 technical reference states that both powered-on and powered-off disks could be added/removed to/from a SCSI-2 bus without causing a 'glitch'. Hot swapping (failed) drives should've been possible.

RAID-5/6 groups of 15 with 2 parity/check disk overhead, 26Gb usable per array, max 208Gb.

RAID redundancy groups are implied to be per (16-disk) 'array' plus one hot-spare .
But 'orthogonal' wiring of redundancy groups was probably used, so how many SCSI buses were needed per controller, in both 1 and 2-Controller configurations?
No two drives in a redundancy group should be connected via the same SCSI HBA, SCSI bus, power-group or cooling-group.
This allows live hardware maintenance or single failures.
How were the SCSI buses organised?
With only 14 devices total per SCSI-2 bus, a max of 7 disks per shared controller was possible.
The only possibly configurations that allow in-place upgrades are: 4 or 6 drives per bus.
The 4-drives/bus resolves to "each drive in an array on a separate bus".
For manufacturing reasons, components need standard configurations.
It's reasonable to assume that all disk arrays would be wired identically, internally and with common mass terminations on either side, even to the extent of different connectors (male/female) per side.
This allows simple assembly and expansion, and trivially correct installation of SCSI terminators on a 1-Controller system.
Only separate-bus-per-drive-in-array (max 4-drives/bus), meets these constraints.
SCSI required a 'terminator' at each end of the bus. Typically one end was the host initiator. For dual-host buses, one at each host HBA works.
Max 4-drives per bus results in 16 SCSI buses per Controller (64-disks per side).
'fast-wide' SCSI-2 must have been used to support dual-porting.
The 16 SCSI buses, one per slot in the disk arrays, would've continued across all arrays in a fully populated system.
In a minimum system, 32 drives, would've been only 2 disks per SCSI bus.

1 or 2 controllers with a shared 64M-512M cache and 8Mb for dynamic mapping.

This would be a high-performance and highly reliable design with a believable $1-2M price for 64 drives (200Gb notional, 150Gb raw):

1 Controllers
128Mb RAM
8 ESCON channels
16 SCSI controllers
64 * 2Gb drives as 4*16 arrays, 60 drives active, 52 drive-equivalents after RAID-6 parity.
cabinets, packaging, fans and power-supplies

From the two price-points, can we tease out a little more of the costs [no allowance for ESCON channel cards]:

1 Controller + 32 disks + 64M cache = $1.3M
2 Controllers + 128 disks + 512M cache = $3.6M

As a first approximation, assume that 512M RAM costs half as much as 2 Controllers for a 'balanced' system. Giving us a solvable set of simultaneous equations:

1.0625 Controllers + 32 disks = $1.3M
2.5 Controllers + 128 disks = $3.6M

roughly:

$900,000 / Controller [probably $50,000 high]
$70,000 / 64M cache [probably $50,000 low]
$330,000 / 32 disks ($10k/drive, or $5/MB)

High-end multi-processor VAX system pricing at the time is in-line with this $900k estimate, but more likely an OEM'd RISC processor (MIPS or SPARC?) was used.
This was a specialist, low-volume device: expected 1st year sales volume was ~1,000.
In 1994, they'd reported sales of 775 units when the upgrade (9220) was released.

Another contemporary computer press article cites the StorageTek Array costing $10/Mb compared to $15/MB for IBM DASD. 100Gb @ $10/Mb is $1M, so congruent with the claims above.

How do the real-world products in 1992 compare to the 1988 RAID estimates of Patterson et al?

StorageTek Iceberg: $10/Mb vs $11-$8 projected.

This was achieved using 2Gb 5¼in drives not the 100Mb 3½in drives modelled
HP sold a 1Gb 3½in SCSI-2 drive (C2247) in 1992. This may have formed the basis of the upgrade 9220 ~two years later.

Using the actual, not notional, supplied capacity (243Gb) the Iceberg cost $15/Mb.
The $15/Mb for IBM DASD compares well to the $18-$10 cited in 1988.

But IBM, in those intervening 5 years, had halved the per-Mb price of their drives once or twice. The 1988 "list price" from the archives of ~$60/Mb are reasonable.

In late 1992, 122Mb Conner CP-30104 were advertised for $400, or $3.25/Mb.
These were IDE drives, though a 120Mb version of the SCSI CP-3100 was sold, price unknown.

The 8.4Gb 25-drive NCR 6298 gave $12.15/Mb, again close to the target zone.
From the Dahlin list, 'street prices' for 420Mb drives at the time, were $1600 for Seagate ST-1480A and $1300 for 425Mb Quantum or $3.05-$3.75/Mb.

The price can't be directly compared to either IBM DASD or StorageTek's Iceberg, because the NCR 6298 only provided a SCSI interface, not an IBM 'channel' interface.

The raw storage costs of the StorageTek Iceberg and NCR are roughly 2.5:1.
Not unexpected due to the extra complexity, size and functionality of the Iceberg.

Friday, November 11, 2011

High Density Racking for 2.5" disks

Datacentre space is expensive: one 2010 artice puts construction at $1200/sq ft and rental at $600/pa/sq ft for an approx design heat load of 10-50W/sq ft. Google is reputed to be spending $3,000/sq ft building datacentre with many times this heat load.

There are 3 different measures of area:

"gross footprint". The room plus all ancillary equipment and spaces.
"room". The total area of the room. Each rack, with aisles & work-space uses 16-20 sq ft.
"equipment footprint". The floor area directly under computing equipment. 24"x40", ~7sq ft.

Presumably the $600/pa rental cost is for "equipment footprint".

The 2½ inch drive form-factor is (100.5 mm x 69.85 mm x 7-15 mm).
Enterprise drives are typically 12.5 or 15mm thick. Vertically stacked, 20-22 removable 2½ inch drives can be fitted across a rack, taking a 2RU space, with around 15mm of 89mm unused.

2½ inch drive don't fit well in standard 19" server racks (17" wide, by 900-1000mm deep, 1RU = 1.75" tall), especially if you want equal access (eg. from the front) to all drives without disturbing any drives. Communications racks are typically 600mm deep, but not used for equipment in datacentres.

With cabling, electronics and power distribution, a depth of 150mm (6") should be sufficient to house 2½ inch drives. Power supply units take additional space.

Usable space inside a server rack, 17" wide and 1000mm deep, would leave 850mm wasted.
Mounting front and back, would still leave 700mm wasted, but create significant heat removal problems, especially in a "hot aisle" facility.

The problems reduces to physical arrangements that maximise exposed area (long and thin rectangles vs the close-to-square 19" Rack, if dual sided, with a chimney) or maximise surface area and minimise floor space - a circle or cylinder.

The "long-thin rectangle" arrangement was popular in mainframe days, often as an "X" or a central-spine with many "wings". It assumes that no other equipment will be sited within the working clearances needed to open doors and remove equipment.

A cylinder, or pipe, must contain a central chimney to remove waste heat. There is also a requirement to plan cabling for power and data. Power supplies can be in the plinth as the central cooling void can't be blocked mid-height and extraction fan(s) need to be mounted at the top.

For a 17" diameter pipe, 70-72 disks can be mounted around the circumference allowing 75 mm height per row, 20-24 rows high allowing for a plinth and normal heights. This leaves a central void of around 7" to handle the ~8kW of power of ~1550 drives.

A 19" pipe would allow 75-80 disks per row and a 9" central void to handle ~9.5kW of ~2000 drives.
Fully populated unit weight would be 350-450kg.

Perhaps one in 8 disks could be removed allowing a cable tray, a not unreasonable loss of space.

These "pipes" could be sited in normal racks either at the end of row requiring one free rack-space beside them, or in a row, taking 3 rack-spaces.

As a series of freestanding units, they could be mounted in a hexagonal pattern (the closest-packing arrangement for circles) with minimum OH&S clearances around them, which may be 600-750mm.

This provides 4-5 times the density of drives over the current usual 22 shelves of 24 drives (480) per rack, with better heat extraction. At $4-5,000/pa rental per rack-space (or ~$10/drive), it's a useful saving.

With current drive sizes of 600-1000Gb/drive, most organisations would get by with one unit of 1-2Pb.

Update: A semi-circular variant 40inx40in for installation in 3 Rack-widths might work as well. Requires a door to create the chimney space/central void - and it could vent directly into a "hot aisle".

Papers:
2008: "Cost Model: Dollars per kW plus Dollars per Square Foot of Computer Floor"

2007: "Data center TCO; a comparison of high-density and low-density spaces"

2006: "Total Cost of Ownership Analysis for Data Center Projects"

2006: "Dollars per kW plus Dollars per Square Foot Are a Better Data Center Cost Model than Dollars per Square Foot Alone"

Thursday, November 10, 2011

Questions about SSD / Flash Memory

Seagate, in 2010, quote their SSD UER specs as:
Nonrecoverable read errors, max: 1 LBA per 10^16 bits read
where a Logical Block Address (LBA) is 512 bytes. Usually called a 'sector'.

But we know that Flash memory is organised as 64Kb blocks (min read/write unit).
Are Seagate saying that errors will be random localised cells, not "whole block at a time"?
Of course, the memory controller does Error Correction to pick up the odd dropped bits.
Current RAID schemes are antagonistic to Flash Memory:
The essential problem with NAND Flash (EEPROM) Memory is that it suffers "wear" - after a number of writes and erasures, individual cells (with MLC, cell =/= bit) are no longer programmable. A secondary problem is "Data Retention". With the device powered down, Seagate quote "Data Retention" of 1 year.

Because Flash Memory wears with writes, batches of components will likely have very similar wear characteristics and if multiple SSD's are mirrored/RAIDed in a system they will most likely be from the same batch, evenly spread RAID writes (RAID-5 writes two physical blocks per logical block) will cause a set of SSD's to suffer correlated wear failures. This is not unlike the management of piston engines in multi-engined aircraft: avoid needing to replace more than one at a time. Faults, Failures and Repair/Install Errors often show up in the first trip. Replacing all engines together maximises the risk of total engine failure.

Not only is this "not ideal", it is exactly worst case for current RAID.
A new Data Protection Scheme is required for SSD's.

Update 23-Dec-2011. Jim Handy in "The SSD Guy" blog [Nov-17, 2011] discusses SSD's and RAID volumes:
So far this all sounds good, but in a RAID configuration this can cause trouble. Here’s why.

RAID is based upon the notion that HDDs fail randomly. When an HDD fails, a technician replaces the failed drive and issues a rebuild command. It is enormously unlikely that another disk will fail during a rebuild. If SSDs replace the HDDs in this system, and if the SSDs all come from the same vendor and from the same manufacturing lot, and if they are all exposed to similar workloads, then they can all be expected to fail at around the same time.

This implies that a RAID that has suffered an SSD failure is very likely to see another failure during a rebuild – a scenario that causes the entire RAID to collapse.

What are the drivers for the on-going reduction in prices of Flash Memory?
Volume? Design? Fabrication method (line width, "high-K", ...)? Chip Density?

The price of SSD's has been roughly halving every 12-18 months for near on a decade, but why?
Understanding the "why" is necessary to be forewarned of any change to the pattern.
How differently are DRAM and EEPROM fabricated?
Why is there about a 5-fold price difference between them?
```
Prices (Kingston from same store, http://msy.com.au, November 2011):
DDR3 6Gb 1333Mhz $41 $7/Gb
SSD 64Gb  $103    $1.50/Gb
SSD 128Gb  $187 $1.25/Gb
```
It would be nice to know if there was a structural difference or not for designing "balanced systems", or integrating Flash Memory directly into the memory hierarchy, not as a pretend block device.
Main CPU's and O/S's can outperform any embedded hardware controller.
Why do the PCI SSD's not just present a "big blocks of memory" interface, but insist on running their own controllers?
Hybrid SSD-HDD RAID.
For Linux particularly, is it possible to create multiple partitions per HDD in a set, then use one HDD-partition to mirror a complete SSD. The remaining partitions can be setup as RAID volumes in the normal way.
The read/write characteristics of SSD and HDD are complementary: SSD is blindly fast for random IO/sec, while HDD's currently stream reads/writes at higher sustained writes.
Specifically, given 1*128Gb SSD and 3*1Tb HDD, create a 128Gb partition on all HDD's. Then mirror (RAID-1), the SSD and one or more of the HDD's (RAID-1 isn't restricted to two copies, but can have as many replicas as desired to increase IO performance or resilience/Data Integrity). Remaining 128Gb HDD partitions can be stripped, mirrored or RAIDed amongst themselves or to other drives. The remaining HDD space can be partitioned and RAIDed to suit demand.

Does it make sense, both performance- and reliability-wise, to mirror SSD and HDD?
Does the combination yield the best, or worse, of both worlds?
Is the cost/complexity and extra maintenance worth it?

Friday, November 04, 2011

Enterprise Drives: Moving to 2.5" form factor

Update [23-Dec-2011]: IDC, in 2009, discussed in a HP report, the migration to 2.5 inch drives in Enterprise Storage. Started in 2004, projected to be complete in 2011.

Elsewhere, a summary of IDC's 2009 Worldwide HDD shipments and revenue:

The transition from 3.5in. to 2.5in. performance-optimized form factor HDDs will be complete by 2012.
Growing interest in new storage delivery models such as storage as a service, or storage in the cloud is likely to put greater storage capacity growth demands on Internet datacenters.
The price per gigabyte of performance-optimized HDD storage will continue to decline at a rate of approximately 25% to 30% per year.

Types of 'Flash Memory'

"Flash Memory", or EEPROM (Electrically Erasable Programmable Read Only Memory), is at the heart of much of the "silicon revolution" of the last 10 years.

How is it packaged and made available to consumers or system builders/designers?

Mobile appliances are redefining Communications and The Internet, precisely because of the low-power, high-capacity and longevity - and affordable price - of modern Flash Memory.

There are many different Flash Memory configurations: NAND, NOR, SLC, MLC, ...
This piece isn't about those details/differences but the in how they are packaged and organised.
What technology does what/has what characteristics is out there on the InterWebs.

Most Flash Memory chips are assembled into different packaging for different uses at different price points. Prices are "Retail Price (tax paid)" from a random survey of Internet sites:

Many appliances use direct-soldered Flash Memory are their primary or sole Persistent Datastore.
The genesis of this was upgradeable BIOS firmware. Late 1990's?
Per-Gb pricing not published: approx. derivable from model price differences.
Commodity 'cards' used in cameras, phones and more: SD-card and friends.
Mini-CD and Micro-SD cards are special cases and attract a price premium.
Some 'high-performance' variants for cameras.
A$2-3/Gb
USB 'flash' or 'thumb drives':
A$2-3/Gb.
High-end camera memory cards: Compact Flash (CF). The oldest mass-use format?
IDE/ATA compatible interface. Disk drive replacement for embedded systems.
Fastest cards are 100MB/sec (0.8Gbps). Max is UDMA ATA, 133MB/sec.
Unpublished Bit Error Rate, Write Endurance, MTBF, Power Cycles, IO/sec.
A$5-$30/Gb
SATA 2.5" SSD (Solid State Drives). Mainly 3Gbps and some 6Gbps interfaces.
MTBF: 1-2M hours,
Service Life: 3-5 years at 25% capacity written/day.
IO/sec: 5,000 - 50,000 IO/sec [max seen: 85k IO/sec]
BER: "1 sector per 10^15-16 bits read"
sustained read/write speed: 70-400MB/sec . (read often slowest)

Power: 150-2000W active, 75-500mW idle
32Gb - 256GB @ A$1.50-$2.50/Gb.
SATA 1.8" SSD. Internal configuration of some 2.5" SSD's.
Not yet widely available.
SAS (Serial Attached SCSI) 2.5" drives.
not researched. high-performance, premium pricing.
PCI "SSD". PCI card presenting as a Disk Device.
Multiple vendors, usual prices ~A$3-4/Gb. Sizes 256Gb - 1Tb.
"Fusion-io" specs quoted by Dell. Est A$20-25/Gb. [vs ~$5/Gb direct]

640GB (Duo)
NAND Type: MLC (Multi Level Cell)
Read Bandwidth (64kB): 1.0 GB/s
Write Bandwidth (64kB): 1.5 GB/s
Read IOPS (512 Byte): 196,000
Write IOPS (512 Byte): 285,000
Mixed IOPS (75/25 r/w): 138,000
Read Latency (512 Byte): 29 μs
Write Latency (512 Byte): 2 μs
Bus Interface: PCI-Express x4 / x8 or PCI Express 2.0 x4
Mini-PCIe cards, Intel: 40 and 80Gb. A$3/Gb
Intel SSD 310 Series 80GB mini PCIe

* Capacity: 80 GB,
* Components: Intel NAND Flash Memory Multi-Level Cell (MLC) Technology
* Form Factor, mini PCIe, mSATA, Interface,
* Sequential Read - 200 MB/s, Sequential Write - 70 MB/s,
* Latency - Read - 35 µs, Latency - Write - 65 µs,
* Lithography - 34 nm
* Mean Time Between Failures (MTBF) - 1,200,000 Hours

Roughly, prices increase with size and performance.
The highest density chips, or leading edge technology, cost a premium.
As do high-performance or "specialist" formats:

micro-SD cards
CF cards
SAS and PCI SSD's.

Wednesday, November 02, 2011

When is an SATA drive not a drive, when it's compact flash

The CompactFlash Association has released the new generation of the CF specification, CFast™ [from a piece on 100MB/sec CF cards]

10-15 years ago, CF cards were used almost exclusively in all Digital cameras, now they are used only in "high-end" Digital SLR's (DSLR), presumably because of cost/availability compared to alternatives like SDHC cards.

The UDMA based CF standard allows up to 133MB/sec transfer rates.
The new SATA based standard, CFast, allows 3Gbps (~300MB/sec) transfer rates.

In another context, you'd call this an SSD (Solid State Disk), even a "SATA drive".
There are two problems:

common cameras don't currently support CFast™, the SATA based standard, and
'fast' CF cards are slower than most SSD's and attract a price premium of 5-10 times.

I'm not sure what decision camera manufacturers will make for their Next Generation high-end storage interface, they have 3 obvious directions:

CF-card format (43×36×5 mm), SATA interface, CFast™
1.8inch or 2.5inch SSD, SATA interface
34mm ExpressCard. PCIe interface.

and the less obvious: somehow adapt commodity SDHC cards to high-performance.
Perhaps in a 'pack' operating like RAID.

The "Next Generation Interface" is a $64 billion question for high-end camera manufactures, the choice will stay with the industry for a very long time, negatively affecting sales if made poorly.

Manufacturers are much better off selecting the same standard (common parts, lower prices for everyone), but need to balance the convenience of "special form factors" with cost. Whilst professional photographers will pay "whatever's needed" for specialist products, their budgets aren't infinite and excessive prices restrict sales to high-end amateurs.

Perhaps the best we'll see is a transition period of dual- or triple-card cameras (SDHC, CF-card and CFast™), with the possibility of an e-SATA connector for "direct drive connection".

Update 04-Nov-2011:
Here's a good overview from the Sandisk site of form-factor candidates to replace the CF card form-factor of (43×36×5 mm):

SanDisk® Solid State Drives for the Client
"A variety of form factors, supporting multiple OEM design needs."

SanDisk® Solid State Drives for the Client
Product Name	Interface	Form Factor	Measurements
SanDisk U100 SSD	SATA	2.5"	100.5 mm x 69.85 mm x 7 mm std allows 9.5mm, 12.5mm, 15mm
SanDisk U100 SSD	SATA	Half-Slim SATA	54.00 mm x 39.00 mm x 3.08 mm (8-64GB), x 2.88 mm (128-256GB), Connector 4.00 mm
SanDisk U100 SSD	SATA	mSATA	30.00 mm x 50.95 mm x 3.4 mm (8-64GB), x 3.2 mm (128-256GB)
SanDisk U100 SSD	SATA	mSATA mini	26.80 mm x 30 mm x 2.2 mm (8GB), x 3.2mm (16-128GB)
SanDisk iSSD(TM)	SATA	SATA uSSD	16 mm x 20 mm x 1.20 mm (8 GB-32 GB) x 1.40 mm (64 GB) x 1.85 mm (128 GB)
Standard 1.8 in disk	SATA	1.8"	54 mm x 71 mm x 8 mm
Express Card	PCIe	Express Card	34mm x 75 mm x 5mm 54mm x 75 mm x 5mm
Compact Flash	ATA	CF-II	36mm x 43 mm x 5mm

Tuesday, November 01, 2011

Flash Memory vs 15,000 RPM drives

Some I.T. technologies have a limited life or "use window". For example:

In 2001, the largest Compact Flash (CF) card for Digital cameras wasn't flash memory, but a hard disk, the 1Gb IBM "microdrive" ($500 claimed price). After Hitachi acquired the business, they produced 4Gb and 6Gb drives, apparently for sale as late as 2006, with the 4Gb variant used in the now discontinued Apple iPod mini.

Around 2006, the 1" microdrive hard disks were out-competed by flash memory and became yet another defunct technology.

Flash memory storage, either SSD or PCI 'drives', for servers now cost A$2-3/Gb for SATA SSD and $3-4/Gb for PCIe cards [A$3/Gb for Intel mini-PCIe cards].

Currently, Seagate 15k 'Cheetah' drives sell for around $2/Gb, but their 2msec (0.5k IO/sec) performance is no match for the 5KIO/sec of low-end SSD's or the 100-250KIO/sec of PCI flash.

10,000 RPM 'Enterprise' drives cost less, around $1.50/Gb, whilst 1Tb 7200 RPM (Enterprise) drives come in at $0.25/Gb.

The only usual criteria 15,000 RPM drives beat other media on is single-drive transfer rate^*.
Which in an 'Enterprise' RAID environment is not an advantage unless a) you're not paying full price or b) you have very special requirements or constraints, such as limited space.

I'm wondering if 2010 was the year that 15,000 RPM Enterprise drives joined microdrives in the backlot of obsolete technologies - replaced in part by the same thing, Flash Memory.

Part of the problem stems from the triple whammy for any specialist technology/device:

Overheads are maximised: By definition 'extreme' technologies are beyond "cutting-edge", more "bleeding-edge", meaning research costs are very high.
Per Unit fixed costs are maximised: Sales volumes of specialist or extreme devices are necessarily low, requiring those high research costs to be amortised over just a few sales.
If the technology ever becomes mainstream, it is no longer 'specialist' and research costs are amortised over very large numbers of parts.
Highest Margin per Unit: If you want your vendor to stay in business, they have to make suitable profits, both to encourage 'capital' to remain invested in the business and have enough surplus available to fund the next, more expensive, round of research. Profitable businesses can be Low-Volume/High Margin or High-Volume/Low Margin (or Mid/Mid).

Specialised or 'extreme performance' products aren't proportionately more expensive, they are necessarily radically more expensive, compounding the problem. When simple alternatives are available to use commodity/mainstream devices (defined as 'least cost per-perf. unit' or highest volume used), then they are adopted, all other things being equal.

^* There are many characteristics of magnetic media that are desirable or necessary for some situations, such as "infinite write cycles". These may be discussed in detail elsewhere.

SteveJ's lab-notes