Monday, October 31, 2011

RAID, Backups and Recovery

There's an ironclad Law of SysAdmin:
"Nobody ever asks for backups to be done, only ever 'restores'"
Discussing RAID and Reliable Persistent Storage cannot be done without reference to the larger context: Whole System Data Protection and Operations.

"Backups" need more thought and effort than a 1980-style dump-disk-to-tape. Conversely, as various politicians & business criminals have found to their cost with "stored emails", permanent storage of everything is not ideal either, even if you have the systems, space and budget for media and operations.

"Backups" are more than just 'a second copy of your precious data' (more fully, 'of some indeterminate age'), which RAID alone partially provide.

A rough taxonomy of 'backups' are:
  • Operating System (O/S) software, modulo installed packages.
  • O/S configuration.
  • O/S logs.
  • Middleware (Database, web server, ..) software.
  • Middleware configuration
  • Middleware data.
  • Middleware logs.
  • Application {Software, Configuration, Data, Logs}
There are, minimally, two distinct type of data:
  • Transactional, eg. Database, comprised of a transaction log and the database.
  • Files.
"Snapshots" provide perfect point-in-time recovery for transactional data (the "roll forward log" on one snapshot, older database snapshot on another).
Versioning systems provide perfect event-in-time recovery for textual and other file data.
Whilst filesystems can be journalled, the potential to capture-for-replay every filesystem transaction isn't normally available due to volumes and complexity.

Since the destruction of the World Trade Centre buildings in 2001, every organisation with critical I.T. systems and a need for strong Data Protection/Reliable Persistent Storage understand that a minimum requirement for 'second copy data' is a Remote Duplicate Copy. This implies at least a secure second facility.

Two critical design and operational parameters are:
  • "Delta-T", the difference in timestreams between the live system and Duplicate Copies.
  • "Sufficient Distance" for the second facility to not be affected by designed-for "events".
 "Disaster Recovery" and its Planning is all about managing these and related factors, along with time-to-restoration and time-to-recovery.

Can "Zero Delta-T" and "Zero TTR" systems be built at all? For all datastore sizes? Economically?
Is there a relationship between "Delta-T" and system cost?
Are there better organisations/designs?

Banks and airlines have attempted this for decades. Insisting on completed transactions being committed to Remote Storage adds significant latency to the process. In a single-threaded application, this limits the processing rate. To achieve high-rate applications, (real-time) parallelism must be embraced with all its complexity and problems.

Sunday, October 30, 2011

Revisting RAID in 2011: Reappraising the 'Manufactured Block Device' Interface.

Reliable Persistent Storage is fundamental to I.T./Computing, especially in the post-PC age of "Invisible Computers". RAID, as large Disk Arrays backed up by tape libraries, has been a classic solution to this problem for around 25 years.

The techniques to construct "high-performance", "perfect" devices with "infinite" data-life from real-world devices with many failure modes, performance limitations and a limited service life vary with cost constraints, available technologies, processor organisation, demand/load and expectations of "perfect" and "infinite".

RAID Disk Arrays are facing at least 4 significant technology challenges or changes in use as I.T./Computing continues to evolve:
  • From SOHO to Enterprise level, removable USB drives are becoming the media of choice for data-exchange,  off-site backups and archives as the price/Gb drives multiples below tape and optical media.
  • M.A.I.D. (Massive Array of Idle Disks) is becoming more widely used for archive services.
  • Flash memory, at A$2-$4/Gb, packaged as SSD's or direct PCI devices, is replacing hard disks (HDDs) for low-latency random-IO applications. E-bay, for example, has announced a new 'Pod' design based around flash memory, legitimising the approach for Enterprises and providing good "case studies" for vendors to use in marketing and sales.
  • As Peta- and Exabyte systems are designed and built, its obvious that the current One-Big-Box model for Enterprise Disk Arrays is insufficient. IBM's General Parallel File System (GPFS) home page notes that large processor arrays are needed to create/consume the I/O load [e.g. 30,000 file creates/sec] the filesystem is designed to provide. The aggregate bandwidth provided is orders of magnitude greater than can be handled by any SAN (Storage Area Network,  sometimes called a 'Storage Fabric').
Internet-scale datacentres, consuming 20-30MW, house ~100,000 servers and potentially 200-500,000 disks, notably not as Disk Arrays. Over a decade ago, Google solved its unique scale-up/scale-out problems with whole-system replication and GFS (Google File System): a 21st century variant of the 1980's research work at Berkeley by Patterson et al called "N.O.W." (Network of Workstations). A portion of which was the concept written up in 1988 as the seminal RAID paper: "Redundant Arrays of Inexpensive Disks".

The 1988 RAID paper by Patterson et al and a follow-on paper in 1993/4 "RAID: High-Performance, Reliable Secondary Storage" plus  'NOW' talks/papers contain a very specific vision:
  • RAID arrays of low-end/commodity drives, not even mid-range drives.
  • Including a specific forecast for "1995-200" of 20Mb 1.3" drives being used to create 10Tb arrays (50,000 drives. Inferring a design from a diagram, taking a modest 5-6 racks).
RAID as a concept and product was hugely successful, as noted in the 1994 Berkeley paper, with a whole industry comprising many products and vendors springing up within 5 years. IBM announced the demise of the SLED (Single Large Expensive Disk) with the 3990 Model 9 in 1993 as a direct consequence of RAID arriving.

These were all potent victories, but the commercial reality was, and still is, very different to that initial vision of zillions of PC drives somehow lashed together:
  • Enterprise Disk Arrays are populated with the most expensive, 'high-spec' drives available. This has been the case from the beginning for the most successful and enduring products.
  • The drives are the cheapest element of Enterprise Disk Arrays:
    the supporting infrastructure and software dominate costs.
  • Specialised "Storage Fabrics" (SANs) are needed to connect the One-Big-Box to the many servers within a datacentre. These roughly double costs again.
  • Vendor "maintenance" and upgrade costs, specialist consultants and dedicated admins create the OpEx (Operational Expenditure) costs dwarfing the cost of replacement drives.
  • Large Disk Array organisations are intended to be space-efficient (ie. low overhead (5-10%) for Redundancy), but because of the low-level "block device" interface presented, used and usable filesystem space is often 25% of notional space. Committed but unused space, allocated but unused "logical devices" and backup/recovery buffers and staging areas are typical causes of waste. This has led directly to the need for "de-duplication", in-place compression and more.
  • With these many layers and complexity, Disk Array failure, especially when combined with large database auto-management, has lead to spectacular public service-failures, such as the 4-week disruption/suspension of Microsoft's "Sidekick" sold by T-Mobile in 2009.  There is nothing publicly documented about the many private failures, disruptions and mass data losses that have occurred. It is against the commercial interests of the only businesses that could collect these statistics, the RAID and database vendors, to do so. Without reliable data, nothing can be analysed or changed.
  • The resultant complexity of RAID systems has created a new industry: Forensic Data Recovery. Retrieving usable data from drives in a failed/destroyed Disk Array is not certain and requires specialist tools and training.
  • To achieve "space efficiency", admins routinely define very large RAID sets (group size 50) versus the maximum group of 20 disks proposed in 1994. These overly large RAID groups, combined with link congestion, have lead to two increasing problems:
    • RAID rebuild times for a failed drive have gone from minutes to 5-30 hours. During rebuilds, Array performance is 'degraded' as the common link to drives is congested: all data on all drives in the RAID set needs to be read, competing with "real" load.
    • Single parity RAID is no longer sufficient for a 99+% probability of a rebuild succeeding. RAID 6, with dual parity (each parity calculated differently), has been recommended for over 5 years. These extra parity calculations and writes create further overhead and reduce normal performance more.
  • Whilst intended to provide very high 'reliability', in practice large Disk Arrays and their supporting 'fabrics' are a major source of failures and downtime. The management of complex and brittle systems has proven to exceed the abilities of many paid-professional admins. As noted above, the extent and depth of these problems are unreported. Direct observation and anecdotal evidence suggests that the problems are wide-spread, on the point of being universal.
  • At the low-end, many admins still prefer internal hardware RAID controllers in servers. As a cost-saving measure, they often also select fixed, not hot-swap, drives. This leads to drive failures not being noticed or corrected for months on end, completely obviating the data-protection offered by RAID. These controllers are relatively unsophisticated, which creates severe problems in the case of power loss: "failed" drive state can be forgotten, as the controller relies on read errors to prompt data reconstruction via parity.
    Instant, irreversible corruption of complete hosted filesystems results: the worst result possible.
  • Low-cost "appliances" have started to appear offering RAID of 2-8 drives. They use standard ethernet and iSCSI or equivalent instead of a dedicated "storage fabric". SOHO and SME's are increasingly using these appliances, both as on-site working storage and for off-site backups and archives. These are usually managed as "Set and Forget", avoiding many of the One-Big-Box failure modes, faults and administration errors.
Questions to consider and investigate:
  • Reliable Persistent Storage solutions are now needed for a very large range of scale: from 1 disk in a household to multiple copies of the entire Public Internet in both Internet Search companies and Archival services, like "The Wayback Machine". There are massive collections of raw data from experiments, exploration, simulations and even films to be stored/retrieved. Some of this data is relatively insensitive to data-corruption, other needs to be bit-perfect.
    A "one-size-fits-all" approach is clearly not sufficient, workable or economic.
  • Like Security and encryption, every site, from household, to service provider, to government agencies, requires multiple levels of "Data Protection" (DP). Not every bit needs "perfect protection", nor "infinite datalife", but identifying protection levels may not be easy or obvious.
  • Undetected errors are now certain with the simple RAID "block device" interface.
    Filesystem and object-level error detection techniques, with error-correction hooks back to the Storage system, are required as the stored data-time product and total IO-volume increases (exponentially) beyond the ability of simple CRC/checksum systems to detect all errors.
  • There may be improvements in basic RAID design and functionality that apply across all scales and DP levels, from single disk to Exabyte. Current RAID designs work in only a single dimension, duplication across drives, whilst at least one other dimension is available, and already used in optical media (CD-ROMs): data duplication along tracks. "Lateral" and "Longitudinal" data duplication seems an adequate descriptions of the two techniques.
  • Specific problems and weaknesses/vulnerabilities of RAID systems need analysis to understand if improvements can be made within existing frameworks or new techniques are required.
  • Scale and Timing are everything. The 1988 solutions and tradeoffs were perfect and correct at the time. Solutions and approaches created for existing and near-term problems and technologies cannot be expected to be Universal or Optimal for different technologies, or even current technologies for all time.
Why did the 1988 and 1994 Patterson et al papers get some things so wrong?

Partly because of human perceptual difficulties dealing with super-linear changes (exponential or higher growth exceeds our perceptual systems abilities)
and also because "The Devil is in the Detail".
Like software, a "paper design" can be simple and elegant but radically incomplete. To probe the designs' limits, assumptions, constraints and blind-spots, it requires implementation and real-world exposure. The more real-world events and situations any piece of software is exposed to, the deeper the level of understanding of the problem-field that is possible.

Tuesday, October 25, 2011

My Ancient Computer Project

Preamble: If you're really looking for advice on floppy drives: 8", 5¼" and 3½", try these links:

The good folk at "Device Side Data" sell a USB adaptor for 5¼ inch floppy drives and a range of relevant cables, power-supplies and enclosures.
For about US$100, you can have a working  5¼in USB setup, but it's bring-your-own-drive.
Sourcing a 5¼ inch floppy drive may be tricky: they went out of production 10-15 years ago.
Sourcing 5¼ inch floppy disks, new or 'slightly used' is probably a greater challenge.

Some weeks ago a friend (AF) asked if I could copy a 5¼ inch floppy disk for him, leading to this little adventure...

AF had a 3½ inch floppy that he might have copied everything onto and wanted to check.
Without specialist equipment, I knew I wouldn't be able to recover all potentially readable data, but I offered to do what I could with a standard drive.

In the end, AF decided to try something else, so I  didn't get to do his copy.
But it did make me get my old 386-SX properly setup, networked so I could move files to/from it and a way to easily copy 5¼ in and 3½ floppy disks.
Before this, I still booted it occassionally, but the real-time clock battery had died and it had no network card or CD-ROM drive.

This decision led to weeks of farnarkling and some interesting lessons.

While it's unclear if my 386-SX will survive another 2 decades, the software can live on through tools like QEMU, WINE/Crossover and even DOSBOX. So there is some value in recovering the data both on the hard disk and my collection of floppies.
The impetus to recover data from unreadable media, 5¼in floppies, is obvious.
For the readable 3½in floppies, taking a copy now is a good investment of time if I ever want to access the data again: magnetic media does degrade over time. In another 20 years, the coating on those floppies may be flaking off in big lumps.

386-SX Initial Config:
  • purchased late 1991, ~$3250
  • 386-SX CPU, 20Mhz (selectable to 8Mhz). No FPU. 8Mhz ISA bus, 8-slots.
  • Dual floppy drives, 5¼ in [boot] and 3½ in.
  • IDE disk. 80Mb, WD AC280.
  • single parallel port, dual serial ports, one used for mouse, other for modem.
  • 5Mb RAM [max at time]
  • Super VGA1024x768 16-col, 640x480, 256 col. 512Kb [K-i-l-o not Mb]
  • 33cm display. [fixed scanrate, can be destroyed @ wrong Hz]
  • mini-tower. pre-ATX power-supply (no 'soft' power switch or 'halt')
  • No sound-card.
  • DOS 5.0 and Windows 3.0
Current state-of-play is:
  • 386-SX has DOS 6.20, Win 3.11, Networking, CD-ROM and Zip drive all working.
    Hard-drive cloned/backed up and (5) ZIP disks read and copied. [ZIP drive back in its box]
    Single 3½in floppy as A: drive. BIOS only allows booting from first drive.
  • 5¼ in drive working as 2nd drive in a Linux machine (2001, Celeron 667) with built-in networking.
  • Working through copying all  5¼ in floppies I can find.
    Tally so far [40]: 1 unreadable, 1 with errors.
    Update 04-Nov-2011: 210 floppies read, 40-50 5 in: 8 "no data". 150+  3½in, 5 "no data"
  • Now have a USB 3½ in floppy drive, can read those at leisure on newer machines.
The most important lesson for me came about two weeks in:
  • I was fixated on doing everything on my 1991 vintage 386-SX.
    At one point I was running through the options of replacing the motherboard and the various costs which weren't attractive given I was 'just playing'.
  • Since USB became ubiquitous, finding machines/motherboards with floppy drive controllers is increasingly difficult, which means even embedded boards with FDC's are rare and expensive.
  • Then I realised I already had everything I needed in the 2001 vintage Celeron system I had tucked away.
    It's loaded with Fedora Core 3 (support ended in 2004) with a linux 2.4 kernel. Old, but usable.
  • It was perfect for what I wanted to do.
    It had a floppy drive controller and I could transplant the cable (with  5¼in 'slot' connectors) from the 386-SX to the Celeron 667.
 I also learnt a little about floppy drive connectors:
  • 5¼in drives have a 'slot' (card-edge) connector on the drive and a header not unlike a 40-pin IDE connector (34-pin is used).
  • 'Classic' 3½in drives use a socket connector similar to IDE connectors (but 34-pin)
  • The $30 USB 3½in drive I bought uses an incompatible tiny connector (two versions, a plug and a socket with a conversion cable). Previously, I didn't know this variant existed. Noted to save other people from popping open put-together-permanently cases. I didn't care about the warranty, but the case needs to be firmly shut or drive operation is affected.

Before starting any work, I had to backup the original 3860-SX disk. My memory was that I'd bought a 30Mb 'RLL' drive (a 20Mb ST-506 drive with a modified controller).
Turns out I really had an 85Mb IDE (now called ATA or PATA), a "WD Caviar® AC280", not only larger, but it would allow me to connect an IDE CD-ROM drive in their as well. My unused hardware pile has any number of CD-ROM drives.

Most importantly, an IDE/ATA drive gave me the option of backing up the drive via an IDE/USB interface... Of which I have a number of versions.

I also had an old system backup of around 30 * 720kb 3½ floppies. DOS 5.0 and Win 3.0.
Using QEMU on a Linux system, I was able to restore this backup to a virtual disk drive.
Shuffling all those disks was painful and slow.

Connecting an early IDE drive to a modern(ish) IDE/USB interface didn't work.
Probably because additional commands were introduced to identify the drive, possibly because this old drive responded to "CHS" (cylinder-head-sector), not LBA (Logical Block Addressing). From the AC280 spec. sheet, the drive electronics did support any reasonable CHS settings, not only the physical layout.

My next preferred method was to connect a second IDE drive as a 'slave' (D: in DOS), fdisk and format it and copy the original drive contents, then connect this drive to a modern system with IDE/USB interface and back it up.

The first 3.5" HDD I tried from my unused pile (1Gb Fujitsu) had errors.
Next drive tried was a 3.5" 4.3Gb. Worked reliably.
In the final config, I replaced that drive with a slower, quieter 2Gb 2.5in drive, cloning it via an IDE/USB interface.

The Phoenix BIOS in the 386-SX is very old. Not only doesn't it support LBA drives, it seemed to limit drives to 1023 cylinders - and 15 heads/63 sectors. Around 470Mb.
A very large fraction of my time was spent fiddling with disk CHS specifications and attempting to get fdisk to ignore the BIOS settings.
[No, I can't update the firmware, the BIOS is pre "Flash-the-BIOS"].

I could setup multiple (large) partitions on the drive with Linux and the IDE/USB interface, but then would run into troubles under DOS and the 386-SX.
I tried 'fdisk' from DOS 5.0, 6.2, 6.22, Win-95 and FreeDos on the 386-SX, but all would only see the 470Mb allowed by the BIOS. Extra partitions would be displayed, but couldn't be changed.

I don't have enough unused 1.44Mb 3½in floppies to back up the entire 85Mb drive, so was very glad my 2nd method worked.

Getting a working ISA-bus (not PCI) network card was simple: I had two in my "unused bits pile".
The one I chose I'd bought new, a Netgear NE2000 clone, but I didn't realise that or find the box (with install floppy) for a while. Relied on Windows NE2000 driver and Internet downloads at first.
The other card didn't have a clear name/identifier, nor did I get a good match from the chip numbers.

This was another 'surprise': how to get any info on installed ISA cards.
I also spent 2 or more days fiddling with the IRQ/DMA settings on the NE2000 card. I'd forgotten the problems that PCI made go away. I ended up with IRQ 5 (COM2) and found an IO address range through trial-and-error. I don't know for sure if its a clash or not.
I couldn't find a tool that would list for me all the cards + settings in the system.
Norton's "SI" provides everything but DMA ports.
MSD (Microsoft Diagnostics) didn't help either.

Getting the CD-ROM working was a good idea, if a little problematic.
Using standard Linux tools, I was able to create an ISO image of the original DOS/Windows system and also add some additional tools.
The problematic part was creating a disk image with Uppercase filenames. Whilst DOS 6.2 (really MSCDEX) reads the root directory correctly, no files or directories can be read/listed.
Perhaps it is the 'Joliet' option I use that causes this... Had the same difficulty with the FreeDos CD.

I managed to get non-DOS booting via 3½in floppies:
Until I tried it, I wasn't aware that FreeDos uses linux as a base. Uses 'SYSLINUX' as the boot loader and seems to have a kernel.sys.

It took a good deal of searching to find any Linux that would support:
  • vanilla 386
  • no FPU
  • no RAM disk for under 12Mb,.
"Floppyfw" recognised the (single) NE2000-compatible network card, but didn't include a shell.
BG-TLB does include BusyBox, but only support 'plip' networking over the parallel port (not tried).
So while I have seen a linux shell prompt and been able to mount the DOS/FAT filesystem, I haven't been able to dual-boot the 386-SX or run it as a Linux only system.

Whilst FreeDos could read its install CD-ROM, DOS 6.2 was unable to read files/directories contents (the lowercase name problem above).

Another Linux not-tried was tomsrtbt: "Tom's floppy which has a root filesystem and is also bootable." It advertises itself as "The most GNU/Linux on 1 floppy disk."
Which might have worked, but it formats 3½in floppies at the non-standard 1.7Mb, not 1.44Mb, with the caveat "Doing this may destroy your floppy drive".

One of the 'surprises' I got was being unable to replace the original 3½in floppy drive with a newer drive. The interface and connectors were all the same, but the newer drive wouldn't work in the old system.
Was it me connecting it incorrectly, a faulty drive or something more?
Unable to tell and unwilling to devote a bunch of time testing it.

One of the worst 'surprises' I got was after installing the IOmega (parallel-port) ZIP drive software on the system after a fresh install of Windows 3.11. [Windows 3.11 had decent Networking support and Microsoft still have downloadable a good TCP/IP stack.]
I had it all setup, tested and working and foolishly, in Windows, selected "Optimise settings" and the system hung.
Whereafter, the system couldn't see any Comms ports, serial or parallel. Which was very problematic because the 386-SX didn't come with a mouse port (DIN or PS/2). I used a serial port for the mouse.
Windows hung when it booted, leading me to try to revert the ZIP drive install and later to re-install Windows.
There were countless reboots and 6-8 hours later I gave up and went to bed.
I had realised/diagnosed that when the machine booted, the BIOS reported "0 serial ports" and "0 parallel ports". The BIOS setup screen only allowed me to selected HDD and floppy drive settings.

First thing in the morning, I powered on the machine and it worked perfectly. Including the original copy of Windows that would hang.
All I can think of was a power-cycle (off/on) cleared the fault, whereas a 'cold' or 'warm' reset (reset switch or ctrl-alt-del) didn't. In the many reboots, I hadn't thought to power-cycle the machine. [A note for myself and others experiencing weirdness on old hardware.]

I also have an old Dell Inspiron 7000, dating from 1999. I got it with a removable ZIP drive, figuring I could do backups and bulk-data transfers using it and the parallel-port ZIP drive. While tinkering around, I disassembled the other ZIP drive. It's an IDE/ATAPI drive, but the connectors are non-standard. I was hoping I'd be able to kludge it to work on another system - but not to be.

One of the BIOS limits, noted above, was it will only boot from the first floppy drive. When I'd configured the machine, I'd made the 5¼in drive "A:". Part of my reconfig was to move that drive and make the 3½in drive "A:", so it could boot from it. And most "floppy disk images" on the net are 1.44M for 3½in drives.
Then I executed a perfect "rookie mistake": I forgot to make a bootable 3½in system disk before moving the 5¼in drive. And all my system disks were, of course, 5¼in.

The 386-SX is in a "mini-tower" case with very limited space between the back of devices in the drive bays and the motherboard etc. This makes running cables and changing drives quite time consuming. Especially with older connectors that are loose and can be jiggled off. This was part of the reason to move to a 2½in HDD - much more space. I did need to find a spare 2½in-3½in mounting kit first.
I didn't believe the weight of the whole system, let alone just the removable cover. Meaty!

Early on I replaced the on-board/real-time clock battery so the system would remember the time over reboots. More modern systems use 3V lithium batteries (CR2032) for this. This old motherboard used a 4.5V alkaline battery (mounted off-board with Velcro). About a week in, I used 3 AAA alkaline batteries in a modified carrier (and a cannibalised connector) to craft a replacement. A less pretty way is to load 3 batteries in a cut-to-length tube with wire soldered directly to exposed battery ends. Soldering wires directly onto batteries requires some technique. You may need help if you try it. There could be an explosion risk with alkaline batteries becoming overheated (they are marked "do not dispose of in fire"). Research this properly or get help if you choose to do this.

The system, including DOS, quite happily accepts dates of 2011. No problems there...

One of the problems I haven't addressed yet is: "How do I clean the drive heads?"
Back in the day when I was a sometime 'operator' on mainframes, cleaning tape drive heads was part of the ritual. We used Isopropyl alcohol + cotton swabs - because it didn't leave a residue. The swabs would always come away stained with oxide coating from the tape.
For these old drives, I've two reasons to want to be able to clean the heads:
a) these are old drives and may well have an internal dust build-up, and
b) older disks are likely to shed more of their coating than when new.
Some media formulations from the late 1980's are known to suffer problems. I've heard first-hand accounts of the work needed to recover period audio-tape recordings due to this problem.
  • 5¼ inch floppy drives load the heads directly in-line with feed-slot.
    It is possible to get a swab in there, though v. difficult to see what's happening.
  • 3½in floppy drives drop the disk to both lock the disk in-place and load the heads, hence the heads, being offset, aren't accessible for swabbing through the feed slot.
Another thing to investigate...