Sunday, September 15, 2013

Reasonably Trustworthy Messaging (RTM)

PJ of Groklaw got spooked a Lavabit founder responding to PRISM by saying "if you knew what I did, you wouldn't use the Internet/email".

This is the start of a design for a reasonably trustworthy messaging system, in the same way that PGP was only pretty good privacy.

I'd like to combine 3 tools/concepts on top of the obvious measures.

  • SFTP as a file delivery mechanism.
  • ACSnet (later MHSnet) was a Store & Forward system that separated content and control while passing files to a content-handler on the far end.
  • Bespoke classified message systems contain two useful concepts:
    • Messages have an urgency and separate security classification.
      • these are used in routing & queuing decisions.
    • Every message is tracked & acquitted via multiple sequence numbers:
      • Per-link or channel sequence number
      • end-end sequence numbers
Obvious measures:
  • File splitting
    • Everything is a 2Kb or 4Kb block
    • I'd prefer an "M of N" redundancy system to allow some data to be lost. 
    • There's a "batch reassembly" file implied by this.
    • SFTP needs to have some means of grouping batched files together.
      • Either a named control file, or
      • a directory name, a sequence number in clear or encrypted.
  • PGP/GPG encryption
    • Encrypt everything
    • Compress first, for text files especially.
      • Ideally, force different coding tables per file.
  • Content-based file naming
    • Files only referred to by content: its hash-sum, SHA-1 or MD5 (old)
    • Queued blocks can be jumbled and transmitted in any order, being reassembled into correct order later.
      • This implies "message headers" contain the hash-keys of contents.
There are old programs, like "fetchmail" (or 'fdm' now), that know how to pull mail from many sources and present to the Mail User Agent in a form it can handle, like the traditional Unix mailbox format.

The ACSnet software was able to accept emails from "sendmail", so a simple SMTP daemon is also needed here to accept outbound messages from Mail Clients, allowing 

An extra layer of obfuscation and aggregation, like ToR or VPN's, makes the task of matching inbound/outbound packets harder. 

One of the central notions is no system, apart from the end-points, ever has the unencrypted files/messages.
  • Messages are encrypted with the public-key of the next-hop destination before sending.
  • Messages queued to a link/channel should be kept as-received (encrypted for current system), only being decrypted before transmission.
  • Small blocks mean encryption can be performed in-memory, without any intermediate results touching a disk. Prevent swapping and VM page-write may be a challenge, but only one layer is lost.
  • SFTP/SSH use per-session keys to encrypt transfers.
    • With content re-encrypted per hop, there is no common plain-text allowing keys to be compromised.
Solution

That's what I'm aiming for, a low user-intervention system that users can reasonably trust:
  • Messages and files sent, when delivered can be shown to intact and complete.
  • No messages/files can be read "in clear" on the wire or on any intermediate system.
    • Messages/files only appear unencrypted on the source and destination system.
    • Simple tools, like USB drives, can be used to transfer files to/from off-line systems.
  • Users can get confirmations of delivery and/or acceptance back from each step along the way.
  • Data is sufficiently obscured so traffic analysis yield minimal useful metadata:
    • Using SFP in a Store+Forward mode removes the normal headers 
    • Any SFP service can be used as a relay, if the per-hop encryption can't be organised on the server and exposing one-layer of encryption an acceptable risk.
Use Case

Alice and Bob want to exchange Company-Confidential information.
They both have access to the same SFTP server, or preferably a service that also supports re-encryption.
Or they each have access to SFTP servers hosted by co-operating trustworthy providers.
Alice and Bob both have off-line computers that they load/download encrypted files onto.
Alice and Bob both have internet-connected computers with the RTM App, PGP/GPG & SFTP installed, plus the transfer host PGP/GPG private keys.
Alice and Bob have exchanged their email Public PGP/GPG keys, both only have their private email keys on their off-line system.
Alice and Bob have exchanged their internet-host PGP/GPG public keys with their upstream server.

Alice, on her off-line system, writes one or more messages to Bob (and others) from her usual Mail Client and possibly queues some file transfers with another App.

The RTM software creates a directory of encrypted files which Alice then copies to a USB drive.

On her on-line system, Alice loads the USB drive and starts the RTM software than uses SFTP to transfer files to their shared server. The RTM software creates a control file for the batch and encrypts all files with the public key of the next hop.

The next-hop system receives the files and decrypts the control file, queuing blocks for whichever next hop link is to be used and creates & encrypts control files for batches.

After zero of more hops, Bob's SFTP server receives blocks batched for him from Alice and others and queues them for Bob, ready to encrypt them with his transfer host public key.

Bob, in his own time, starts RTM on his transfer system and downloads the blocks queued for him, decrypting them to copy to his USB drive. These blocks were encrypted for Bob by Alice and others using the email public key he shared with them. They cannot be decrypted on the transfer host.

Bob takes his USB drive to his off-line system and starts RTM to upload the encrypted blocks, decrypt them, checks hash-keys, reassembles the sent files and then delivers to the appropriate 'handler', email, file transfer or other specific tool.

Bob can then check email on his off-line machine using his favourite Mail Client. He can choose to save/distribute the decrypted files transferred by whatever means he needs.

Alice and Bob do not have to use off-line systems with air-gaps. If they accept the risk, they can run both the transfer host and "in-clear" system as Virtual Machines in the same system. A private, internal network can be used to transfer encrypted blocks between the two VM's.

If the system hosting the two VM's is compromised, the most an attacker can do is monitor the display.

I haven't discussed the various passwords/pass-phrases that would be needed in operation.
The system should be simple to install and configure and be mostly self-administering.


Monday, February 04, 2013

Storage: New Era Taxonomies

There are 3 distinct consumer facing market segments that must integrate seamlessly:
  • portable/mobile devices: "all flash".
  • Desktop (laptop) and Workstations
  • Servers and associated Storage Arrays.
We're heading into new territory in Storage:
  • "everything" is migrating on-line. Photos, Documents, Videos and messages.
    • but we don't yet have archival-grade digital storage media.
      • Write to a portable drive and data retention is 5 years: probable, 10 years: unlikely. That's a guess on my part, real-life may well be much worse.
    • Currently householders don't understand the problem.
      • Flash drives are not (nearly) permanent or error-free.
      • Most people have yet to experience catastrophic loss of data.
      • "Free" cloud storage may only be worth what you pay for it.
  • Disk Storage (magnetic HDD's) is entering its last factor-10 increase.
    • We should expect 5-10TB/platter for 2.5" drives as an upper bound.
    • Unsurprisingly, the rate of change has gone from "doubling every year" to 35%/year to 14%/year.
    • As engineers approach hard limits, the rate of improvement is slower and side-effects increase.
    • Do we build the first maximum-capacity HDD in 2020 or a bit later?
  • Flash memory is getting larger, cheaper and faster to access, but itself is entering an end-game.
    • but retention is declining, whilst wear issues may have been addressed, at least for now.
    • PCI-attached Flash, the minimum latency config, is set to become standard on workstations and servers.
      • How do we use it and adequately deal with errors?
  • Operating Systems, General Business servers and Internet-scale Datacentres and Storage Services have yet to embrace and support these new devices and constraints. 
David Patterson, author with Gibson & Katz of the 1989 landmark paper on "RAID", noted that every storage layer trades cost-per-byte with throughput/latency.
When a layer is no longer cheaper than a faster layer, consumers discard it. Tapes were once the only high-capacity long-term storage option

My view of FileSystems and Storage:
  • high-frquency, low-latency: PCI-flash.
  • high-throughput, large-capacity: read/write HDD.
  • Create-Once Read-Maybe snapshot and archival: non-update-in-place HDD.
    • 'Create' not 'Write'-Once. Because latent errors can only be discovered actively, one of the tasks of Archival Storage systems is regularly reading and rewriting all data.
What size HDD will become the norm for read/write and Create-once HDD's?
I suspect 2.5" because:

  • Watts-per-byte are low because aerodynamic drags increases near the fifth power of platter diameter and around cube of rotational speed.
    • A 3.5" disk has to spin at around 1700rpm to match a 5400rpm 2.5" drive in power/byte, and ~1950rpm to match a 7200rpm 2.5" drive.
    • All drives will use ~2.4 times the power to spin at 7200rpm vs 5400rpm.
    • Four 2.5" drives provide around the same capacity as a single 3.5" drive
      • Area of 2.5" platters are half a 3.5" platters.
      • 2.5" drives are half the thickness as 3.5" drives (25.4mm)
      • 3.5" drives may squeeze 5 platters, 25% better than 2.5".
  • Drives are cheap, but four smaller drives will always be more expensive than a single larger drive.
    • Four sets of heads will always provide:
      • higher aggregate throughput
      • lower-latency
      • more diversity, hence more resilience and recovery options
      • "fewer eggs in one basket". Impact of failures are limited to a single drive.
    • In raw terms, the cheapest, slowest, most error-prone storage will always be 3.5" drives. But admins build protected storage, not raw.
      • With 4TB 3.5" drives, 6 drives will provide 16TB in a RAID-6 config.
        • Note the lack of hot-spares.
      • With 1TB 2.5" drives, RAID-5 is still viable.
        • 24 drives as two sets of 11 drives + hot-spare, provide 20TB.
        • For protected storage, 3.5" drives only offer at best 3.2 times the density and many-fold less throughput and latency.



Here are some of my take-aways from LCA 2013 in Canberra.

* We're moving towards 1-10TB of PCI-Flash or other Storage Class Memory being affordable and should expect it to be a 'normal' part of desktop & server systems. (Fusion IO now 'high-end')

  - Flash isn't that persistent, does fade (is that with power on?).
    - How can that be managed to give decade long storage?
  - PCI-Flash/SCM could be organised as one or all  of these:
     - direct slow-access memory [need a block-oriented write model]
     - fast VM page store
     - File System. Either as
- 'tmpfs' style separate file system
- seamlessly integrated & auto-managed, like AAPL's Fusion LVM
- massive write-through cache (more block-driver?)

  - there was a talk on Checkpoint/Restart in the Kernel, especially for VirtMach, it allows live migration and the potential for live kernel upgrades of some sort...
    - we might start seeing 4,000 days uptime.
    - PCI-Flash/SCM would be the obvious place to store CR's and as source/destination for copies
    - nobody is talking about error-detection and data preservation for this new era: essential to explicitly detect/correct and auto-manage.

  - But handling read-errors and memory corruption wasn't talked about..
    - ECC won't be enough to *detect* let alone correct large block errors.
    - Long up-times means we'll want H/A CPU's as well to detect compute errors.
      Eg. triplicated CPU paths with result voting.

    - the 'new era' approach to resilience/persistence has been whole-system replication and 'network' (ethernet/LAN) connection, and away from expensive internal replication for H/A.


==> As we require more and more of 'normal' systems, they start to need more and more Real-time and H/A components.

==> For "whole-system" replication, end-end error detection of all storage transfers starts to look necessary. i.e. an MD5 or other checksum generated by the drive or Object store and passed along the
chain into PCI-Flash and RAM: and maybe kept for rechecking.

==> With multiple levels of storage with latency between them and very high compute rates in CPU's, we're heading into the same territory that Databases addressed (in the 80's?) with Transactions and ACID.


* Log-structured File Systems are perfect fit for large Flash-based datastores.
  - but log-structured FS may also be perfect for:
    - write-once data, like long-term archives (eg. git repos)
    - shingled write (no update-in-place) disks, effectively WORM.

==> I think we need an explicit Storage Error Detect/Correct layer between disks and other storage to increase BER from 10^14 or 10^16 to more like 10^25 or 10^30. [I need to calculate what numbers are actually needed.] Especially are everything gets stored digitally and people expect digital archives to be like paper and "just work" over many decades.

Thursday, January 17, 2013

Storage: FileSystems, Block/Object Storage and Physical Disk Management in 21st Century Systems

The central social contract filesystem and storage layers have with users is:
  • Don't lose data
  • Make it easy to get data in and out, preferably verifiably correct.
  • Performance is nice, but can never talks precedence over preserving data and replaying it correctly.
The approaches and paradigms that worked for Unix in 1970 won't work now. Its was a world of 5-10MB drives @ 1-10Mbps, 1MHz CPU's without cache and "off-line" storage was 6250bpi 9-track 0.5in tape (2400' ~120Mb).

Even nearly 20 years later in 1988, the year of the Patterson/Gibson/Katz RAID paper, streaming the full contents of a drive for a rebuild (100MB 5.25" SCSI drives) was ~100 seconds and ~1000 seconds for 1GB 8" Fujitsu Eagle drives preferred by the first Storage Arrays.

What's changed is the relative capacity and speeds of storage devices, the demands of "average users" and some additional layers of storage, like cache and Flash memory.

The old approaches are creaking and becoming more & more complex in attempts to handle performance (rate), volume and size. One "fast" filesystem, ReiserFS, was popular for a time but notorious with users for corrupting disks and losing data. Breaking the contract loses users...

The 10 TB/platter 2.5" drives expected by 2020 will only read 2-3 times faster than current 1TB drives (250-400MB/sec). That's 40,000 seconds to stream the whole drive: 10-12 hours. Increasingly, Jim Grays' millennial observation, "Tape is Dead, Disk is the new Tape" (meaning disks are good at streaming, poor at random I/O), is driving Storage designs. Enterprise Class Storage Arrays cannot compete with Flash memory for random I/O and to cover need increasingly long drive rebuild times (4-150 hours) have adopted slower, more inefficient/complex parity schemes.

We now have chips with 3 levels of cache, soon with on-chip DRAM, on-board DDR3 DRAM, PCIe Flash, SATA/SAS Flash and HDD drives and soon "no update-in-place" Shingled-Write drives.
SCM, Storage Class Memories, like Flash are hoped to provide the path to higher capacity devices, but to date, their are no obvious commercial technologies.

This in the context of at least 4 types of compute devices, each with different demands for Storage and data recovery and protections.
  • Mobile: smartphones and tablets. Not usually "content creators" but "viewers". Software from Firmware and vendor App Stores. Auto-sync config and data to "Cloud" or desktop.
  • Laptop/low-end Desktop: Limited "content creation". Restores via vendor products, erratic/random backups and data protection.
  • "Power user" Workstations: Professional platforms for content creation. Dedicated Storage Appliances, with problematic & erratic data protection.
  • Servers:
    • SOHO/SME, small ISP: single servers or small farms. Nil or problematic data protection.
    • SMP servers, business server farms: SAN's + Storage Arrays, H/A, multiple-sites, fail-over, ...
    • Clusters and large arrays: special filesystems, lots of storage, fast networks.
    • Internet-scale Data Centres: purpose built hardware and storage solutions.
On my average Mac desktop there are over 2M files after 5years, a scale not anticipated by the original Unix Directory-Inode-Links-Blocks design.

It's now possible and feasible for individuals to follow Gordon Bell and digitally record their entire lives. This is more than storing random snaps from smartphones, but creating a usable, accessible store.
In 10M recording seconds per year, individuals can create 100k files/year, 1TB at low data rates (100KB/sec) and view 1-10M files/web-pages.

This load for even the current 1-2B smartphone users (not the 6B cell phone services), whilst potentially being a boon for Network Operators & Storage vendors, requires new services and new approaches.  Especially:
  • Strong User Identification with many roles per individual, for work, interests and personal life.
  • Single Federated views of individual-Identity storage.
  • New Search, Indexing, tagging and annotation tools.
  • Integrated "point-in-time" file browsing and scanning.
  • Internet-scale data de-duplication and peer-peer Storage.
We already have definitive solutions for "point-in-time"  recovery of:
  • Text files via Version Control Systems like SVN, CVS, RCS, ...
  • Relational Databases with full-DB snapshot and "roll-forward" transaction logs,
  • but other important binary data types, {DB's, images, videos, sound, PDF-docs, geo-data, machine control, ...}, aren't born with verifiable digital signatures, nor their own change logs.
Metadata, both system generated like timestamps, Geo-locn and user GUID, and user-supplied data, like tags and text, are as important as the data changes.

Backups and Version Control Systems typically offer 3 sorts of versioning. A combination of these methodologies will be used at various levels:

  • Full Backup. 100% replication of all bits.
  • Incremental: Store only the bits changed since last Incremental.
    • Notionally, the minimum storage required.
    • Slowest to recover: all Incrementals must be applied sequentially, in order.
    • Most prone to error and data loss.
      • If one delta-file is deleted or corrupted, the entire set is useless.
  • Differential: Store all bits changed since last Full Backup.
    • Each differential is larger than the last, potentially up to the size of a Full Backup.
    • Fastest to recover.
    • Simplest to manage
    • Robust against errors and deletions, if the dataset was stored.

Work on non-Relational Databases is occurring, but there are important challenges for relational Databases a continuous-timeline view of storage, more than the current transactional/data-wharehouse duality/conversion:
  • limited data storage formats can be supported, "importing and conversion" 
  • indexing of data is a separate activity and stored/accessed differently.
  • Schemas and Database names have to survive changes.
  • Semantics of individual fields are as important as
The Wayback Machine, a.k.a. The Internet Archive, gives us a working model and informs us that people can tolerate a) retrieval delays, b) some datasets unavailable and c) some data loss.

It costs a lot less for "Best Effort" rather than "Guaranteed" storage services, suggesting multiple approaches, cost structures and service offerings in the marketplace. Hopefully consumers won't be inveigled to over-pay or complacently rely on inappropriate low-cost providers.

Will current Consumer Protection laws need to be extended to this area??
If you share data within a group (Family and Friends) and some people don't maintain their part of the archive - losing data for people that rely on them, do current laws apply or will new law be needed?

Will this lead to new businesses of "Archive Auditing"?

There are currently three "drop-dead" problems for these services, ignoring the current "unsupported file format" and "ancient system & run-time" issues:
  • Currently, there is no archival quality digital media.
    Hard Disks, Flash memory and CD/DVD's have limited lifetimes. They cannot be left on a shelf and be expected to work a couple of decades on... Data must be constantly scanned, rebuilt and migrated to new storage systems.
    • Acid-free paper and microforms will store documents for over 100 years.
    • Colour film is still the only archival media for movies and still images.
    • No good magnetic media exist for medium-long term storage of sound recordings.
  • Vendor longevity and professional misconduct or negligence, even systemic corruption.
    • When an Archival Storage Service goes bust, how do the owners of the data recover their data? Not over network links and if the facilities are locked and powered-off by administrators or sheriffs, not physically either.
    • There are around the world, just a few Telcos or Power Utilities that are 100 years old. Can we really expected profitable Storage to start now and last 5 times longer than Google without any commercial upsets? I'd argue "no".
    • Rogue admins and managers are the least of the problem, though they'll exist and cause problems.
    • Expecting ordinary, fallible owners, workers and managers to always resist temptation, bribery and sloth/negligence is more than naive and simplistic. Mistakes will happen, security breaches will occur and ordinary folk doing boring jobs will take shortcuts.
    • Valuable resources will always attract those wishing to steal it. These sorts of facilities must begin by never storing anything of value. Organised crime's only access must be via the individual users' system/device, not in a single, centralised resource.
  • Legal access issues: a whole new area of lucrative International Law awaits us...
    • Who has the right to look at data?
    • Can data "in default" (unpaid fees) be sold? To whom? At what price?
    • Can a Vendor move data from the Jurisdiction of origin, with or without permission?
    • Can Vendors share data across facilities in different Jurisdictions?
    • Can Storage custodians be forced to grant local Law Enforcement Offices access to individual or bulk data?
There are now three distinct views of the filesystem provided in the abstract model for user applications:
  • Current files
  • snapshots
  • archives
The O/S has to provide these services for each of those dimensional slices through the storage:
  • map names (paths) to inodes. Subsumes a "mount device/mount-point" model.
  • inodes (the immutable file, with metadata)
  • datablock link map, which reduces to start/end for contiguous allocation.
  • data blocks and free block list
  • Physical drive management, like LVM.
Systems have to address four different aspects of real-world storage access:
  • availability and connection paths
  • errors and rereads
  • erasures and failures
  • durability and longevity of data sets (protection and archive)
Overlaid on this are 4-5 distinct access patterns, similar to a metal working "temperatures":
  • "white-hot" region: read/write access on-board (RAM and PCIe Flash)
  • "red-hot" region: read/write access to direct-connect updatable HDD's
  • cool region: write once access to Big, Slow HDD's, probably non-update-in-place.
  • "blue" (cold) region: write once, seldom read HDD's. No update-in-place, append-only.
  • "black" (frozen) region: remote and archival storage. Rarely Accessed, Critical when needed.
There is a direct correspondence between different temperature regions and the filesystem abstraction they are providing.
  • Archives are read-only and live only in cool, cold and frozen regions.
  • Snapshots may be in a "red-hot" region, but otherwise in cool and cold regions.
    • Files are ever only moved to Archive from the Snapshot areas.
  • Current files will be migrated from, or cached into, the high-speed read/write regions on demand.
    • The link between Snapshots and Current files is: Snapshot[0] == Current filesystem.
My thesis is that the traditional Unix filesystem and O/S structure of Directories-Inodes-Block_maps-Data_blocks cannot serve all these demands well, but that we already have very good tools to handle them.

Schemes to handle inodes, Block_Maps & Linking and Block access for each "temperature" storage can be designed well for the specific trade-offs and performance expectations.

The major problem appears to me to be mapping File Names to Inodes:
  • It either requires very high performance and low-latency for the hottest I/O region, or
  • requires very large namespaces for snapshots and archives.
  • Indexes for Current & Snapshot views may be stored in low-latency storage, but the volume of names stored in long-term Archives means they cannot.
Neither of which is well served by the traditional "directory in a block", backed by O/S cache model.
But both are robustly handled by Database systems, albeit differently organised, indexed and tuned.

What is missing in normal systems is:

  • Filesystem or storage layer of "What's Changed?" (Deltas) via md5sums or change messages.
  • Swapping snapshot views between "Delta"and "Full" filesystem views:
    • 'rsync' identifies changed files, but users have to create full filesystem images themselves.
    • Apple's TimeMachine creates a full filesystem image at a point-in-time, but provides no "Delta" interface beyond a single file or directory.
There are two implications that fall out of this analysis:

  • Consumers will demand "Open" storage standards allowing them to swap devices, systems and Storage Vendors, not be locked into Proprietary standards, especially single-vendor solutions, and
  • a software solution model based on the Apache web-server or Linux kernel: co-operative Open Source backed by the GNU license. This allows all vendors to avoid license and patent issues, share work, leverage prior work, support and develop common standards, whilst also allowing market-differentiation by offering specific tools or hardware/software combinations.
The current Unix-like approaches of filesystems, O/S supported directory scanning (name to inode mapping), LVM handling {data protection, logical and physical volumes}, independent snapshot/archive facilities, independent hot-plug media and manual setup and operation of Archival stores cannot provide an Identity-keyed Federated Storage & Archive system.

Not all data stores or vendors will provide the same grade of service. Features that can be borrowed from:

  • NTP (Network Time Protocol): stratum level of server. Just how good are they?
  • IP Routing: "cost of routes". Preferentially chose the faster, cheaper services.
The main features required in an Identity-keyed Federated Storage & Archive system are:
  • Data access limited by Identity (data privacy as part of "Security")
    • Multiple Identities per user, based on role or use.
    • Multiple Users and Identities per Device.
    • Master Identity access to specified data, for work and families.
  • Automatic implementation of Policies
  • Addition and management of user-managed hot-plug media
  • Automatic integration across all single-Identity devices of local disk, local network storage, peer storage and multiple Vendor services
  • Policies set as targets:
    • Cost
    • Maximum size of store
    • Maximum data recovery time
    • Minimum and Maximum times between recovery points:
      • every minute for the last 36 hours
      • every hour for the last fortnight
      • every day for the last year
      • every week for the last decade
      • every month after that
    • normal performance: access rate, I/O per sec
    • By datatype, Data Resilience and Longevity (Probability Data Loss per period, Maximum data loss event size)
    • Warnings, Alerts and Alarms.
    • Default and specified Data Destruction dates

Wednesday, January 02, 2013

Storage: Specifying Data Resilience and Data Protection



In Communications theory, there are two distinct concepts:

  • Errors [you get a signal, but noise has induced one or more symbol errors], and
  • Erasures [you lost the signal for a time or it was swamped by noise]

Erasures are often in "bursts", so techniques are needed to not just recover/correct a small number of symbols, but

This is the theory behind the Reed-Solomon [Galois Field] encoding for CD's and later DVD's.
It uses redundant symbols to recreate the data, needing twice as many symbols to correct errors as recreate erasures. A [24,28] RS code encodes 24 symbols into 28, with 5 symbols/bytes of redundancy. This can be used to correct up to 2 errors (2*2 symbols used)  plus 1 erasure.

The innovation in CD's was applying 2 R-S codes successively, but between them using Cross Interleaving to handle burst errors by spreading a single L1 [24,28] frame across a whole 2352 sector [86?,98]. Only 1 byte of an erased L1 frame would appear in any single L2 sector.

DVD's use a related but different form of combining two R-S codes: Internal/External Parity.

CR-ROM's apply an L3 R-S Product Code on top of the L1&L2 RS codes + CIRC to get more acceptable Bit Error Rates (BER's) of ~10^15, vs 10^9. Data per frame goes down to 2048by (2Kb) fro 2352by.

With Hard Disks, and Storage in general, the last two big advances were:

  • RAID [1988/9, Patterson, Gibson, Katz]
  • Snapshots [Network Appliance, early 1990's]

RAID-3/4/5 was notionally about catering for erasures caused by the failure of a whole drive or component, such as a controller or cable.
This was done with low overhead by using the computationally cheap and fast XOR operation to calculate a single parity block.

But in use, the ability to correct both errors and erasures with parity blocks has been conflated...

RAID-3/4/5 is now generally though to be about Error Correction, not Failure Protection.

The usual metrics quoted for HDD's & SSD's are:
 - MTBF (~1M hours) or Annualised Failure Rate (AFR) 0.6-0.7%
 - BER (unrecoverable Bit Error Rate) 1 in 10^15
 - Size, Avg seek time, max/sustained transfer rate.

Operational Questions, Drive Reliability:

 - For a fleet, per 1000 drives, average drives fail per year?
    [1 year = ~8700 hrs, = ~8.5M hours/year/1000 drives = 8.5 drive
fails/year]
   Alternatively, AFR: 0.6-0.7% * 1000, = 6-7 drives/1000/year

 - What's the minimum wall-clock time to rebuild a full drive?
    [Size / sustained transfer rate: 4Tb @ 150MB/sec write = 7.5Hrs ]

 - what's the likelihood of a drive fail during a rebuild?
    7.5 hrs / 1M hrs = 0.001% [???] per drive.
   - for RAID-set of 10, (7.5/1M)/10 = 0.01%

 - probability data loss in rebuild (N = 10):
    Transfer / BER = 4TB * 10 = 32 * 10* 10^12 bits =
     3.2 * 10^14 / 10^15
   = .32 = 32% [suggests further protection is needed against data loss]

Data Protection questions. I don't know how to address these...

 - If we store data in RAID-6/7 units of 10-drive-equivalents
    with a lifetime of 5 years per set:
  - In a "lifetime" (60 year = 12 sets),
    what's the probability of Data Loss?

 - How many geographically separated replicas do we need to
    store data 100 years?


I think I know how to specify Data Protection: the same way (%) as AFR.

What you have to build for is Mean-Years-Between-Dataloss
and I guess that implies the degree of Dataloss: 1-by, 1-block (4Kb), 1MB?
And well as complete failure of a dataset-copy.

Typical AFR's are 0.4%-0.7%, as quoted by drive manufacturers based on
accelerated testing.

We know from those 2008(?) studies of large cohorts of drives, this is
optimistic by an order of magnitude...

An AFR of 1 in 10^6 results in a 99.99% 100YR-F-R.
(1 - .0000010) ^ 100

AFR of 1 in 10^5 is 99.9% 100YR-FR (CFR? Century Failure Rate)


AFR of 1 in 10^4 is 99.0% CFR.

So we have to estimate a few more probabilities:
 - site suffering natural disaster or fire etc.
 - site suffering war damage or intentional attack
 - country or economy crumbling [ every 40-50 yrs a depression ]
 - company surviving (Kodak lated 100yrs
 - admins doing their job competently and fully.
 - managers not scamming (selling disks, not provide service)

Are there more??