Thursday, January 17, 2013

Storage: FileSystems, Block/Object Storage and Physical Disk Management in 21st Century Systems

The central social contract filesystem and storage layers have with users is:
  • Don't lose data
  • Make it easy to get data in and out, preferably verifiably correct.
  • Performance is nice, but can never talks precedence over preserving data and replaying it correctly.
The approaches and paradigms that worked for Unix in 1970 won't work now. Its was a world of 5-10MB drives @ 1-10Mbps, 1MHz CPU's without cache and "off-line" storage was 6250bpi 9-track 0.5in tape (2400' ~120Mb).

Even nearly 20 years later in 1988, the year of the Patterson/Gibson/Katz RAID paper, streaming the full contents of a drive for a rebuild (100MB 5.25" SCSI drives) was ~100 seconds and ~1000 seconds for 1GB 8" Fujitsu Eagle drives preferred by the first Storage Arrays.

What's changed is the relative capacity and speeds of storage devices, the demands of "average users" and some additional layers of storage, like cache and Flash memory.

The old approaches are creaking and becoming more & more complex in attempts to handle performance (rate), volume and size. One "fast" filesystem, ReiserFS, was popular for a time but notorious with users for corrupting disks and losing data. Breaking the contract loses users...

The 10 TB/platter 2.5" drives expected by 2020 will only read 2-3 times faster than current 1TB drives (250-400MB/sec). That's 40,000 seconds to stream the whole drive: 10-12 hours. Increasingly, Jim Grays' millennial observation, "Tape is Dead, Disk is the new Tape" (meaning disks are good at streaming, poor at random I/O), is driving Storage designs. Enterprise Class Storage Arrays cannot compete with Flash memory for random I/O and to cover need increasingly long drive rebuild times (4-150 hours) have adopted slower, more inefficient/complex parity schemes.

We now have chips with 3 levels of cache, soon with on-chip DRAM, on-board DDR3 DRAM, PCIe Flash, SATA/SAS Flash and HDD drives and soon "no update-in-place" Shingled-Write drives.
SCM, Storage Class Memories, like Flash are hoped to provide the path to higher capacity devices, but to date, their are no obvious commercial technologies.

This in the context of at least 4 types of compute devices, each with different demands for Storage and data recovery and protections.
  • Mobile: smartphones and tablets. Not usually "content creators" but "viewers". Software from Firmware and vendor App Stores. Auto-sync config and data to "Cloud" or desktop.
  • Laptop/low-end Desktop: Limited "content creation". Restores via vendor products, erratic/random backups and data protection.
  • "Power user" Workstations: Professional platforms for content creation. Dedicated Storage Appliances, with problematic & erratic data protection.
  • Servers:
    • SOHO/SME, small ISP: single servers or small farms. Nil or problematic data protection.
    • SMP servers, business server farms: SAN's + Storage Arrays, H/A, multiple-sites, fail-over, ...
    • Clusters and large arrays: special filesystems, lots of storage, fast networks.
    • Internet-scale Data Centres: purpose built hardware and storage solutions.
On my average Mac desktop there are over 2M files after 5years, a scale not anticipated by the original Unix Directory-Inode-Links-Blocks design.

It's now possible and feasible for individuals to follow Gordon Bell and digitally record their entire lives. This is more than storing random snaps from smartphones, but creating a usable, accessible store.
In 10M recording seconds per year, individuals can create 100k files/year, 1TB at low data rates (100KB/sec) and view 1-10M files/web-pages.

This load for even the current 1-2B smartphone users (not the 6B cell phone services), whilst potentially being a boon for Network Operators & Storage vendors, requires new services and new approaches.  Especially:
  • Strong User Identification with many roles per individual, for work, interests and personal life.
  • Single Federated views of individual-Identity storage.
  • New Search, Indexing, tagging and annotation tools.
  • Integrated "point-in-time" file browsing and scanning.
  • Internet-scale data de-duplication and peer-peer Storage.
We already have definitive solutions for "point-in-time"  recovery of:
  • Text files via Version Control Systems like SVN, CVS, RCS, ...
  • Relational Databases with full-DB snapshot and "roll-forward" transaction logs,
  • but other important binary data types, {DB's, images, videos, sound, PDF-docs, geo-data, machine control, ...}, aren't born with verifiable digital signatures, nor their own change logs.
Metadata, both system generated like timestamps, Geo-locn and user GUID, and user-supplied data, like tags and text, are as important as the data changes.

Backups and Version Control Systems typically offer 3 sorts of versioning. A combination of these methodologies will be used at various levels:

  • Full Backup. 100% replication of all bits.
  • Incremental: Store only the bits changed since last Incremental.
    • Notionally, the minimum storage required.
    • Slowest to recover: all Incrementals must be applied sequentially, in order.
    • Most prone to error and data loss.
      • If one delta-file is deleted or corrupted, the entire set is useless.
  • Differential: Store all bits changed since last Full Backup.
    • Each differential is larger than the last, potentially up to the size of a Full Backup.
    • Fastest to recover.
    • Simplest to manage
    • Robust against errors and deletions, if the dataset was stored.

Work on non-Relational Databases is occurring, but there are important challenges for relational Databases a continuous-timeline view of storage, more than the current transactional/data-wharehouse duality/conversion:
  • limited data storage formats can be supported, "importing and conversion" 
  • indexing of data is a separate activity and stored/accessed differently.
  • Schemas and Database names have to survive changes.
  • Semantics of individual fields are as important as
The Wayback Machine, a.k.a. The Internet Archive, gives us a working model and informs us that people can tolerate a) retrieval delays, b) some datasets unavailable and c) some data loss.

It costs a lot less for "Best Effort" rather than "Guaranteed" storage services, suggesting multiple approaches, cost structures and service offerings in the marketplace. Hopefully consumers won't be inveigled to over-pay or complacently rely on inappropriate low-cost providers.

Will current Consumer Protection laws need to be extended to this area??
If you share data within a group (Family and Friends) and some people don't maintain their part of the archive - losing data for people that rely on them, do current laws apply or will new law be needed?

Will this lead to new businesses of "Archive Auditing"?

There are currently three "drop-dead" problems for these services, ignoring the current "unsupported file format" and "ancient system & run-time" issues:
  • Currently, there is no archival quality digital media.
    Hard Disks, Flash memory and CD/DVD's have limited lifetimes. They cannot be left on a shelf and be expected to work a couple of decades on... Data must be constantly scanned, rebuilt and migrated to new storage systems.
    • Acid-free paper and microforms will store documents for over 100 years.
    • Colour film is still the only archival media for movies and still images.
    • No good magnetic media exist for medium-long term storage of sound recordings.
  • Vendor longevity and professional misconduct or negligence, even systemic corruption.
    • When an Archival Storage Service goes bust, how do the owners of the data recover their data? Not over network links and if the facilities are locked and powered-off by administrators or sheriffs, not physically either.
    • There are around the world, just a few Telcos or Power Utilities that are 100 years old. Can we really expected profitable Storage to start now and last 5 times longer than Google without any commercial upsets? I'd argue "no".
    • Rogue admins and managers are the least of the problem, though they'll exist and cause problems.
    • Expecting ordinary, fallible owners, workers and managers to always resist temptation, bribery and sloth/negligence is more than naive and simplistic. Mistakes will happen, security breaches will occur and ordinary folk doing boring jobs will take shortcuts.
    • Valuable resources will always attract those wishing to steal it. These sorts of facilities must begin by never storing anything of value. Organised crime's only access must be via the individual users' system/device, not in a single, centralised resource.
  • Legal access issues: a whole new area of lucrative International Law awaits us...
    • Who has the right to look at data?
    • Can data "in default" (unpaid fees) be sold? To whom? At what price?
    • Can a Vendor move data from the Jurisdiction of origin, with or without permission?
    • Can Vendors share data across facilities in different Jurisdictions?
    • Can Storage custodians be forced to grant local Law Enforcement Offices access to individual or bulk data?
There are now three distinct views of the filesystem provided in the abstract model for user applications:
  • Current files
  • snapshots
  • archives
The O/S has to provide these services for each of those dimensional slices through the storage:
  • map names (paths) to inodes. Subsumes a "mount device/mount-point" model.
  • inodes (the immutable file, with metadata)
  • datablock link map, which reduces to start/end for contiguous allocation.
  • data blocks and free block list
  • Physical drive management, like LVM.
Systems have to address four different aspects of real-world storage access:
  • availability and connection paths
  • errors and rereads
  • erasures and failures
  • durability and longevity of data sets (protection and archive)
Overlaid on this are 4-5 distinct access patterns, similar to a metal working "temperatures":
  • "white-hot" region: read/write access on-board (RAM and PCIe Flash)
  • "red-hot" region: read/write access to direct-connect updatable HDD's
  • cool region: write once access to Big, Slow HDD's, probably non-update-in-place.
  • "blue" (cold) region: write once, seldom read HDD's. No update-in-place, append-only.
  • "black" (frozen) region: remote and archival storage. Rarely Accessed, Critical when needed.
There is a direct correspondence between different temperature regions and the filesystem abstraction they are providing.
  • Archives are read-only and live only in cool, cold and frozen regions.
  • Snapshots may be in a "red-hot" region, but otherwise in cool and cold regions.
    • Files are ever only moved to Archive from the Snapshot areas.
  • Current files will be migrated from, or cached into, the high-speed read/write regions on demand.
    • The link between Snapshots and Current files is: Snapshot[0] == Current filesystem.
My thesis is that the traditional Unix filesystem and O/S structure of Directories-Inodes-Block_maps-Data_blocks cannot serve all these demands well, but that we already have very good tools to handle them.

Schemes to handle inodes, Block_Maps & Linking and Block access for each "temperature" storage can be designed well for the specific trade-offs and performance expectations.

The major problem appears to me to be mapping File Names to Inodes:
  • It either requires very high performance and low-latency for the hottest I/O region, or
  • requires very large namespaces for snapshots and archives.
  • Indexes for Current & Snapshot views may be stored in low-latency storage, but the volume of names stored in long-term Archives means they cannot.
Neither of which is well served by the traditional "directory in a block", backed by O/S cache model.
But both are robustly handled by Database systems, albeit differently organised, indexed and tuned.

What is missing in normal systems is:

  • Filesystem or storage layer of "What's Changed?" (Deltas) via md5sums or change messages.
  • Swapping snapshot views between "Delta"and "Full" filesystem views:
    • 'rsync' identifies changed files, but users have to create full filesystem images themselves.
    • Apple's TimeMachine creates a full filesystem image at a point-in-time, but provides no "Delta" interface beyond a single file or directory.
There are two implications that fall out of this analysis:

  • Consumers will demand "Open" storage standards allowing them to swap devices, systems and Storage Vendors, not be locked into Proprietary standards, especially single-vendor solutions, and
  • a software solution model based on the Apache web-server or Linux kernel: co-operative Open Source backed by the GNU license. This allows all vendors to avoid license and patent issues, share work, leverage prior work, support and develop common standards, whilst also allowing market-differentiation by offering specific tools or hardware/software combinations.
The current Unix-like approaches of filesystems, O/S supported directory scanning (name to inode mapping), LVM handling {data protection, logical and physical volumes}, independent snapshot/archive facilities, independent hot-plug media and manual setup and operation of Archival stores cannot provide an Identity-keyed Federated Storage & Archive system.

Not all data stores or vendors will provide the same grade of service. Features that can be borrowed from:

  • NTP (Network Time Protocol): stratum level of server. Just how good are they?
  • IP Routing: "cost of routes". Preferentially chose the faster, cheaper services.
The main features required in an Identity-keyed Federated Storage & Archive system are:
  • Data access limited by Identity (data privacy as part of "Security")
    • Multiple Identities per user, based on role or use.
    • Multiple Users and Identities per Device.
    • Master Identity access to specified data, for work and families.
  • Automatic implementation of Policies
  • Addition and management of user-managed hot-plug media
  • Automatic integration across all single-Identity devices of local disk, local network storage, peer storage and multiple Vendor services
  • Policies set as targets:
    • Cost
    • Maximum size of store
    • Maximum data recovery time
    • Minimum and Maximum times between recovery points:
      • every minute for the last 36 hours
      • every hour for the last fortnight
      • every day for the last year
      • every week for the last decade
      • every month after that
    • normal performance: access rate, I/O per sec
    • By datatype, Data Resilience and Longevity (Probability Data Loss per period, Maximum data loss event size)
    • Warnings, Alerts and Alarms.
    • Default and specified Data Destruction dates

No comments: