Monday, October 31, 2011

RAID, Backups and Recovery

There's an ironclad Law of SysAdmin:
"Nobody ever asks for backups to be done, only ever 'restores'"
Discussing RAID and Reliable Persistent Storage cannot be done without reference to the larger context: Whole System Data Protection and Operations.

"Backups" need more thought and effort than a 1980-style dump-disk-to-tape. Conversely, as various politicians & business criminals have found to their cost with "stored emails", permanent storage of everything is not ideal either, even if you have the systems, space and budget for media and operations.

"Backups" are more than just 'a second copy of your precious data' (more fully, 'of some indeterminate age'), which RAID alone partially provide.

A rough taxonomy of 'backups' are:
  • Operating System (O/S) software, modulo installed packages.
  • O/S configuration.
  • O/S logs.
  • Middleware (Database, web server, ..) software.
  • Middleware configuration
  • Middleware data.
  • Middleware logs.
  • Application {Software, Configuration, Data, Logs}
There are, minimally, two distinct type of data:
  • Transactional, eg. Database, comprised of a transaction log and the database.
  • Files.
"Snapshots" provide perfect point-in-time recovery for transactional data (the "roll forward log" on one snapshot, older database snapshot on another).
Versioning systems provide perfect event-in-time recovery for textual and other file data.
Whilst filesystems can be journalled, the potential to capture-for-replay every filesystem transaction isn't normally available due to volumes and complexity.

Since the destruction of the World Trade Centre buildings in 2001, every organisation with critical I.T. systems and a need for strong Data Protection/Reliable Persistent Storage understand that a minimum requirement for 'second copy data' is a Remote Duplicate Copy. This implies at least a secure second facility.

Two critical design and operational parameters are:
  • "Delta-T", the difference in timestreams between the live system and Duplicate Copies.
  • "Sufficient Distance" for the second facility to not be affected by designed-for "events".
 "Disaster Recovery" and its Planning is all about managing these and related factors, along with time-to-restoration and time-to-recovery.

Can "Zero Delta-T" and "Zero TTR" systems be built at all? For all datastore sizes? Economically?
Is there a relationship between "Delta-T" and system cost?
Are there better organisations/designs?

Banks and airlines have attempted this for decades. Insisting on completed transactions being committed to Remote Storage adds significant latency to the process. In a single-threaded application, this limits the processing rate. To achieve high-rate applications, (real-time) parallelism must be embraced with all its complexity and problems.

No comments: