Sunday, November 27, 2011

Journaled changes: One solution to RAID-1 on Flash Memory

As I've posited before, simple-minded mirroring (RAID-1) of Flash Memory devices is not only a poor implementation, but worst-case.

My reasoning is: Flash wears-out and putting identical loads on identical devices will result in maximum wear-rate of all bits, which is bad but not catastrophic. It also creates a potential for simultaneous failures where a common weakness fails in two devices at the one time.

The solution is to not put the same write load on the two  devices, but still have two exact copies.
This problem would be an especial concern for PCI-SSD devices internal to a system. The devices can't normally can't be hot-plugged, though there are hot-plug standards for PCI devices (e.g. Thunderbolt and ExpressCard), they are not usually options for servers and may be limited performance.

One solution, I'm not sure if it's optimal or not, but it is 'sufficient', is to write blocks as normal to the primary device and maintain the secondary device as snapshot + (compressed) journal entries. When the journal space hits a high-water mark the snapshot is made an exact copy (e.g. bring the snapshot up-to-date when a timer expires (hourly, 6-hourly, daily, ...) or when the momentum of changes will fill the journal to 95% before the snapshot could be updated).

If the journal fills, the mirror is invalidated and either changes must be halted or the devices go into unprotected operation. Both not desirable operational outcomes. A temporary, though unprotected, work-around is to write the on-going journal either to the primary device or into memory.

Implementation Outline:

With existing devices, a set of blocks on both devices need to be allocated for the journal. Whilst the journal area won't be 'written to' on the primary device,  it needs to be there:
  • so identical data areas are available on both devices
  • if the primary and secondary devices are swapped, another device designated the new primary - either as an additional device or a replacement for the secondary.
I'd prefer additional chips be added to the Flash devices specifically for journaling. NOR chips, expensive but not as prone to wear, could even be used. [If similar speed.]

Better techniques of dealing with the journal as a set of version-changes with unique keys (e.g. sequence number) would allow a device to be removed from a mirror and rejoined with minimal updates, avoiding a slow and expensive full-copy. This edge-case would benefit from writing the journal to the primary. One of the most annoying behaviours of RAID systems is "popping" a drive for a second or two (as in "is this the right drive? Oops, no"), then having to wait hours for a full rebuild to complete. Even if nothing was changed on the volumes in that short time...

Scaling:

RAID-1 provides both protection against device failure and improves read I/O performance. Write performance is limited to the slowest device.

Mirroring can also be used as an operational technique to create full-copies of large/critical filesystems or databases on a live system with no downtime.
A 3rd or 4th volume is added to a mirror, synchronised, then after ensuring content-consistency "split-off" and used separately, typically as the base for a test/conversion environment or backups.  Because the disk, or set of disks, can be loaded onto a truck/plane, very high effective bandwidths are possible for the price of a courier. It can be faster than volume 'snapshots' and over-the-wire replication, not to mention a fraction of the cost. Airlines are known to have moved data-centres between continents this way, whilst maintaining their 24x7 booking and flight systems.

This (block-exact + snapshot-and-journal) model can be scaled up to N-replicas by replicating either or both types of replica. For different operational requirements, different combinations would be preferred. All combinations have uses/advantages in specific instances and won't be enumerated.

No comments: