Sunday, May 04, 2014

RAID-0 and RAID-3/4 Spares

This piece is not based on an exhaustive search of the literature. It addresses a problem that doesn't seem to have been addressed as RAID-0 and the related RAID-3/4, a single parity drive.

Single parity drives seem to be deemed early on to be impractical because it apparently comprises a deliberate system bottleneck. RAID-3/4 has no bottleneck for streaming reads/writes and for writes, performance becomes, not approaches, the raw write performance of the array is available, identical to RAID-0 (stripe). For random writes, the 100-150 times speed differential between sequential and random access of modern drives can be leveraged with a suitable buffer to remove the bottleneck. The larger the buffer, the more likely the pre-read of data, to save to calculate the new parity, won't be needed. This triples the array throughput by avoiding the full revolution forced by the read/write-back cycle.

Multiple copies of the parity drive (RAID-1) can be kept to mitigate against the very costly failure of a parity drive: all blocks on every drive must be reread to recreate a failed parity drive. For large RAID groups and the very low price of small drives, this is not expensive.

With the availability of affordable, large SSD's, naive management of a single parity drive also removes the bottleneck for quite large RAID groups. The SSD can be backed by a log-structured recovery drive, trading on-line random IO performance for rebuild time.

Designing Local and/or Global Spares for large (N=64..512) RAID sets is necessary to reduce overhead, improve reconstruction times and avoid unnecessary partitioning, limiting recovery options and causing avoidable data loss events.

In 1991, IBM lodged a patent for Distributed Spares, like RAID-5/6 where a Parity Disk(s) got distributed, in diagonal stripes, across all disks in a RAID group. There is a mapping from logical to physical drives where, like RAID-5/6, for adjacent stripes, blocks from a single logical drive are on different physical drives, 'rotated' around, if you will.  If a physical drive fails, all drives in the RAID-group take part in the reconstruction with the mapping of logical drives onto the rotating blocks being modified. It allows multiple spare drives.

Gibson and Howard presented a paper in 1992, "Parity Declustering for Continuous Operation in Redundant Disk Arrays", intended to maximise reconstruction speed, producing minimal reconstruction times. Their algorithm criteria:
  1. Single failure correcting.
  2. Distributed reconstruction. 
  3. Distributed parity. 
  4. Efficient mapping. 
  5. Large write optimization. 
  6. Maximal parallelism.
Gibson and Howard create a smaller logical array that is mapped onto the physical array, trading the overhead of parity for faster reconstruction and shorter periods of degraded performance. This mapping is not dissimilar to the IBM distributed spare. By adding unused, 'spare', physical drives to the parity declustering algorithm, the same outcome can be achieved as IBM's distributed spares.

The per-drive unused/spare blocks of IBM's "Distributed Spares" and the spares possible in a Parity Declustering logical/physical mapping can be collected and laid out on drives in many ways, depending on how many 'associated' blocks are stored consecutively. For both schemes, only single 'associated' blocks are used.

On a drive with a total of M spare (or parity) blocks, the number of 'associated' blocks consecutively allocated can vary from N=1, to N=M, but with the proviso that only exactly divisors of M are possible. The remainder of M/k must be zero, where k is the number of consecutive blocks.

N=1 is the IBM and Parity Declustering case.
N=M is what we would now recognise as a disk partition.
Values of N from 2 to M/k we might describe as "chunks", "stripes", "extents" or allocation units.

We end up with k extents, where N=M/k.

Microsoft in its "Storage Spaces" product uses a fixed allocation unit of 256MB. Like Logical Volume Managers, this notion of (physical) "extents" is the same as posited here.

In RAID-3/4, the disks holding data only are exactly a RAID-0 (striped) set. For this reason, any Spare handling scheme for one will work for both, with the caveat that RAID-0 has no redundancy to recover lost data.  The Gibson/Howard logical/physical geometry remapping is still useful in spreading reads across all drives in a reconstruction, albeit it is no longer "Parity Declustering".

A RAID-0 (striped) set can be managed locally by a low-power controller, typified by Embedded RAID Controllers found in low-end hardware RAID cards. These RAID-0 (striped) sets can be globally managed with shared parity drives, providing much larger RAID groups. These cards are capable enough to provide:
  • Large block (64kB to 256kB) error detection and correction, such as Hamming Codes or Galois Field "Erasure Codes".
  • Manage distributed spare space
By separating out and globally managing Parity Drives, additional techniques are available, such as mirroring parity drives, SSD Parity and/or log-structured updates. Larger IO buffers acting as read-back caches can significantly improve random IO performance with a degree of locality (updating the same blocks).

The Embedded RAID Controllers can handle distributed spares by storing one logical drive per partition. By distributing the logical drives across many physical drives, reconstruction when needed, can be run utilising all drives in a RAID group.

Notes on RAID-3 vs RAID-4:

Table 1 of Section 4.1, pg 13, of the SNIA technical definitions documents lays down the industry standard definitions (below) of the different primary RAID types. Wikipedia provides an interpretation of these definitions, with per 'byte' and per 'block' calculation of parity.

RAID-3 and RAID-4 use a single parity drive, computed with XOR over all drives in a parity volume group. [SNIA uses the term "disk group", pg 12, para 3.1.4. "Parity Volume Group" is a non-standard term.] Additional parity, like the Q-parity used in RAID-6, could be added to recover from multiple failed drives.

RAID-3: Striped array with non-rotating parity, optimized for long, single- threaded transfers. [parity computed by byte, sector or native disk block, maximising streaming performance and scheduling IO across all disks]

RAID-4:  Striped array with non-rotating parity, optimized for short, multi- threaded transfers. [parity is computed on larger blocks, striped across the drives, not on whole 'extents']

What's not generally commented on is that RAID-3 and RAID-4, like all RAID types, has no read-write-back penalty for large streaming transfers. Each stripe written in full requires no pre-read of data or parity to recalculate the parity block(s). This means the usual objection to RAID-3/4, a single "hot spot" on the parity drive, does not limit write performance for streaming transfers, only for intensive random writes.

The loss of a parity drive forces a recalculation of all parity, forcing a rescanning of all data in all drives in the effected parity volume group. The system load and impact of the rescan is very high, and while its underway, there is no Data Protection during the parity drive rebuild for the Parity Volume Group, suggesting for larger groups, that mirroring the parity drive is desirable.

For each moderate parity group sizes, taking the group off-line and rebuilding in sustained streaming mode would be desirable.

As disk size and number of drives per parity volume group increases, volume of data to rescan increases linearly with each. Scan time for a whole array, especially in sustained streaming, may be limited by congestion on shared connections, such as Fibre Channel or SAS.

RAID-3/4's most useful characteristic is that the data drives are a RAID-0 volume, with striping not concatenation. The parity volume(s) exist in isolation to the base data volume and can be stored and managed independently.

No comments: