Increasing drives per enclosure from 15-45 for 3.5" drives to 1,000 requires a deep rethink of target market, goals and design.
Not the least is dealing drive failures. With an Annualised Failure Rate (AFR) of 0.4%-0.75% now quoted by Drive Vendors, dealing with 5-15 drive failures per unit, per year is a given. In practice, failure rates are at least twice the Vendor quoted AFR not the least because in systems, conditions can be harsh and other components/connectors also fail, not just drives. Drives have a design life of 5 years, with an expected duty-cycle. Consumer-grade drives aren't expected to run 24/7 like the more expensive enterprise drives. Fail Rates over time, when measured on large fleets in service, increase over time and considerably towards end of life.
It's isn't enough to say "we're trying to minimise per unit costs", all designs do that, but for different criteria.
What matters is the constraints you're working against or parameters being optimised.
Competing design & operational dimensions start with:
- Cost (CapEx, OpEx and TCO),
- Durability, and
What type and speed of External interface is needed? SAS, Fibre Channel, Infiniband, or Ethernet?
If Ethernet, 1Gbps, 10Gbps or 40-100Gbps? The infrastructure costs of high-bandwidth external interfaces goes far beyond the price of the NIC's: patch-cables, switches, routers in the data-centre and more widely are affected.
Does the Internal fabric designed to be congestion-free with zero contention access to all drives, or does it meet another criteria?
There are at least 4 competing parameters that can be optimised:
- raw capacity (TB)
- (random) IO/second, both read and write.
- streaming throughput
- Data Protection as RAID resilience
Close attention to design detail is needed to reduce per-IO latency to microseconds, necessary for high random IO/sec. 1,000x 5400 RPM HDD's can support 180k IO/sec, or around 6 microseconds per IO.
Marketing concerns, not engineering, mostly drive these considerations.
This forces a discipline on the engineering design team: to know exactly their target market, what they need, what they value and what price/capability trade-offs they'll wear.
This falls broadly into three areas:
- Spare policy and overhead,
- Parity block organisation overhead, and
- Rebuild time and performance degradation during rebuilds.
- RAID-5 reduces write-performance by a factor of 3 for streaming and random IO.
- while RAID-6 burns CPU's in the Galois Field calculations needed for the 'Q' parity blocks,
- RAID-1 and RAID-10 are simple, low-CPU solutions, but cost more in capacity and don't offer protection against all dual-drive failures.