Monday, April 21, 2014

Storage: First look at Hardware block diagram

Stuffing 500-1000 2.5" drives in an enclosure is just the start of a design adventure.

The simplest being choosing fixed or hot-plug drive mounting. There's a neat slide-out tray system for 3.5" drives that allows hot-plug access for densely vertically packed drives that could be adapted to 2.5" drives.


There seem to be roughly three hardware approaches, all of which have different performance goals, different MTDL intents and :

  • Single board, 'back blaze' style. Capacity-optimised, low external bandwidth.
  • High Availability, dual motherboard, dual-ported local Embedded RAID Controllers, zero contention internal fabric, high performance external interfaces, additional Flash memory, NVRAM memory
  • Mid-scale server. Single board, fast external interfaces, Embedded RAID Controllers, low contention internal fabric.
Low-end, Capacity Focused, low bandwidth external interface.

The Backblaze Storage Pod 4.0 has moved away from SATA Port Multiplexers to 40-port SAS cards (HighPoint Rocket 750's), reducing system cost and increasing system throughput, though they are still limited to 1Gbps external interfaces. They could invest $500 in dual-port twisted-pair 10Gbps NIC's, though this would require matching Top of Rack switches and higher capacity datacentre links, switches and routers. Their product offering is capacity, not performance, based over the wider Internet.

The advantage to such a system is everything is done in software at the one place.
The load on sub-systems is limited due to 1Gbps external interface.

The HDD's can be organised as single or multiple RAID groups, in differing RAID levels, as desired.

Backblaze don't seem to run hot-spares nor hot-plug drives. They have a "Break/Fix" maintenance schedule, costing around 15 mins ($15-$25) per time, including a presumed device outage, which doesn't give them a TCO benefit buying Enterprise drives. It's unclear if they run a daily or weekly replacement schedule.

Even though Backblaze do limit external bandwidth to approximately one drive performance, they still benefit from high-performance internal fabric: for data scrubbing, reconfiguration, data migration and RAID rebuild.

Of the ~$10,000 Backblaze pay per box of 180TB, around 60% is spent directly on drives.
Backblaze build 25 units/month, saving 10%-30% on a one-off price.

High-End, multi-redundant systems

These are going to be built from high-spec items, with hot-pluggable everything and high-speed external interfaces. Expect "exotic" solutions and custom-built components and subsystems. Under half the cost of the system will be in disk drives.

In the light of this cost structure, high-end systems can afford many more spare drives, even providing enough for a full lifetime's projected failures. This over-provisioning would only marginally increase retail price.

For a single random I/O optimised enclosure (1,000x5mm drives), achieving 250k IO/sec with 4kB blocks, 1GB/sec minimum is required (10Gbps) for the external interface. With any degree of streaming IO, 40Gbps will easily be reached.

When performance is being optimised, zero contention internal links are required. Multiple 10Gbps schemes are available, allowing 8-10 drives to share a single link. This suggests a tiered internal fabric with a switching capability, especially if dual controllers are expected to access all disks.

Considering the possible streaming performance of 300 and 500 drives around 1Gbps each, suggests 100Gbps external interfaces to be a minimum. This makes for a very expensive datacentre infrastructure, but that the price of high performance.

Mid-scale systems

Building fast systems with good redundancy and low contention fabrics while using mostly commodity parts will be a point of differentiation between solution providers.

There is a lot of scope for novel and innovative designs for tiered storage and low contention internal fabrics.

No comments: