Friday, November 25, 2011

Flash Memory: will filesystems become the CPU bottleneck?

Flash memory with 50+k IO/sec may be too fast for Operating Systems (like Linux) with file-system operations consuming more CPU, even saturating it. They are on the way to becoming the system rate-limiting factor, otherwise known as a bottleneck.

What you can get away with at 20-100 IO/sec, i.e. consumes 1-2% of CPU, will be a CPU hog at 50k-500k IO/sec, a 5,000-50,000 times speed up.

The effect is the reverse of the way Amdahl speed-up is explained.

Amdahl throughput scaling is usually explained like this:
If your workload has 2 parts (A is single-threaded, B can be parallelised), when you decrease the time-taken for  'B' by adding parallel compute-units, the workload becomes dominated by the single-threaded part, 'A'. If you half the time it takes to run 'B', it doesn't halve the total run time. If 'A' and 'B' parts take equal time (4 units each, total 8), then a 2-times speed up of 'B' (4 units to 2) results in a 25% reduction in run-time (8 units to 6). Speeding 'B' up 4-times is a 37% reduction (8 to 5).
This creates a limit to the speed-up possible: If 'B' reduces to 0 units, it still takes the same time to run all the single-threaded parts, 'A'. (4 units here)

A corollary of this: the rate-of-improvement for each doubling of cost nears zero, if not well chosen.

The filesystem bottleneck is the reverse of this:
If your workload has an in-memory part (X) and wait-for-I/O part (W) both of which consume CPU, if you reduce the I/O wait to zero without reducing the CPU overhead of 'W', then the proportion of useful work done in 'X' decreases. In the limit, the system throughput is constrained by CPU expended on I/O overhead in 'W'.

The faster random I/O of Flash Memory will reduce application execution time, but at the expense of increasing % system CPU time. For a single process, the proportion and total CPU-effort of I/O overhead remains the same. For the whole system, more useful work is being done (it's noticeably "faster"), but because the CPU didn't get faster too, it needs to spend a lot more time on the FileSystem.
Jim Gary observed that:
  • CPU's are now mainly idle, i.e. waiting on RAM or I/O.
    Level-1 cache is roughly the same speed as the CPU, everything else is much slower and must be waited for.
  • The time taken to scan a 20Tb disk using random I/O will be measured in days whilst  a sequential scan ("streaming") will take hours.
Reading a "Linux Storage and Filesystem Workshop" (LSF) confrence report, I was struck by comments that:
  • linux file systems can consume large amount of CPU doing their work, not just fsck, but handling directories, file metadata, free block chains, inode block chains, block and file checksums, ...
There's a very simple demonstration of this: optical disk (CD-ROM or DVD) performance.
  • A block-by-block copy (dd) of a CD-ROM at "32x", or approx 3Mb/sec, will copy a full 650Mb in 3-4 minutes. Wikipedia states a 4.7Gb DVD takes 7 minutes (5.5Mb/sec) at "4x".
  • Mounting a CD or DVD then doing a file-by-file copy takes 5-10 times as long.
  • Installing or upgrading system software from the same CD/DVD is usually measured in hours.
The fastest way to upgrade software from CD/DVD is to copy an image with dd to hard-disk, then mount that image. The difference is the random I/O (seek) performance of the underlying media, not the FileSystem. [Haven't tested times or speedup with a fast Flash drive.]

This performance limit may have been something that the original Plan 9 writers knew and understood:
  • P9 didn't 'format' media for a filesystem: initialised a little and just started writing blocks.
  • didn't have fsck on client machines, only the fileserver.
  • the fileserver wrote to three levels of storage: RAM, disk, Optical disk.
    RAM and disk were treated as cache, not permanent storage.
    Files were pushed to Optical disk daily, creating a daily snapshot of the filesystem at the time. Like Apple's TimeMachine, files that hadn't changed were 'hard-linked' to the new directory tree.
  • The fileserver had operator activities like backup and restore. The design had no super-user with absolute access rights, so avoided many of the usual admin-related security issues.
  • Invented 'overlay mounts', managed at user not kernel level, to combine the disparate file-services available and allow users to define their own semantics.
Filesystems have never, until now, focussed on CPU performance, rather the opposite, they've traded CPU and RAM to reduce I/O latency, historically improving system throughput, sometimes by orders-of-magnitude.
Early examples were O/S buffers/caching (e.g. Unix) and the  'elevator algorithmn' to optimally reorder writes to match disk characteristics.

 This 'burn the CPU' trade-off shows up with fsck as well. An older LSF piece suggested that fsck runs slowly because it doesn't do a single pass of the disk, effectively forced into near worst-case unoptimised random I/O.

On my little Mac Mini with a 300Gb disk, there's 225Gb used. Almost all of which, especially the system files, is unchanging. Most of the writing to disk is "append mode" - music, email, downloads - either blocks-to-a-file or file-to-directory. With transactional Databases, it's a different story.

The filesystem treats the whole disk as if every byte could be changed in the next second - and I pay a penalty for that in complexity and CPU cycles. Seeing my little Mac or an older Linux desktop do a filesystem check after a power fail is disheartening...

I suggest future O/S's will have to contend with:
  • Flash or SCM with close to RAM performance 'near' the CPU(s)  (on the PCI bus, no SCSI controller)
  • near-infinite disk ("disk is tape", Jim Gray) that you'll only want to access as "seek and stream". It will also take "near infinite" time to scan with random I/O. [another Jim Gray observation]
And what are the new rules for filesystems in this environment?:
  • two sorts of filesystems that need to interwork:
    • read/write that needs fsck to properly recover after a failure and
    • append-only that doesn't need checking once "imaged", like ISO 9660 on optical disks.
  • ''Flash" file-system organised to minimise CPU and RAM use. High performance/low CPU use will become as important as managing "wear" for very fast PCI Flash drives.
  • 'hard disk' filesystem with on-the-fly append/change of media and 'clone disk' rather than 'repair f/sys'.
  • O/S must seamlessly/transparently:
    1. present a single file-tree view of the two f/sys
    2. like Virtual Memory, safely and silently migrate data/files from fast to slow storage.
I saw a quote from Ric Wheeler (EMC) from LSF-07 [my formatting]:
 the basic contract that storage systems make with the user
 is to guarantee that:
  •  the complete set of data will be stored,
  •  bytes are correct and
  •  in order, and
  •  raw capacity is utilized as completely as possible.
I disagree nowdays with his maximal space-utilisation clause for disk. When 2Tb costs $150 (7.5c/Gb) you can afford to waste a little here and there to optimise other factors.
With Flash Memory at $2-$5/Gb, you don't want to go wasting much of that space.

Jim Gray (again!) early on formulated "the 5-minute rule" which needs rethinking, especially with cheap Flash Memory redefining the underlying Engineering factors/ratios. These sorts of explicit engineering trade-off calculations have to be done for the current disruptive changes in technology.
  • Gray, J., Putzolu, G.R. 1987. The 5-minute rule for trading memory for disk accesses and the 10-byte rule for trading memory for CPU time. SIGMOD Record 16(3): 395-398.
  • Gray, J., Graefe, G. 1997. The five-minute rule ten years later, and other computer storage rules of thumb. SIGMOD Record 26(4): 63-68.
I think Wheeler's Storage Contract also needs to say something about 'preserving the data written', i.e. the durability and dependability of the storage system.
For how long? what what latency? How to express that? I don't know...

There is  also a matter of "storage precision", already catered for with CD's and CD-ROM, Wikipedia states:
The difference between sector size and data content are the header information and the error-correcting codes, that are big for data (high precision required), small for VCD (standard for video) and none for audio. Note that all of these, including audio, still benefit from a lower layer of error correction at a sub-sector level.
Again, I don't know how to express this, implement it nor a good user-interface. What is very clear to me is:
  • Not all data needs to come back bit-perfect, though it is always nice when it does.
  • Some data we would rather not have, in whole or part, than come back corrupted.
  • There are many data-dependent ways to achieve Good Enough replay when that's acceptable.
First, the aspects of Durability and Precision need to be defined and refined, then a common File-system interface created and finally, like Virtual Memory, automated and executed without thought or human interaction.

This piece describes FileSystems, not Tabular Databases nor other types of Datastore.
The same disruptive technology problems need to be addressed within these realms.
Of course, it'd be nicer/easier if other Datastores were able to efficiently map to a common interface or representation shared with FileSystems and all the work/decisions happened in Just One Place.

Will that happen in my lifetime? Hmmmm....

No comments: