Sunday, January 08, 2012

Revolutions End II and The Memory Wall

The 2011 ITRS report for the first time uses the terms, "ultimate Silicon scaling" and "Beyond CMOS". The definitive industry report is highlighting for us that the end of the Silicon Revolution is in sight, but that won't be the end of the whole story. Engineers are very clever people and will find ways to keep the electronics revolution moving along, albeit at a much gentler pace.

In 2001, the ITRS report noted that CPU's would be hitting a Power Wall, they'd need to forgoe performance (frequency) to fit within a constrained power envelope. Within 2 years, Intel was shipping multi-core CPU's. Herb Sutter wrote about this in "The Free Lunch is Over".

In the coming 2011 ITRS report, they write explicitly about "Solving the Memory Wall".
Since 1987 and the Pentium IV, CPU clock frequency (also 'cycle time') has been increasing faster than DRAM cycle times: by roughly 40% per year. (7%/year for DRAM and ~50%/year for CPU chip freq.)

This is neatly solved, by trading latency for bandwidth, with caches.
The total memory bandwidth needs for multi-core CPU's doesn't just scale with the chip frequency (5%/year growth), but with the total number of cores accessing the cache (number of cores grow at approx 40%/year).
Cache optimisation, the maximisation of cache "hit ratio", requires the largest cache possible. Hence Intel now has 3 levels of cache, with the "last level cache" being shared globally (amongst all cores).

The upshot of this is simple: to maintain good cache hit-ratios, cache size has to scale with the total demand for memory access. i.e.  N-cores * chip freq.
To avoid excessive processor 'stall', waiting for the cache to be filled from RAM, the hit-ratio has to increase as the speed differential increases. An increased chip frequency requires a faster average memory access time.
So the scaling of cache size is: ( N-cores  ) * (chip freq)²

The upshot is:
cache memory has grown to dominate CPU chip layout and will only increase.
But it's a little worse than that...
The capacity growth of DRAM has slowed to a doubling every 3-4 years.
In 2005, the ITRS report for the first time dropped DRAM as its "reference technology node", replacing it with Flash memory and CPU's.

DRAM capacity growth is falling behind total CPU chip memory demands.
Amdahl posited another law for "Balanced Systems": that each MIP required 1MB of memory.

Another complicating factor is bandwidth limitations for "off-chip" transfers - including memory.
This is called "the Pin Bottleneck" (because external connections are notionally by 'pins' on the chip packaging). I haven't chased down the growth pattern of off-chip pins. The 2011 ITRS design report discusses it, along with the Memory Wall, as a limiting factor and a challenge to be solved.

As CPU memory demand, the modern version of "MIPS", increases, system memory sizes must similarly scale or the system becomes memory limited. Which isn't a show-stopper in itself, because we invented Virtual Memory (VM) quite some time back to "impedance match" application memory demands for with available physical memory.

The next performance roadblock is VM system performance, or VM paging rates.
VM systems have typically used Hard Disk (HDD) as their "backing store", but whilst the capacity has grown faster than any other system component (doubling every year since ~1990), latency, seek and transfer times have risen comparatively slowly. Falling, relatively, behind CPU cycle times and memory demands by 50%/year (??).

For systems using HDD as their VM backing store, throughput will be adversely affected, even constrained, by the increasing RAM deficit.

There is one bright point in all this, Flash Memory has been doubling in capacity as fast as CPU memory demand, and increasing in both speed (latency) and bandwidth.

So much so, that there are credible projects to create VM systems tailored to Flash.

So our commodity CPU's are evolving to look very similar to David Patterson's iRAM (Intelligent RAM) - a single chip with RAM and processing cores.

Just how the chip manufacturers respond is the "$64-billion question".

Perhaps we should be reconsidering Herb Sutters' thesis:
Programmers have to embrace parallel programming and learn to create large, reliable systems with it to exploit future hardware evolution.