Monday, December 17, 2012

Storage: Active spares in RAID volumes

If you have a spare HDD in a chassis powered-up and spinning, then the best use of the power you're burning is to use the drive.

Sunday, November 18, 2012

Cross compiling i386-ELF Linux kernel on OS/X, Snow Leopard

This is NOT a tutorial on 'C', Software Development, the GNU toolchains, including 'make', or building a linux kernel. That is assumed knowledge.

There's already good documentation on the Web for building Android (ARM) kernels on OS/X, and some tantalising, though incomplete, comments on building i386 kernels: "it took a while to setup, then was OK"...

Although OS/X runs on x86 (and x86_64), it won't build an x86 Linux kernel. You still need to cross-compile because Linux uses ELF (extensible loader format) and OS/X uses its own multi-CPU format, "mach-o".

This environment variable that must be set:
CROSS_COMPILE=i386-elf- [or the full path to your gcc tools, but only the common prefix]
Optionally, you can set (32-bit):
As well, I added these directories to the beginning of my PATH to catch gcc (HOSTCC) and i386-elf-gcc for the cross-compiler:
The Linux kernel Makefile uses some trickery to have verbose, quiet and silent modes, default is "quiet". If you need to see for debugging, the commands issued, set this additional environment variable:
I chose to use the Macports native 'gcc', not the OS/X supplied compiler. Because the Macport i386-elf version of gcc has incorrect paths compiled in, I needed the 2 additional i386-elf directories.

Note: I had to make a symbolic link for i386-elf-gcc. The port "i386-elf-gcc @4.3.2_1" installed the full set of tools (as, ld, nm, strip, ...) into /opt/local/bin, but didn't install the shortname ('gcc'), only the long version name: i386-elf-gcc-4.3.2, which the Linux kernel Makefile doesn't cater for.

The 'Macports' project provides many GNU tools pre-built, with source. Generally, it's a good first thing to try. I reported multiple faults and found them unresponsive and less than helpful. YMMV.

The command is 'port', after the BSD tool of the same name. BSD delivered pure-source bundles, Macports do not. While Macports notionally updates itself, I had trouble with a major upgrade, initially available only as Source, now available as a binary upgrade.

There seems to be a bias towards newer OS/X environments. "Snow Leopard", Darwin 10.8.0, is now old. "Lion" and "Mountain Lion" have replaced it...

Ben Collins, 2010, has good notes, a working elf.h, and gcc-4.3.3 and binutils from Ubuntu Jaunty.
A comment suggests GNU sed is necessary, not the standard OS/X sed. I made this change.

From 2010, using 'ports' to cross-compile to ARM by Plattan Mattan is useful. Uses OS/X gcc as HOSTCC.
The page suggests installing ports: install libelf git-core
Then using git to clone the kernel source.

I installed ports:
gcc43 @4.3.6_7 (active)
i386-elf-binutils @2.20_0 (active)
i386-elf-gcc @4.3.2_1 (active)
libelf @0.8.13_2 (active)
Other useful pages:
Building GCC toolchain for ARM on Snow Leopard (using Macports)
Building i386-elf cross compiler and binutils on OS/X (from source).

I got a necessary hint from Alan Modra about i386-elf-as (the assembler) processing "/" as comments, not "divide" in macro expansions, as expected in the kernel source:
For compatibility with other assemblers, '/' starts a comment on the i386-elf target.  So you can't use division.  If you configure for i386-linux (or any of the bsds, or netware), you won't have this problem.
Do NOT in your .config file select "a.out" as an executable fileformat. The i386 processor isn't defined, so compile fails with "SEGMNT_SIZE" not defined.

I went to and downloaded a bzipped tar file of linux- ("Full Source") for my testing. Always a good idea to check the MD5 of any download, if available.
I wanted a stable, older kernel to test with.Your needs will vary.

I didn't run into the "malloc.h" problem noted by Plattan, it seemed to come with libelf.

I made three sets of changes (changes in red) to the standard linux Makefile (can apply as a patch):
mini-too:linux- steve$ diff -u ../saved/Makefile.dist Makefile
--- ../saved/Makefile.dist 2012-08-21 04:45:22.000000000 +1000
+++ Makefile 2012-11-20 14:10:46.000000000 +1100
@@ -231,7 +231,7 @@
 HOSTCC       = gcc
 HOSTCXX      = g++
-HOSTCFLAGS   = -Wall -Wmissing-prototypes -Wstrict-prototypes -O2 -fomit-frame-pointer
+HOSTCFLAGS   = -Wall -Wmissing-prototypes -Wstrict-prototypes -O2 -fomit-frame-pointer -idirafter /opt/local/include 
 # Decide whether to build built-in, modular, or both.
@@ -335,10 +335,10 @@
     -Wbitwise -Wno-return-void $(CF)
+AFLAGS_MODULE   = $(MODFLAGS)  -Wa,--divide 
 LDFLAGS_MODULE  = -T $(srctree)/scripts/
+AFLAGS_KERNEL =  -Wa,--divide 
 CFLAGS_GCOV = -fprofile-arcs -ftest-coverage
@@ -354,6 +354,7 @@
      -fno-strict-aliasing -fno-common \
      -Werror-implicit-function-declaration \
      -Wno-format-security \
+     -isystem /opt/local/i386-elf/include -idirafter /opt/local/lib/gcc/i386-elf/4.3.2/include/ -idirafter /usr/include -idirafter /usr/include/i386 \
I created the required "elf.h", not supplied in port libelf, in /opt/local/include, specified above in HOSTCFLAGS:
mini-too:linux- steve$ cat /opt/local/include/elf.h 
/* @(#) $Id: $ */

#ifndef _ELF_H
#define _ELF_H
#include <libelf/gelf.h>
/* */

#define R_ARM_NONE        0
#define R_ARM_PC24        1
#define R_ARM_ABS32       2
#define R_MIPS_NONE       0
#define R_MIPS_16         1
#define R_MIPS_32         2
#define R_MIPS_REL32      3
#define R_MIPS_26         4
#define R_MIPS_HI16       5
#define R_MIPS_LO16       6

/* from /opt/local/libexec/llvm-3.1/include/llvm/Support/ELF.h */
/* or */

#define R_386_NONE      0
#define R_386_32        1
#define R_386_PC32      2
#define R_386_GOT32     3
#define R_386_PLT32     4
#define R_386_COPY      5
#define R_386_GLOB_DAT  6
#define R_386_JMP_SLOT  7 /* was R_386_JUMP_SLOT */
#define R_386_RELATIVE  8
#define R_386_GOTOFF    9  
#define R_386_GOTPC     10 
#define R_386_32PLT     11 
#define R_386_TLS_TPOFF 14 
#define R_386_TLS_IE    15 
#define R_386_TLS_GOTIE 16 
#define R_386_TLS_LE    17 
#define R_386_TLS_GD    18 
#define R_386_TLS_LDM   19 
#define R_386_16        20 
#define R_386_PC16      21 
#define R_386_8 22
#define R_386_PC8       23 
#define R_386_TLS_GD_32 24 
#define R_386_TLS_GD_PUSH       25         
#define R_386_TLS_GD_CALL       26      
#define R_386_TLS_GD_POP        27         
#define R_386_TLS_LDM_32        28      
#define R_386_TLS_LDM_PUSH      29         
#define R_386_TLS_LDM_CALL      30      
#define R_386_TLS_LDM_POP       31         
#define R_386_TLS_LDO_32        32      
#define R_386_TLS_IE_32 33                 
#define R_386_TLS_LE_32 34                 
#define R_386_TLS_DTPMOD32      35
#define R_386_TLS_DTPOFF32      36
#define R_386_TLS_TPOFF32       37
#define R_386_TLS_GOTDESC       39
#define R_386_TLS_DESC_CALL     40      
#define R_386_TLS_DESC  41                 
#define R_386_IRELATIVE 42 
#define R_386_NUM       43 

#endif /* _ELF_H */
I didn't try specifying all variables on the command-line when invoking make. This might work, though incomplete (only 2 of the 3 changes):
$ make ARCH=x86 CROSS_COMPILE=i386-elf- HOSTCFLAGS="-idirafter /Users/steve/src/linux/linux-" AFLAGS_KERNEL="-Wa,--divide"
It took me sometime to figure out how to browse on-line the git repository for my specific kernel, to investigate the change history of a specific file. It's worth taking the time to learn this.

I was not able to figure out how to get the GNU make in OS/X to show me the full commands it was about to execute. "make -d" spits out a mountain of stuff on what it is doing, but not the commands. "make -n" is for dry-runs and prints commands, though whether or not that's what is run later, I'm not sure. See KBUILD_VERBOSE.

It also seems impossible to ask gcc to tell you what directory/ies it's using for the system include files.
Whilst the Macports gcc works, it uses /usr/include, the default OS/X gcc directory and creates some additional header files.

Current status: 19-Nov-2012. failing in drivers/gpu with include files missing. Which "CC"?
  CC      drivers/gpu/drm/drm_auth.o
In file included from include/drm/drmP.h:75,
                 from drivers/gpu/drm/drm_auth.c:36:
include/drm/drm.h:47:24: error: sys/ioccom.h: No such file or directory
include/drm/drm.h:48:23: error: sys/types.h: No such file or directory
Final status: 20-Nov-2012. Untested - booting kernel.
  BUILD   arch/x86/boot/bzImage
Root device is (14, 1)
Setup is 12076 bytes (padded to 12288 bytes).
System is 3763 kB
CRC d747d6db
Kernel: arch/x86/boot/bzImage is ready  (#1)
  Building modules, stage 2.
  MODPOST 2 modules
  CC      arch/x86/kernel/test_nx.mod.o
  LD [M]  arch/x86/kernel/test_nx.ko
  CC      drivers/scsi/scsi_wait_scan.mod.o
  LD [M]  drivers/scsi/scsi_wait_scan.ko

real 11m5.195s
user 8m31.722s
sys 1m50.243s
A number of header errors (missing or duplicates & incompatible decls) were not solved, but sidestepped by unselected the problem areas in the .config file (details below).

Additional changes:
Creating config file.
make defconfig
producing a .config that can be edited or patched.

Summary of subsystem items unselected in .config afterwards:
# CONFIG_SUSPEND is not set
# CONFIG_ACPI is not set
# CONFIG_INET_LRO is not set
# CONFIG_AGP is not set
# CONFIG_DRM is not set
You might try saving the patches below (a 'diff -u' of the above defconfig result to mine) and apply to the .config file as a patch: patch .config defconfig.patch
--- ../saved/defconfig 2012-11-19 23:20:04.000000000 +1100
+++ .config 2012-11-19 23:32:49.000000000 +1100
@@ -1,7 +1,7 @@
 # Automatically generated make config: don't edit
 # Linux kernel version:
-# Mon Nov 19 18:57:25 2012
+# Mon Nov 19 15:53:57 2012
 # CONFIG_64BIT is not set
@@ -390,7 +390,6 @@
 # CONFIG_HZ_100 is not set
@@ -401,7 +400,6 @@
-# CONFIG_KEXEC_JUMP is not set
@@ -418,45 +416,11 @@
 # CONFIG_PM_VERBOSE is not set
+# CONFIG_SUSPEND is not set
 # CONFIG_PM_RUNTIME is not set
-# CONFIG_ACPI_DEBUG is not set
-# CONFIG_ACPI_PCI_SLOT is not set
-# CONFIG_ACPI_SBS is not set
+# CONFIG_ACPI is not set
 # CONFIG_SFI is not set
-# CONFIG_APM is not set
 # CPU Frequency scaling
@@ -479,11 +443,8 @@
 # CPUFreq processor drivers
-# CONFIG_X86_PCC_CPUFREQ is not set
 # CONFIG_X86_POWERNOW_K6 is not set
 # CONFIG_X86_POWERNOW_K7 is not set
-# CONFIG_X86_POWERNOW_K8 is not set
 # CONFIG_X86_GX_SUSPMOD is not set
 # CONFIG_X86_SPEEDSTEP_ICH is not set
@@ -491,7 +452,6 @@
 # CONFIG_X86_P4_CLOCKMOD is not set
 # CONFIG_X86_CPUFREQ_NFORCE2 is not set
 # CONFIG_X86_LONGRUN is not set
-# CONFIG_X86_LONGHAUL is not set
 # CONFIG_X86_E_POWERSAVER is not set
@@ -513,9 +473,7 @@
-# CONFIG_DMAR is not set
@@ -528,7 +486,6 @@
 # CONFIG_PCI_STUB is not set
 # CONFIG_PCI_IOV is not set
 # CONFIG_ISA is not set
 # CONFIG_MCA is not set
@@ -556,7 +513,6 @@
@@ -564,7 +520,7 @@
 # Executable file formats / Emulations
 # CONFIG_BINFMT_AOUT is not set
@@ -610,7 +566,7 @@
+# CONFIG_INET_LRO is not set
 # CONFIG_INET_DIAG is not set
 # CONFIG_TCP_CONG_BIC is not set
@@ -671,11 +627,11 @@
@@ -853,13 +809,6 @@
 # CONFIG_MTD is not set
 # CONFIG_PARPORT is not set
-# Protocols
 # CONFIG_BLK_DEV_FD is not set
 # CONFIG_BLK_CPQ_DA is not set
@@ -950,7 +899,6 @@
 # CONFIG_SATA_SIL24 is not set
@@ -969,7 +917,6 @@
 # CONFIG_SATA_VIA is not set
 # CONFIG_SATA_INIC162X is not set
-# CONFIG_PATA_ACPI is not set
 # CONFIG_PATA_ALI is not set
 # CONFIG_PATA_ARTOP is not set
@@ -1062,7 +1009,6 @@
 # CONFIG_EQUALIZER is not set
 # CONFIG_TUN is not set
 # CONFIG_VETH is not set
-# CONFIG_NET_SB1000 is not set
 # CONFIG_ARCNET is not set
@@ -1364,7 +1310,6 @@
@@ -1372,7 +1317,6 @@
 # CONFIG_INPUT_CM109 is not set
 # Hardware I/O ports
@@ -1420,7 +1364,6 @@
 # CONFIG_SERIAL_8250_CS is not set
@@ -1464,8 +1407,6 @@
 # CONFIG_NSC_GPIO is not set
 # CONFIG_CS5535_GPIO is not set
 # CONFIG_RAW_DRIVER is not set
-# CONFIG_HPET_MMAP is not set
 # CONFIG_TCG_TPM is not set
 # CONFIG_TELCLOCK is not set
@@ -1475,7 +1416,6 @@
 # CONFIG_I2C_CHARDEV is not set
 # I2C Hardware Bus support
@@ -1500,11 +1440,6 @@
 # CONFIG_I2C_VIAPRO is not set
-# ACPI drivers
-# CONFIG_I2C_SCMI is not set
 # I2C system bus drivers (mostly embedded / system-on-chip)
 # CONFIG_I2C_OCORES is not set
@@ -1625,12 +1560,6 @@
 # CONFIG_SENSORS_LIS3_I2C is not set
-# ACPI drivers
-# CONFIG_SENSORS_ATK0110 is not set
-# CONFIG_SENSORS_LIS3LV02D is not set
@@ -1713,42 +1642,19 @@
 # Graphics support
-# CONFIG_AGP_ALI is not set
-# CONFIG_AGP_ATI is not set
-# CONFIG_AGP_AMD is not set
-# CONFIG_AGP_NVIDIA is not set
-# CONFIG_AGP_SIS is not set
-# CONFIG_AGP_SWORKS is not set
-# CONFIG_AGP_VIA is not set
+# CONFIG_AGP is not set
-# CONFIG_DRM_TDFX is not set
-# CONFIG_DRM_R128 is not set
-# CONFIG_DRM_RADEON is not set
-# CONFIG_DRM_I810 is not set
-# CONFIG_DRM_I830 is not set
-# CONFIG_DRM_I915_KMS is not set
-# CONFIG_DRM_MGA is not set
-# CONFIG_DRM_SIS is not set
-# CONFIG_DRM_VIA is not set
-# CONFIG_DRM_SAVAGE is not set
+# CONFIG_DRM is not set
 # CONFIG_VGASTATE is not set
 # CONFIG_FB_DDC is not set
@@ -1773,13 +1679,11 @@
 # CONFIG_FB_VGA16 is not set
 # CONFIG_FB_UVESA is not set
 # CONFIG_FB_VESA is not set
 # CONFIG_FB_N411 is not set
 # CONFIG_FB_HGA is not set
 # CONFIG_FB_S1D13XXX is not set
 # CONFIG_FB_NVIDIA is not set
 # CONFIG_FB_RIVA is not set
-# CONFIG_FB_I810 is not set
 # CONFIG_FB_LE80578 is not set
 # CONFIG_FB_MATROX is not set
 # CONFIG_FB_RADEON is not set
@@ -2241,30 +2145,12 @@
 # CONFIG_STAGING is not set
-# CONFIG_ACER_WMI is not set
-# CONFIG_ASUS_LAPTOP is not set
-# CONFIG_TC1100_WMI is not set
-# CONFIG_MSI_LAPTOP is not set
-# CONFIG_SONY_LAPTOP is not set
-# CONFIG_ACPI_WMI is not set
-# CONFIG_ACPI_ASUS is not set
-# CONFIG_ACPI_CMPC is not set
 # Firmware Drivers
 # CONFIG_EDD is not set
 # CONFIG_DELL_RBU is not set
 # CONFIG_DCDBAS is not set
@@ -2604,7 +2490,6 @@
-# CONFIG_IMA is not set

Saturday, August 11, 2012

Man-in-the-Middle Port Protection

I'm wondering if this idea could work as a generic solution to the "OMG, the printer's been hacked' problem".

There are a very large classes of important legacy IP devices that can be easily compromised and either used as "zombies" in a bot-net, as relay devices or like Infrastructure controllers, be high-value targets for disruptive hackers.

What they share in common is:
  • their software can't be fixed or upgraded to current "hardness" levels and/or support current security protocols, like 802.1x, and
  • full replacement, or "fork-lift upgrade", is deemed unwarranted or infeasible.
The "Man-in-the Middle" (MitM) single ethernet port protection solution comes in two parts:
  • a dongle or wall-wart to connect in front of the printer, perhaps with Power-over-Ethernet (PoE) and
  • a central server/filter/firewall that the dongle(s) connect back to.
The dongle needs two ethernet connections and the software has to learn two things:
  • up-stream/down-stream traffic flows, which side is the printer, which is the network, and
  • the IP number + MAC address of the device.
It's job is to transparently take over the IP number and MAC address of the printer.

First it must gain itself a second IP number for its work and authenticate itself to the network if 802.1x is in use.

Having done that, it establishes a secure tunnel, via SSL or SSH, back to the central security device where the real security measures are taken.
The central security device can be implemented as a cloud of local devices, centrally managed.

The device can be mimiced either locally or centrally, depending on network configuration, latency and traffic volume concerns and devices available. Note that the protected device is behind a dongle, it will never be seen 'bare' on the network again, so conflicts will not arise. To the printer or device-under-control, no change in the network or environment will be discernible. To all other devices on the network, the printer or device-under-control will appear to have had a firmware upgrade, but otherwise be identical.

First option is for all network traffic destined for the printer has to be shunted back to the central secure device, modulo trivial ipfilter rules. This includes preventing any unauthorised outbound traffic, or even directing all outbound packets for analysis through the central security device for Intrusion Detection analysis.

Once the traffic is back at the central secure device, it can be properly inspected and cleaned, then turned around on the same SSL/SSH tunnel.

Second option, the central secure device assumes both IP number and MAC address of the device-under-control by advertising it's IP and the same MAC addrs. It can also provide 802.1x-client facilities.

The central secure device then forwards only valid traffic to the dongle at the printer over SSL/SSH, and response traffic is tunnelled back and inspected over the same path.

The difference is where the impersonated IP number/MAC address now appears in the network:
  • either exactly where it always has been, or
  • in a central location.
A 2-port version of the Raspberry Pi would be able to do this, plus some tinkering in a firewall appliance.

If very short UTP leads are used on the MitM dongle, it will be very difficult to remove from the printer or device under control.

The Use Case is simple:
  • get new secure server
  • install 802.1x certificates for printers/devices-under-control in the secure server
  • install 802.1x certs in dongles, setup in DHCP, don't have to be static IP numbers.
  • egister SSH or SSL keys of dongle to secure server
  • check locally the dongle works and correctly controls a test device
  • go to printer, install dongle, check it works. Requires normal traffic and a scan for vulnerabilities.
  • there might be a some outage as the double shuffle happens, the MAC address may now appears elsewhere in the network
  • not sure how real-time swap might affects existing IP connections. If the disruption in traffic flow is under the TCP disconnect window, it won't be noticed. It's not an option to leave existing connections uninspected, they could Botnet control channel.
Do these things already exist? I don't know..
I can't believe something this simple hasn't already been built.

There are 2½ variants that might also be interesting.
For variant #2, I'm not sure what can already be done in switch software.

1. Plug 2-port MitM dongle directly into switch, capture all packets at the switch, not at the printer.

1A. Plug a multi-port MitM dongle into switch so that it acts like a switch itself, having one up-stream link via the switch, and provides multiple ports for controlled devices to be connected to. Needs the host switch to allow multiple MAC's per port.

Problem with remote dongle:
Printer/device-under-control can be relocated (physically or connection) and lose protection/filtering.
2. High-end Switches have "Port mirroring" software, can that be used or modified for the MitM packet redirection?
  • Port-mirroring sends a copies of ingress and egress packets to another port, even on another switch.
  • The remote Filter/firewall (FF) needs two ports, A & B, one ingress, one egress..
    • Network egress traffic is redirected to Port-A of the FF, even on another VLAN.
    • Network Ingress traffic is instead received from Port-A of the FF.
    • Port-B traffic of the FF is sent back as egress traffic of the controlled port, and
    • ingress traffic of the controlled port is sent to Port-B traffic of the FF.
  • This achieves logically what the physical dongle+physical patch leads achieved, passing all traffic via an MitM Filter/Firewall.
2A. Multi-port filter/firewall.
  • Can switch software multiplex multiple MAC addresses onto two ports, serving multiple devices under control with the same 2-port hardware, or
  • does it need a pair of ports on the filter/firewall for each device under control, or
  • 1 upstream link to capture all redirected traffic, and 1 ethernet port for each device under control.

Thursday, June 07, 2012

The New Storage Hierarchy and 21 Century Databases

The new hardware capabilities and cost/performance characteristics of storage and computer systems means there has to be a radical rethink of how databases work and are organised.

The three main challenges I see are:
  • SSD and PCI Flash memory with "zero" seek time,
  • affordable Petabyte HDD storage, and
  • object-based storage replacing "direct attach" devices.
These technical tradeoff changes force these design changes:
  • single record access time is no longer dominated by disk rotations, old 'optimisations' are large, costly, slow and irrelevant,
  • the whole "write region" can be held in fast memory changing cache requirements and design,
  • Petabtye storage allows "never delete" of datasets which pose new problems:
    • how does old and new data get physically organised?
    • what logical representations can be used to reduce queries to minimal collections?
    • how does the one datastore support conflicting use types? [real-time transactions vs data wharehouse]
    • How are changed Data Dictionaries supported?
    • common DB formats are necessary as the lifetime of data will cover multiple products and their versions. 
  • Filesystems and Databases have to use the same primitives and use common tools for backups, snapshots and archives.
    • As do higher order functions/facilities:
      • compression, de-duplication, transparent provisioning, Access Control and Encryption
      • Data Durability and Reliability [RAID + geo-replication]
  • How is security managed over time with unchanging datasets?
  • How is Performance Analysis and 'Tuning' performed?
  • Can Petabyte datasets be restored or migrated at all?
    • DB's must continue running without loss or performance degradation as the underlying storage and compute elements are changed or re-arragned.
  • How is expired data 'cleaned' whilst respecting/enforcing any legal caveats or injunctions?
  • What data are new Applications tested against?
    • Just a subset of "full production"? [doesn't allow Sizing or Performance Testing]
    • Testing and Developing against "live production" data is extremely unwise [unintended changes/damage] or a massive security hole. But when there's One DB, what to do?
  • What does DB roll-back and recovery mean now? What actions should be expected?
    • Is "roll-back" or reversion allowable or supportable in this new world?
    • Can data really be deleted in a "never delete" dataset?
      • Is the Accounting notion of "journal entries" necessary?
    • What happens when logical inconsistencies appear in geo-diverse DB copies?
      • can they be detected?
      • can they ever be resolved?
  • How do these never-delete DB's interface or support corporate Document and Knowledge Management systems?
  • Should summarises ever be made and stored automatically under the many privacy and legal data-retention laws, regulations and policies around?
  • How are conflicting multi-jurisdiction issues resolved for datasets with wide geo-coverage?
  • How are organisation mergers accomplished?
    • Who owns what data when an organisation is de-merged?
    • Who is responsible for curating important data when an organisation disbands?
XML is not the answer: it is a perfect self-containing data interchange format, but not an internal DB format.
Redesign and adaption is needed at three levels:
  • Logical Data layout, query language and Application interface.
  • Physical to Logical mapping and supporting DB engines.
  • Systems Configuration, Operations and Admin.
We now live in a world of VM's, transparent migration and continuous uninterrupted operations: DB's have to catch up.

They also have to embrace the integration of multiple disparate data sources/streams as laid out in the solution Jerry Gregoire created for Dell in 1999 with his "G2 Strategy":
  • Everything should be scalable through the addition of servers.
  • Principle application interface should be a web browser.
  • Key programming with Java or Active X type languages.
  • Message brokers used for application interfacing.
  • Technology selection on an application by application basis .
  • Databases should be interchangeable.
  • Extend the life of legacy systems by wrapping them in a new interface.
  • Utilize "off the shelf systems" where appropriate.
  • In house development should rely on object based technology - new applications should be made up of proven object puzzle pieces.
Data Discovery, Entity Semantics with range/limits (metadata?) and Rapid/Agile Application development are critical issues in this new world.

Tuesday, February 14, 2012

charging a Nokia phone (C2-01) from USB. Need enough power.

[this piece is a  place marker for people searching on "how to" charge from USB.
 Short answer: Plug in any modern phone and it's supposed to "Just Work".]

A couple of weeks ago I bought an unlocked Nokia C2-01 from local retailer Dick Smith's.

I wanted bluetooth, 3G capability and got a direct micro-USB (DC-6) connector too.
I bought a bluetooth "handsfree" for my car as well. It came with a cigarette-lighter charger with a USB socket and a USB mini (not micro) cable.

I remembered that all mobiles sold in the European Union were mandated to use a USB charger [MoU in 2009, mandate later] and thought I'd be able to use the car-charger for everything: phone, camera, ...

The supplied 240V external phone charger worked well for the C2-01.
But I couldn't get it to charge from the in-car USB charger.

Turns out the handset charger could only supply 400ma, not the 500ma of the USB standard. Bought another in-car USB charger from Dick Smith's: works fine with both.

What had confused me was the phone wouldn't charge when I tested it with my (old) powered USB hub. Is it old and tired or was the phone already fully charged?? Need to properly test that.

I hadn't tried it with my Mac Mini, jumping to the unwarranted conclusion "this phone doesn't do USB charging". When tested, worked OK directly with the Mac...

There's one little wrinkle.
Devices like the iPad that charge from USB take more than 500ma (700ma?) - which Mac's supply, but are more than the standard. [Why some USB adaptors for portable HDD's have two USB-A connectors.]

I know a higher current has been specified for USB - but can't remember the variant. Is it just the new "USB 3" or current "USB 2" as well?

Sunday, February 12, 2012

shingled write disks: bad block 'mapping' not A Good Idea

Singled-write disks can't update sectors in-place. Plus they are likely to have sectors larger than the current 2KB. [8KB?]

The current bad-block strategy of rewriting blocks in another region of the disk is doubly flawed:
  • extremely high-density disks should be treated as large rewritable Optical Disks. They are great at "seek and stream", but have exceedingly poor GB/access/sec ratios. Forcing the heads to move whilst streaming data affects performance radically and should be avoided to achieve consistent/predictable good performance.
  • Just where and how should the spare blocks be allocated?
    Not the usual "end of the disk", which forces worst case seeks.
    "In the middle" forces long seeks, which is better, but not ideal.
    "Close by", i.e. for every few shingled-write bands or regions, include spare blank tracks (remembering they are 5+ shingled-tracks wide).
My best strategy, in-place bad-block identification and avoidance, is two-fold:
  • It assumes large shingled-write bands/regions: 1-4GB.
  • Use a 4-16GB Flash memory as a full-region buffer, and perform continuous shingled-writes in a band/region. This allows the use of CD-ROM style Reed-Solomon product codes to cater for/correct long burst errors at low overhead.
  • After write, reread the shingle-write band/region and look for errors or "problematic" recording (low read signal), then re-record. The new write stream can put "synch patterns" in the not to be used areas, the heads spaced over problematic tracks or the track-width widened for the whole or part of the band/region.
This moves the cost of bad-blocks from read-time to write-time. It potentially slows the sequential write speed of the drive, but are you writing to a "no update-in-place" device for speed? No. You presumably also want the best chance possible of retrieving the data later on.

Should the strategy be tunable for the application? I'm not sure.
Firmware size and complexity must be minimal for high-reliability and low defect-rates. Only essential features can be included to achieve this aim...

Monday, February 06, 2012

modern HDD's: No more 'cylinders'

Another rule busted into a myth. How does this affect File Systems, like the original Berkeley Fast File System?

There's a really interesting piece of detail in a 2010 paper on "Shingled Writes".

Cylinder organisations are no longer advantageous.
It's faster to keep writing on the one surface than to switch heads.

With very small feature/track sizes, the time taken for a head-switch is large. The new head isn't automatically 'on track', it has to find the track... "settling time".

Bands consist of contiguous tracks on the same surface.
 At first glance, it seems attractive to incorporate parallel tracks on all surfaces (i.e., cylinders) into bands.

 However, a switch to another track in the same cylinder takes longer than a seek to an adjacent track:
 thermal differences within the disk casing may prevent different heads from hovering over the same track in a cylinder.

 To switch surfaces, the servo mechanism must first wait for several blocks of servo information to pass by to ascertain its position before it can start moving the head to its desired position exactly over the
desired track.

 In contrast, a seek to an adjacent track starts out from a known position.

 In both cases, there is a settling time to ensure that the head remains solidly over the track and is not oscillating over it.

 Because of this difference in switching times, and contrary to traditional wisdom regarding colocation within a cylinder, bands are better constructed from contiguous tracks on the same surface.

Saturday, February 04, 2012

Intra-disk Error Correction: RAID-4 in shingled-write drives

High density shingled-write drives cannot succeed without especial attention being paid to Error Correction, not just error detection.
Sony/Philips realised this when developing the Compact Digital Audio Disk (CD) around 1980 and then again in 1985 with the "Yellow Book" CD-ROM standard for data-on-CD. The intrinsic bit error rate of ~ 1 in 105 becomes "infinitesimal" to quote one tutorial, with burst errors of ~4,000 bits corrected by the two lower layers.

Error rates and sensitivity to defects increase considerably as feature sizes reach their limit. The 256Kbit DRAM chips took years to come into production after 64Kbit chips because manufacturing yields were low. Almost every chip worked well enough, but had some defects causing it to be failed in testing. The solution was to overbuild the chips and swap defective columns with spares during testing.

Shingled-write disks, with their "replace whole region, never update-in-place", allow for a different class of Error Protection. RAID techniques with fixed parity disks seem a suitable candidate when individual sectors are never updated. Network Appliance very successfully leveraged this with their WAFL file system.

That shingled-write disks require good Error Correction should be without dispute.
What type of ECC (Error Correcting Code) to choose is an engineering problem based on the expected types of errors and the level of Data Protection required. I've previously written that for backup and archival purposes, the probable main uses of shingled-write disks, bit error rates of 1 in 1060 should be a minimum.

One of the advantages of shingled-write disks, is that each shingled-write region can be laid down in one go from a Flash memory buffer.
It can then be re-read and rewritten catering for the disk characteristics found:
  • excessive track cross-talk,
  • writes affected by excessive head movement (external vibration),
  • individual media defects or moving contamination,
  • areas of poor media, and
  • low signal or high signal-to-noise ratio due to age, wear or production variations.
Depending on the application, multiple rewrites may be attempted.
It would even be possible, given spare write-regions, for drives to periodically read and rewrite all data to the new areas. This is fraught because the extra "duty cycle" will decrease drive life plus if the drive finds uncorrectable errors when the attached host(s) weren't addressing it, what should be done?

Reed-Solomon encoding is well proven in Optical Disks: CD, CD-ROM and DVD and probably in-use now for 2Kb sector disks.
Reed-Solomon codes can be "tuned" to the application, the amount of parity overhead can be varied and other techniques like scrambling and combined in Product Codes.

R-S codes have a downside: complexity of encoders and decoders.
[This can mean speed and throughput as well. Some decoding algorithmns require multiple passes to correct all errors.]

For a single platter shingled-write drive, Error Correcting codes (e.g. Reed-Solomon) are the only option to address long burst errors caused by recording drop-outs.

For multi-platter shingled-write disks, another option is possible:
 RAID-4, or block-wise parity (XOR) on a dedicated drive (in this case, 'surface').

2.5 in drives can have 2 or 3 platters, i.e. 4 or 6 surfaces.
Dedicating one surface to parity gives 25% and 16.7% overhead respectively, higher than the ~12.5% Reed-Solomon overhead in the top layer of CD-ROM's.
With 4 platters, or 8 surfaces, overhead is 12.5%, matching that of CD-ROM, layer 3.

XOR parity generation and checking is fast, efficient and well understood, this is it's attraction.
But despite a large overhead, it:
  • can at best only correct a single sector in error, fails on two dead sectors in the sector set,
  • relies on the underlying layer to flag drop-outs/erasures, and
  • relies on the CRC check to be perfect.
If the raw bit error rate is 1 in 1014 with 2Kb sectors. The probability of any sector  having an uncorrected error is 6.25 x 10 -9.
The probability of two sectors in a set being in error is:
6.25 x 10 -9 * 6.25 x 10 -9 = 4 x 10 -17

This is well below what CD-ROM achieves.
But, to give intra-disk RAID-4 its due:
  • corrects a burst error of 16,000 bits. Four times the CD limit.
  • will correct every fourth sector on each surface
  • is deterministic in speed. Reed-Solomon decoding algorithms can require multiple passes to fully correct all data.
I'm thinking the two schemes could be used together and would complement each other.
Just how, not yet sure. A start would be to group together 5-6 sectors with a shared ECC in an attempt to limit the number of ganged failed sector reads in a RAID'd sector set.

Multiple arms/actuators and shingled-write drives

Previously, I've written on last-gen (Z-gen) shingled-write drives and mentioned multiple independent arms/actuators:
  • separating "heavy" heads (write + heater) from lightweight read-only, and
  • using dual sets of heads, either read-only or read-write.
There is some definitive work by Dr. Sudhanva Gurumurthi of University of Virginia and his students on using multiple arms/actuators in current drives, especially those with Variable Angular Velocity, not Constant Angular Velocity, maybe nearer Constant Linear Velocity - approaches used in Optical drives.
E.g. "Intra-disk Parallelism" thesis and "Energy-Efficient Storage Systems" page.

The physics are good and calculations impressive, but where is the commercial take-up?
Extra arms/actuators and drive/head electronics are expensive, plus need mounting area on the case.
What's the "value proposition" for the customer?

Both manufacturers and consumers have to be convinced it's a worthwhile idea and there is some real value. Possibly the problem is two-fold:
  • extra heads don't increase capacity, only reduce seek time (needed for "green" drives), a hard sell.
  • would customers prefer two sets of heads mounted in two drives with double the capacity, the flexibility to mirror data and replace individual units.
Adopting dual-heads in shingled-write drives might be attractive:
  • shingled-write holds the potential to double or more the track density with the same technology/parts. [Similar to the 50% increase early RLL controllers gave over MFM drives.]
    Improving drive $$/GB is at least an incentive to produce and buy them.
  • We've no idea how sensitive to vibration drives with these very small bit-cells will be.
    Having symmetric head movement will cancel most vibration harmonics, helping settling inside the drive and reducing impact externally.
To appreciate the need

Ed Grochowski, in 2011 compared DRAM, Flash and HDD,  calculating bitcell sizes for a 3.5 in disk and 750Gb platter used in 3TB drives (max 5 platters in a 25.4 mm thick drive).

The head lithography is 37nm, tracks are 74nm wide and now with perpendicular recording, 13nm long.
The outside track is 87.5mm diameter, or 275 mm in length, hold a potential 21M bitcells, yielding 2-2.5MB of usable data. With 2KB sectors, ~1,000 sectors/track maximum.

The inner track is 25.5mm diameter, 80mm in length: 29% of the outside track length.
The 31mm wide write-area contains up to 418,000 tracks with a total length of ~5 km.
Modern drives group tracks in "zones" and vary rotational velocity. The number of zones and how closely  they approximate "Constant Linear Velocity", like early CD drives, isn't discussed in vendors data sheets.

While Grocowski doesn't mention clocking or sector overheads (headers, sync bits, CRC/ECC) and inter-sector gaps.
Working backwards from the total track length, the track 'pitch' is around 150nm, leaving a gap of roughly a full track width between tracks.

I've not seen mentioned bearing runout and wobble that require heads to constantly adjust tracking, a major issue with Optical disks. Control of peripheral disk dimensions and ensuing problems is discussed.

As tracks become thinner, seeking to, and staying on, a given track becomes increasingly difficult. These are extremely small targets to find on a disk and tracking requires very fine control needing both very precise electronics and high-precision mechanical components in the arms and actuators.

Dr Gurumurthi notes in "HDD basics" that this "settling-time" becomes more important with smaller disks and higher track density.

This, as well as thinner tracks, is the space that shingled-writing is looking to exploit.
The track width and pitch become the same, around 35nm for a 4-fold increase in track density using current heads, less inter-region gaps and other overheads.

Introducing counter-balancing dual heads/actuators may be necessary to successfully track the very small features of shingled-write disks. A 2-4 times capacity gain would justify the extra cost/complexity for manufacturers and customers.

Wednesday, February 01, 2012

Z-gen hard disks: shingled writes

We are approaching the limits of magnetic hard disk drives, probably before 2020, with 4-8TB per 2.5 inch platter.

One of the new key technologies proposed is "shingled writes", where new tracks partially overwrite  an adjacent, previously written track, making the effective track-width of the write heads much smaller. Across the disk, multiple inter-track blank (guard) areas are needed to allow the shingling to start and finish, creating "write regions" (super-tracks?) as the smallest recordable area, instead of single tracks. The cost of discarding the guard distance between tracks is higher cross-talk, requiring more aggressive low-level Error Correction and Detection schemes.

In the worst case, a single sector update, the drive has to read the whole write-region into local memory, update the sector and then rewrite the whole write-region. These multi-track writes, with one disk revolution per track, are not only slow and make the drive unavailable for the duration, but require additional internal resources to perform, including memory to store a whole region. Current drive buffers would limit the size of regions to a few 10's of MB, which may not yield a worthwhile capacity improvement.

The shingled-write technique has severe limitations for random "update-in-place" usage:
  • write-regions either have to be very small with many inter-region gaps/guard areas, considerably reducing the areal recording density and obviating its benefits, or
  • have relatively few very large write-regions to achieve 90+% theoretical maximum areal recording density at the expense of update times in the order of 10-100 seconds and significant on-drive memory. This substantially increases cost if SRAM is used and complexity if DRAM is used.
Clearly, shingled writes are not an optimum solution for drives used for random writes and update-in-place, they are perhaps the worse solution for this sort of work load.

A tempting solution is to adopt a "log structured" approach, such as used in Flash Memory in the SSD FTL (Flash Translation Layer), and map logical sectors to physical location:
write sector updates to a log-file, don't do updates-in-place and securely maintain a logical-to-physical sector map.

Contiguous logical sectors are initially written physically adjacent, but over time as sectors are updated multiple times, contiguous logical sectors will be spread widely across the disk. Thereby radically slowing streaming read rates, leaving many "dead" sectors reducing the effective capacity and requiring active consolidation, or "compaction".

The drive controllers still have to perform logical-to-physical sector mapping, optimally order reads and reassemble the contiguous logical stream.

Methods to ameliorate disruption of spatial proximity must trade space for speed:
  • either allow low-density (non-shingled) sets of tracks in the inter-region gaps specifically for sector updates, or
  • leave an update area at the end of every write-region.
    Larger update areas lower effective capacity/areal density, whilst smaller areas are saturated more quickly.
Both these update-expansion area approaches have a "capacity wall", or an inherent hard-limit:
 what to do when the update-area is exhausted?

Pre-emptive strategies, such as predictive compaction, must be scheduled for low activity times to minimise performance impact, requiring the drive to have significant resources, a good time and date source and to second-guess its load-generators. Embedding additional complex software in millions of disk drives that need to achieve better than "six nines" reliability creates an administration, security and data-loss liability nightmare for consumers and vendors alike.

The potential for "unanticipated interactions" is high. The most probable are defeating Operating System block-driver attempts to optimally organise disk I/O requests and duplicating housekeeping functions like compaction and block relocation, resulting in write avalanches and infinite cascading updates triggered when drives near full capacity.

In RAID applications, the variable and unpredictable I/O response time would trigger many false parity recovery operations, offsetting the higher capacity gained with significant performance penalties.

In summary, trying to hide the recording structure from the Operating System, with its global view and deep resources, will be counter-effective for update-in-place use.

Shingled-write drives are not suitable for high-intensity update-in-place uses such as databases.

There are workloads that are a very good match to large-region update whole-disk structures:
  • write-once or very low change-rate data, such as video/audio files or Operating System libraries etc,
  • log files, when preallocated and written in append-only mode,
  • distributed/shared permanent data, such as Google's compressed web pages and index files,
  • read-only snapshots,
  • backups,
  • archives, and
  • hybrid systems designed for the structure, using techniques like Overlay mounts with updates written to more volatile-friendly media such as speed-optimised disks or Flash Memory.
Shingled-write drives with non-updateable, large write-regions are a perfect match for an increasingly important HDD application area: "Seek and Stream".

There are already multiple classes of disk drives:
  • cost-optimised drives,
  • robust drives for mobile application,
  • capacity-optimised Enterprise drives,
  • speed-optimised Enterprise drives, and
  • "green" or power-minimised variants of each class.
Pure shingled-write drives could be considered a new class of drive:
  • capacity-optimised, write whole-region, never update. Not unlike CD-RW or DVD-RAM.
With Bit-Patterned-Media (BPM), another key technology needed for Z-gen drives, a further refinement is possible for write whole-region drives:
  • continuous spiral tracks per write-region, as used by Optical drives.
Lastly, an on-drive write-buffer, of Flash memory, of 1 or 2 write-regions size would, I suspect, improve drive performance significantly and allow additional optimisations or Forward Error Correction in the recording electronics/algorithms.

For a Z-gen drive with 4TB/platter and 4Gbps raw bit-rates, 2-8Gb write-regions may be close to optimal. Around 1,000 regions per drive would also fit nicely with CLV (Constant Linear Velocity) and power-reducing slow-spin techniques.

A refinement would be to allow variable size regions to precisely match the size of data written, in much the same way that 1/2 inch tape drives wrote variable sized blocks. This technique allows the Operating System to avoid wasted space or complex aggregation needed to match file and disk recording-unit sizes. This is not quite the "Count Key Data" organisation of old mainframe drives (described by Patterson et al in 1988 as "Single Large Expensive Drives").

Like Optical disks, particularly the CD-ROM "Mode 1" layer, additional Forward Error Correction can be cheaply built into the region data to achieve protection from burst-errors and achieve unrecoverable bit-error rates in excess of 1 in 1060 both on-disk and for off-disk transfers.

For 100-year archival data to be stored on disks, it has to be moved and recreated every 5-7 years, forcing errors to be crystallised each time. Petabyte RAID'd collections using drives with 1 in 1016 bits-in-error only achieve a 99.5% probability of successful rebuild with RAID-6. Data migrations are effective RAID rebuilds. Twenty consecutive rebuilds have a 10% probability of complete data loss, an unacceptably high rate. Duplicating the systems reduces this to a 1% probability of complete data loss, but at a 100% overhead. The modern Error Correction techniques suggested here require modest (10-20%) overheads and would improve data protection to less than 0.1% data loss, though not obviating the problems of failing hardware.

In a world of automatic data de-duplication and on-disk compression, data protection/preservation becomes a very high priority. Storage efficiency brings the cost of "single point of failure". This adds another impetus to add good Error Correction to write-regions.

A 4GB region, written at 4Gbps would stream in 8-10 seconds. Buffering an unwritten region in local Flash Memory would allow fast access to the data both before and during the commit to disk operation, given sufficient excess Flash read bandwidth.
Note, this 4-8Gbps bandwidth, is an important constraint on the Flash Memory organisation. Unlike SSD's, because of the direct access and sequential-write, no FTL is required, but bad-block and worn-block management are still necessary.

4GB, approximately the size of a DVD, is known to be a useful and manageable size with a well understood file system (ISO 9660) available, it would match shingled-disk write-regions well. Working with these region-sizes and file system is building on well-known, tested and understood capabilities, allowing rapid development and a safe transition.

Using low-cost MLC Flash Memory with a life perhaps of 10,000 erase cycles would allow the whole platter (1,000 regions) to be rewritten at least 10 times. Allowing a 25% over-provisioning of Flash may improve the lifetime appreciably as is done with SSD's, which could be a "point of difference" for different drive variants.
Specifically, the cache suggested is write-only, not a read-cache. The drive usage is intended to be "Seek and Stream", which does not benefit from an on-drive read-cache. For servers and disk-appliances, 4GB of DRAM cache is now an insignificant cost and the optimal location for a read cache.

Provisioning enough Flash Memory for multiple uncommitted regions, even 2 or 3, may also be a useful "point of difference" for either Enterprise or Consumer applications. Until this drive organisation is simulated and Operating and File Systems are written and trialled/tested against them, real-world requirements and advantages of larger cache sizes are uncertain.

Depending on the head configuration, the data could be read-back after write as a whole region and re-recorded as necessary. The recording electronics perhaps adjusting for detected media defects and optimising recording parameters for the individual surface-head characteristics in the region.

If regions are only written at the unused end of drives, such as for archives or digital libraries, maximum effective drive capacity is guaranteed, there is no lost space. Write-once, Read-Many is an increasingly common and important application for drives.
A side-effect is that individual drives will, like 1/2 inch tapes of old, vary in achieved capacity, though of the same notional size, but you can't know until the limit is reached. Operating and File Systems have dealt with "bad blocks" for many decades and can potentially use that approach to cope with variable drive capacity, though it is not a perfect match. Artificially limiting drive capacity to the "Least Common Denominator", either by the consumer or vendor, is also likely. "Over-clocking" of CPU's shows that some consumers will push the envelope and attempt to subvert/overcome any arbitrary hardware limits imposed. If any popular Operating System can't cope easily with uncertain drive capacity and variable regions, this will limit their uptake in that market, though experience suggests not for long if their is an appreciable price or capacity/performance differential.

When re-writing a drive, the most cautious approach is to first logically erase the whole drive and then start recording again, overwriting everything. The most optimistic approach is to logically erase 2 or 3 regions, the region you'd like to write and enough of a physical cushion to allow defects etc not to cause an unintended region overwrite.

This suggests two additional drive commands are needed:
  • write region without overwriting next region (or named region)
  • query notional region size available from "current position" to next region or end-of-disk.
This raises an implementation detail beyond me:
Are explicit "erase region" or "free region" operations required?
Would they physically write to every raw bit-location in a region or not?

On Heat Assisted Magnet Recording (HAMR), drive vibration, variable speed and multiple heads/arms.

HAMR is the other key technologies (along with BPM and shingled-writes) being explored/researched to achieve Z-gen capacities.
It requires the heating of the media, presumably over the Curie Point, to erase the existing magnet fields. Without specific knowledge, I'm guessing those heads will be bigger and heavier than current heads, and considerably larger and heavier than the read heads needed.

Large write-regions with wide guard areas between, would seem to be very well suited to HAMR and its implied low-precision heating element(s).  Relieving the heating elements of the same precision requirements as the write and read heads may make the system easier to construct and control and hence record more reliably. Though this is pure conjecture on my part.

Dr. Sudhanva Gurumurthi and his students have extensively researched and written about the impact of drive rotational velocity, power-use and multiple heads.  From the timing of their publications and the release of slow-spin and variable-speed drives, its reasonable to infer that Gurumurthi's work was taken up by the HDD manufacturers.

Being used for "Seek and Stream", not for high-intensity Random I/O, HDD's will exhibit considerably less head/actuator movement resulting in much less generated vibration if nothing else changes. This improves operational reliability and lowers induces errors greatly by removing most of the drive-generated vibration. At the very least, less dampening will be needed in high-density storage arrays.

Implied in the "Seek and Stream" is that I/O characteristics will be different, either:
  • nearly 100% write for an archive or logging drive with zero long seeks, or
  • nearly 100% read for a digital library or distributed data with moderate long seeking.
In both scenarios, seeks would reduce from the current 250-500/sec to 0.1-10/sec.
For continuous spiral tracks, head movement is continuous and smooth for the duration of the streaming read/write, removing entirely the sudden impulses of track seeks. For regions of discrete, concentric tracks, the head movements contain the minimum impulse energy. Good both for power-use and induced vibration.

Drawing on Gurumurthi's work on multiple heads compensating for performance of slow-spin drives, this head/actuator arrangement for HAMR with shingled-write may be benefical:
  • Separate "heavy head", either heating element, write-head or combined heater-write head.
  • Dual light-weight read heads, mounted diagonally from each other and at right-angles to the "heavy head".
Because write operations are infrequent in both scenarios, the "heavy head" will be normally unloaded, even leaving the heating elements (if lasers aren't used) normally off. The 2-10 second region-write time, possibly with 1 or 2 rewrite attempts, means 10 msec heater ramp-up would not materially affect performance.

A single write head can only achieve half the maximum raw transfer rate of dual read heads. Operating Systems have not had to deal with this sort of asymmetry before and it could flush out bugs due to false assumptions.

Separating the read and "heavy" heads reduces the arm/bearing engineering requirements and actuator power for the usual dominant case - reading. By slamming around lighter loads, lower impulses are produced. Because the drives are not attempting high-intensity random I/O, lower seek performance is acceptable. The energy used in accelerating/de-accelerating any mass is proportional to velocity2. Reducing the arm seek velocity by 70% halves the energy needed and the impulse energy needing to be dissipated. (Lower g-forces also reduce the amplitude of the impulse, though I can't remember the relationship.)

With "Seek and Stream" mode of operation, for a well tuned/balanced system, the dominant time factor is "streaming". The raw I/O transfer rate is of primary concern. The seek rate, especially for read, can be scaled back with little loss in aggregate throughput.

Optimising these factors is beyond my knowledge.

By using dual opposing read heads, impulses can be further reduced by synchronising the major seek movements of the read heads/arms.
As well, both heads can read the same region simultaneously, doubling the read throughput. This could be as simple as having each head read alternate tracks, or in a spiral track, the second head starts halfway through the read area, though to simply achieve maximum bandwidth, the requesting initiator may have to be able to cope with two parallel streams, then join the fragments in its buffer. Not ideal, but attainable.

Assuming shingled-writes, dual spiral tracks would allow simple interleaving of simultaneous read streams, but would either need two write heads similarly diagonally opposed or a single device with two heads offset by a track width and possibly staggered in the direction of travel to be assembled. Would the a single laser heating element suffice two write heads?? This arrangement sounds overly complicated, difficult to consistently manufacture to high-precision and expensive.

For a single spiral track with dual read-heads, a dual spiral can be simulated, though achieving full throughput requires more local buffer space.
The controller moves the heads to adjacent tracks and reads a full-track from each into a first set of buffers, it then concatenates the buffers and streams the data stream. After the first track, the heads are leap-frogged and stream to an alternate set of buffers, which are then concatenated and streamed while the heads are leap-frogged and switch back to the first set of buffers, etc.

This scheme doesn't need to buffer an exact track, but something larger than the longest track at a small loss of speed. If a 1MB "track" size is chosen, then 4MB of buffer space is required. Data can begin being streamed from the first byte of track 0, though only after both buffers are full can full-speed transfers happen.

It's possible to de-interleave the data when written and reorder before writing with alternate sectors offset by half the now write buffer size (2MB for a 4MB buffer). On reading, directly after the initial seek to the same 4MB segment but offsets zero and 2MB, heads will read alternate sectors which can then be interleaved easily and output at full bandwidth. When a head reaches the end of a segment (4MB), they jump to the next segment and start streaming again.  Some buffering will be required because of the variable track size and geometric head offsets. I'm not sure if either scheme is superior.

  • shingled-write drives form a new class of "write whole-region, never update" capacity-optimised drives. As such, they are NOT "drop-in replacements" for current HDD's, but require some tailoring of Operating and File Systems.
  • abandon the notional single-sector organisation for multi-sector variable blocking similar to old 1/2 tape.
  • large write-regions (2-8GB) of variable size with small inter-region gaps maximise achievable drive capacity and minimise file system lost-space due to disk and file system size mismatches. If regions are fixed-sector organised, lost space will average around a half-sector, under 1/1000th overhead.
  • Appending regions to disks is the optimal recording method.
  • Optimisation techniques used in Optical Drives, such as continuous spiral tracks and CD-ROM's high resilience Error Correction, can be applied to fixed-sectors and whole shingled-write regions.
  • Integral high-bandwidth Flash Memory write caches would allow optimal region recording at low cost, including read-back and location optimised re-recording.
  • Shingled-writes would benefit from purpose-designed BPM media, but could be usefully implemented with current technologies to achieve higher capacities, though perhaps exposing individual drive variability.
  • shingled-writes and large, "never updated" regions work well with HAMR, BPM, separated read/write heads and dual light-weight read heads.

Sunday, January 08, 2012

Revolutions End II and The Memory Wall

The 2011 ITRS report for the first time uses the terms, "ultimate Silicon scaling" and "Beyond CMOS". The definitive industry report is highlighting for us that the end of the Silicon Revolution is in sight, but that won't be the end of the whole story. Engineers are very clever people and will find ways to keep the electronics revolution moving along, albeit at a much gentler pace.

In 2001, the ITRS report noted that CPU's would be hitting a Power Wall, they'd need to forgoe performance (frequency) to fit within a constrained power envelope. Within 2 years, Intel was shipping multi-core CPU's. Herb Sutter wrote about this in "The Free Lunch is Over".

In the coming 2011 ITRS report, they write explicitly about "Solving the Memory Wall".
Since 1987 and the Pentium IV, CPU clock frequency (also 'cycle time') has been increasing faster than DRAM cycle times: by roughly 40% per year. (7%/year for DRAM and ~50%/year for CPU chip freq.)

This is neatly solved, by trading latency for bandwidth, with caches.
The total memory bandwidth needs for multi-core CPU's doesn't just scale with the chip frequency (5%/year growth), but with the total number of cores accessing the cache (number of cores grow at approx 40%/year).
Cache optimisation, the maximisation of cache "hit ratio", requires the largest cache possible. Hence Intel now has 3 levels of cache, with the "last level cache" being shared globally (amongst all cores).

The upshot of this is simple: to maintain good cache hit-ratios, cache size has to scale with the total demand for memory access. i.e.  N-cores * chip freq.
To avoid excessive processor 'stall', waiting for the cache to be filled from RAM, the hit-ratio has to increase as the speed differential increases. An increased chip frequency requires a faster average memory access time.
So the scaling of cache size is: ( N-cores  ) * (chip freq)²

The upshot is:
cache memory has grown to dominate CPU chip layout and will only increase.
But it's a little worse than that...
The capacity growth of DRAM has slowed to a doubling every 3-4 years.
In 2005, the ITRS report for the first time dropped DRAM as its "reference technology node", replacing it with Flash memory and CPU's.

DRAM capacity growth is falling behind total CPU chip memory demands.
Amdahl posited another law for "Balanced Systems": that each MIP required 1MB of memory.

Another complicating factor is bandwidth limitations for "off-chip" transfers - including memory.
This is called "the Pin Bottleneck" (because external connections are notionally by 'pins' on the chip packaging). I haven't chased down the growth pattern of off-chip pins. The 2011 ITRS design report discusses it, along with the Memory Wall, as a limiting factor and a challenge to be solved.

As CPU memory demand, the modern version of "MIPS", increases, system memory sizes must similarly scale or the system becomes memory limited. Which isn't a show-stopper in itself, because we invented Virtual Memory (VM) quite some time back to "impedance match" application memory demands for with available physical memory.

The next performance roadblock is VM system performance, or VM paging rates.
VM systems have typically used Hard Disk (HDD) as their "backing store", but whilst the capacity has grown faster than any other system component (doubling every year since ~1990), latency, seek and transfer times have risen comparatively slowly. Falling, relatively, behind CPU cycle times and memory demands by 50%/year (??).

For systems using HDD as their VM backing store, throughput will be adversely affected, even constrained, by the increasing RAM deficit.

There is one bright point in all this, Flash Memory has been doubling in capacity as fast as CPU memory demand, and increasing in both speed (latency) and bandwidth.

So much so, that there are credible projects to create VM systems tailored to Flash.

So our commodity CPU's are evolving to look very similar to David Patterson's iRAM (Intelligent RAM) - a single chip with RAM and processing cores.

Just how the chip manufacturers respond is the "$64-billion question".

Perhaps we should be reconsidering Herb Sutters' thesis:
Programmers have to embrace parallel programming and learn to create large, reliable systems with it to exploit future hardware evolution.