SteveJ's lab-notes: The New Storage Hierarchy and 21 Century Databases

The new hardware capabilities and cost/performance characteristics of storage and computer systems means there has to be a radical rethink of how databases work and are organised.

The three main challenges I see are:

SSD and PCI Flash memory with "zero" seek time,
affordable Petabyte HDD storage, and
object-based storage replacing "direct attach" devices.

These technical tradeoff changes force these design changes:

single record access time is no longer dominated by disk rotations, old 'optimisations' are large, costly, slow and irrelevant,
the whole "write region" can be held in fast memory changing cache requirements and design,
Petabtye storage allows "never delete" of datasets which pose new problems:

how does old and new data get physically organised?
what logical representations can be used to reduce queries to minimal collections?
how does the one datastore support conflicting use types? [real-time transactions vs data wharehouse]
How are changed Data Dictionaries supported?
common DB formats are necessary as the lifetime of data will cover multiple products and their versions.

Filesystems and Databases have to use the same primitives and use common tools for backups, snapshots and archives.

As do higher order functions/facilities:

compression, de-duplication, transparent provisioning, Access Control and Encryption
Data Durability and Reliability [RAID + geo-replication]

How is security managed over time with unchanging datasets?
How is Performance Analysis and 'Tuning' performed?
Can Petabyte datasets be restored or migrated at all?

DB's must continue running without loss or performance degradation as the underlying storage and compute elements are changed or re-arragned.

How is expired data 'cleaned' whilst respecting/enforcing any legal caveats or injunctions?
What data are new Applications tested against?

Just a subset of "full production"? [doesn't allow Sizing or Performance Testing]
Testing and Developing against "live production" data is extremely unwise [unintended changes/damage] or a massive security hole. But when there's One DB, what to do?

What does DB roll-back and recovery mean now? What actions should be expected?

Is "roll-back" or reversion allowable or supportable in this new world?
Can data really be deleted in a "never delete" dataset?

Is the Accounting notion of "journal entries" necessary?

What happens when logical inconsistencies appear in geo-diverse DB copies?

can they be detected?
can they ever be resolved?

How do these never-delete DB's interface or support corporate Document and Knowledge Management systems?
Should summarises ever be made and stored automatically under the many privacy and legal data-retention laws, regulations and policies around?
How are conflicting multi-jurisdiction issues resolved for datasets with wide geo-coverage?
How are organisation mergers accomplished?

Who owns what data when an organisation is de-merged?
Who is responsible for curating important data when an organisation disbands?

XML is not the answer: it is a perfect self-containing data interchange format, but not an internal DB format.
Redesign and adaption is needed at three levels:

Logical Data layout, query language and Application interface.
Physical to Logical mapping and supporting DB engines.
Systems Configuration, Operations and Admin.

We now live in a world of VM's, transparent migration and continuous uninterrupted operations: DB's have to catch up.

They also have to embrace the integration of multiple disparate data sources/streams as laid out in the solution Jerry Gregoire created for Dell in 1999 with his "G2 Strategy":

Everything should be scalable through the addition of servers.
Principle application interface should be a web browser.
Key programming with Java or Active X type languages.
Message brokers used for application interfacing.
Technology selection on an application by application basis .
Databases should be interchangeable.
Extend the life of legacy systems by wrapping them in a new interface.
Utilize "off the shelf systems" where appropriate.
In house development should rely on object based technology - new applications should be made up of proven object puzzle pieces.

Data Discovery, Entity Semantics with range/limits (metadata?) and Rapid/Agile Application development are critical issues in this new world.

SteveJ's lab-notes

Thursday, June 07, 2012

The New Storage Hierarchy and 21 Century Databases

No comments:

Index

Blog Archive

About Me

Labels

MathJax