User login

Navigation

Relationships

You are here

HDFS Design and Synergies with VDM

The Hadoop Distributed File System (HDFS) uses divide and conquer techniques behind the covers to distribute data and processing. The design of HDFS according to "Tom White - The Definitive Guide to Hadoop - O'Reiley" is driven by three primary objectives:

Accommodate very large data

Optimize for throughput, streaming data efficiently

Read whole files as opposed to specific records

Optimize the total access time as opposed to getting to the first record fast

Use commodity hardware

What HDFS is not currently enabled or optimized for are:

Direct low latency access to specific records or areas of the file -

Lots of small files - the overhead associated with each file open and read is relatively high. The same amount of data stored in 1000 files vs 10 files causes 100 times more overhead and it can be quite significant.

File maintenance limitations

Concurrent write capability does not exist. One writer at a time writes to the file

No updates in the middle of the file are provided for. All writing is limited to appending at the end of a file.

VDMETL is designed with the following assumptions and goals:

The Target is typically a DBMS or other environment that requires data to be cleansed and transformed. XML and conventional table structures are supported.

Large amounts of data stored in large fixed format files that don't change, but can be replaced to address upstream corrections

Efficient history loads that require massive amounts of data to be processed in short time windows

Efficient reloads and corrections with minimal disruption of production capacity

Fast and easy mechanism for implementing small incremental changes to the transformation/cleansing rules.in the provisioning processes

Openness and simplicity - use standard open Unix tools and platforms - very much in line with Hadoop and HDFS

Convention over configuration - Avoid complexity by standardization and enforcement of naming and other conventions