The Hytormo Project

We propose a hybrid storage model of row and column stores, called HYTORMO, together with data storage and query processing strategies. First, HYTORMO is designed and implemented to be deployed on large-scale environment to make it possible to manage big medical data. Second, the data storage strategy combines the use of vertical partitioning and a hybrid storage in order to create data storage configurations that can reduce storage space demand and increase the performance of queries in given workloads. To achieve a data storage configuration, we propose two different database design approaches: (1) expert-based design and (2) automated design.

In the former approach, experts (e.g., database designers) manually create data storage configurations by grouping the attributes of DICOM data and selecting a suitable data storage layout for each column group. In the latter approach, we propose a hybrid automated design framework, called HADF. HADF depends on similarity measures (between attributes) that can take into consideration the combined impact of both workload- and data-specific information to automatically group the attributes into column groups and to select suitable data storage layouts. Finally, we propose a suitable and efficient query processing strategy built on top of HYTORMO. It considers the use of both inner joins and left-outer joins for join operations between vertically partitioned tables to prevent data loss if only using inner joins. Furthermore, an Intersection Bloom filter is applied to remove irrelevant tuples from input tables of join operations. This helps to reduce the number of disk I/Os, network communication and CPU cost and thus improves query performance.

Resources used

1 master node (Spark)

4 CPU (XEON E5-2630 2.4GHz)

RAM: 10GB

Disk: 300 GB

8 worker nodes (Spark)

2 CPU (XEON E5-2630 2.4GHz)

RAM: 6GB

Disk: 250 GB

Software Used

Hadoop 2.7.1, Hive 1.2.1 and Spark 1.6.0.

Dataset used

Real DICOM datasets are collected and extracted to be stored in HYTORMO.

Queries used

Three types of query workloads will be applied:

OLAP workloads.

OLTP workloads.

Mixed OLAP and OLTP workloads.

Expected results

We provides experimental evaluations to validate the benefits of the proposed methods over real DICOM datasets. The storage space demand is reduced and the workload execution time is improved.