Accessing data stored on a disk drive takes
on average three orders of magnitude longer than accessing data stored in main
memory. It is therefore not surprising that the time for moving blocks of data
from the disk to main memory usually dominates query execution time in DBMSs
[20].

The storage manager is a component of a
database that determines how data is written on and read from the disk drive.
We define its primary purpose to be the strategic placement of data on the
disk drive such that the average query execution time can be minimal. Of
course, execution time depends on other factors as well. Without prejudice to
its primary purpose, the storage manager shall store data as compactly as
possible on the disk.

In Section 2.2, we presented the mechanic
properties of disk drives and showed how the time to service a read or write
request can be broken down into data access time and data transfer time.
Ignoring head and track switch time and multiple zone recording, data transfer
time is just a constant determined by the rotational speed of the disk and the
number of sectors per track. That leaves the storage manager with the
optimization of data access time; that is, seek time and rotational latency.

To minimize seek time, data items that are
frequently accessed together should be placed in the same cylinder, or in
cylinders that are as close as possible.

Data placement for minimizing rotational
latency is more complex. In general, data items that are frequently accessed
together should be placed in blocks that are as close as possible on the
circumference. If the items are placed in the same cylinder in different
tracks, offsets for head switch time must be added. If the items are placed in
different cylinders, offsets for seek time or track switch time must be added.
Figure 5 illustrates the complexity of finding rotational latency-minimizing
placements. It shows several locations of data item that, given the placement of data item and assuming that is accessed after , all may yield zero
rotational latency.

Ideally, the storage manager analyzes the
disk drive and derives a cost model to evaluate different placements of sets of
data items based on access time. Sets of data items that are frequently
accessed together are assigned a higher weight in the model.

In practice, storage managers in modern
database systems typically do not tackle the specific placement in cylinders,
surfaces, and tracks. Instead, they implicitly assume a model of linear storage
based on logical block addresses (see Section 2.2 and especially Figure 2).
Data that is frequently accessed together is placed either in the same block or
in blocks that are as close as possible by their disk addresses. While this
might not always yield an optimal data placement, it greatly reduces
complexity. The model is hardware-independent, as the mapping of sectors to
logical block addresses is supplied by the disk drive.

The storage manager should utilize existing
information to find an optimal placement. In a relational database, for
instance, all records in a table could be considered items that should be
stored together. For several types of data, the storage manager may be able to
identify deeper structures. Integrity constraints and foreign key relationships
can contain hints.

Graph data by its nature is highly
structured and has abundant potential for storage optimization. The most basic
desideratum for a storage manager for graph data should be to

(1) store vertices together with incident
edges.

We have seen in Sections 3.2 and 4.2 that
graph algorithms frequently access the labels and edges of a vertex together. A
storage manager for graph data should therefore store them as close as
possible. As close as possible means in the same block. If it is not possible
to store them in the same block (e.g., because the vertex has more edges than
can fit in a block), they should be stored in blocks that are adjacent by their
disk addresses.

Henceforth, we take storing a vertex to
mean storing its labels and edges together. We use to
denote the number of bytes that the encoding of an arbitrary vertex uses on the disk, and we use
to denote the number of bytes in a
block that can hold vertex data.

At the beginning of this section, we
proposed that, without prejudice to the overriding objective of finding an optimal
data placement, the storage manager shall place data as compactly on the disk
as possible. Applied to desideratum (1), the storage manager may store several
vertices in the same block, but it may not break up a vertex into two blocks if .

In Section 4.1.1, we argued that vertex
labels and edges should be represented in separate tables in a relational
database. Here, we are thinking about how the two tables should be stored on
the disk, which is a decision of the storage manager and hidden from the user.
Desideratum (1) suggests that the edges table should be
stored grouped on column a and interleaved with the vertices table. This may adversely affect the execution time of simple SELECT queries over one table, but it can increase the execution speed of
queries that involve graph navigation.

Graph navigation is at the heart of most
graph algorithms. The second desideratum is therefore to

(2) store vertices together with vertices
in their neighborhood.

For most graphs, it will not be possible to
place each vertex together with its neighbors in the same block or in adjacent
blocks. An optimal placement is then one in which as many vertices as possible
are stored as close as possible. Let us define an optimization problem:

Let be
the graph that we are trying to store on the disk. If contains unordered pairs of vertices, convert
them to symmetric directed pairs. Let us assume that .
Moreover, let us assume that the disk has enough free space to store the entire
graph in consecutive blocks, and let these blocks be numbered 0, 1, …. Let ) be a function that maps the vertices
of to these blocks. We
use to denote the set , and we use to
denote the sum .

We propose that an optimal placement by
desideratum (2) solves the following problem:

Put differently, let the cost of a
particular placement be the sum of the edges that cross block boundaries
multiplied by the distance between blocks. An optimal placement by desideratum
(2) minimizes this cost.

It is easy to see that if is connected, can only be minimal if maps to consecutive blocks. Even if contains multiple components, some
queries (e.g., SELECT over ) benefit if the components are stored in
groups of blocks adjacent to each other. Let us therefore redefine to be a surjective function , where is not known.

Our optimization problem is related to a
problem that first appeared in [9], known as the minimum linear arrangement
(MinLA) problem. For a given graph ,
the MinLA problem is to find a bijective function that
minimizes cost .

The MinLA problem has been proved to be
NP-hard, and its decision problem to be NP-complete [8]. The MinLA problem
would be equivalent to our optimization problem if we knew which vertices to
put together in blocks, but not the order in which to put the blocks on the
disk.

Several heuristic algorithms have been
proposed that find near optimal solutions for MinLA problems. Among the most
successful are spectral sequencing [11], multilevel-based algorithms [16, 22], a
divide-and-conquer algorithm [2], and simulated annealing [19, 18]. Petit [19]
compiled a suite of experiments with several of these algorithms. For the
largest graph with 10,240 vertices and 30,380 edges, execution times varied
between 4 seconds (spectral sequencing) and 12.83 hours (simulated annealing).

MinLA algorithms can be applied to our
placement problem in two ways. Let be the graph that
we are trying to store on the disk:

[A] Create a list of possible mappings
from to consecutive
blocks. Convert each mapping to a graph: Each block is a vertex and each edge is an edge between the blocks that and are mapped to. The graphs may contain loops
and parallel edges.

Solve the MinLA
problem for each graph. The mapping for the graph with the overall lowest , together with its ordering function define a function that solves our placement problem.

[B] Solve the MinLA problem for . Set two integer variables and to
0. While do the following:

• Set to be the vertex with .

• If , increment .

• Set and increment .

Suppose the MinLA algorithm was capable of
finding an optimal linear arrangement for every graph. Then approach [A] was
guaranteed to find a function that solved our
placement problem. Unfortunately, the number of possible mappings of to blocks increases
exponentially with . A modern computer could
probably not even enumerate all mappings for a large graph. Approach [A] is
therefore not useful for a storage manager.

Approach [B] is not guaranteed to find an
optimal . However, the resulting is likely to be close to optimal. The
MinLA algorithm must be capable of dealing with very large graphs since it is
applied directly to the input graph.

Figure 6 shows two linear arrangements for
an example graph. The arrangement shown under (a) is the MinLA. The
edge-weighted graph below each arrangement illustrates how the algorithm
described above under [B] would convert the arrangement into a disk
representation. We call such a graph a block graph. Each vertex in a
block graph corresponds to one block on the disk. The block graph contains an
edge if , , , . The weight of edge is the cardinality of set . The block graph preserves enough
information to calculate . Notice that the
placement derived from the optimal arrangement
is inferior to the placement shown in (b).

Suppose we could find a placement that
solved the minimization problem for . Would it be
optimal? assumes that accessing blocks and together
takes exactly half of the time to access blocks and together.
In practice, deviations from this assumption may be substantial.

Suppose that , , ,
and . By ,
this placement is optimal if and cannot be placed in the same block. Assume a
graph traversal algorithm that prints the labels of vertices it finds explores . After retrieving block from the disk, it is quite
possible that block has
just passed under the disk head when the algorithm returns from printing the
label of . In the worst
case, the algorithm has to wait for a whole rotation of the disk until block passes under the disk head
again.

The read-ahead functionality of the disk
drive may mitigate the problem. However, read-ahead works only in forward
direction. If and ,
there is no guarantee that block is in the disk’s cache after block has been read. does not differentiate between edge and edge .

Moreover, read-ahead is usually only
performed when the disk drive’s request queue is empty. If the algorithm
requests several different blocks at the same time, read-ahead data may only be
available for the last block read.

A storage manager that makes placement
decisions based on should implement its own
buffer and buffer management. Whenever a block is read, a uniform number of blocks before
and after block should
be retrieved from the disk.

Instead of relying only on caching and
buffering, we suggest refining the placement. The third and fourth desiderata
for a storage manager for graph data are to

(3) minimize the total edge weight in the
block graph, and to

(4) minimize the number of edges in the
block graph.

Again, we can express the desiderata as
optimization problems over and :

Notice the similarity between and . penalizes edges that join vertices in
different blocks by the distance between blocks. penalizes
such edges uniformly. Put differently, rewards
vertices that are stored close to most of their neighbors, while rewards vertices that are stored in
the same block with most of their neighbors.

Minimizing tends
to leave short edges between blocks, minimizing tends
to create clusters of vertices in the blocks. Clusters are useful, as they may
enable graph algorithms to run several iterations based on the data in just a
few blocks. Especially DFS algorithms benefit, as they need only a single
unexplored neighbor in each iteration to continue.

Minimizing channels
the edges in to as
little edges in the block graph as possible. The function ignores the number of
edges in that span
across two blocks. The reasoning is the following: Reading a block from the
disk takes a certain amount of time, regardless of how many vertices in the
block are needed in the graph algorithm. It is therefore sensible to try to
reduce the number of blocks that the neighbors of the vertices in a block are
stored in.

Minimizing is
a special case of the graph partitioning problem, a fundamental and
well-researched problem in graph theory. Given a graph ,
its objective is to divide into disjoint subsets, ,
such that the number of edges that join vertices from different partitions is
minimal, and , . and are given. Each set is
called a partition.

The graph partitioning problem has been
formulated in several versions. A version for edge-weighted graphs minimizes
not the number but the total weight of edges that cross partitions. A version
for vertex-weighted graphs uses the following as the balancing constraint: , where is
the weight of vertex .
In another version, the maximum partition size can be specified per partition.
In [15], the problem is extended for graphs where each vertex has multiple
weights.

The problem is relevant to many
applications, including parallel scientific computing, task scheduling, and
VLSI design. Graph partitioning is used, for instance, in scientific
simulations, where computation is performed iteratively on each element of a
physical 2D or 3D mesh model (a model of an airfoil, e.g.), to map the mesh
elements to processors such that the load is roughly equal and inter-processor
communication is minimized.

The graph partitioning problem is
NP-complete [8], but because of its relevance, a large number of heuristic
algorithms have been developed. A detailed survey with examples can be found in
[23].

The graph partitioning problem has
traditionally been approached with recursive bisection. That is, by first
dividing into two
partitions, then dividing each of these partitions into two partitions, and so
on. In the mid-90s, Karypis and Kumar [13] proposed a multilevel algorithm that
achieved a -way
partitioning in one run. This has since become the second standard approach to
graph partitioning.

Because most of the proposed algorithms are
complex, several authors have published open source ready-to-use tools that
partition arbitrary graphs. The two most widely used tools are Chaco version
2.0 (Hendrickson and Leland, published 1995, approx. 30,000 lines of C code
[10]) and Metis version 4.0 (Karypis and Kumar, published 1998, approx. 22,000
lines of C code [14]).

The optimization problem for is not a partitioning problem because , the number of blocks in the
disk representation, is not known. To use an existing partitioning algorithm, has to be guessed. Given , a partitioning algorithm
for vertex-weighted graphs can be used. The weight of each vertex is given by . To get on
the right side of the balancing constraint, parameter has to be set to . must be chosen such that is non-negative. Choosing such that means that the algorithm cannot
leave any block with free space. This might not be possible. In general,
several different values for should be tried.

Unfortunately, most partitioning algorithms
have been designed for a relatively small number of partitions. Chaco
implements only , , and . Metis (algorithm kmetis, default
parameters) has a memory complexity of ,
which makes it impracticable to get a result for large . For reference, a database with 5000 blocks
at 16 KB per block can merely hold 80 MB of data.

Finally, existing tools should always be
used with care. Metis, for instance, has several cases based on hard-coded
parameters built in that allow individual partitions to violate the balancing
constraint; that is, grow too large. For our purpose, the balancing constraint
has to be strict.

We have yet to tackle the optimization
problem for . Both Chaco and Metis implement
procedures that try to reduce the number of partitions that each partition is
linked to. Unfortunately, these features are not well documented. However, it
shows that the minimization problems for and
can be addressed together.

Figure 7 shows three placements of an
example graph with 16 vertices into disk blocks. It shows the block graph for
each placement, as well as the cost according to , , and .
Boldface indicates the lowest value for each function.

The example graph is undirected. For simplicity,
the block graphs are shown as undirected graphs as well. To determine the
values of the cost functions, each edge is treated as two symmetric directed
edges with the same weight.

Which of the three placements is
preferred? And how does it relate to functions , , and ?
These questions are difficult and we cannot answer them definitively. Which
placement is preferred depends on a number of factors, including the type of
query that is most often run, the characteristics of the graph, and the
available buffer memory. We might prefer alternative (b) for BFS queries
because regardless of which vertex is the source vertex, there will only be two
disk accesses: One to retrieve the block in which the source vertex is stored,
and one to retrieve all remaining blocks. We might prefer alternative (c) if
block 2 can be kept in the buffer permanently.

A storage manager should combine the cost
functions in an intelligent way. Unfortunately, , , and are
based on different scales and difficult to compare. always
holds. The next chapter describes how G-Store uses the functions.