4. The computer readable storage medium of claim 1 wherein the data
redistribution schedule table specifies a degree of parallelism
characterizing the number of source tables that can be simultaneously
redistributed.

Description:

FIELD OF THE INVENTION

[0001] This invention relates generally to digital data storage and
retrieval in a shared-nothing system. More particularly, this invention
relates to a technique for expanding a shared-nothing system with minimal
service interruption.

BACKGROUND OF THE INVENTION

[0002]FIG. 1 illustrates a shared-nothing network 100 known in the art.
The shared- nothing network or architecture 100 includes a master node
102 and a set of shared- nothing nodes 104_A through 104_N. Each
shared-nothing node 104 has its own private memory, disks and
input/output devices that operate independent of any other node in the
architecture 100. Each node is self sufficient, sharing nothing across
the network. Therefore, there are no points of contention across the
system and no sharing of system resources. The advantage of this
architecture is that it is highly scalable.

[0003] Database systems store data in tables distributed across
shared-nothing nodes. Data is stored by assigning each datum (e.g.,
record or row) to one of the nodes. Data is typically assigned to nodes
according to one of two principles. One approach is hash distribution,
which uses a hash function to map data to nodes. Another approach is to
assign data to nodes in a round-robin or random manner.

[0004] The part of the datum for which the hash is computed is referred to
as a distribution key. The distribution key can be a compound key, i.e.,
consisting of several columns of a row. Hash distribution results in
uniform data distribution and the co-location of records with the same
distribution key (i.e., records with the same distribution key are
assigned to the same node). Co-location of data is frequently exploited
in join operations where data from different database tables are joined.
Join operations are usually the most costly operation in a query
workload. By selecting frequently used join columns as distribution keys,
joins can be performed on a per-node basis without having to redistribute
the data among nodes between processing steps. All rows of a table are
distributed using the same distribution keys. Individual tables generally
differ in choice of distribution key.

[0005] Nodes are added to a shared-nothing system to accommodate more data
or additional query workloads. When new nodes are added to a system, data
needs to be redistributed. Data redistribution commonly entails the
examination and positional reassignment of individual datum. Reassessing
each row of a large data store can take a significant amount of time,
e.g., reassigning tens or hundreds of terabytes of data may take several
days. Consequently, it is common practice to schedule downtime of several
days when a node is added to a shared-nothing system. It would be
desirable to minimize the downtime associated with the expansion of a
shared-nothing data store.

[0007] The invention is more fully appreciated in connection with the
following detailed description taken in conjunction with the accompanying
drawings, in which:

[0008]FIG. 1 illustrates a shared-nothing network architecture known in
the art.

[0009]FIG. 2 illustrates a master node of a shared-nothing network
implemented in accordance with an embodiment of the invention.

[0010]FIG. 3 illustrates a shared-nothing node of a shared-nothing
network implemented in accordance with an embodiment of the invention.

[0011]FIG. 4 illustrates processing operations associated with an
expansion evaluator implemented in accordance with an embodiment of the
invention.

[0012] FIG. 5 illustrates a schedule table implemented in accordance with
an embodiment of the invention.

[0013]FIG. 6 illustrates table redistribution operations performed in
accordance with an embodiment of the invention.

[0014] Like reference numerals refer to corresponding parts throughout the
several views of the drawings.

DETAILED DESCRIPTION OF THE INVENTION

[0015]FIG. 2 illustrates a master node 202 configured to implement
operations of the invention. The master node 202 includes standard
components, such as one or more CPUs 210_A through 210_N. The CPUs are
attached to input/output devices 212 via a bus 214. The input/output
devices 212 may include standard components, such as a keyboard, mouse,
display, printer and the like. A network interface circuit 216 is also
connected to the bus 214, allowing the master node 202 to operate in a
networked environment.

[0016] A memory 220 is also connected to the bus 214. The memory 220
stores executable instructions to implement embodiments of the invention.
In particular, the memory 220 stores an expansion evaluator 222. The
expansion evaluator 222 includes executable instructions to evaluate the
addition of one or more nodes to a shared-nothing system. Based upon the
performance attributes of the existing and added nodes, the expansion
evaluator 222 forms a data redistribution schedule table with parameters
224, which is stored in memory 220. A user may access the table 224 to
reconfigure the data redistribution task and/or to track the progress of
the data redistribution task.

[0017]FIG. 3 illustrates a shared nothing node 304 implemented in
accordance with an embodiment of the invention. The shared nothing node
304 also includes standard components, such as a set of CPUs 310_A
through 310_N, which are connected to a bus 314. A set of input/output
devices 312 are also connected to the bus 314. A network interface
circuit 316, also connected to the bus 314, provides connectivity to a
network.

[0018] A memory 320 is also connected to the bus 314. The memory 320
includes data and executable instructions to implement operations of the
invention. In particular, the memory 320 includes a redistribution module
322 with executable instructions to redistribute data within a
shared-nothing database segment 324. The redistribution module 322
distributes data in accordance with the data redistribution schedule
table.

[0019]FIG. 4 illustrates processing operations associated with an
embodiment of the invention. A new shared-nothing data store
configuration is evaluated 402. This may be implemented with the
expansion evaluator 222. A data redistribution schedule table with
specified parameters is then formed 404. This operation may also be
implemented with the expansion evaluator 222. Data is then redistributed
within the shared-nothing system 406. Redistribution modules 322 on
individual nodes may be used to implement this operation.

[0020] FIG. 5 illustrates a data redistribution schedule table 500
configured in accordance with an embodiment of the invention. The table
500 includes a column 502 specifying different tables that need to be
migrated. The table 500 also includes a column 504 specifying the
priority that should be ascribed to the migration of the table. Thus, for
example, a frequently used table may be assigned a high priority.

[0021] The table 500 also specifies a degree of parallelism 506. The
degree of parallelism characterizes the number of source tables that can
be simultaneously redistributed. The table 500 may also have one or more
time columns 508. The time column may specify a start time for a data
migration task. The time column 508 may also specify a duration time that
a task may be performed (e.g., from midnight to six in the morning on
consecutive days until completed).

[0022] FIG. 5 illustrates that the table 500 may include a progress column
510. The progress information characterizes the percent completion of a
data migration task. Finally, the table 500 may also include a completion
column 512 to indicate whether a data migration task has been completed.

[0023] As previously indicated, the expansion evaluator 222 may be used to
form the table 500. Preferably, the table is available for editing by an
administrator. For example, the table is preferably editable by an
administrator to alter parameters, such as priority, parallelism and
time. Regular query tools, such as SQL may be used to alter parameters of
the table.

[0024]FIG. 6 illustrates operations performed by a redistribution module
322. In one embodiment, the redistribution module 602 creates a target
table corresponding to an existing table 602. For example, the target
table is formed on a newly added node of the system. The target table is
a clone of the original table definition, but does not contain any data
and is not visible to other transactions. The source table is then locked
604. This prevents updates, but reads are permitted. Data is read from
the source table and is redistributed to the target table 606. That is,
data is read row-by-row and is inserted into the target table. When the
data exchange is complete, the source and target tables are atomically
exchanged. That is, the system sees the new target table instead of the
original source table.

[0025] The distribution policy of a table encodes the type of distribution
to be used e.g., hash or round robin. In the case of a hash distribution,
the distribution policy defines the distribution key. Preferably, the
distribution of each table is recorded in the database catalog (a table
and partitions of a table are referred to as a table for simplicity). In
the case of hash distribution, the datum (one or more columns of a row)
is used to compute a hash value. Hash values are mapped to nodes, e.g.,
based on a modulo operation. Preferably, the distribution policy is
transparent to the client application, in which case distributed data can
be accessed with the same queries by a client. Depending on the
distribution, the query processor may choose different execution plans
and exploit co-location due to hash distribution, where applicable.
Depending on the query workload it may not be possible to process
randomly distributed data as efficiently as hash distributed data, both
in terms of response time and resource footprint, e.g., memory.

[0026] The data migration of the invention may be implemented with minimal
downtime. In one embodiment, the system is offline only during an
initialization stage. In the initialization stage, a data directory is
formed and information about all user tables from all existing databases
is defined on a new node. This process captures metadata for each table
in an expansion schema for status tracking. As part of the initialization
process, the policy for all tables is set to random distribution (i.e.,
hash distribution policies are nullified). Users can continue to access
the database after initialization is complete and the system is back
online. While online, data integrity is maintained and normal operations,
such as ETL jobs, user queries and reporting can be supported. Each table
or partition is unavailable for write operations only when the table is
in flight, i.e., when the table is being redistributed.

[0027] Thus, the invention supports expansion of a shared-nothing database
with minimal downtime. The invention redistributes individual tables
according to a new mapping to spread data uniformly across original and
added nodes. Data redistribution may be performed while the database
system is fully operational. Data that has not been redistributed remains
accessible, although access efficiency may be reduced. Preferably,
fault-tolerance of the system is maintained during the redistribution by
maintaining at least two copies of each datum.

[0028] The data redistribution schedule table with specified parameters
allows an administrator to tailor data migration tasks. For example, the
table may be used to allow an administrator to give priority to
frequently used tables so that they are redistributed first. The table
may also be set to perform data redistribution tasks at specific times,
e.g., only during off-peak hours. A degree of parallelism may also be
defined to specify the number of tables to be redistributed at once.

[0029] An embodiment of the present invention relates to a computer
storage product with a computer readable storage medium having computer
code thereon for performing various computer-implemented operations. The
media and computer code may be those specially designed and constructed
for the purposes of the present invention, or they may be of the kind
well known and available to those having skill in the computer software
arts. Examples of computer-readable media include, but are not limited
to: magnetic media such as hard disks, floppy disks, and magnetic tape;
optical media such as CD-ROMs, DVDs and holographic devices;
magneto-optical media; and hardware devices that are specially configured
to store and execute program code, such as application-specific
integrated circuits ("ASICs"), programmable logic devices ("PLDs") and
ROM and RAM devices. Examples of computer code include machine code, such
as produced by a compiler, and files containing higher-level code that
are executed by a computer using an interpreter. For example, an
embodiment of the invention may be implemented using JAVA®, C++, or
other programming language and development tools. Another embodiment of
the invention may be implemented in hardwired circuitry in place of, or
in combination with, machine-executable software instructions.

[0030] The foregoing description, for purposes of explanation, used
specific nomenclature to provide a thorough understanding of the
invention. However, it will be apparent to one skilled in the art that
specific details are not required in order to practice the invention.
Thus, the foregoing descriptions of specific embodiments of the invention
are presented for purposes of illustration and description. They are not
intended to be exhaustive or to limit the invention to the precise forms
disclosed; obviously, many modifications and variations are possible in
view of the above teachings. The embodiments were chosen and described in
order to best explain the principles of the invention and its practical
applications, they thereby enable others skilled in the art to best
utilize the invention and various embodiments with various modifications
as are suited to the particular use contemplated. It is intended that the
following claims and their equivalents define the scope of the invention.