High-Performance Parallel Database Processing and Grid Databases- P1

High-Performance Parallel Database Processing and Grid Databases- P1: Parallel databases are database systems that are implemented on parallel computing
platforms. Therefore, high-performance query processing focuses on query
processing, including database queries and transactions, that makes use of parallelism
techniques applied to an underlying parallel computing platform in order to
achieve high performance.

Copyright  2008 by John Wiley & Sons, Inc. All rights reserved.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey.
Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or
by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as
permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior
written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to
the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400, fax
978-646-8600, or on the web at www.copyright.com. Requests to the Publisher for permission should be
addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ
07030, (201) 748-6011, fax (201) 748-6008.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in
preparing this book, they make no representations or warranties with respect to the accuracy or
completeness of the contents of this book and speciﬁcally disclaim any implied warranties of
merchantability or ﬁtness for a particular purpose. No warranty may be created or extended by sales
representatives or written sales materials. The advice and strategies contained herein may not be suitable
for your situation. You should consult with a professional where appropriate. Neither the publisher nor
author shall be liable for any loss of proﬁt or any other commercial damages, including but not limited to
special, incidental, consequential, or other damages.
For general information on our other products and services please contact our Customer Care
Department within the U.S. at 877-762-2974, outside the U.S. at 317-572-3993 or fax 317-572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print,
however, may not be available in electronic formats.
Library of Congress Cataloging-in-Publication Data:
Taniar, David.
High-performance parallel database processing and grid databases / by David
Taniar, Clement Leung, Wenny Rahayu.
p. cm.
Includes bibliographical references.
ISBN 978-0-470-10762-1 (cloth : alk. paper)
1. High performance computing. 2. Parallel processing (Electronic computers)
3. Computational grids (Computer systems) I. Leung, Clement H. C. II. Rahayu,
Johanna Wenny. III. Title.
QA76.88.T36 2008
004’ .35—dc22
2008011010
Printed in the United States of America
10 9 8 7 6 5 4 3 2 1

Preface
The sizes of databases have seen exponential growth in the past, and such growth
is expected to accelerate in the future, with the steady drop in storage cost accom-
panied by a rapid increase in storage capacity. Many years ago, a terabyte database
was considered to be large, but nowadays they are sometimes regarded as small,
and the daily volumes of data being added to some databases are measured in
terabytes. In the future, petabyte and exabyte databases will be common.
With such volumes of data, it is evident that the sequential processing paradigm
will be unable to cope; for example, even assuming a data rate of 1 terabyte per
second, reading through a petabyte database will take over 10 days. To effectively
manage such volumes of data, it is necessary to allocate multiple resources to it,
very often massively so. The processing of databases of such astronomical propor-
tions requires an understanding of how high-performance systems and parallelism
work. Besides the massive volume of data in the database to be processed, some
data has been distributed across the globe in a Grid environment. These massive
data centers are also a part of the emergence of Cloud computing, where data
access has shifted from local machines to powerful servers hosting web appli-
cations and services, making data access across the Internet using standard web
browsers pervasive. This adds another dimension to such systems.
Parallelism in databases has been around since the early 1980s, when
many researchers in this area aspired to build large special-purpose database
machines—databases employing dedicated specialized parallel hardware.
Some projects were born, including Bubba, Gamma, etc. These came and
went. However, commercial DBMS vendors quickly realized the importance
of supporting high performance for large databases, and many of them have
incorporated parallelism and grid features into their products. Their commitment
to high-performance systems and parallelism, as well as grid conﬁgurations,
shows the importance and inevitability of parallelism.
In addition, while traditional transactional data is still common, we see
an increasing growth of new application domains, broadly categorized as
data-intensive applications. These include data warehousing and online analytic
processing (OLAP) applications, data mining, genome databases, and multiple
media databases manipulating unstructured and semistructured data. Therefore,
it is critical to understand the underlying principle of data parallelism, before
specialized and new application domains can be properly addressed.
xv

xvi PREFACE
This book is written to provide a fundamental understanding of parallelism in
data-intensive applications. It features not only the algorithms for database opera-
tions but also quantitative analytical models, so that performance can be analyzed
and evaluated more effectively.
The present book brings into a single volume the latest techniques and principles
of parallel and grid database processing. It provides a much-needed, self-contained
advanced text for database courses at the postgraduate or ﬁnal year undergraduate
levels. In addition, for researchers with a particular interest in parallel databases
and related areas, it will serve as an indispensable and up-to-date reference. Prac-
titioners contemplating building high-performance databases or seeking to gain a
good understanding of parallel database technology too will ﬁnd this book valuable
for the wealth of techniques and models it contains.
STRUCTURE OF THE BOOK
This book is divided into ﬁve parts. Part I gives an introduction to the topic, includ-
ing the rationale behind the need for high-performance database processing, as well
as basic analytical models that will be used throughout the book.
Part II, consisting of three chapters, describes parallelism for basic query opera-
tions. These include parallel searching, parallel aggregate and sorting, and parallel
join. These are the foundation of query processing, whereby complex queries can
be decomposed into any of these atomic operations.
Part III, consisting of the next four chapters, focuses on more advanced query
operations. This part covers groupby-join operations, parallel indexing, parallel
object-oriented query processing, in particular, collection join, and query schedul-
ing and optimization.
Just as the previous two parts deal with parallelism of read-only queries, the next
part, Part IV, concentrates on transactions, also known as write queries. We use
the grid environment to study transaction management. In grid transaction man-
agement, the focus is mainly on grid concurrency control, atomic commitment,
durability, as well as replication.
Finally, Part V introduces other data-intensive applications, including data
warehousing, OLAP, business intelligence, and parallel data mining.
ACKNOWLEDGMENTS
The authors would like to thank the publisher, John Wiley & Sons, for agreeing
to embark on this exciting journey. In particular, we would like to thank Paul
Petralia, Senior Editor, for supporting this project. We would also like to thank
Whitney Lesch and Anastasia Wasko, Assistants to the Editor, for their endless
efforts to ensure that we remained on track from start to completion. Without their
encouragement and reminders, we would not have been able to ﬁnish this book.

PREFACE xvii
We also thank Bruna Pomella, who proofread the entire manuscript, for com-
menting on ambiguous sentences and correcting grammatical mistakes.
Finally, we would like to express our sincere thanks to our respective univer-
sities, Monash University, Victoria University, Hong Kong Baptist University, La
Trobe University, and RMIT, where the research presented in this book was con-
ducted. We are grateful for the facilities and time that we received during the
writing of this book. Without these, the book would not have been written in the
ﬁrst place.
David Taniar
Clement H.C. Leung
Wenny Rahayu
Sushant Goel