Harvard Computer Science Technical Reports for 1994

We develop a model-based approach to reasoning, in
which the knowledge base is represented as a set of
models (satisfying assignments) rather then a logical
formula, and the set of queries is restricted. We
show that for every propositional knowledge base (KB)
there exists a set of {\em characteristic models}
with the property that a query is true in KB if and
only if it is satisfied by the models in this set.
We fully characterize a set of theories for which the
model-based representation is compact and provides
efficient reasoning. These include cases where the
formula-based representation does not support
efficient reasoning. In addition, we consider the
model-based approach to {\em abductive reasoning} and
show that for any propositional KB, reasoning with
its model-based representation yields an abductive
explanation in time that is polynomial in its size.
Some of our technical results make use of the {\em
Monotone Theory}, a new characterization of Boolean
functions introduced.

The
notion of {\em restricted queries} is inherent to our
approach. This is a wide class of queries
for which reasoning is very efficient and exact, even
when the model-based representation KB provides only
an approximate representation of the ``world''.

Moreover, we show that the theory developed here
generalizes the model-based approach to reasoning
with Horn theories, and captures
even the notion of reasoning with Horn-approximations.
Our result characterizes the Horn
theories for which the approach suggested in
is useful and the phenomena
observed there, regarding the relative sizes of the
formula-based representation and model-based
representation of KB is explained and put
in a wider context.

We introduce a new framework for the study of
reasoning. The Learning (in order) to Reason
approach developed here combines the interfaces to
the world used by known learning models with the
reasoning task and a performance criterion suitable
for it. In this framework the intelligent agent is
given access to her favorite learning interface, and
is also given a grace period in which she can
interact with this interface and construct her
representation KB of the world $W$. Her reasoning
performance is measured only after this period, when
she is presented with queries $\alpha$ from some
query language, relevant to the world, and has to
answer whether $W$ implies $\alpha$.

The
approach is meant to overcome the main computational
difficulties in the traditional treatment of
reasoning which stem from its separation from the
``world''. First, by allowing the reasoning task to
interface the world (as in the known learning
models), we avoid the rigid syntactic restriction on
the intermediate knowledge representation. Second,
we make explicit the dependence of the reasoning
performance on the input from the environment. This
is possible only because the agent interacts with the
world when constructing her knowledge representation.

We show how previous results from learning
theory and reasoning fit into this framework and
illustrate the usefulness of the Learning to Reason
approach by exhibiting new results that are not
possible in the traditional setting. First, we give
a Learning to Reason algorithm for a class of
propositional languages for which there are no
efficient reasoning algorithms, when represented as a
traditional (formula-based) knowledge base. Second,
we exhibit a Learning to Reason algorithm for a class
of propositional languages that is not known to be
learnable in the traditional sense.

We present a brief description of a finite element solver
implemented on the Connection Machine CM-5 system. A more detailed
presentation of the issues involved in such an implementation can be
found in [1,2].

In this paper we extend previous work on automatic motion synthesis
for physically realistic 2D articulated figures in three ways. First,
we describe an improved motion-synthesis algorithm that runs
substantially faster than previously reported algorithms. Second, we
present two new techniques for influencing the style of the motions
generated by the algorithm. These techniques can be used by an
animator to achieve a desired movement style, or they can be used to
guarantee variety in the motions synthesized over several runs of the
algorithm. Finally, we describe an animation editor that supports the
interactive concatenation of existing, automatically generated motion
controllers to produce complex, composite trajectories. Taken
together, these results suggest how a usable, useful system for
articulated-figure motion synthesis might be developed.

Motion synthesis is the process of automatically generating
visually plausible motions that meet goal criteria specified by a
human animator. The objects whose motions are synthesized are often
animated characters that are modeled as articulated figures or
mass-spring lattices. Controller synthesis is a technique for motion
synthesis that involves searching in a space of possible controllers
to generate appropriate motions. Recently, automatic
controller-synthesis techniques for 2D articulated figures have been
reported. An open question is whether these techniques can be
generalized to work for 3D animated characters. In this paper we
report successful automatic controller synthesis for 3D articulated
figures and mass-spring models that are subject to nonholonomic
constraints. These results show that the 3D motion-synthesis problem
can be solved in some challenging cases, though much work on this
general topic remains to be done.

The recursive spectral bisection (RSB) algorithm was proposed
by Pothen {\em et al.} [1] as the basis for computing small vertex
separators for sparse matrices. Simon [2] applied this algorithm to mesh
decomposition and showed that spectral bisection compared favorably with
other decomposition techniques. Since then, the RSB algorithm has been
widely accepted in the scientific community because of its robustness and
its consistency in the high-quality partitionings it generates. The major
drawback of the RSB algorithm is its high computing cost, as noted in [2],
caused by the need for solving a series of eigenvalue problems. It is
often stated that an unstructured mesh can be decomposed after it is
generated, and the decomposition reused for the different calculations
performed on that mesh. However, a new partitioning is to be obtained if
adaptive mesh refinement is required. The mesh also has to be
re-decomposed if the number of processing nodes available to the user
changes between two calculations. In order to avoid the mesh decomposition
from becoming a significant computational bottleneck, an efficient
data-parallel implementation of the RSB algorithm using the CM Fortran
language [3] is developed. In this paper, we present only an abbreviated
description of the parallel implementation of the RSB algorithm, followed
by two decomposition examples. Details of the implementation can be found
in [4].

The objective of this paper is to analyze the impact of data
mapping strategies on the performance of finite element applications.
First, we describe a parallel mesh decomposition algorithm based on
recursive spectral bisection used to partition the mesh into element
blocks. A simple heuristic algorithm then renumbers the mesh nodes.
Large three-dimensional meshes demonstrate the efficiency of those mapping
strategies and assess the performance of a finite element program for fluid
dynamics.

Efficient data motion has been of critical importance in high
performance computing almost since the first electronic computers were
built. Providing sufficient memory bandwidth to balance the capacity of
processors led to memory hierarchies, banked and interleaved memories.
With the rapid evolution of MOS technologies, microprocessor and memory
designs, it is realistic to build systems with thousands of processors and a
sustained performance of a trillion operations per second or more. Such
systems require tens of thousands of memory banks, even when locality of
reference is exploited. Using conventional technologies, interconnecting
several thousand processors with tens of thousands of memory banks can
feasibly only be made by some form of sparse interconnection network.
Efficient use of locality of reference and network bandwidth is critical.
We review these issues in this paper.

Can stochastic search algorithms outperform existing deterministic
heuristics for the NP-hard problem Number Partitioning if given a
sufficient, but practically realizable amount of time? In a thorough
empirical investigation using a straightforward implementation of one
such algorithm, simulated annealing, Johnson et al. (1991) concluded
tentatively that the answer is "no." In this paper we show that the
answer can be "yes" if attention is devoted to the issue of problem
representation (encoding). We present results from empirical tests of
several encodings of Number Partitioning with problem instances
consisting of multiple-precision integers drawn from a uniform
probability distribution. With these instances and with an appropriate
choice of representation, stochastic and deterministic searches can --
routinely and in a practical amount of time -- find solutions several
orders of magnitude better than those constructed by the best
heuristic known (Karmarkar and Karp, 1982), which does not employ
searching. The choice of encoding is found to be more important than
the choice of search technique in determining search efficacy. Three
alternative explanations for the relative performance of the encodings
are tested experimentally. The best encodings tested are found to
contain a high proportion of good solutions; moreover, in those
encodings, the solutions are organized into a single "bumpy funnel"
centered at a known position in the search space. This is likely to be
the only relevant structure in the search space because a blind search
performs as well as any other search technique tested when the search
space is restricted to the funnel tip. We also show how analogous
representations might be designed in a principled manner for other
difficult combinatorial optimization problems by applying the
principles of parameterized arbitration, parameterized constraint, and
parameterized greediness.

We present a system for generating parsers based directly on the
metaphor of parsing as deduction. Parsing algorithms can be
represented directly as deduction systems, and a single deduction
engine can interpret such deduction systems so as to implement the
corresponding parser. The method generalizes easily to parsers for
augmented phrase structure formalisms, such as definite-clause
grammars and other logic grammar formalisms, and has been used for
rapid prototyping of parsing algorithms for a variety of formalisms
including variants of tree-adjoining grammars, categorial grammars,
and lexicalized context-free grammars.

We present three different methods for finding solutions to the 2D
translation-only {\em containment} problem: find translations for $k$
polygons that place them inside a given polygonal container without
overlap. Both the container and the polygons to be placed in it may
be non-convex. First, we provide several exact algorithms that
improve results for $k=2$ or $k=3$. In particular, we give an
algorithm for three convex polygons and a non-convex container with
running time in ${\rm O}(m^3nlog mn)$, where $n$ is the number of
vertices in the container, and $m$ is the sum of the vertices of the
$k$ polygons. This is an improvement of a factor of $n^2$ over
previous algorithms. Second, we give an approximation algorithm for
$k$ non-convex polygons and a non-convex container based on
restriction and subdivision of the configuration space. Third, we
develop a MIP (mixed integer programming) model for $k$ non-convex
polygons and a non-convex container.

We present an extension to a hybrid graph-bisection algorithm
developed by Bui et al. that uses vertex coalescing and the
Kernighan-Lin variable-depth algorithm to minimize the size of the cut
set. In the original heuristic technique, one iteration of vertex
coalescing is used to improve the performance of the original
Kernighan-Lin algorithm. We show that by performing vertex coalescing
recursively, substantially greater improvements can be achieved for
standard random graphs of average degree in the range [2.0,5.0].

This thesis introduces Programmable Reduced Instruction Set Computers
(PRISC) as a new class of general-purpose computers. PRISC use RISC
techniques as a base, but in addition to the conventional RISC
instruction resources, PRISC offer hardware programmable resources
which can be configured based on the needs of a particular
application. This thesis presents the architecture, operating system,
and programming language compilation techniques which are needed to
successfully build PRISC. Performance results are provided for the
simplest form of PRISC -- a RISC microprocessor with a set of
programmable functional units consisting of only combinational
functions. Results for the SPECint92 benchmark suite indicate that an
augmented compiler can provide a performance improvement of 22% over
the underlying RISC computer with a hardware area investment less than
that needed for a 2 kilobyte SRAM. In addition, active manipulation of
the source code leads to significantly higher local performance gains
(250%-500%) for general abstract data types such as short-set vectors,
hash tables, and finite state machines. Results on end-user
applications that utilize these data types indicate a performance gain
from 32%-213%.

Given a two-dimensional, non-overlapping layout of convex and
non-convex polygons, {\em compaction} refers to a simultaneous motion
of the polygons that generates a more densely packed layout. In
industrial two-dimensional packing applications, compaction can
improve the material utilization of already tightly packed layouts.
Efficient algorithms for compacting a layout of non-convex polygons
are not previously known.

This dissertation offers the first
systematic study of compaction of non-convex polygons. We start by
formalizing the compaction problem as that of planning a motion that
minimizes some linear objective function of the positions. Based on
this formalization, we study the complexity of compaction and show it
to be PSPACE-hard.

The major contribution of this dissertation
is a position-based optimization model that allows us to calculate
directly new polygon positions that constitute a locally optimum
solution of the objective via linear programming. This model yields
the first practically efficient algorithm for translational
compaction--compaction in which the polygons can only translate. This
compaction algorithm runs in almost real time and improves the
material utilization of production quality human-generated layouts
from the apparel industry.

Several algorithms are derived
directly from the position-based optimization model to solve related
problems arising from manual or automatic layout generation. In
particular, the model yields an algorithm for separating overlapping
polygons using a minimal amount of motion. This separation algorithm
together with a database of human-generated markers can automatically
generate markers that approach human performance.

Additionally,
we provide several extensions to the position-based optimization
model. These extensions enables the model to handle small rotations,
to offer flexible control of the distances between polygons and to
find optimal solution to two-dimensional packing of non-convex
polygons.

This dissertation also includes a compaction algorithm
based on existing physical simulation approaches. Although our
experimental results showed that it is not practical for compacting
tightly packed layouts, this algorithm is of interest because it shows
that the simulation can speed up significantly if we use geometrical
constraints to replace physical constraints. It also reveals the
inherent limitations of physical simulation algorithms in compacting
tightly packed layouts.

Most of the algorithms presented in this
dissertation have been implemented on a SUN ${\rm SparcStation}^{\rm
TM}$ and have been included in a software package licensed to a CAD
company.

This paper demonstrates that scalability and competitive
efficiency can be achieved for unstructured grid finite element
applications on distributed memory machines, such as the Connection Machine
CM-5 system. The efficiency of finite element solvers is analyzed through
two applications: an implicit computational aerodynamics application and an
explicit solid mechanics application. Scalability of mesh decomposition and
data mapping strategies are also discussed. Numerical examples that
support the claims for problems with an excess of fourteen million
variables are presented.

The statistical query learning model can be viewed as
a tool for creating (or demonstrating the existence of) noise-tolerant
learning algorithms in the PAC model. The complexity of a statistical
query algorithm, in conjunction with the complexity of simulating SQ
algorithms in the PAC model with noise, determine the complexity of
the noise-tolerant PAC algorithms produced. Although roughly optimal
upper bounds have been shown for the complexity of statistical query
learning, the corresponding noise-tolerant PAC algorithms are not
optimal due to inefficient simulations. In this paper we provide both
improved simulations and a new variant of the statistical query model
in order to overcome these inefficiencies.

We improve the time
complexity of the classification noise simulation of statistical query
algorithms. Our new simulation has a roughly optimal dependence on
the noise rate. We also derive a simpler proof that statistical
queries can be simulated in the presence of classification noise.
This proof makes fewer assumptions on the queries themselves and
therefore allows one to simulate more general types of queries.

We
also define a new variant of the statistical query model based on
relative error, and we show that this variant is more natural and
strictly more powerful than the standard additive error model. We
demonstrate efficient PAC simulations for algorithms in this new model
and give general upper bounds on both learning with relative error
statistical queries and PAC simulation. We show that any statistical
query algorithm can be simulated in the PAC model with malicious
errors in such a way that the resultant PAC algorithm has a roughly
optimal tolerable malicious error rate and sample complexity.

Finally, we generalize the types of queries allowed in the statistical
query model. We discuss the advantages of allowing these generalized
queries and show that our results on improved simulations also hold
for these queries.

Massively parallel processors introduce new demands on software systems
with respect to performance, scalability, robustness and portability. The
increased complexity of the memory systems and the increased range of problem
sizes for which a given piece of software is used, poses serious challenges
to software developers. The Connection Machine Scientific Software Library,
CMSSL, uses several novel techniques to meet these challenges. The CMSSL
contains routines for managing the data distribution and provides data
distribution independent functionality. High performance is achieved through
careful scheduling of arithmetic operations and data motion, and through the
automatic selection of algorithms at run--time. We discuss some of the
techniques used, and provide evidence that CMSSL has reached the goals of
performance and scalability for an important set of applications.

The concept of block--cyclic order elimination can be applied to
out--of--core $LU$ and $QR$ matrix
factorizations on distributed memory architectures
equipped with a parallel I/O system. This elimination scheme provides load
balanced computation in both the factor and solve phases and further
optimizes the use of the network bandwidth
to perform I/O operations. Stability of LU factorization is enforced by
full column pivoting.
Performance results are presented for the Connection Machine system CM--5.

ROMM is a class of Randomized, Oblivious, Multi--phase, Minimal routing
algorithms. ROMM routing offers a potential for improved performance
compared to fully randomized algorithms under both light and heavy loads.
ROMM routing also offers close to best case performance for many common
permutations. These claims are supported by extensive simulations of
binary cube networks for a number of routing patterns. We show that
$kimes n$ buffers per node suffice to make $k$--phase ROMM routing free
from deadlock and livelock on $n$--dimensional binary cubes.

In the Spacetime Constraints paradigm of animation, the animator
specifies what a character should do, and the details of the motion
are generated automatically by the computer. Ngo and Marks recently
proposed a technique of automatic motion synthesis that uses a
massively parallel genetic algorithm to search a space of motion
controllers that generate physically realistic motions for 2D
articulated figures. In this paper, we describe an empirical
study of evolutionary computation algorithms and standard
function optimization algorithms that were implemented in lieu
of the massively parallel GA in order to find a substantially
more efficient search algorithm that would be viable on serial
workstations. We discovered that simple search algorithms based
on the evolutionary programming paradigm were most efficient in
searching the space of motion controllers.

O(N) algorithms for N-body simulations enable the simulation of particle
systems with up to 100 million particles on current Massively Parallel
Processors (MPPs). Our optimization techniques mainly focus on minimizing the
data movement through careful management of the data distribution and the data
references, both between the memories of different nodes, and within the
memory hierarchy of each node. We show how the techniques can be expressed
in languages with an array syntax, such as Connection Machine Fortran (CMF).
All CMF constructs used, with one exception, are included in High Performance
Fortran.

The effectiveness of our techniques is demonstrated on an implementation
of Anderson's hierarchical O(N) N-body method for the Connection Machine
system CM-5/5E. Of the total execution time, communication accounts for
about 10-20% of the total time, with the average efficiency for arithmetic
operations being about 40% and the total efficiency (including communication)
being about 35%. For the CM-5E a performance in excess of 60 Mflop/s per
node (peak 160 Mflop/s per node) has been measured.

An agent's ability to understand an utterance depends upon its ability
to relate that utterance to the preceding discourse. The agent must
determine whether the utterance begins a new segment of the discourse,
completes the current segment, or contributes to it. The intentional
structure of the discourse, comprised of discourse segment purposes
and their interrelationships, plays a central role in this process
(Grosz and Sidner, 1986). In this thesis, we provide a computational
model for recognizing intentional structure and utilizing it in
discourse processing. The model specifies how an agent's beliefs
about the intentions underlying a discourse affects and are affected
by its subsequent discourse. We characterize this process for both
interpretation and generation and then provide specific algorithms for
modeling the interpretation process.

The collaborative planning framework of SharedPlans (Lochbaum, Grosz,
and Sidner, 1990; Grosz and Kraus, 1993) provides the basis for our
model of intentional structure. Under this model, agents are taken to
engage in discourses and segments of discourses for reasons that
derive from the mental state requirements of action and collaboration.
Each utterance of a discourse is understood in terms of its
contribution to the SharedPlans in which the discourse participants
are engaged. We demonstrate that this model satisfies the
requirements of Grosz and Sidner's (1986) theory of discourse
structure and also simplifies and extends previous plan-based
approaches to dialogue understanding. The model has been implemented
in a system that demonstrates the contextual role of intentional
structure in both interpretation and generation.

The O(N) hierarchical N-body algorithms and Massively Parallel
Processors allow particle systems of 100 million particles or more to
be simulated in acceptable time. We describe a data parallel
implementation of Anderson's method and demonstrate both efficiency
and scalability of the implementation on the Connection Machine CM-5/5E
systems. The communication time for large particle systems amounts to
about 10-25%, and the overall efficiency is about 35%. On a CM-5E the
overall performance is about 60 Mflop/s per node, independent of the
number of nodes.

The Connection Machine Scientific Software Library (CMSSL) is a library of
scientific routines designed for distributed memory architectures. The BLAS
of the CMSSL have been implemented as a two--level structure to exploit
optimizations local to nodes and across nodes. This paper presents the
implementation considerations and performance of the Local BLAS, or BLAS
local to each node of the system. A wide variety of loop structures and
unrollings have been implemented in order to achieve a uniform and high
performance, irrespective of the data layout in node memory. The CMSSL is
the only existing high--performance library capable of supporting both the
data parallel and message passing modes of programming a distributed memory
computer. The implications of implementing BLAS on distributed memory
computers are considered in this light.

We present a simple method for developing parallel and systolic
programs from data dependence. We derive sequences of parallel
computations and communications based on data dependence and
communication delays, and minimize the communication delays and
processor idle time. The potential applications for this method
include supercompiling, automatic development of parallel programs,
and systolic array design.

Christopher Small and Margo Seltzer. 1994. ``VINO: An Integrated Platform for Operating System and Database Research.''

In 1981, Stonebraker wrote:

Operating system services in many existing systems are either too
slow or inappropriate. Current DBMSs usually provide their own and
make little or no use of those offered by the operating system.

The standard operating system model has changed little since that time,
and we believe that, at its core, it is the {\em wrong} model for DBMS and
other resource-intensive applications. The standard model is inflexible,
uncooperative, and irregular in its treatment of resources.

We describe the design of a new system, the VINO kernel, which addresses the
limitations of standard operating systems. It focuses on three key ideas:

David Mazieres and Michael D. Smith. 1994. ``Abstract Execution in a Multi-Tasking Environment.''

Tracing software execution is an important part of understanding
system performance. Raw CPU power has been increasing at a rate far
greater than memory and I/O bandwidth, with the result that the
performance of client/server and I/O-bound applications is not scaling
as one might hope. Unfortunately, the behavior of these types of
applications is particularly sensitive to the kinds of distortion
induced by traditional tracing methods, so that current traces are
either incomplete or of questionable accuracy. Abstract execution is
a powerful tracing technique which was invented to speed the tracing
of single processes and to store trace data more compactly. In this
work, abstract execution was extended to trace multi-tasking
workloads. The resulting system is more than 5 times faster than
other current methods of gathering multi-tasking traces, and can
therefore generate traces with far less time distortion.

Igor Najfeld and Timothy F. Havel. 1994. ``Derivatives of the Matrix Exponential and their Computation.''

Matrix exponentials and their derivatives play an important role
in the perturbation analysis, control and parameter estimation of
linear dynamical systems. The well-known integral representation
of the derivative of the matrix exponential $\exp(t{f A})$ in the
matrix direction ${f V}$,
$\int_0^t{f\exp(({\it t}-au)A)V\exp(au A)}\,{\rm d}au$,
enables us to derive a number of new properties of this derivative,
along with spectral, series and exact representations. Many of these
results extend to arbitrary analytic functions of a matrix argument,
for which we have also derived a simple relation between the gradients
of their entries and the directional derivatives in the elementary
directions. Based on these results, we construct and optimize two
new algorithms for computing the directional derivative. We have
also developed a new algorithm for computing the matrix exponential,
based on a rational representation of the exponential in terms of
the hyperbolic function ${f A}{f coth}({f A})$, which is more
efficient than direct Pade approximation. Finally, these results are
illustrated by an application to a biologically important parameter
estimation problem which arises in nuclear magnetic resonance
spectroscopy.

Most contemporary implementations of the Berkeley Fast File System
optimize file system throughput by allocating logically sequential
data to physically contiguous disk blocks. This clustering is effective
when there are many contiguous free blocks on the file system. But the
repeated creation and deletion of files of varying sizes that occurs
over time on active file systems is likely to cause fragmentation of
free space, limiting the ability of the file system to allocate data
contiguously and therefore degrading performance.

This paper presents empirical data and the analysis of allocation and
fragmentation in the SunOS 4.1.3 file system (a derivative of the
Berkeley Fast File System). We have collected data from forty-eight
file systems on four file servers over a period of ten months. Our
data show that small files are more fragmented than large files, with
fewer than 35% of the blocks in two block files being allocated optimally,
but more than 80% of the blocks in files larger than 256 kilobytes being
allocated optimally. Two factors are respon sible for this difference
in fragmentation, an uneven distribution of free space within file
system cylinder groups and a disk allocation algorithm which frequently
allocates the last block of a file discontiguously from the rest of the file.

Performance measurements on replicas of active file systems show that
they seldom perform as well as comparable empty file systems but that
this performance degradation is rarely more than 10-15%. This decline
in performance is directly correlated to the amount of fragmentation
in the files used by the benchmark programs. Both file system
utilization and the amount of fragmentation in existing files on the
file system influence the amount of fragmentation in newly created
files. Characteristics of the file system workload also have a
significant impact of file system fragmentation and performance, with
typical news server workloads causing extreme fragmentation.

A necessary condition for the establishment, on a substantial basis,
of a parallel software industry would appear to be the availability
of technology for generating transportable software, i.e.
architecture independent software which delivers scalable performance
for a wide variety of applications on a wide range of multiprocessor
computers. This paper describes H-BSP -- a general purpose parallel
computing environment for developing transportable algorithms. H-BSP
is based on the Bulk Synchronous Parallel Model (BSP), in which a
computation involves a number of supersteps, each having several
parallel computational threads that synchronize at the end of the
superstep. The BSP Model deals explicitly with the notion of
communication among computational threads and introduces parameters
g and L that quantify the ratio of communication throughput to
computation throughput, and the synchronization period, respectively.
These two parameters, together with the number of processors and the
problem size, are used to quantify the performance and, therefore, the
transportability of given classes of algorithms across machines having
different values for these parameters. This paper describes
the role of unbundled compiler technology in facilitating the
development of such a parallel computer environment.