Harvard Computer Science Technical Reports for 1991

We report on the status of code development for a simulation of
Quantun Chromodynamics (QCD) with dynamical Wilson fermions on the
Connection Machine model CM-2. Our original code, written in \*Lisp,
gave performance in the near -GFLOPS range. We have rewritten the
most time-consuming parts of the code in the low-level programming
system CMIS, including the matrix multiply and the communication.
Current versions of the code run at approximately 3.6 GFLOPS for the
fermion matrix inversion, and we expect the next version to reach or
exceed 5 GFLOPS.

In a {\em generalized shuffle permutation} an address \ $ (a_ {q-
1} -a_ {q-2} ldots a_ {} ) $ \ receives its content from an
address obtained through a cyclic shift on a subset of the \ $ q $
\ dimensions used for the encoding of the addresses. Bit-
complementation may be combined with the shift. We give an
algorithm that requires \ $ \frac {K} {2} + 2 $ \ exchanges for \
$ K $ \ elements per processor, when storage dimensions are part
of the permutation, and concurrent communication on all ports of
every processor is possible. The number of element exchanges in
sequence is independent of the number of processor dimensions \ $
\sigma_ {r} $ \ in the permutation. With no storage dimensions in
the permutation our best algorithm requires \ $ (\sigma_ {r} + 1)
lceil \frac {K} {2 \sigma_ {r}} \rceil $ \ element exchanges. We
also give an algorithm for \ $ \sigma_ {r} = 2, $ \ or the real
shuffle consists of a number of cycles of length two, that
requires \ $ \frac {K} {2} + 1 $ \ element exchanges in sequence
when there no bit complement. The lower bound is \ $ \frac {K}
{2} $ \ for both real and mixed shuffles with no bit
complementation. The minimum number of communication start-ups is
\ $ \sigma_ {r} $ \ for both cases, which is also the lower bound.
The data transfer time for communication restricted to one port
per processor is \ $ \sigma_ {r} \frac {K} {2} , $ \ and the
minimum number of start-ups is \ $ \sigma_ {r} . $ \ The analysis
is verified by experimental results on the Intel iPSC/1, and for
one case also on the Connection Machine.

We examine the computational complexity of cartographic label
placement, a problem derived from the cartographer's task of
placing text labels adjacent to map features in such a way as to
minimize overlaps with other labels and map features.
Cartographic label placement is one of the most time-consuming
tasks in the production of maps. Consequently, several attempts
have been made to automate the label-placement task for some or
all classes of cartographic features (punctual, linear, or areal
features), but all previously published algorithms for the most
basic task---point-feature-label placement---either exhibit worst-
case exponential time complexity, or incorporate incomplete
heuristics that may fail to find an admissible labeling even when
one exists. The computational complexity of label placement is
therefore a matter of practical significance in automated
cartography. We show that admissible label placement is NP-
complete, even for very simple versions of the problem. Thus, no
polynomial time algorithm exists unless $P=NP$. Similarly, we show
that optimal label placement can be solved in polynomial time if
and only if $P=NP$, and this result holds even if we require only
approximately optimal placements. The results are especially
interesting because cartographic label placement is one of the few
combinatorial problems that remains NP-hard even under a geometric
(Euclidean) interpretation. The results are of broader practical
significance, as they also apply to point-feature labeling in non-
cartographic displays, e.g., the labeling of points in a scatter
plot.

The design of geometric models can be a painstaking and time-
consuming task. Typical CAD packages are often primitive or
deficient in the means they offer for placing geometry in a design
or for subsequently modifying the geometry. Modification in
particular can tax a user's patience when it requires many
deletion, creation, or perturbation operations to effect a
conceptually simple change in the design. One area of research
that attempts to address these deficiencies involves the use of
constraints on the geometry as a means of both specifying and
controlling its shape. The form in which the constraint
information is demanded from the user determines the ease of use
of any geometric system that is based on constraints, and is one
of the basic problems to be addressed in the design of such a
system. I present a constraint-based geometric editor that allows
the manipulation of both constraints and geometry using the direct
-manipulation paradigm, which is well established as being of
central importance in many easy-to-use systems. When using the
editor, constraints are presented to the user graphically, in the
context of the geometric design, and may be created, destroyed,
and manipulated interactively along with the geometry. The
constraints may either be created explicitly, or implicitly as a
side effect of creating geometry. In addition, constraints are
used in a novel way to facilitate interactive creation and
positioning of geometry in three space, despite the limitations of
commonly-available two-dimensional display and input devices.
Lastly, whenever geometry is modified using direct manipulation, a
solver is called which updates the geometry in accordance with the
existing constraints. All of these features contribute to the
ease of use of my system. I also present a solver that addresses
another basic problem inherent in constraint-based systems;
namely, the need to efficiently obtain a solution that
instantiates the geometry so as to satisfy the constraints. I
provide a robust and efficient solver that is $O(n^2)$ in the size
of the geometric problem being solved and present the mathematics
that support it. In addition, I present a new algorithm that
partitions the free-form geometry and constraint network into a
number of pieces that may be independently solved. For many
networks, this algorithm yields a solution to the entire network
in close to linear time.

TR-07-91

Alfredo De Santis and Giuseppe Persiano. 1991. ``Tight Upper and Lower Bounds on the Path Length of Binary Trees.''

The {\em external path length} of a tree $T$ is the sum of the
lengths of the paths from the root to each external node. The
{\em maximal path length difference,} $riangle$, is the
difference between the length of the longest and shortest such
path. We prove tight lower and upper bounds on the external path
length of binary trees with $N$ external nodes and prescribed
maximal path length difference $riangle$. In particular, we give
an upper bound that, for each value of $riangle$, can be exactly
achieved for infinitely many values of $N$. This improves on the
previously known upper bound that could only be achieved up to a
factor proportional $N$. We then use the upper bound to give a
simple upper bound on the path length of Red-Black trees which is
asymptotically tight. We also present, as a preliminary result,
an elementary proof of the known upper bound. We finally prove a
lower bound which can be exactly achieved for each value of $N$
and $riangle leq N/2$.

We develop a general framework for deriving abstract domains from
concrete semantic domains in the context of first-order recursive
schemes and prove several theorems which ensure the correctness
(safety) of abstract computations. The abstract domains, which we
call {\em Weak Hoare powerdomains} , subsume the roles of both
the abstract domains and the collecting interpretations in the
abstract interpretation literature.

In this paper we consider abstract interpretation of PCF programs.
The main development is the extension of {\em weak powerdomains}
to higher types. In the classical abstract interpretation
approach, abstract domains are constructed explicitly and the
abstract semantics is then related to the concrete semantics. In
the approach introduced here, abstract domains are {\em derived}
directly from concrete domains. The conditions for deriving the
domains are intended to be as general as possible while still
guaranteeing that the derived domain has sufficient structure so
that it can be used as a basis for computing correct information
about the concrete semantics. We prove three main theorems, the
last of which ensures the correctness of abstract interpretation
of PCF programs given safe interpretations of the constants. This
generalizes earlier results obtained for the special case of
strictness analysis.

We provide performance models for several primitive operations on
data structures distributed over memory units interconnected by a
Boolean cube network. In particular, we model single source, and
multiple source concurrent broadcasting or reduction, concurrent
gather and scatter operations, shifts along several axes of multi-
dimensional arrays, and emulation of butterfly networks. We also
show how the processor configuration, data aggregation, and the
encoding of the address space affect the performance for two
important basic computations: the multiplication of arbitrarily
shaped matrices, and the Fast Fourier Transform. We also give an
example of the performance behavior for local matrix operations
for a processor with a single path to local memory, and a set of
registers. The analytic models are verified by measurements on
the Connection Machine model CM-2.

Free choice among join orderings is one of the most powerful
optimizations in a conventional optimizer. But the freedom is
limited to Select/Project/Join queries. In this paper, we extend
this freedom to queries that include outerjoins. Unlike previous
work, these results are not limited to queries possessing a ``nice
structure,'' or queries that are nicely represented in relational
calculus. Our theoretical results concern query
``simplification'' and reassociation using a generalized
outerjoin. We show how the necessary computation can be added
rather easily to the join-order generation of a conventional query
optimizer.

A model of plan recognition in discourse must be based on intended
recognition, distinguish each agent's beliefs and intentions from
the other's, and avoid assumptions about the correctness or
completeness of the agents' beliefs. In this paper, we present an
algorithm for plan recognition that is based on the SharedPlan
model of collaboration and that satisfies these constraints.

Large, robust massively parallel programs that are understandable
(and therefore maintainable) are not yet a reality. Such programs
require a programming methodology that minimizes the conceptual
differences between the program and the domain addressed by the
program, encourages reusability, and still produces robust
programs that are readily maintained and reasoned about. This
paper proposes the parallel object-oriented model. The model is
constructed from an object-oriented methodology augmented by
constructs and semantics for parallel processing, and satisfies
the requirements for building large parallel applications. It
presents a unique way of representing object references and of
managing concurrent access to objects. The methodology may be
extended for a wide range of computing platforms and application
areas.

A model of plan recognition in discourse must be based on intended
recognition, distinguish each agent's beliefs and intentions from
the other's, and avoid assumptions about the correctness or
completeness of the agents' beliefs. In this paper, we present an
algorithm for plan recognition that is based on the SharedPlan
model of collaboration and that satisfies these constraints.

We show a general positive result that allows us to boost the
expressiveness of projection closed classes of boolean functions
that are identifiable with membership and equivalence queries.
Such classes include monotone DNF formulas, read-once formulas,
conjunctions of Horn clauses, switch configurations, $\mu$-formula
decision trees, and read-twice DNF formulas. We show that when
representations from such classes (rather than single literals)
are tested at the leaves of a read-once formula, or on the
internal nodes of a $\mu$-decision tree, the resulting
representation class is still identifiable with membership and
equivalence queries. The additional overhead in time and queries
is polynomial.

The computational task of protein-structure prediction is believed
to require exponential time, but previous arguments as to its
intractability have taken into account only the size of a
protein's conformational space. Such arguments do not rule out
the possible existence of an algorithm, more selective than
exhaustive search, that is efficient and exact. (An {\em
efficient} algorithm is one that is guaranteed, for all possible
inputs, to run in time bounded by a function polynomial in the
problem size. An {\em intractable} problem is one for which no
efficient algorithm exists.) Questions regarding the possible
intractability of problems are often best answered using the
theory of NP-completeness. In this treatment we show the NP-
hardness of two typical mathematical statements of empirical
potential energy function minimization for macromolecules. Unless
all NP-complete problems can be solved efficiently, these results
imply that a function-minimization algorithm can be efficient for
protein-structure prediction only if it exploits protein-specific
properties that prohibit the simple geometric constructions that
we use in our proofs. Analysis of further mathematical statements
of molecular structure prediction could constitute a systematic
methodology for identifying sources of complexity in protein
folding, and for guiding development of predictive algorithms.

All-to-all personalized communication is a class of permutations
in which each processor sends a unique message to every other
processor. We present optimal algorithms for concurrent
communication on all channels in Boolean cube networks, both for
the case with a single permutation, and the case where multiple
permutations shall be performed on the same local data set, but on
different sets of processors. For \ $ K $ \ elements per
processor our algorithms give the optimal number of elements
transfer, \ $ K/2. $ \ For a succession of all-to-all personalized
communications on disjoint subcubes of \ $ eta $ \ dimensions
each, our best algorithm yields \ $ \frac {K} {2} + \sigma - eta
$ \ element exchanges in sequence, where \ $ \sigma $ \ is the
total number of processor dimensions in the permutation. An
implementation on the Connection Machine of one of the algorithms
offers a maximum speed-up of 50% compared to the previously best
known algorithm.

For matrix multiplication on hypercube multiprocessors with the
product matrix accumulated in place a processor must receive about
\ $ P^ {2} /\sqrt {N} $ \ elements or each input operand, with
operands of size \ $ P imes P $ \ distributed evenly over \ $ N
$ \ processors. With concurrent communication on all ports, the
number of element transfers in sequence can be reduced to \ $ P^
{2} /\sqrt {N} log N $ \ for each input operand. We present a
two-level partitioning of the matrices and an algorithm for the
matrix multiplication with optimal data motion and constant
storage. The algorithm has sequential arithmetic complexity \ $
2P^ {3} , $ \ and parallel arithmetic complexity \ $ 2P^ {3} /N. $
\ The algorithm has been implemented on the Connection Machine
model CM-2. For the performance on the 8K CM-2, we measured about
1.6 Gflops, which would scale up to about 13 Gflops for a 64K full
machine.

We present a new algorithm for conversion between binary code and
binary-reflected Gray code that requires approximately \ $ \frac
{2K} {3} $ \ element transfers in sequence for \ $ K $ \ elements
per node, compared to \ $ K $ \ element transfers for previously
known algorithms. For a Boolean cube of \ $ n = 2 $ \ dimensions
the new algorithm degenerates to yield a complexity of \ $ \frac
{K} {2} + 1 $ \ element transfers, which is optimal. The new
algorithm is optimal within a factor of \ $ \frac {1} {3} $ \ for
any routing strategy. We show that the minimum number of element
transfers for minimum path length routing is \ $ K $ \ with
concurrent communication on all channels of every node of a
Boolean cube.

An all-to-all broadcast routing algorithm that allows concurrent
communication on all channels of the Connection Machine Boolean
cube network is described. Explicit routing formulas are given for
both the physical broadcast between processors, and the virtual
broadcast within processors. Implementation issues are addressed
and timings for the physical and virtual broadcast are given for
the Connection Machine system CM-2. The peak data transfer rate
for the physical broadcast on a 64k CM-2 is 4.1 Gbytes/sec, and
the peak rate for the virtual broadcast is about 20 Gbytes/sec.
Reshaping of arrays is shown experimentally to reduce the
broadcast time by a factor of up to 7 by reducing the amount of
local data motion. Finally, we also show how to exploit symmetry
for computation of an interaction matrix using the all-to-all
broadcast function. Further optimizations are suggested for \ $N-
$body type calculations. Using the all-to-all broadcast function,
a peak rate of 5 Gflops/s has been achieved for the \ $ N-$body
computations in 32-bit precision on a 64k CM-2.

Network diagrams are a familiar graphic form that can express many
different kinds of information. The problem of automating network
-diagram layout has therefore received much attention. Previous
research on network-diagram layout has focused on the problem of
aesthetically optimal layout, using such criteria as the number of
link crossings, the sum of all link lengths, and total diagram
area. In this paper we propose a restatement of the network-
diagram-layout problem in which layout-aesthetic concerns are
subordinated to perceptual-organization concerns. We describe a
notation for describing the visual organization of a network
diagram. This notation is used in reformulating the layout task
as a constrained-optimization problem in which constraints are
derived from a visual-organization specification and optimality
criteria are derived from layout-aesthetic considerations. Two
new heuristic algorithms are presented for this version of the
layout problem: one algorithm uses a rule-based strategy for
computing a layout; the other is a massively parallel genetic
algorithm. We demonstrate the capabilities of the two algorithms
by testing them on a variety of network-diagram-layout problems.

S. Lennart Johnsson. 1991. ``Minimizing the Communication Time for Matrix Multiplication on
Multi-Processors.''

We present a few algorithms that allow concurrency in
communication on multiple channels of multi-processors to be
exploited for the multiplication of matrices of arbitrary shapes.
For multi-processors configured as \ $ n-$dimensional Boolean
cubes our algorithms offer a speedup of the communication over
previous algorithms for square matrices and square cubes by a
factor of \ $ \frac {n} {2} . $ \ We show that configuring \ $ N $
\ processors as a three-dimensional array may reduce the
communication complexity by a factor of \ $ \sqrt[6] {N} $ \
compared to the two-dimensional partitioning. The best two-
dimensional configuration of the multi-processor nodes has a ratio
between the number of rows and columns equal to the ratio between
the number of rows and columns of the product matrix. The optimum
three dimensional configuration has a ratio between the length of
the machine axes equal to the ratio between the length of the
three axis in matrix multiplication. For product matrices of
extreme shape a one-dimensional partitioning may be optimum. All
presented algorithms use standard communication functions.

We describe an implementation of the Cooley Tukey complex-to-
complex FFT on the Connection Machine. The implementation is
designed to make effective use of the communications bandwidth of
the architecture, its memory bandwidth, and storage with
precomputed twiddle factors. The peak data motion rate that is
achieved for the interprocessor communication stages is in excess
of 7 Gbytes/s for a Connection Machine system CM-200 with 2048
floating-point processors. The peak rate of FFT computations local
to a processor is 12.9 Gflops/s in 32-bit precision, and 10.7
Gflops/s in 64-bit precision. The same FFT routine is used to
perform both one- and multi-dimensional FFT without any explicit
data rearrangement. The peak performance for one-dimensional FFT
on data distributed over all processors is 5.4 Gflops/s in 32-bit
precision and 3.2 Gflops/s in 64-bit precision. The peak
performance for square, two-dimensional transforms, is 3.1
Gflops/s in 32-bit precision, and for cubic, three dimensional
transforms, the peak is 2.0 Gflops/s in 64-bit precision. Certain
oblong shapes yield better performance. The number of twiddle
factors stored in each processor is \ $ \frac {P} {2N} + log_ {2}
N $ \ for an FFT on \ $ P $ \ complex points uniformly distributed
among \ $ N $ \ processors. To achieve this level of storage
efficiency we show that a decimation-in-time FFT is required for
normal order input, and a decimation-in-frequency FFT is required
for bit-reversed input order.

We investigate, within the PAC learning model, the problem of
learning nonoverlapping perceptron networks. These are loop-free
neural nets in which each node has only one outgoing weight. We
give a polynomial time algorithm that PAC learns any
nonoverlapping perceptron network using examples and membership
queries. The algorithm is able to identify both the architecture
and the weight values necessary to represent the function to be
learned. Our results shed some light on the effect of the overlap
on the complexity of learning in neural networks.

TR-27-91

Jon Christensen, Joe Marks, and Stuart Shieber. 1991. ``Labeling Point Features on Maps and Diagrams Using Simulated
Annealing.''

A major factor affecting the clarity of graphical displays is the
degree to which text labels obscure display features (including
other labels) as a result of spatial overlap. Point-feature label
placement (PFLP) is the problem of placing text labels adjacent to
point features on a map or diagram in order to maximize
legibility. This problem arises for all kinds of informational
graphics, though it is most often associated with automated
cartography. In this paper we present a comprehensive treatment
of the PFLP problem. First, we summarize some recent results
regarding the computational complexity of PFLP. These results show
that optimal PFLP is NP-hard. Second, we survey previously
reported algorithms for PFLP. Third, we describe a stochastic-
optimization method for PFLP, based on simulated annealing.
Finally, we present the results of an empirical comparison of the
known algorithms for PFLP. Our results indicate that the
simulated-annealing approach to PFLP is superior to all existing
methods, regardless of label density.

We study the cost of implementing {\em linearizable} read/write
objects for shared-memory multiprocessors and under various
assumptions on the available timing information. We take as cost
measure the {\em worst-case response time} of performing an
operation in distributed implementations of virtual shared memory
consisting of such objects and supporting linearizability. It is
assumed that processes have clocks that run at the same rate as
real time and all messages incur a delay in the range $[d-u,d]$
for some known constants $u$ and $d$, $0 leq u leq d$. In the
{\em perfect clocks} model, where processes have perfectly
synchronized clocks and every message incurs a delay of exactly
$d$, we present a family of optimal linearizable implementations,
parameterized by a constant $eta$, $0 leq eta leq 1$, for
which the worst-case response times for read and write operations
are $eta d$ and $(1-eta)d$, respectively. The parameter
$eta$ may be appropriately chosen to account for the relative
frequencies of read and write operations. Our main result is the
first known linearizable implementation for the {\em imperfect
clocks} model where clocks are not initially synchronized and
message delays can vary, i.e., $u > 0$; it achieves worst-case
response times of less than $4u+b$ ($b>0$ is an arbitrarily small
constant) and $d+3u$ for read and write operations, respectively.
This implementation uses novel synchronization techniques to
exploit the lower bound on message delay time and achieve bounds
on worst-case response times that depend on the message delay
uncertainty $u$. For a wide range of values of $u$, these bounds
improve previously known ones for implementations that support
consistency conditions even weaker than linearizability.