Hai (Paul) Liu

I'm interested in functional programming and programming language research in general. I did my PhD with Professor Paul Hudak at Yale University. I was part of the Yale Haskell Group. I'm now a research scientist at Intel Labs. I can be reached at hai dot liu at aya dot yale dot edu.

PC member of Haskell Symposium 2015.
PC member of 26th Symposium on Implementation and Application of Functional Languages (IFL'14).
PC member of The 3rd ACM SIGPLAN Workshop on Functional High-Performance Computing (FHPC'14).
PC member of the Haskell Implementors Workshop 2014 (HIW'14).

Causal commutative arrows (CCA) extend arrows with additional
constructs and laws that make them suitable for modelling domains such
as functional reactive programming, differential equations and
synchronous dataflow.
Earlier work has revealed that a syntactic transformation of CCA
computations into normal form can result in significant performance
improvements, sometimes increasing the speed of programs by orders of
magnitude.
In this work we reformulate the normalization as a type class instance
and derive optimized observation functions via a specialization to
stream transformers to demonstrate that the same dramatic improvements
can be achieved without leaving the language.

Deep neural networks (DNNs) have undergone a surge in popularity with
consistent advances in the state of the art for tasks including image
recognition, natural language processing, and speech recognition. The
computationally expensive nature of these networks has led to the proliferation
of implementations that sacrifice abstraction for high performance. In this
paper, we present Latte, a domain-specific language for DNNs that provides a
natural abstraction for specifying new layers without sacrificing performance.
Users of Latte express DNNs as ensembles of neurons with connections between
them. The Latte compiler synthesizes a program based on the user specification,
applies a suite of domain-specific and general optimizations, and emits
efficient machine code for heterogeneous architectures. Latte also includes a
communication runtime for distributed memory data-parallelism. Using networks
described using Latte, we demonstrate 3-6x speedup over Caffe (C++/MKL) on the
three state-of-the-art ImageNet models executing on an Intel Xeon E5-2699 v3
x86 CPU.

In light of recent hardware advances, general-purpose computing on
graphics processing units (GPGPU) is becoming increasingly
commonplace, and needs novel programming models due to GPUs'
radically different architecture. For the most part, existing
approaches to programming GPUs within a high-level programming
language choose to embed a domain-specific language (DSL) within a
host metalanguage and then implement a compiler that maps programs
written within that DSL to code in low-level languages such as
OpenCL or CUDA. An alternative, underexplored, approach is to
compile a restricted subset of the host language itself directly
down to OpenCL/CUDA. We believe more research should be done to
compare these two approaches and their relative merits. As a step
in this direction, we implemented a quick proof of concept of the
alternative approach. Specifically, we extend the Repa library
with a computeG function to offload a computation to the
GPU. As long as the requested computation meets certain
restrictions, we compile it to OpenCL 2.0 using the recently added
feature for shared virtual memory. We can successfully run nine
benchmarks on an Intel integrated GPU. We obtain the expected
performance from the GPU on six of those benchmarks, and are close
to the expected performance on two more. In this paper, we
describe an offload primitive for Haskell, how to extend Repa to
use it, how to implement that primitive in the Intel Labs Haskell
Research Compiler, and evaluate the approach on nine benchmarks,
comparing to two different CPUs, and for one benchmark to
hand-written OpenCL code.

Papers on functional language implementations frequently set the
goal of achieving performance "comparable to C", and sometimes
report results comparing benchmark results to concrete C
implementations of the same problem. A key pair of questions for
such comparisons is: what C program to compare to, and what C
compiler to compare with? In a 2012 paper, Satish et al compare
naive serial C implementations of a range of throughput-oriented
benchmarks to best-optimized implementations parallelized on a
six-core machine and demonstrate an average 23x (up to 53x)
speedup. Even accounting for thread parallel speedup, these
results demonstrate a substantial performance gap between naive
and tuned C code. In this current paper, we choose a subset of the
benchmarks studied by Satish et al to port to Haskell. We measure
performance of these Haskell benchmarks compiled with the standard
Glasgow Haskell Compiler and with our experimental Intel Labs
Haskell Research Compiler and report results as compared to our
best reconstructions of the algorithms used by Satish et al.
Results are reported as measured both on an Intel Xeon E5-4650
32-core machine, and on an Intel Xeon Phi co-processor. We hope
that this study provides valuable data on the concrete performance
of Haskell relative to C.

The Glasgow Haskell Compiler (GHC) is a well supported optimizing
compiler for the Haskell programming language, along with its own
extensions to the language and libraries. Haskell's lazy semantics
imposes a runtime model which is in general difficult to implement
efficiently. GHC achieves good performance across a wide variety
of programs via aggressive optimization taking advantage of the
lack of side effects, and by targeting a carefully tuned virtual
machine. The Intel Labs Haskell Research Compiler uses GHC as a
frontend, but provides a new whole-program optimizing backend by
compiling the GHC intermediate representation to a relatively
generic functional language compilation platform. We found that
GHC's external Core language was relatively easy to use, but
reusing GHC's libraries and achieving full compatibility were
harder. For certain classes of programs, our platform provides
substantial performance benefits over GHC alone, performing 2x
faster than GHC with the LLVM backend on selected modern
performance-oriented benchmarks; for other classes of programs,
the benefits of GHC's tuned virtual machine continue to outweigh
the benefits of more aggressive whole program optimization.
Overall we achieve parity with GHC with the LLVM backend. In this
paper, we describe our Haskell compiler stack, its implementation
and optimization approach, and present benchmark results comparing
it to GHC.

We begin with a functional reactive programming (FRP) model in which every
program is viewed as a signal function that converts a stream of input values
into a stream of output values. We observe that objects in the real world --
such as a keyboard or sound card -- can be thought of as signal functions as
well. This leads us to a radically different approach to I/O: instead of
treating real-world objects as being external to the program, we expand the
sphere of influence of program execution to include them within. We call this
virtualizing real-world objects. We explore how virtual objects (such as GUI
widgets) and even non-local effects (such as debugging and random number
generation) can be handled in the same way.

The key to our approach is the notion of a resource type that assures that a
virtualized object cannot be duplicated, and is safe. Resource types also
provide a deeper level of transparency: by inspecting the type, one can see
exactly what resources are being used. We use arrows, type classes, and type
families to implement our ideas in Haskell, and the result is a safe,
effective, and transparent approach to stream-based I/O.

Arrows are a popular form of abstract computation. Being more general
than monads, they are more broadly applicable, and in particular are a
good abstraction for signal processing and dataflow computations.
Most notably, arrows form the basis for a domain specific language
called Yampa, which has been used in a variety of concrete
applications, including animation, robotics, sound synthesis, control
systems, and graphical user interfaces.

Our primary interest is in better understanding the class of abstract
computations captured by Yampa. Unfortunately, arrows are not
concrete enough to do this with precision. To remedy this situation
we introduce the concept of commutative arrows that capture a
non-interference property of concurrent computations. We also add an
init operator that captures the causal nature of arrow effects,
and identify its associated law.

To study this class of computations in more detail, we define an
extension to arrows called causal commutative arrows (CCA), and
study its properties. Our key contribution is the identification of a
normal form for CCA called causal commutative normal form
(CCNF). By defining a normalization procedure we have developed an
optimization strategy that yields dramatic improvements in performance
over conventional implementations of arrows. We have implemented this
technique in Haskell, and conducted benchmarks that validate the
effectiveness of our approach. When compiled with the Glasgow Haskell
Compiler (GHC), the overall methodology can result in significant
speed-ups.

Arrows are a popular form of abstract computation. Being more general
than monads, they are more broadly applicable, and in particular are a
good abstraction for signal processing and dataflow computations.
Most notably, arrows form the basis for Yampa, a functional
reactive programming (FRP) language embedded in Haskell. Our primary
interest is in better understanding the class of abstract computations
captured by Yampa. Unfortunately, arrows are not concrete enough to
do this with precision for the lack of a domain specific knowledge.

In this thesis, we present a more constrained class of arrows called
causal commutative arrows (CCA) that introduces an init
operator to capture the causal nature of arrow effects, as well as
two additional laws. Our key contribution is the identification of a
normal form for CCA, and by defining a normalization procedure we have
developed an optimization strategy that yields dramatic improvements
in performance over conventional implementations of arrows.

To study this abstract class of computation more concretely, we
explore three different and yet related applications of CCA, namely,
synchronous dataflow, ordinary differential equation, and functional
reactive programming. For each application, we develop an
arrow based DSL that is an instance of CCA, and we show their significant
advantages at improving program's run-time behavior, such as
eliminating hideous space leaks, and boosting performances by orders
of magnitude.

We propose a programming paradigm called compress-and-conquer
(CC) that leads to optimal performance on multicore platforms. Given
a multicore system of p cores and a problem of size n, the problem
is first reduced to p smaller problems, each of which can be solved
independently of the others (the compression phase). From the
solutions to the p problems, a compressed version of the same
problem of size O(p) is deduced and solved (the global
phase). The solution to the original problem is then derived from the
solution to the compressed problem together with the solutions of the
smaller problems (the expansion phase).

The CC paradigm reduces the complexity of multicore programming by
allowing the best-known sequential algorithm for a problem to be used
in each of the three phases. In this paper we apply the CC paradigm
to a range of problems including scan, nested scan, difference
equations, banded linear systems, and linear tridiagonal systems. The
performance of CC programs is analyzed, and their optimality and
linear speedup are proven. Characteristics of the problem space
subject to CC are formally examined, and we show that its
computational power subsumes that of scan, nested scan, and mapReduce.

The CC paradigm has been implemented in Haskell as a modular,
higher-order function, whose constituent functions can be shared by
seemingly unrelated problems. This function is compiled into
low-level Haskell threads that run on a multicore machine, and
performance benchmarks confirm the theoretical analysis.

We study a number of embedded DSLs for autonomous ordinary differential
equations (autonomous ODEs) in Haskell. A naive implementation based on the
lazy tower of derivatives is straightforward but has serious time and
space leaks due to the loss of sharing when handling cyclic and infinite data
structures. In seeking a solution to fix this problem, we explore a number of DSLs
ranging from shallow to deep embeddings, and middle-grounds in between. We
advocate a solution based on arrows, an abstract notion of computation
that offers both a succinct representation and an effective implementation.
Arrows are ubiquitous in their combinator style that happens to capture
both sharing and recursion elegantly. We further relate our arrow-based DSL to
a more constrained form of arrows called causal commutative arrows, the
normalization of which leads to a staged compilation technique improving ODE
performance by orders of magnitude.

Arrows are a popular form of abstract computation. Being more general
than monads, they are more broadly applicable, and in particular are a
good abstraction for signal processing and dataflow computations.
Most notably, arrows form the basis for a domain specific language
called Yampa, which has been used in a variety of concrete
applications, including animation, robotics, sound synthesis, control
systems, and graphical user interfaces.

Our primary interest is in better understanding the class of abstract
computations captured by Yampa. Unfortunately, arrows are not
concrete enough to do this with precision. To remedy this situation
we introduce the concept of commutative arrows that capture a
kind of non-interference property of concurrent computations. We also
add an init operator, and identify a crucial law that captures
the causal nature of arrow effects. We call the resulting
computational model causal commutative arrows.

To study this class of computations in more detail, we define an
extension to the simply typed lambda calculus called causal
commutative arrows (CCA), and study its properties. Our key
contribution is the identification of a normal form for CCA called
causal commutative normal form (CCNF). By defining a
normalization procedure we have developed an optimization
strategy that yields dramatic improvements in performance over
conventional implementations of arrows. We have implemented this
technique in Haskell, and conducted benchmarks that validate the
effectiveness of our approach. When combined with stream fusion, the
overall methodology can result in speed-ups of greater than two orders
of magnitude.

The implementation of conceptually continuous signals in
functional reactive programming (FRP) is studied in detail. We show
that recursive signals in standard implementations using streams and
continuations lead to potentially serious time and space leaks under
conventional call-by-need evaluation. However, by moving to the
level of signal functions, and structuring the design around
arrows, this class of time and space leaks can be avoided.
We further show that the use of optimal reduction can also
avoid the problem, at the expense of a much more complex evaluator.

Reliability is a critical requirement of the Internet. The availability
and resilience of the Internet under failures can have significant
global effects. However, in the current Internet routing architecture,
achieving the high level of reliability demanded by many mission critical
activities can be costly. In this paper, we first propose a
novel solution framework called reliability as an interdomain service
(REIN) that can be incrementally deployed in the Internet and
may improve the redundancy of IP networks at low cost. We then
present robust algorithms to efficiently utilize network redundancy
to improve reliability. We use real IP network topologies and traffic
traces to demonstrate the effectiveness of our framework and
algorithms.

Here are some software projects that I've worked on. They may be school or work related, or just of personal interest. Only open source projects are listed here although some are yet to be made public. The list of commercial ones can be found in my resume.

Site Tool is a minimalistic approach to personal document and web page writing, or to put it simply, pandoc + make + darcs. It is sort of like a personal Wiki, but without involving any web server, or the clunky edit box in a browser. It is an early effort towards my ideal of the web-age human computer interface: a digital notebook where one continuously writes, reads, and cross-links, and on top of which, one shares such efforts with others.

CCA is a pre-processor and optimizer for Causal Commutative Arrows, which is a more constrained class of Arrows with two additional laws: commutativity and product. It implements the normalization algorithm presented in our paper using Template Haskell to provide staged compilation of generic CCA arrows, with speedups sometimes over two orders of magnitude. The pre-processor is based on Paterson's arrowp preprocessor but specialized to deal with CCA. Latest development version can be found in its darcs repository.

Euterpea
is a new Haskell library for computer music applications developed at the Yale Haskell Group. It is a descendent of Haskore and HasSound, and is intended for both educational purposes as well as serious computer music development. Euterpea is a wide-spectrum library, suitable for high-level music representation, algorithmic composition, and analysis; mid-level concepts such as MIDI; and low-level audio processing, sound synthesis, and instrument design. The name Euterpea is derived from Euterpe, who was one of the nine Greek Muses (goddesses of the arts), specifically the Muse of Music.
My contribution include the real-time MIDI I/O, as well as a Musical User Interface that enables a set of computer-music specific GUI widgets such as keyboards, guitar frets, knobs, sliders, and so on. Yale ITS had an article about our work behind this. The software is under active development and can be obtained from its darcs repository.

HWiki
is a custom Wiki engine written in Haskell to provide online book editing and sharing, and PDF publication to e-ink reader devices including iRex iLiad, Sony Reader, Amazon Kindle DX and a few others. It runs as a SCGI program, serves an AJAX interface (borrowed from Orchid), and relies on Pandoc to parse Markdown syntax. Notable features include automatic segmentation of big files into smaller chunks, and a built-in special purpose revision control system that supports millons of small files because neither git or darcs is up to the task. The software has also been licensed (in GPL) by some commercial user.

LambdaINet
is an experimental Interaction Net evaluator written in Haskell that implements optimal evaluation for Lambda calculus based on Lambdascope. It features an interactive graphical user interface that helps understanding optimal as well as other reduction strategies. Not included at the moment is a prototypical implementation of a fine-grain parallel reduction system (complete with a concurrent garbage collector) for Interaction Nets that is both lock-free and wait-free. Few implementations exist for optimal evaluation, I'm proud that this is one of them.

GLFW is a is a Haskell module for GLFW OpenGL framework. It provides an alternative to GLUT for OpenGL based Haskell programs. I initially started this project as an effort to port SOE software to a modern cross-platform graphics interface. The latest version is a cabal install away from Hackage DB, and its development is now coordinated through the mailinglist and darcs repsitory.

The following are past projects no longer being maintained. Source codes are distributed as they are, and I have no knowledge whether they still compile or work at all.

AwkiAwki is a WikiWiki clone written in awk by Oliver Tonnhofer. I've made numerous extensions to the original to support attachments, code inlining, header index, user authentication, comment posting, etc. etc. The latest version of my branch is made available in source. I've stopped developing it further since I moved on to the Haskell based HWiki.

Incremental Garbage Collector is a patch for Lua 4.0 that implements a tri-color algorithm so as to reduce apparent GC pause during Lua script execution, and to meet critical requirements for applications like games. The patch was written and released in 2002, and obsoleted by Lua 5.1 when the latter came out in 2005 with a built-in incremental collector.

GEEP is a multi-channel communication protocol to replace TCP, or put in another way, a reliable data transport layer on top of UDP, developed at GIME International in 2001. It features minimal connection maintenance overhead, built-in multi-channel control within same connection, selective acknowledgement to reduce network load, and enhanced Vegas transmission control mechanisms to speed up transfer by fully utilizing available bandwidth. GEEP has served well as data protocol between game client and server, as well as in a generic software auto-update service.

GIME is a generic engine that supports multi-threading, internationalization, database, multiple interfaces including web, telnet and graphic clients on which massive multi-user interactive applications could be built with little effort. It was the first open technology developed at GIME International back in 2001. Its SourceForge repository provides the backend server (written in Pike), frontend GUI tools (written in Lua and C), and a demo groupware built with GIME (also with Roxen webserver). There was also a later unreleased version written in Haskell which served a WAP game for cellphones, but that was more of an experiment than a real product.

SGZ MUD is a text-based Chinese MUD (Multi-User Dungeon) with a background set in the Three Kingdoms era. It started in 1998 after I localized a version of Lima Mudlib in Chinese, and soon became a rare gem among popular Chinese MUDs at the time due to its unique strategy play. The enthusiasm about the game led us to start a company to develop it further to eventually become the GIME engine and the Century of Three Kingdoms MMOG. SGZ mudlib (excluding player data) is made available as LPC source codes, and you need a version of MudOS driver (or precompiled i386 Linux binary) to run it.