Finally Here!

General Chair ACM SIGMETRICS 2013, CMU, June 2013

RESEARCH

I am interested in the performance analysis and design of computer
systems, particularly distributed systems. I work on finding
analytical models which capture the important characteristics of a
computer system and allow me to redesign the system to improve its
performance (response time).

I believe that many conventional wisdoms on which we base system
designs are not well understood and sometimes false, leading to
inferior designs. Many of our existing beliefs stem from queueing
research in the 60's and 70's -- the great era for system performance
modeling. Unfortunately, back then we did not have all the analytical
and computational tools available today. Consequently, some
questions were answered only under the approximation of
Markovian workloads (exponentially distributed job sizes) which we now
know to be non-representative of many real-world workloads (which show
much greater variability and often heavy tails). Also, many
multi-server system models and scheduling/routing schemes
were not analytically tractable at that time.

My research revisits these very classic questions in system design,
armed with today's new queueing and computational techniques, as well
as a new perspective on real-world workloads, performance metrics, and
implementation experience.
I work on deriving new fundamental theorems in system design, many of
which seem counterintuitive in light of age-old beliefs. I then
incorporate these theorems into implementations of
Web servers, database servers, and distributed server
systems.

Here are a few examples of commonly-held beliefs that my research challenges:

Thousands of "load balancing" heuristics do exactly that -- they
aim to balance the load among the existing hosts. But is load balancing
necessarily a good thing?

Ever notice that the optimal scheduling policy,
Shortest-Remaining-Processing-Time-First (SRPT), is rarely used in
practice? There's a fear that the big jobs will "starve", or be
treated unfairly as compared with Processor-Sharing (PS). But is
this a reality?

To minimize mean waiting time in a server farm, research suggests that each job should be sent to the server at which it will experience the least wait. That seems good from the job's perspective, but is the greedy strategy best for the system as a whole?

Given a choice between a single machine with speed s, or n
identical machines each with speed s/n, which would you choose?
Think again ...

Migrating active jobs is generally considered too expensive.
Killing jobs midway through execution and restarting them from scratch
later is even worse! Says who?

Cycle stealing (using another machine's idle cycles to do your work) is
a central theme in distributed systems and has been the topic of thousands
of papers. But can we quantify when cycle stealing pays off, as a function of switching costs and threshold parameters?