Thursday, July 25, 2013

I can't tell you how you should read a research paper. Probably a different and customized style would work best for you. But I can tell you how I read a research paper. And if you think some of the things below make sense, you can try them for yourself and see if they work for you as well.

So, how do I read a paper? The first answer that pops to my mind is "I fight with it". I strive to completely understand and "Grok" the paper by arguing with it. I resist to believe the claims in the paper at face value, I try to poke holes at it. I believe that the paper has to earn my trust, and I only accept the parts that I am obliged to accept.

My algorithm for reading a paper

0. Print the paper
1. Ask questions and argue with the paper
2. Write a review
3. Fight with the paper
4. Sleep on it and come back to it

Print the paper

Ask questions

I am a slow and deliberate reader. I ask a lot of questions while reading the paper. Why did the paper define it this way? Was this the only way to define it? Was this the best way to define it? Do I believe it? Why should I believe it? I continuously have some questions in my mind going through the paper. If nothing else, I try to guess where the next paragraph will be (or should be) leading the paper. This is called critical reading.

I get very detail-oriented when reading the paper. I highlight, I underline, I cross over sentences. I mark important paragraphs with stars. I use the margins to take notes about my arguments and about some connections that occur to me. Obviously, there will be several places I won't understand in my first reading. I mark these with WDYM (what do you mean). I hypothesize (make up guesses) about things I don't understand, and revise my hypothesis as I read further in to the paper.

I am a bit ashamed of this, but I often write provocative things on the margin, like "This claim is stupid", "This is just bullshit". It is silly, but I do this to keep myself engaged with the paper and be emotionally involved as well. This way, I never fall asleep reading a paper.

Write a review

"Writing is nature's way of telling you how sloppy your thinking is."

I write a short review of the paper to test how much I understand of it. Since I already mark enough text from the paper and write enough notes in the margins while reading the paper critically, it becomes easy to start writing my personalized review of the paper using these.

I am often surprised by how little I understood the paper in my first reading. When writing the review, since I try to explain things myself in my own way, I often realize that I didn't really understand the paper.

To elaborate this point, I will borrow from my advice to undergraduate students:
There are different levels of knowing/understanding. The first level is "knowing by word". With passive learning (like passively reading), you get the most limited form of understanding/knowing. It is ephemeral, and you are not able to reconstruct that on your own. The second level is "knowing by first hand experience". At this level, you are able to understand/witness and if necessary rediscover the knowledge on your own. Finally, the third level is "knowing by heart". At this level, you internalized the knowledge so that it becomes part of your nature. You grok it.

Fight with the paper

Now that I somewhat understand the paper, it is time for me to think about it more in depth to understand it better. How would I do it? Could I find a better/simpler solution to this problem? What would be the major disadvantages of the proposed solution? Or, even, considering the problem the paper addressed, was this the right question to ask in the first place?

I find that this exercise also has benefits for my own writing/research. Trying to poke holes in others' papers helps me become a better writer and write stronger/better-defended papers. (If you are a graduate student, this will help you write a well-defended strong thesis, and you can get by fighting a small snake in the snake fight portion of your thesis defense.)

It is now time for me to read the paper a second time. This time I try to test/verify my major objections about the paper and also try to understand the places in the paper I had marked as WDYM (or sometimes as WDYFM).

Sleep on it and come back to it

At this point, I might have been overly negative and hard on the paper, and now is the time to empathize with the paper and put it back in context. It is important not to confuse critical reading with being hypercritical about a paper and dismissing the contributions made by the paper.

So I sleep on it, I take time off and go on with other things (life?) for a day or so. But during this time, my mind comes back unconsciously to the paper and reconsiders my judgments about it. And often, as a result I come to respect the paper and appreciate it more. Even though I fought with the paper and argued with it fiercely, now I try to respect and truly learn from the paper.

Maybe this quote from "Ender's game" explains this best. Ender says "In the moment when I truly understand my enemy, understand him well enough to defeat him, then in that very moment I also love him."

Questions-Answers

How do other researcher read papers?
Unfortunately, I don't know too much about how other researchers read. It would be nice to learn tricks and approaches used by others. But I guess these things are not discussed much explicitly.

I think my approach to paper reading was shaped strongly by my PhD advisor Anish Arora. He would also write on the paper, underline carefully, and argue in the margins. He was also extremely meticulous about paper reviewing for conference committees. He would insist that we understand every aspect of the paper, and sometimes understand its contributions better than the authors that wrote the paper :-)

I also had a chance to observe my postdoc advisor Nancy Lynch reading papers. She is a very detail-oriented person as well. She was able to find even the tiniest smallest mistakes in the papers with ease. She once told me that her mind worked like a debugger when reading a paper, and these bugs jumped at her. The way she worked with students was that she would dedicate herself solely on a student/paper for the duration of an entire week. That week, she would avoid thinking or listening other works/students. This is because, she wanted to immerse and keep every parameter about the paper she is working on in her mind, and grok it.

After reading this post, Ted Herman shared some of his useful tricks. While reading, he would think about how he would rewrite the paper in a way that would have cleared up the misunderstandings. He would also ask these questions of the paper: "Why is this interesting? What are the surprises? Should we even be surprised?"

How much time do I spend reading a paper this way?
Many many hours. Of course if the subject is easy or I know the domain well, I can read a paper in a more relaxed manner. But I don't remember getting a lot of benefit from such readings. I get the best return on investment on the papers I had to struggle with and truly grok.

Isn't this wasteful for bad papers?
Yes, it is. I do this for reading papers that are important for my research and/or reading good quality papers that appeared at good venues. Or for papers I am stuck with reviewing for a conference or journal.

Sunday, July 21, 2013

While there has been good progress and wide availability of the devices (smartphones, tablets, sensors) to fulfill the ubiquitous computing vision, the-state-of-the-art in software and integration is lagging far behind. Consider DARPA's 2009 network grand challenge on the occasion of the 40th anniversary of the Internet. The challenge was to accurately find 10 weather balloons deployed in arbitrary locations of the U.S. within a day. There was an award of $40,000 for the team that would first report the locations of the 10 balloons accurately, and this challenge was solved within 9 hours. The winning team employed social networks and a multilevel incentive structure, but had to prepare, campaign, and publicize aggressively for an entire month before the challenge day.

This points to a big gap between the potential of the smartphones and the reality of the smartphone software today. Why are the existing apps so limited and person-centric? Why can we not have an app that is able to solve similar active collaboration and coordination problems automatically?

We argue that the reason for this gap is the lack of an infrastructure to task/utilize these devices for collaboration and coordination. In the absence of such an infrastructure, the state-of-the-art today is for each device to connect to Internet to download/upload data and accomplish an individual task that does not require collaboration and coordination. In contrast, providing an infrastructure for publish/subscribe and tasking of these devices would enable any device to utilize the data published by several devices in a region, as well as to task several devices in a region to acquire the needed data, if the data is not already being published to the infrastructure.

In order to task/utilize ubiquitous devices for collaboration, we propose a ubiquitous computing middleware, Eywa, which provides an open publish-subscribe infrastructure for smartphones, and paves the way for crowdsourced sensing and collaboration applications. Eywa differs from previous work as its emphasis is on active and focused crowdsourcing and it serves as an enabler for users/devices to task other devices automatically to collect the needed data.

Here is a very raw/immature idea I have been toying with recently. I decided to write it down and share in order to get feedback about whether this is worth exploring further.

Serializability

Serializability of reads and writes (sequential consistency) is an important feature in distributed systems. A significant application of serializability is in transaction processing in distributed databases (e.g., Spanner).

Paxos serialization versus serializability

Paxos is a method for fault-tolerant state machine replication. In order to achieve fault-tolerant state machine replication, Paxos employs consensus to serialize operations at a leader and apply the operations at each replica in this exact serialized order (dictated by the leader).

While serialization is not an end in Paxos but rather a means, Paxos is seen nowadays as a defacto method for serialization rather than a method for fault-tolerant state machine replication. Maybe the ZooKeeper implementation of a Paxos-based lock-service help promote this confusion and blur the lines between Paxos's true goal (fault-tolerant replication) with its side effect (serialization).

In fact Paxos serialization is overkill, it is too strong. Paxos will serialize operations in a total order, which is not necessarily needed for sequential consistency. Today in many applications where knowing the total order and replicated-logging of that order is not important, Paxos is still (ab)used.

Serialization is not and should not be tightly-related to a replication solution.

OK, so what? (What is wrong with using a drug for its side effect?)

This confusion may permeate the misconception that serialization is inherently a reactive process. Many people have now been thinking about serialization from the Paxos perspective and as an inherently on-demand process.

(Probably there are other problems that arise due to this confusion, but for now I am only picking on this one.)

Proactive serialization

Serialization can actually be proactive and anticipatory. The workers does not need to make a serialization request to a Paxos leader. By getting rid of such a request (albeit in a fraction of the cases), latency can be reduced and throughput can be improved.

By adopting the proactive serialization philosophy, a lock service can provide the locks to some workers proactively so those workers can go ahead. Embracing this philosophy, a lock-service master can anticipate a future request and provide some locks ahead of time. The benefit of this is to eliminate the need for an explicit request (again in a fraction of the cases) and improve throughput and reduce latency.

A couple paragraphs above I said that "Today in many applications where knowing the total order and replicated-logging of that order is not important, Paxos is still (ab)used." Picking up on that thread I will claim further (without much evidence for now) that in those cases, proactive serialization could do a better job.

What are some ways to achieve proactive serialization?

How can proactive serialization be achieved? One approach is to employ machine learning. The lock service learns the patterns of requests for locks over time and then starts distributing these locks speculatively before they are requested.

Employing presignaling can be another approach to achieve proactive serialization. The nodes can anticipate which locks they will be needing in the next round or minute and ask for those locks in bulk speculatively. The lock service can choose to honor some of these requests and achieve proactive serialization to improve performance.

Another approach to achieve proactive serialization can be to do a static analysis of workers' programs/actions and come up with patterns/sequence of accesses. When the lock service observes one of the detected patterns, it can give the locks in advance to the respective workers that are mentioned in the rest of the pattern.

Did I pick on Paxos unfairly?

Actually, none of these proactive approaches are ruled out in Paxos. The Paxos leader itself could be making these proactive/speculative lock giving requests in advance. So, did I really need to pick on Paxos? Maybe not.

However, Paxos serialization in total order is still overkill and can cause problems in a Paxos-based solution. Google Spanner used a smart solution of assigning separate conflict domains (at the tablet level) to Paxos groups. This way it was able to provide locks in parallel across separate conflict domains.

If we do not tightly-couple serialization with a Paxos-based replication service, we can find more efficient and new solutions for serialization.

Thursday, July 4, 2013

The Spanner paper by Google (appeared in OSDI'12) is cryptic and hard to
understand. When I first read it, I thought I understood the main idea, and that the
benefit of TrueTime was to enable lock-free read-only transactions in Spanner.
Then, I slowly realized things didn't check; it was possible to achieve
lock-free read-only transactions without TrueTime as well. I did another read, and thought for some time, and had a better understanding of how TrueTime
benefits Spanner, and how to improve its shortcomings.

I will first provide a summary of the Spanner work (borrowing sentences and
figures from the Spanner paper), and then talk about what TrueTime is actually
good for.

In a nutshell

Spanner is Google's scalable, multi-version, globally-distributed, and
synchronously-replicated database. Spanner supports non-blocking reads in the past, lock-free read-only transactions, and atomic schema changes. In order to
support externally-consistent distributed transactions at global scale,
it uses a novel TrueTime API that exposes clock uncertainty.

From NoSQL to NewSQL!

The need to support semi-relational tables and synchronous replication in
Spanner has been motivated by the popularity of Megastore. Many applications at
Google (e.g., Gmail, Picasa, Calendar, Android Market, and AppEngine) chose to
use Megastore because of its semi-relational data model and synchronous
replication, despite its poor write throughput.

Spanner evolved from a Bigtable-like versioned key-value store into a temporal
multi-version database. Data is stored in semi-relational tables, and Spanner
provides a SQL-based query language and supports general-purpose long-lived
transactions (e.g, for report generation —on the order of minutes). The
Spanner team believes it is better to have application programmers deal with
performance problems due to overuse of transactions as bottlenecks arise, rather
than always coding around the lack of transactions.

Spanner distributed database

Data is versioned, and each version is automatically timestamped with its commit
time by the TrueTime API. Spanner provides externally consistent reads and
writes, and globally-consistent reads across the database at a timestamp.
External consistency (or equivalently, linearizability) is defined as follows:
if a transaction T1 commits before another transaction T2 starts, then T1's
commit timestamp is smaller than T2's. Using TrueTime Spanner is able to assign
globally-meaningful commit timestamps to transactions, which reflect the
serialization order.

2 TrueTime API

TT.now() returns a TTinterval that is guaranteed to contain the absolute time
during which TT.now() was invoked. In other words, TrueTime guarantees that for
an invocation tt = TT.now(), tt.earliest < tabs(enow) < tt.latest, where enow is the invocation event and tabs(enow) is the absolute time of
event enow. The instantaneous error bound is denoted as ε, which is
half of the interval’s width.

Google keeps uncertainty small (bounded by around 6ms) by using multiple modern
clock references (GPS and atomic clocks). TrueTime is implemented by a set of
time master machines per datacenter and a time slave daemon per machine. The
majority of masters have GPS receivers with dedicated antennas. The remaining
masters (Armageddon masters) are equipped with atomic clocks.

Every daemon polls a variety of masters to reduce vulnerability to errors from
any one master. Between synchronizations, a daemon advertises a slowly
increasing time uncertainty. ε is derived from conservatively applied
worst-case local clock drift. The daemon's poll interval is currently 30
seconds, and the current applied drift rate is set at 200 μ sec/second, which
together account for the bound on uncertainty at 6ms.

The TrueTime API directly exposes clock uncertainty, and the guarantees on
Spanner's timestamps depend on the bounds that the implementation provides. If
the uncertainty is large, Spanner slows down to wait out that uncertainty.

3 Spanner Implementation

A zone has 1 zonemaster and 100 to 1000s of spanservers. Zonemaster assigns data
to spanservers; spanserver serves data to clients. Location proxies help clients
to locate the spanservers assigned to serve their data. The universe master
displays status information about all the zones for interactive debugging. The
placement driver handles automated movement of data across zones on the
timescale of minutes.

Each spanserver is responsible for 100 to 1000 tablets. A tablet implements a
bag of the following mappings: (key:string, timestamp:int64) → string. To support replication, each spanserver implements a single Paxos state
machine on top of each tablet.

At every replica that is a leader, each spanserver implements: a lock table
(mapping ranges of keys to lock states) to implement concurrency control, and a
transaction manager to support distributed transactions. If a transaction
involves only one Paxos group (as is the case for most transactions), it can
bypass the transaction manager, since the lock table and Paxos together provide
transactionality.

If a transaction involves more than one Paxos group, those groups' leaders
coordinate to perform 2-phase commit. One of the participant groups is chosen as
the coordinator: the participant leader of that group will be referred to as the
coordinator leader, and the slaves of that group as coordinator slaves.

4 Concurrency control

The Spanner implementation supports read-write transactions, read-only
transactions, and snapshot reads. Standalone writes are implemented as
read-write transactions; non-snapshot standalone reads are implemented as
read-only transactions. A snapshot read is a read in the past that executes
without locking.

4.1 Read-Write transactions

Read-write transactions use 2-phase locking and 2-phase commit. First, the
client issues reads to the leader replica of the appropriate group, which
acquires read locks and then reads the most recent data. When a client has completed all reads
and buffered all writes, it starts 2-phase commit. Read-write transactions can be assigned commit timestamps by the coordinator
leader at any time when all locks have been acquired, but before any locks have
been released. For a given transaction, Spanner assigns it the timestamp that
the coordinating leader assigns to the Paxos write that represents the
transaction commit. To wait out the uncertainty in TrueTime,
there is a Commit Wait: The coordinator leader ensures that clients cannot see
any data committed by Ti until TT.after(si) is true.

4.2 Read-only transactions

Reads in a read-only transaction execute at a system-chosen timestamp without
locking, so that incoming writes are not blocked. A read-only transaction
executes in 2 phases: assign a timestamp sread, and then execute the
transaction's reads as snapshot reads at sread. The snapshot reads can
execute at any replicas that are sufficiently up-to-date.

Serving reads at a timestamp.
Every replica tracks a value called safe time tsafe which is the maximum
timestamp at which a replica is up-to-date. A replica can satisfy a read at a
timestamp t, if t ≤ tsafe. We define tsafe = min(tPaxos, tTM), where each Paxos state machine has a safe time tPaxos and each
transaction manager has a safe time tTM.

tPaxos is the timestamp of the highest-applied Paxos write. Because
timestamps increase monotonically and writes are applied in order, writes will
no longer occur at or below tPaxos with respect to Paxos.

tTM is ∞ at a replica if there are zero prepared (but safe not
committed) transactions—that is, transactions in between the two phases of
2-phase commit. Otherwise, for every participant group g, over all
transactions Ti prepared at g, tTM= mini (spreparei,g)-1. In
other words, tTM denotes the request timestamp of the earliest prepared but
not committed transaction.
Assigning timestamps to read-only transactions.
The simple assignment of sread = TT.now().latest to a read-only transaction
preserves external consistency. However, such a timestamp may require the
execution of the data reads at sread to block if tsafe has not
advanced sufficiently. To reduce the chances of blocking, Spanner should assign
the oldest timestamp that preserves external consistency. (External consistency
constraint dictates that you cannot use an older version of variable to read,
and you cannot assign a timestamp earlier than pending read-write transaction on
any of the variables involved in the read-only transaction). Spanner implements
a simpler choice when multiple Paxos groups are involved. The client avoids a
negotiation round, and just has its reads execute at sread=
TT.now().latest, which may wait for safe time to advance.

4.3 Refinements

tTM as defined above has a weakness, in that a single prepared transaction
prevents tsafe from advancing. Such false conflicts can be removed by
augmenting tTM with a fine-grained mapping from key ranges to
prepared-transaction timestamps. When a read arrives, it only needs to be
checked against the fine-grained safe time for key ranges with which the read
conflicts.

tPaxos is also advanced by heartbeats to help tsafe advance at the
replicas. (This does not require high precision clock synchronization, and NTP
easily suffices for this.)

5 TrueTime, what is it good for?

The paper does not directly discuss what TrueTime buys for Spanner. It says
this in the conclusions: "One aspect of our design stands out: the linchpin of
Spanner's feature set is TrueTime. We have shown that reifying clock uncertainty
in the time API makes it possible to build distributed systems with much
stronger time semantics." What does this mean exactly? What TrueTime buys
Spanner is left unclear in the paper.

After re-reading the paper only with this question in mind, I was left more
puzzled. In my first read of the paper, I thought TrueTime enabled lock-free
reads in Spanner. After the second reading, I realized that lock-free reads
could be implemented without TrueTime, only by using version numbers, because
read-only transactions were also serialized by coordinating leaders and Paxos
groups. TrueTime wasn't speeding up read-only transactions either. Even using
TrueTime, read-only transaction still needs to wait to hear/learn the commit
information from any variable/tablet-overlapping pending/prepared read-write transaction.

Maybe TrueTime benefited by making Spanner implementation simple, but Section
4.2.4 lists several unavoidable implementation hardships even with TrueTime. It
looks like using version numbers wouldn't be significantly more complicated.
Also, for schema change and paxos leader replacement (which the paper claims
TrueTime simplified a lot), the NTP synchronization (several tens of ms accuracy)
easily suffices. We could have easily avoided the more precise TrueTime
implementation with GPS and atomic clocks for these.

TrueTime and consistent-cuts in the past

After I was almost convinced that TrueTime was not good for anything, I realized
this: TrueTime benefits snapshot reads (reads in the past) the most! By just
giving a time in the past, the snapshot read can get a consistent cut read of
all the variables requested at that given time. This is not an easy feat to
accomplish in a distributed system without using TrueTime and high-precision
synchronized clocks, as it would require capturing and recording causality
relationships across many different versions of the variables involved so that a consistent cut
can be identified for all the variables requested in the snapshot read. That
would certainly be highly prohibitive to store in the multiversion database and
very hard to query as well.
TrueTime provides a convenient and succinct way of encoding and accessing past consistent-cuts of the Spanner multiversion database.

But can we find any clever solutions to circumvent that problem without resorting
to the high-precision clocks in TrueTime?

My close friend and frequent-collaborator Sandeep Kulkarni at Michigan State
University had proposed HybridClocks in 2001. HybridClocks also exposed clock
uncertainty ε in physical clocks, but it also used logical clocks in addition to
the physical clocks to capture the finer-grain causality relationships that
falls into the ε gray area.

Using HybridClocks it can be possible to relieve the atomic clock requirement in
Spanner and use NTP instead. Even with ε at 100 msecs, we can still track
finer-grain dependencies between variables using HybridClocks. The good news is
that the size of the HybridClocks can be kept very succinct and can be recorded
in Spanner database as timestamp to enable snapshot reads easily.

HybridClocks idea may also help speed up Spanner's high-precision clock-based
implementation. In Spanner ε determines the rate of read-write
transactions on a tablet. This is because, the coordinating leaders delay
read-write transactions to commit with at least ε time apart in order to
ensure that there is no uncertainty in past snapshot reads. It is possible to
avoid this wasteful waiting by adding some logical clocks information to
TrueTime as prescribed in HybridClocks.