The 24th ACM Symposium on Operating Systems Principles is about to kick off in beautiful Nemacolin Woodlands this morning. Several of us are here and will be live-blogging the talks and Q&A sessions. We've got exciting papers on multi-core scalability, fast transactions, synchronization and many other topics coming up, and that is just the first session!

- The real bottlenecks may be in the interface design.
- Scalable commutativity rule - whenever interface operations commute, they can be implemented in a way that scales.
- Example with File Descriptors. If create could return any FD then create could scale.
- One contented cache line can massively affect the scalability of an application. Single read of a contended cache line can take up to 25x more than a syscall.
- Commuter's commutativity rule is sensitive to state as well.
- Commuter allows users to systematically evaluate the scalability of an interface.
- Analyzed Linux 3.8 and found that 68% of the cases are conflict-free.
- Built sv6 - a POSIX-like operating system in which file system and virtual memory follow the commutativity rule. 99% of the test cases are conflict-free in sv6.

Speedy Transactions in Multicore In-Memory Databases

- Goal: extremely high throughput in-memory relational database.
- fully serializable transactions.
- can recover from crash
- Built a DB and found out that a lot of time was spent transaction commit.
- No need for global TIDs for recovery.
- Shared memory contention only occurs when transactions conflict.
- Uses time-based epochs to avoid doing a serialization memory write per transaction. (i.e. assign each txn a sequence number and a epoch)
- With epochs serialization is just a shared memory read.
- Pre-commit execution similar to OCC (Optimistic concurrency control).
- Commit protocol:
- phase 1: acquire lock on all the records in the write set.
- phase 2: validate records in read set. (i.e. abort if record's TID changed or lock held by another transaction).
- phase 3: pick TID and perform writes.
- They use Masstree, a fost non-transactional B-Tree
- used TPC-C to evaluate because it has large transactions and YCSB-like benchmark.

Q: In your TPC-C you only have one thread per warehouse. You also
A: Doesn't meant zero-contention. About 10% of the warehouses communicate with other warehouses. Wrt to the network we don't believe that network overhead is such a big issue. It is only around 20-30%.

Q: Epochs seem to be increasing latency. Wouldn't you be able to avoid epochs by explicitly tracking dependencies among transactions?
A: We tried in an initial version but it proved to be quite expensive.

Q: Have you compared with a scalable counter for TIDs rather than fetch and get?
A: It's not something I've run experiments for. That's something we'll think about.

Q: Have you explored timestamped counters (vector clock)?
A: The problem is that TID will be the size of the number of cores.
Q: I mean hardware timespatemped counters.
A: No, we haven't.

Everything You Always Wanted to Know about Synchronization but Were Afraid to Ask

Tudor David, Rachid Guerraoui, Vasileios Trigonakis (EPFL)

- Synchronization of the biggest bottlenecks in scalability.
- Scalability issues: is it the hardware, workload, appliction or the algorithm
itself?
- Scalability of sync is mainly a property of the hardware.
- Observations:
- crossing sockets is a killer
- sharing within a socket is necessary but not sufficient
- intra-socket uniformity matters
1) Crossing sockets
- measured the latency on multi-sockets: within socket 40ns, +40ns per hop.
- depending on architecture crossing sockets can be up to 8x more expensive.
- Xeon has a step decrease in performance.
2) Sharing within socket is necessary but not sufficient
- data within socket: - Operaton (40ns) or 120ns if it requires a broadcast to find the data that may potentially be in the same socket.
- On Xeon the latency is always around 20-40 ns.
3) Distance of single-socket
- Niagara: 23 ns while on Tilera 1 hop 40ns and +2 ns per hop.
- max-scalability of Niagara is 3.8x compared to 2.3x on Tilera => uniformity of Niagara provides better scalability.
4) Simple locks are powerfull
- in 25/32 out of the test cases single spin locks perform the best.

Q: Does analysis point to any interesting directions to improve hardware level sync?
A: Our results, indicate that it could be interesting to have hwd locks.

Q: Did you ever see latencies increase to do contention on the inter-connect?
A: We didn't really observe this because they're mainly sync bound. There's no stress on the interconnect. But the problem you've mentioned can definitely happen.

Session 2: Time is of the essence

Dandelion: A Compiler and Runtime for Heterogeneous Systems

This is all about improving programmability for heterogeneous distributed systems. Prediction: CPU+GPU clusters will become more prevalent in the future, but yet they are currently hard to program for. With Dandelion, the goal is the holy grail: have the programmer write straight-line programs in a high-level language and have the automatically translated into the appropriate jobs for accelerators and cluster nodes. In the current prototype, the work is a little less ambitious: it auto-parallelizes data-parallel sections of LINQ programs for GPUs and CPUs. The Dandelion compiler will generate several nested data-flow graphs at different scales: cluster-level ones, machine-level ones and accelerator-level ones.

Consider a running example: k-means clustering. Expressed as 3-line LINQ query involving 4 operators and one UDF. However, some operators (such as GroupBy) are not trivially parallelizable on GPUs, which use a SIMD paradigm rather than control flow threads with traditional lock-based synchronization. For generating good GPU code, a number of quantities (e.g. the number of groups) need to be known. In order to address this, they do a multi-stage GroupBy. It turns out that the implementation actually generalizes to a generic data-flow graph! This is very handy, as it makes writing a LINQ->GPU compiler a matter of iterating over the data-flow graph and emitting the correct code.

Translation is at the .NET byte-code level. However, a problematic point is dynamic memory allocation (not supported well on GPUs): if possible, they transform it to stack-based allocation, otherwise they fall back to CPU execution. Invocation is lazy, i.e. code is only generated when computations are executed.

Two key questions in the evaluation: (1) Was programmability improved? (2) What is the cost of doing so, and what is the performance of the resulting system? For the first, only minimal programmer annotations are required (more details in paper). The performance of Dandelion's auto-generated code is within 50% of hand-coded CUDA implementations with optimizations applied; and about 17x faster than a serial C++ CPU implementation. Does this generalize to things other than k-means? Yep, seems to also work with other workloads (pagerank, skyserver, terasort), although not always (e.g. skyserver is faster on CPUs). Much related work, most closely related is LINQits in ISCA 2013. -- (ms)

Q: The distributed equivalent. What is Dandelion's tactic to parallelize over cluster? A: Similar to Dryad's way of parallelizing.

Q: Why are you doing the translation at the bytecode level and not at a higher level? A: We can get the AST and because it's a shared format targeted by all the .net languages.

Sparrow: Distributed, Low Latency Scheduling

Tasks in compute clusters are becoming shorter and shorter: queries on in-memory data sets take only hundreds of milliseconds. This results in a new challenge for schedulers: they must make very rapid decisions in milliseconds, and must yet achieve a good quality of placement. As many query workloads are based fan-out-fan-in paradigms, stragglers have a major effect on job makespan. All of this should be done fault-tolerantly, and support a high job throughput. Current centralized schedulers can deal with thousands of scheduling decisions per second, but this is not sufficient for the sort of very short tasks. The main advantage of centralized schedulers is that they have a global view of the system state; nevertheless, they fail to scale and provide fault-tolerance for the low-latency workloads targetted by Sparrow. The latter is a completely decentralized scheduler based on a probing approach; they show that it makes good placement decisions despite having no global view of the cluster.

In Sparrow, tasks are queued at the worker nodes. Schedulers simply randomly pick workers to assign tasks to. Comparing uniformly random placement to an omniscient, infinitely fast scheduler in simulation, it becomes clear that scheduling delay grows quickly as a function of cluster load. Other sampling approaches (e.g. sample two machines and pick the less loaded one) work better, but is still limited by the per-task sampling yielding outlier results for large jobs. Instead of probing individually for each task, they sample a number of machines (multiple of the task count) on a per-job basis (batch sampling). However, queue length is a poor predictor of delay at the workers in the presence of heterogeneous task runtimes. The solution is late-binding: the workers only respond to probes when they are ready to accept a task. With all of these features, Sparrow achieves performance only 11% worse than an omniscient scheduler.

Both per-job and per-task constraints are supported in Sparrow. However, with per-task contraints, batch sampling is no longer possible (late-binding works, though). So how does this perform in a real setting, rather than in simulation? Comparing to Spark's centralized scheduler, throughput on a synthetic no-op workload collapses for Spark at ~1300 ms task length; Sparrow continues scaling. For TPC-H, Sparrow's performance is much improved by the more elaborate features like batch sampling and late binding. Again, Sparrow is within 12% of the ideal modelled scheduler. Sparrow is also resilient to scheduler failure.

So when does it NOT work well? At very high load (around 90-95%), Sparrow's response time grows very quickly. However, high loads in clusters are not very common with the industry settings they investigated. It also does not work well with some policies that require global visibility.

Mobile apps often pull data in from servers. Minimizing the user-perceived response delay is important, as it has a strong correlation with with user satisfaction. There are many components in the user-perceived delay, one of which is the "server delay" (effectively the back-end processing). If the time budget for the back-end processing is increased, higher quality results can be provided to users. These budgets are typically fixed today, and do not take the variance of communication delay to the mobile device into account. Consequently, in this work, they take the position that the server should adjust its deadlines depending on information about the communication delay to the user device.

However, measuring the elapsed time between request origin and arrival at a server is hard, and so is estimating the response time until the response will reach the device. Timecard simplifies this task for developers, and exposes a simple two-call API to obtain the previously mentioned numbers. The way they achieve this is by attaching a transaction context (TC) to each interaction between threads. The TC gets passed along callbacks, RPCs and even HTTP requests. Another challenge is clock synchronization on the phone: they need a reference clock shared between the server and the phone in order to figure out delays. The solution is to send probes from the app to pull clock references; however, this is complicated by the fact that radio energy state transitions in the phone add extra delay to the probes. There is some trick that I did not catch to allow them to side-step this problem.

Predicting downlink delay is another challenge -- but they simply assume that the latency will dominate and take some TCP window size information into account. Now, middlebox window sizes also matter, and they build a decision tree model of the middlebox behaviour. They have lots of large data sets from big providers and divide these into training and evaluation data sets, and inform the model using these (?).

Evaluation: modified two existing services (mobile ads service and Twitter analysis service). With timecard deployed, the CDF of the end-to-end delay is much closer to the desired vertical line than without using Timecard. Good news: Timecard only has 0.1% runtime overhead, less than 1% memory and CPU overhead, and negligible battery impact.

Session 3: Seed corn

Fast Dynamic Binary Translation for the Kernel

Piyus Kedia, Sorav Bansal (IIT Delhi)

Dynamic binary translation at kernel-level is much harder than at user-level, as the frequency of interrupts and exceptions is much higher. For example, the Interrupt Descriptor Table (IDT) needs to point to the DBT dispatcher component in order to translate the ISRs. The dispatcher needs to now convert interrupt data on the stack to native values, emulate precise exceptions (ensuring all previous exceptions have been handled) and emulate precise interrupts (ensure interrupts are only handled at boundaries of native instructions, rather than in the middle of translation). VMware's software virtualization system can do kernel binary translation, but has huge overheads (~130x for Apache). Most of this overhead actually comes from exceptions and interrupts. Another existing piece of work is Dynamo-Rio Kernel (DRK), which similarly has overheads of factors of hundreds. BTKernel, their work, by contrast has only around 2x overhead.

The key insight in BTKernel is that one can be a bit more optimistic most of the time, as the OS kernel mostly does not require precise interrupts and exception handling. They handle special cases when this isn't true differently, but generally just take the seatbelts off. One example is the compile-time generated exception tables for page faults inside the kernel (e.g. in copy_from_user()): these contain ranges of PC addresses that are permitted to page fault. Of course, with dynamically translated binaries, these address ranges are no longer valid, and the exception table needs to be modified to reflect the addresses inside the DBT's code cache.

Their strategy requires that blocks inside the DBT's code cache must be immutable and are not evicted without sufficient handling, since there can now exist pointers into the code cache. They also do some neat tricks to improve memory layout and calling optimizations. In their performance analysis, they find that BTKernel sometimes performs not only much better than the competing systems, but sometimes even makes code *faster* than native. They explain the latter with better i-cache affinity.

Q: Are there any negative implications of the immutability restrictions on translated code blocks? For example, you couldn't do Transmeta-style successive optimization.A: No really thought about this, but you could make multiple copies of the block in question, and changes that do not affect the block functionality are okay (?)

Q: <missed>A: <missed>

Q: Why is there a speedup over native code? What evidence did you use to attribute it to the i-cache?A: <missed>

VirtuOS: An Operating System with Kernel Virtualization

Ruslan Nikolaev, Godmar Back (Virginia Polytechnic Institute)

Bugs, exploits or crashes inside the kernel (e.g. in device drivers) can cause a machine to crash completely. VirtuOS addresses this problem by virtualizing parts of the kernel, and isolating them in their own virtual machines. Ideally, would obviously like to do this at good performance close to a monolithic, non-compartmentalized system. The slices that they break the kernel up into are called network/storage service domains and primary domains (each of which runs a full Linux kernel, but only uses certain parts). Only primary domains can run user processes. User processes are dynamically linked against a modified libc that ends up forwarding syscalls to the correct domain via a shared memory region. The service domain, after dealing the with the request, modifies the queue of threads waiting to run, inserting the user thread at the head. More detail, but all fairly standard techniques.

Evaluation: code fully compatible, can run unmodified binaries, few lines of code modified. The show using a timeline of throughput that failure of the network domain does not affect a concurrently running storage domain. The overheads introduced by the extra memory copies on the system call path compete with the benefit of VirtuOS's system call dispatch not having exception overhead. As one would expect, for small storage accesses, the latter benefit dominates, while the VirtuOS's performance degrades as the data volume increases.

From L3 to seL4: What Have We Learnt in 20 Years of L4 Microkernels?

Kevin Elphinstone, Gernot Heiser (NICTA & UNSW)

20th anniversary of L4-based micro-kernels at this SOSP. Why is this significant? The 1993 paper showed that single-threaded IPC performance could be improved by 20x, which was a big deal back then. L4 is still known for its very high IPC performance. How was this achieved? By strong adherence to the "minimality principle", permitting concepts inside a micro-kernel ONLY if they absolutely need to be in there. The L4 source code is still fairly concise, and this remained true across all of the L4 variants. L4 has been deployed in billions of devices, and is the only formally verified OS kernel in existence.

Consider some of the original defining characteristics and how they evolved. One key principle of L4 was the synchronous IPC model, where a thread would block after sending a message. One extra feature was "long IPC", which however required concurrency inside the kernel code, and invocation of user-mode page fault handler. Long IPC largely existed in order to provide POSIX support, but was abandoned for lack of necessity. Another IPC-related feature are timeouts -- these were actually overloaded to provide timed sleep functionality (by receiving from a non-existent thread). However, nobody really know what the right values for timeouts were, and some L4 variants (OKL4, seL4) ended up abandoning this concept. Synchronous IPC has some general issues, so some people introduced asynchrony into L4 (permitting message sending without blocking). While original micro-kernels had the problem that asynchronous IPCs require the kernel to buffer data, but L4 variants side-stepped this problem by limiting the buffering to a single word and introduced concepts like the seL4 asynchronous notifications. Instead of thread IDs, modern L4 kernels use capability-protected endpoints as these avoid covert channels and reduce redundant IPC duplication. Capabilities also replaced hierarchical delegation models for protection. L4 also abandoned multiple kernel stacks using virtual TCBs and lazy scheduling. Even more lessons in the paper.

Looking at the overall score board, few design principles were actually retained as-is; several were modified, and many abandoned. However: minimality is a proven success in L4, IPC speed still matters and capabilities are the way to go. The "last interesting problem at the kernel level" for the community is a clean abstraction of time.

Q: Talked about user-mode page handler. When you have a page fault, do you have to lock the entire address space of the process?A: Every address space has only one page fault handler, so this isn't a problem.