Early last fall, I started working on a project called js.js with two other graduate students, Naga Katta and Stephen Beard. We started using a public Github repository from the start, and at the beginning of January, the author of three.js found and tweeted a link to our repository to his many followers. Soon after, a post wound up on Hacker News. Unfortunately, we weren’t very far along in the project yet, and weren’t really ready to show anyone our work, so there was a lot of confusion and negative criticism.

We’ve reached the point where js.js performs reasonably well and has a decent amount of functionality implemented. In this post, I’ll outline what js.js is, how it works, demonstrate a sample application that uses it, and show results of a performance analysis.

js.js is a JavaScript interpreter (which runs in JavaScript) that allows an application to execute a third-party script inside a completely isolated, sandboxed environment. An application can, at runtime, create and interact with the objects, properties, and methods available from within the sandboxed environment, giving it complete control over the third-party script. js.js supports the full range of the JavaScript language, is compatible with major browsers, and is resilient to attacks from malicious scripts.

Our initial prototype implementation of the js.js runtime has been created by compiling the SpiderMonkey JavaScript interpreter to LLVM bytecode using the Clang compiler and then using Emscripten to translate the LLVM bytecode to JavaScript.

Emscripten

Emscripten, a project by Alon Zakai, is an LLVM-to-JavaScript compiler. It takes LLVM bitcode (compiled with an LLVM frontend like Clang) and compiles that into JavaScript, which can be run on the web. There is some great technical documentation on how this works on the Emscripten wiki, so I won’t go into too much detail, but I’ll give an example of this process.

This is where Emscripten comes in. It takes this LLVM bitcode and generates JavaScript instructions. Rather than trying to emulate the LLVM, it actually translates operations that it can into their JavaScript equivalent. Here’s what it looks like after translation:

To see this working in actions, here’s a jsfiddle showing this function calculating the 20th number in the Fibonacci sequence. Notice an important thing happening in this translation: the LLVM bitcode is not getting emulated. It has actually translated addition, assignment, and comparison operators into their equivalent JavaScript form.

js.js

To create js.js, we ran Emscripten on SpiderMonkey, the JavaScript engine used in Firefox. SpiderMonkey comprises about 300,000 lines of C and C++ code. Much of our effort was spent patching SpiderMonkey to get it to compile in Emscripten’s environment, a limited subset of libc. We also had to disable all assembly routines and just-in-time (JIT) compiling features of SpiderMonkey, since assembler is not available in JavaScript.

Once we had the SpiderMonkey API available, we wrote a wrapper script that makes it much easier to use the library from JavaScript. The js.js API wrapper is about 1000 lines of JavaScript and allows you to create sandboxed environments and execute code in them. The following is an example that shows how to use the API to run 1+1 in a sandbox and get the result as a number:

Performance Evaluation

We wanted to quantify the performance overhead at the microbenchmark level and the macrobenchmark level. For the former, we timed each of the js.js functions, and for the latter, we used the SunSpider JavaScript benchmark.

Microbenchmark

To quantify the performance of each the js.js API function, the following table shows the mean across 10 runs of execution time (in milliseconds) of each function called by the above 1+1 example:

Operation

Mean Execution Time (ms)

libjs.min.js load

84.9

NewRuntime

25.2

NewContext

35.8

GlobalClassInit

15.5

StandardClassesInit

60.1

Execute 1+1

70.6

DestroyContext

33.3

DestroyRuntime

1.8

The execution time here isn’t great, but it’s within the range that the library is usable. It takes about 220ms of setup time to get an execution environment up and running.

Macrobenchmark

We also wanted to compare the performance of js.js with native JavaScript execution. We took the SunSpider JavaScript benchmark and ran it natively using the SpiderMonkey js shell (with the JIT turned off). We then ran the benchmarks again using js.js. The following graph shows the median factor of slowdown for running each benchmark inside js.js in both Firefox and Chrome:

This benchmark shows that on average, running code inside js.js is about 200 times slower than native execution. Considering that JavaScript is being run instead of native x86, two orders of magnitude is not terrible. Depending on the scripts being executed in the sandbox, the performance overhead might be acceptable. We’ve also looked into where most of the execution time is going, and we found that the interpreter loop is being converted into a single JavaScript function that is thousands of lines long. JIT compilers don’t handle this case very well, so we’ve been brainstorming ideas on how to break up this interpreter loop into separate functions, that could help improve performance.

Demo

As a demo, we took the JavaScript used to render the Twitter button and ran it inside js.js, giving the script virtual access to a DOM. This allows us to run the script, while maintaining complete control over what the sandbox can do to the real DOM.

Conclusion

This project has been a lot of fun to work, and we recently had a demo paper about js.js accepted to WebApps 2012. For more details, check out the js.js paper. The source code for js.js is available on GitHub, so if you’re interested in the project, you’re welcome to fork it and try it out. Pull requests accepted!

ACM recently released the ACM Authorizer service. It allows authors of ACM papers to link to ACM’s digital library from their website. Users who click the link will receive access to the author’s paper free of charge.

The main concerns I had about changing my website to link to this new service are:

If ACM’s website is down, my paper is no longer accessible

An independent copy of my paper’s PDF would not longer be linked in Google’s search index

There is an easy workaround to this problem. You can keep your markup linking to a local PDF of your paper and use javascript to rewrite the link.

I chose to use a class for my ACM links because a class-based lookup is very fast. I put the ACM digital library identifier into the id attribute of the element, but it has a prefix because HTML id’s are not allowed to start with a number (so my HTML still validates).

Now, it’s easy to add some javascript to rewrite the link on page load:

The first script tag loads jQuery from google’s CDN. The second script tag finds all a elements with a CSS class of “acm_authorizer” and changes their href attribute to point to the ACM digital library — getting the ACM ID from the id attribute of the element.

With this, users who visit the site with JavaScript enabled will be directed to ACM, but the markup of the page points to a copy of the paper on my site, keeping it in the index of search engines. You can see this in action at my website.

I recently attended OSDI 2010 where I sat through about 30 presentations on systems-related topics. I was surprised that there were so many occurrences of things that I think should be avoided when giving a presentation.

In this post, I’m going to outline several things that, in my opinion, you should avoid when giving a talk at a conference. Keep in mind that this is just my opinion and you’re welcome to disagree, but I think you’re wrong and I’ll explain why.

Note that the examples below are all taken from the OSDI slides that were posted after the conference. If you’re an author of one of these presentations, please don’t take it as criticism of your work or you. The criticism is only aimed at your stylistic choices when creating your presentation.

Dark Backgrounds

This is probably the worst decision you can make. Now it’s definitely true that dark backgrounds can look nice. In fact, they can look particularly good on the screen where you created the presentation, which is why I think it’s an attractive choice for a lot of people.

The problem with this is two-fold. First, a more minor reason, is that it’s difficult to keep things consistent when using a dark background. Often you include things like images, tables, diagrams, and graphs. Many of these will have white backgrounds, so to get them to fit with your presentation, it’s extra work that you have to do.

Second, the major reason not to use a dark background is that it makes your slides much harder to read when displayed on a projector. The way that the human eye perceives blackness is relative to what’s around it. When you project black text on a white background, the black color of the text looks very black because of the large, bright, patch of white around it. When you project white text on a black background, the background looks washed out because there isn’t much light around it. This effect is also amplified because projectors can’t actually display a real black color. Black is the absence of light, so a projector simply doesn’t project any light at the pixels that are supposed to be black. Since conferences are always in brightly lit rooms, the black background ends up looking like a pale gray.

Ugly Fonts

This presentation uses the Chalkboard font. It turns what should be a technical diagram into what looks like a 4-year-old’s drawing. It’s very distracting. Comic Sans also made an appearance at OSDI. Please use reasonable fonts. You can also see the poor choice of a dark background here as well.

Outline Slides

I’ll admit that outline slides can have their place. For example, if you’re giving an hour-long talk, it can be nice to give an overview to your audience to tell them what you’re going to cover. When you only have 22 minutes, however, wasting time telling us what you’re going to talk about instead of just talking about it is silly. In this particular talk, they went back to the outline slide each time they switched sections. Even if it only takes a few seconds each time, you probably wasted a full minute just telling me what you were going to talk about. Especially when you’re throwing darts at your bullet points. No clue what they were thinking.

Underlined Text

This is not a deal breaker, but I still advise against it. Underlined text usually looks pretty ugly, as it does here. Avoid it.

Header/Footer on Every Slide

Sometimes these can be tastefully done, but many people abuse it. In the first example, the presentation had a giant bar that says “Facebook” on every single slide. Perhaps it’s company policy; I don’t know, but it looks obnoxious and I don’t need to be reminded every single slide that this presentation is from Facebook. In the second example, it even includes the name of the conference in the footer. I often forget which conference I’m at in the middle of a talk, so I’m glad it was there. But really, you don’t need a header and/or footer on every slide. It’s distracting and unnecessary.

Walls of Text

I think people are wising up to this, but there were still a few cases. If you put too much text on a slide, then one of two things will happen: either people only pay attention to what you say – making the slide ineffective, or they spend time reading the slide instead of listening to you. Instead, use the text on your slide to support and enhance what you’re saying. The first example here just has too much text. The second example has a lot of code. Both make it difficult to simultaneously listen to the speaker and read the slides.

Conclusion

Again, this was just my opinion, but if you can avoid these things, please do. It will enhance your presentation and make it so people don’t get distracted.

Introduction

I’m currently working on a project called Sirikata, an open, programmable, scalable, secure, and extensible virtual world.

One aspect of this virtual environment is application-level messaging. Users can create objects in our world, and these objects can communicate with each other. To get an idea of what application-level messaging looks like, we wanted to take a look at one of the world’s biggest virtual worlds: World of Warcraft (WoW).

In WoW, users can create an “Addon”: a user-interface modification component that can add enhancements to the game. To communicate, these addons (written in Lua) have access to a function called SendAddonMessage.

WoW SendAddonMessage Function

Sends a message to the hidden addon channel.

SendAddonMessage("prefix", "text", "type", "target")

prefix: String – Message prefix, can be used as your addon identifier.

target: String – Used only for “WHISPER” communication – the player to whisper to.

Example:

SendAddonMessage( "CTRA", "User Data: XYZ", "RAID" );

Notes:

Calling this function results in the event CHAT_MSG_ADDON being invoked on:

<target>’s client if <type> is “WHISPER” or

all clients in the <type> chat channel, otherwise

The combined length of message and prefix can be at most 254 characters (255th is the tab character used to delimit the two)

Length above 254 will disconnect you.

Example Addons

EPGP

EPGP is an Effort/Gear reward system and addon for World of Warcraft.

It is a DKP system, described below:

Dragon kill points or DKP are a semi-formal score-keeping system (loot system) used by guilds in massively multiplayer online games. Players in these games are faced with large scale challenges, or raids, which may only be surmounted through the concerted effort of dozens of players at a time. While many players may be involved in defeating a boss, the boss will reward the group with only a small number of items desired by the players. Faced with this scarcity, some system of fairly distributing the items must be established. Used originally in the massively multiplayer online role-playing game Everquest, dragon kill points are points that are awarded to players for defeating bosses and redeemed for items that those bosses would ‘drop’. At the time most of the bosses faced by the players were dragons, hence the name.

EPGP uses the messaging system to send information between users of the addon about the values of loot. For example, if an item drops for the first time, the “loot master” might assign a value to the item. This value is then shared with other people using the epgp addon.

Properties

small message size

very infrequently sent

must be delivered

Gatherer

Gatherer is an addon for herbalists, miners and treasure hunters in World of Warcraft. It’s main purpose is to track the closest plants, deposits and treasure locations on your minimap.

In WoW, players can gather resources. The resources are spread across the world and spawn in what seems to be a randomized fashion. In fact, though, the spawn locations are actually preset, so this addon tracks the location where users find resources to make it easier to find later.

Gatherer uses the messaging system to share the location of found resources between players. For example, when you find a resource, you can have it automatically share the location with your guild members. This way, if many people in your guild use the addon, you will get a nice compiled database of resource locations.

Properties

small message size

medium send frequency

not important if messages are delayed

not important if some messages are completely lost

Recount

Recount is a graphical damage meter.

In WoW, players fight bad guys. To do this, they have to deal damage. A metric for evaluating players is to look at the amount of damage they deal per second (DPS).

Every action that players within your party take is logged, which includes actions that result in damage being dealt. However, you only get log information for players that are within a short range of you. Because of this, in some encounters, some players might be too far away from you for their actions to be logged.

To make the meter more accurate, Recount uses the messaging system to aggregate logging data about damage being done.

Properties

Medium size messages

Heavy traffic load during encounters (about 10 seconds to 6 minutes)

Latency effects instantaneous accuracy during an encounter

Syncing algorithm still works if some messages are lost

Data Collection

To study WoW, I wrote an AddOn called AppMsgLogger. This AddOn collects three pieces of information:

A timestamp every time the user logs in or out

For every addon message received (the CHAT_MSG_ADDON event), a timestamp, the zone the user is in, the channel it was received on (PARTY, GUILD, RAID, etc), and the length of the message in bytes

A list of the AddOns the user is running

I had several players run this AddOn for me and send me the data they collected. As a result, I have data consisting of 49 hours of time played across eight players.

Results

Out of the 173,456 messages received, here is the breakdown of the channel they were sent on:

Battleground: 262 (0.15%)

Whisper: 1,240 (0.71%)

Party: 3,185 (1.84%)

Guild: 21,439 (12.36%)

Raid: 147,330 (84.94%)

So, only less than 1% of the traffic is unicast (Whisper) while the majority is broadcast traffic related to combat (Raid) and social organization (Guild).

Next, I wanted to look at the breakdown of message size. Here’s a histogram plotting message size frequency:

You can see in this histogram that the vast majority of messages are small (< 50 bytes), with a small cluster of messages near the maximum size of 254 bytes. This is probably because most addons use a library called AceComm which splits messages longer than 254 bytes into multiple messages.

Next, I wanted to see the distribution of messages across the different addons sending them. I matched each prefix string to its corresponding addon and plotted their message frequency:

Note the log-log scale. What we see here is that a small number of addons send the majority of messages, with a long tail of addons that send a very small number of messages.

Next, I wanted to look at how bursty the message traffic was. To get an idea, I took a single user’s 24-hour play time (over multiple sessions), strung it together, and calculated the average byte rate of each 1-minute interval (click for full size).

What you can see in this graph is that there is considerable variability: some periods show a high rate (up to 14 Kbit/s) while others show a rate of almost zero.

To put these numbers in context, I added a background color that alternates every time the user switched zones. For zones that were longer than 10 minutes, I also added the name of the zone in the center of its colored region. As expected, high-rate periods are seen during raids (Icecrown) while the user was engaged in combat. Low-rate periods are seen when the user was presumably idling in capital cities (Orgrimmar, Dalaran).

The last thing I wanted to see was what percentage of messages were wasteful. Since messages are broadcast to their corresponding channels (GUILD, RAID, etc), users in the channel will receive all messages sent to the channel regardless of whether they have an addon actually listening to a given prefix string. Since I had a mapping of prefixes to the sending addon, I calculated the percentage of messages that were received by each user with a prefix string for an addon that the user wasn’t currently running. The percentages for the users that I collected data from are as follows: 58.9%, 28.2%, 15.9%, 12.6%, 9.5%, 86.1%, 83.9%, 17.0%. Since the sample size is low and variability is high, I’d be hesitant to draw any conclusions from those numbers, but it leads me to believe that a significant amount of traffic that is broadcast is wasteful.

Conclusion

I hope you’ve found this analysis useful. We’ll be using some of this information as we continue to design and implement Sirikata.

I’m back in Princeton after spending the week in Indianapolis, Indiana for the first ACM Symposium on Cloud Computing (SOCC 2010). I’m posting here with a brief summary of each talk from Day 2 of the conference as well as some of my thoughts. These are my reactions to the presentations only, as I haven’t read most of the papers.

Keynote 2

Jason gave a great presentation about the architecture at Facebook and some of the lessons their engineering team has learned in the last several years. In the last three years, Facebook has seen a factor of 20 growth in the number of users they support – from 1 million users in 2007 to 400 million users in 2010. With this high rate of growth, the infrastructure team has had to keep the site running in addition to innovating.

The traditional website scaling technique is to have all of a user’s data stored in a single server so that data can be partitioned easily and therefore scale. The social network that Facebook has can’t use this method because the data is interconnected and there is no way to partition the user graph. Instead, Facebook uses a number of techniques – aggressive caching, a multi-tiered infrastructure, and innovative APIs for developers that allow for applications to easily scale.

The three tiered structure that Facebook uses consists of:

a web server front end running PHP that processes user requests. Not much to say here except that Facebook now compiles PHP to C++ to speed up execution

a caching layer that uses memcached which is simple, fast, and reliable. They have 10’s of TBs of memory for caching!

a database layer that uses MySQL. They have 8,000 server-years of run time experience without data loss or corruption. Since most requests are cached, the traffic to MySQL stays low, but they require a 3:1 ratio of the innodb table size to RAM for each server in order to keep up with request rate.

One interesting caveat here is that they are double caching most data. It’s hard to provide enough caching space to only cache in the memcache layer or the database layer, so the double-caching approach is the simplest option right now for Facebook, but looking into reducing this in the future.

Jason also talked about their Hadoop/Hive/Scribe system used to run MapReduce queries as well as a system called TAO that they use to provide a simple API to their developers for writing applications. Rather than developers having to write database and memcache queries, TAO is an API-aware layer built on top of memcache that provides easy to use abstractions.

Overall, a great talk with nice insight into Facebook’s infrastructure.

The Hermes project comes out of MSR where they are looking into optimizing the storage of e-mail messages in Microsoft Exchange. The normal implementation for Exchange is to take a new user account and locate their messages on any server in the Exchange cluster that has space available. By doing this, it splits load across the Exchange servers well.

The problem is that when two users that are on different physical servers receive the same message in their mailbox, the message has to be stored twice – once on each server. Hermes looks into trying to cluster users to minimize the amount of storage redundancy and network traffic between servers.

An example system was given that has 128K users on 68 exchange servers (1800 users/server). The sample size was 22 weeks of e-mail which contained 337M messages. They calculated that there is 3TB of extra storage required per week to store messages. By using a clustering algorithm to cluster together users that frequently e-mail each other, Hermes was able to save 55TB of storage over the 22 weeks of e-mail. Since moving user mailboxes is expensive, they suggest partitioning once and then re-partitioning once a week.

This was nice work, and seems like it solved a big problem with Exchange, although I question the architecture of Exchange to begin with. Services like Gmail do a much better job of storing e-mail messages and don’t run into problems like this.

Keith talked about the current state of how (non-computer) scientists use cloud services to do their work. Many scientists use things like Hadoop to process large datasets (e.g. Large Hadron Collider). There are three main types:

Large scale synchronous workloads – generally run today on supercomputers

Medium scale synchronous workloads – can use local clusters

Large scale asynchronous workloads – these can be parallelized well so can be run on cloud services

He talked about how the interfaces to cloud services are not standardized, so once chosen, scientists are locked in to a single provider. The suggestion was that cloud services should be more standardized and make it easier to share and collaborate among scientists.

This work focused on trying to provide a way for non-experts to write Internet services that scale well. Non-experts now have access to large scale cloud platforms like Amazon, Google, and Microsoft, so hardware is no longer the bottleneck when needing to scale an application. Instead, it’s difficult to architect an application.

The current techniques for scaling Internet services are things like tiering, partitioning, replication, data duplication and de-normalization, pre-computation, and others. These techniques are notoriously difficult to build and deploy correctly. It would be nice for non-experts to have a way to write applications using these techniques easily.

In comes Fluxo – a compiler that takes input in the form of a restricted programming language and translates it to run on Azure. The restricted programming language simplifies program analysis and allows for the use of the common techniques listed above. Fluxo is profile driven, so it collects metrics, analyzes the program, transforms it, and repeats.

In an evaluation, pre-computation can get benefits of up to 60% in performance, and caching can decrease latency by up to 50%. Seems like a nice idea – I can see small to medium size companies finding a platform like this useful.

Data-centric parallel programming uses schema-free systems like MapReduce and Hadoop to run efficiently. Relational databases, on the other hand, are schema bound, have well-defined properties, are flexible and optimizable, but don’t scale well. This work looks at extending the capabilities of MapReduce. The notion of first-order and second-order function is introduced. First-order functions are user code, while second order functions are things like the Map and Reduce steps today.

PACTS extends the MapReduce paradigm with some additional functions. For example:

Cross (two inputs) – each combination of records from the two inputs is built and is independently processable

Match (two inputs) – each combination of records with equal keys from the two inputs is built

CoGroup (many inputs) – pairs with identical keys are grouped for each input, group of all inputs with identical keys are processed together

PACT programs are compiled to parallel data-flows to be executed in Nephele. Nephele is an execution engine for parallel data-flow programs. It provides the process scheduling, communication, and fault-tolerance mechanisms needed to execute the workload. Example given of PACT workloads are TCP-H aggregation and K-Means Clustering. I’m not too familiar with running MapReduce workloads, so not sure of the impact this work will have.

This work looked at the tradeoff between RDBMS’s and NoSQL-type databases. RDBMS’s have powerful declarative query languages, have a query optimizer that decides the best execution plan, has automatic parallelism, and is performance-opaque. Contrastingly, NoSQL has a simple query interface (get/set), but complex queries have to be implemented at the application layer and are non-trivial to architect.

The argument is made to use a system called PIQL which is designed to be a middle ground between the two extremes. PIQL is a scale-independent declarative language that only allows developers to write queries with a data-size independent upper bound. PIQL is able to say no to certain queries, where as an RDBMS can’t.

Most applications have natural cardinality constraints. Example given was that a Facebook user has an upper limit of 5000 friends. PIQL lets developers specify these constraints in their data model so the optimizer can use them to decide on an optimized execution plan. If a query would be too slow, PIQL rejects it. PIQL is designed to run on top of existing key/value data-stores and it scales well. I think this is the right direction for providing easier ways to write NoSQL-type applications. I’m interested in looking more at what their API provides.

Shivnath described some of the difficulties with using MapReduce today. The lifecycle of a MapReduce job is that the programer first provides the Map and Reduce functions, kicks off the job, which gets split into many Map waves, followed by many Reduce waves, until it finishes.

The complexity here is that the user has to specify things like the number of maps, the number of reduces, and memory usage. There are over 190 parameters that a user can specify in Hadoop, and for many, it isn’t clear how they impact performance. As an experiment, a 50GB terasort was run on EC2. By changing parameters, the running time drastically changed, and certain parameters depend on each other, so individual parameters can’t be tuned separately.

The argument was that a better approach should be used for default parameters in Hadoop. There has been a lot of work in traditional RDBMS’s and these techniques should be leveraged to try and tune MapReduce workloads.

This work looks at how to properly evaluate a NoSQL cloud database. There are many systems out there – PNUTS, BigTable, Megastore, Azure, Cassandra, CouchDB, and many more, but the tradeoffs between systems are not clear. It would be nice if there was a standard benchmark that could be used to analyze the performance of each system so that developers can choose the appropriate solution to meet their needs.

In comes YSCB – an open source, standard benchmark tool for evaluating NoSQL systems. YCSB focuses on performance (latency and throughput) and scalability (adding servers, latency as system scales, and elasticity). The YCSB tool is written in Java, which provides easily extendable workloads.

A test setup was run on six server-class machines, and systems evaluated are Cassandra, HBase, sharded MySQL, and PNUTS. Some of the example workloads given were (more in paper):

A – Update heavy workload that had 50:50 ratio of reads to updates

B – Read heavy workload with 95:5 ratio of reads to updates

E – Short range scans of 1 to 100 records of size 1KB

Elasticity – Run a read-heavy workload while adding servers from 2 to 6.

The graphs showed a clear distinction between different systems and provided insight to which systems would be best in different situations. It was also noted that these tests can be used to improve the database system, as happened with Cassandra which improved dramatically in subsequent version releases.

This looks like a great tool for evaluating the plethora of NoSQL systems out there. It would be nice to have a standard setup (e.g. on Emulab) so that the comparison between systems is fair and repeatable.

Stefan pointed out that when we install privileged software today (like drivers) we have to take their reliability on faith. Third-party drivers run with the highest privilege in the operating system, so if they have bugs or malicious code, serious problems can occur. The analogy made was to cars – we wouldn’t give our car keys to a random driver without some guarantee of the car’s safety.

It would be nice to be able to test a driver by exploring all possible code paths and look for common bugs and problems. Testing as a Service (TaaS) is a system for doing just that. TaaS can be used by home users by uploading a driver to a website and evaluating it, by developers who can integrate testing into their IDE’s for convenience, and as a commercial certification service for software.

This seems like a nice idea, although I question how much traction it could get. A lot of Windows drivers today don’t get certified because companies like to cut corners and produce a product for the least amount of money possible. It seems like even if this service was provided, companies would still cut corners and just skip using it to save money. Being able to test as a home user is not that useful, as a failed test would just give you the option of not installing the software, the same option you have today. Perhaps if a big player like Microsoft started requiring certification for driver installation, it could get the traction it needed.

Keynote 3

This was an interesting keynote. I wasn’t very familiar with exactly what Salesforce provided to its customers before this talk. I knew that a lot of companies used their services, but I didn’t realize the scale they were working at. Salesforce provides cloud services for 75,000 customers running 150,000 custom applications and serving 17,500 websites.

They provide many services, including a query language, API’s, OLAP, batch processing, search, mobile, content management, packaging, and many others. What’s interesting about Salesforce’s architecture, though, is that each customer (or “tenant” as they call them) is not given its own architecture. Instead, all tenants run on a single platform, supported by a single code base.

Each username is mapped to a tenant, which is mapped to an “instance”. An instance is a cluster of hardware including a SAN, application servers, load balancers, object storage, etc. To provide isolation, each tenant runs as a separate Java thread. They have a three-tier caching layer, first caching locally within each Java thread, then caching with remote requests, and finally providing an application-layer service to users running on memcache.

Underneath the cache is a persistence layer that uses Oracle. Every customer’s data is stored together in a table, with static (compile time) and dynamic (run time) query analysis that enforces that a customer can only read their own data. All data is stored as a string in the database, so to provide indexes, a separate table has a column for each possible type which can then be indexed.

Salesforce only maintains a single code base for their infrastructure. To provide compatibility with previous releases, functions are annotated with API version numbers, such that if a customer is still using an old version, they don’t get access to changes in the API. When customers package their own code, they are required to submit unit tests that cover most of their code paths, so when new versions are released, Salesforce uses customer tests to verify compatibility with old API versions. An analytics component is provided that does real-time calculations instead of offline processing like most other analytics systems.

This was a nice talk. The presentation was a little dry (Rob didn’t seem very enthusiastic about their services!) but it was still interesting.

Data Services

G-Store addresses the weakness of most NoSQL databases that they don’t provide transactions over multiple keys. Many NoSQL databases provide test-and-set type semantics for a single key, but can’t provide any guarantees about multiple keys. To do this, users have to implement any transactional logic in the application layer.

G-Store is a layer that runs on top of a key/value store and provides additional functions. Users can submit a list of keys to add to a group. When a group is created, a single node is assigned as the controller for this set of keys. Since a single node controls the group, it can execute transactions over the group. Once a user is finished, the group can be deleted.

This work is similar to our own CRAQ work, which can provide multi-key transactions over a chain. CRAQ only provides static group membership, in that it doesn’t support explicitly moving keys around after their initial creation, but by leveraging Sinfonia-type mini-transactions, it provides the same service as G-Store. I was disappointed that we weren’t cited.

Google Fusion Tables is a service that allows data management for the web era. Google wants to provide data storage that can integrate seamlessly with the web, is searchable, embeddable, extensible, easy to use, provides incentives for sharing data, and facilitates collaboration.

The example given was biking trails in Europe. There is a website that allows users to upload GPS data of bike trails. They want to allow data like this to be integrated into the Google Maps API for visualization and allow users to query it easily, e.g. find trail lengths < 25 miles.
Google Fusion Tables allows users to extend data tables in order to collaborate. Tables can be merged. For example, a user might merge the bike trails table with a sponsors table and a results table. Users can have discussions about the data, set permissions, and perform other types of operations.
The architecture to support these operations has a front end dispatcher that receives requests, a query processing component, a backend that does plan execution, and a replicated storage layer running on BigTable. This seems like a nice, useful API. Looking forward to checking it out.

This work tries to address some shortcomings in Hadoop. In MapReduce, there is a two stage computation: the Map step and the Reduce step. Between processing steps, there is a large amount of intermediate data that has to get shipped from server to server. For example, Google’s indexing application runs a chain of 24 MapReduce jobs. In many cases, this intermediate data is larger than either the input data or output data.

If a MapReduce node fails, the computation it was supposed to provide has to be rerun on another node. The problem is that to rerun the computation, the intermediate data that was sent to the failed node has to be sent to the new node as input. If it’s not the first step in the chain, this intermediate data is no longer available – it was the output of the previous step. To recover it, the previous step has to be rerun as well. This is a problem when chaining multiple steps together, because the failure will cascade and cause every previous step to have to be rerun.

This work extends Hadoop by replicating this intermediate data. Unfortunately, this isn’t a simple thing to do because most MapReduce jobs are network-bound (verified experimentally), meaning that adding additional replication slows down the job. An observation is made, though, that often the bottleneck is at the inter-rack switches. The network capacity within a rack is not saturated. To take advantage of this, replication can be done at the rack level.

Failures really do occur often in large MapReduce clusters (verified from industry reports), and an experimental evaluation shows that when using rack-level replication, jobs can be sped up significantly when failures occur. When no failures occur, there is only a small overhead associated with the replication. This seems like a nice addition to Hadoop, and I hope to see it added to the main branch.

This work looks at hardware failures in large scale datacenters. Microsoft has 100K+ servers that are geographically distributed around the world running heterogeneous hardware. They observe lots of failures and have to deal with them with a support staff. Since a person has to diagnose hardware faults and perform the necessary fix, failures come at a high cost.

The dataset used here was from Microsoft’s online services. This runs things like Messenger, Hotmail, etc. The dataset was over a 14 month period, looking at 100K+ servers, and contained a work log from their datacenter. The work log records which server fails along with what component had to be replaced (e.g. disk, dimm, raid controller, etc).

The analysis they performed looked at characteristics in aggregate, failure rates across individual servers and datacenters, and tried to see if failures could be predicted. The method used was a decision tree. Some interesting takeaways:

For every 100 machines, 9.5 machines recorded a failure annually

For every 100 machines, 20 failures were recorded annually (so 2 repairs per machine)

I really have nothing to say about this talk. I couldn’t understand anything from the presentation at all. I verified it wasn’t just me – others that I asked didn’t understand it, and there were no audience questions either. This is a great example of how not to give a talk.

This work looked at trying to provide a distributed storage service that is power-proportional. This means that as the load on the system increases, the power consumption should also increase proportionally. This is not the case today – idling servers still use a large percentage of the required power for peak load.

One way to reduce power consumption is just to turn off servers during times of low demand, but traditional distributed storage services use methods like consistent hashing to get good load balancing, so if you shut off a server, the data on that server is no longer available.

Instead, what Rabbit does is store a primary replica of every key in a small number of nodes. A secondary replica goes to a larger number of nodes, a third replica to an even larger set of nodes, and a fourth replica to the largest number of nodes. By doing this, you can turn off 95% of your servers without losing availability. If you have 5 servers on, each serves 20% of requests. If you have 100 servers on, each serves 1% of requests.

This allows for fast, fine-grained scaling by simply powering off machines. An example benchmark of a 100GB terasort shows that read throughput scales linearly with percentage of maximum power used. However, write performance decreases because all data has to be written to a small number of nodes. A method is used from previous work to offload writes to get better write performance.

RACS is a method for combining multiple cloud platforms together for data storage, such that a single vendor doesn’t have to be relied on. Today, when choosing a cloud datastore, customers are often locked in because of the high cost of switching providers. To transfer a large amount of data from vendor A to vendor B, the customer must pay for the bandwidth cost of the transfer twice – once for each provider.

What RACS does is use erasure coding across many cloud platforms to reduce the costs associated with switching vendors. The notation RACS(k,n) is used, where it stripes data over n providers such that any k-subset contains a complete copy of the data. This can tolerate up to (n-k) failures before data is lost.

An evaluation used an FTP trace from the Internet Archive over 1.5 years. The cost of three providers over time were very similar. Using RACS is more expensive: RACS(1,2) is much more expensive, while RACS(8,9) is only slightly more expensive. However, the cost of switching providers is very expensive when not using RACS, on the order of two-months of operating costs, compared to much cheaper cost when using RACS. This is interesting work – I hope they implement more cloud platforms in addition to their current prototype.

Web applications often experience periods of spikes in traffic that are unpredictable. For example, Wikpedia, after the death of Michael Jackson, saw 13% of total requests to the single page of Michael Jackson. The first contribution of the paper was to characterize these spikes. There are two main types – workload volume spikes consisting of the time to peak, duration, magnitude, and maximum slope; and data hotspots consisting of the number of hotspots, spacial locality, and entropy.

The conclusion was that there is no typical type of spike because they vary widely. The next contribution was to try and generate workload spikes using the 7 characteristics listed above. They were able to create a generative spike model using two parameters N, the number of hotspots, and L, a clustering parameter.

After using the generative model, they were able to match real world examples they collected with their counterpart from the output of the generative model, validating that the model produces valid workload spikes.

Conclusion

SOCC turned out to be a great conference, and was definitely a success overall. The next SOCC conference will be co-hosted with SOSP in 2011 in Portugal, so I’m looking forward to that if I’ll be able to attend!

I’m currently in Indianapolis, Indiana for the first ACM Symposium on Cloud Computing (SOCC 2010). I’m posting here with a brief summary of each talk at the conference as well as some of my thoughts. These are my reactions to the presentations only, as I haven’t read most of the papers.

Jeff Dean gave a keynote about different services running at Google and general principles on how to build large-scale services. The talk was roughly divided into three parts. The first part was about Google’s data centers that house a few hundreds of clusters. Each cluster has thousands of machines with one or a handful of configurations. Each machine (at least) runs GFS and Colossus (next-gen GFS), and a cluster scheduling daemon.

The second part was about Google’s back-end services including MapReduce, BigTable, and Spanner. A few interesting notes about these systems are:

BigTable now has a dedicated team that manages BigTable service clusters. Because of this, there has been a lot of work on fair-share scheduling and performance isolation.

BigTable has something called coprocessors, which are basically “arbitrary code that runs next to each tablet in table”.

The third part was about experiences and design patterns for building large-scale services. There were too many design patterns to post here (Jeff said he would post slides on his website), but here are a few which I find interesting:

Use back-of-the-envelope calculation whenever possible before choosing a design

Design for growth, but don’t overdo it (e.g., 5x – 10x OK, but not 100x growth), and

Canary requests: some requests crash every single process in the same service, so try a few machines first.

Overall, it was an interesting talk about quite a broad set of topics. There are only a few places where you can accumulate wisdom about building truly large-scale systems, and it is always interesting to see what they are doing to cope with the scale.

This work addresses the issue of how to create applications that run in the cloud. Machines today have many cores (16 today, up to 1000 in five years), and parallelizing across many cores is very difficult. They created a new system called Fos (Factored Operating System) which runs on top of Xen and splits the OS into separate services. The kernel is a microkernel and all services run on top (e.g. File System). The communication between components in the system is done with message passing. There are no shared locks.

Because of this abstraction, each component can be adjusted elastically to handle demand. For example, if an application starts accessing the File System component at too high of a rate, the underlying system can spawn another File System component on another core or another machine and start transparently redirecting requests to the new component.

The implementation of Fos is fairly complete – some applications include a web server, slide viewer, video transcoder (ffmpeg), and busy box. In fact, it was revealed at the end of the talk that the presentation was running on Fos, and it didn’t crash! The system’s website is running on the Fos web server.

I’m not sure how this work will play out. I could see this becoming a standard approach as servers start having thousands of cores, but I’m not sure how applications here would be able to cope with network latency involved in inter-machine message passing.

Virtualization

This work looks at trying to increase the performance of shared file systems that are used by virtual machine clusters. In the traditional setup, a datacenter has a SAN server that hosts files for VMs in a cluster to access. The SAN is set up such that it is highly reliable and provides high throughput.

The problem with a SAN is that it’s expensive and it doesn’t scale very well to hundreds or thousands of servers accessing it simultaneously. It’s also limited by network bandwidth. VMware’s Lithium technology does away with the SAN and instead arranges the VM hosts into a peer-to-peer network and uses local storage to store files, replicating data for redundancy and performance.

The system still preserves the features required of a VM file server, e.g. cloning, snapshots, etc. It uses a branching method similar to some source control systems like Mercurial for quick copying of large files.

When compared to a SAN, Lithium doesn’t perform as well with a small number of hosts, but as the number of VM hosts increases, Lithium scales linearly while the SAN maxes out at a constant throughput. This approach seems like a great idea, and hope to see it pushed to production in future VMware releases.

The presentation of this work was quite confusing, but the basic idea is that when doing fair scheduling for resources in a VM, ill effects are often encountered. For example, as is well known, TCP doesn’t react very well to congestion in the network (e.g. why TCP doesn’t perform very well over wireless networks), so when a VM host artificially limits a guest’s access to the network, the TCP congestion window will drop dramatically and negatively impact performance.

To try and solve this problem, DVT takes an approach of never sharply decreasing a guest’s share of the network. For example, if host X is receiving 100% of network access, but hosts W, Y and Z suddenly request access, rather than the traditional approach of reducing X immediately to 25%, DVT instead slowly reduces it over time. To keep access fair, DVT will eventually reduce X to lower than 25% to make up for the increased share it got previously.

By doing this, DVT increases the performance of VM guests by up to 25% since the TCP congestion window doesn’t drop sharply. It was mentioned that future work will look at applying the same method to disk access, although it isn’t clear to me how slowly reducing disk access instead of sharply reducing it would increase performance in an application.

This work asked the question “Can we tell how much power a VM guest is consuming?”. I was wondering what the motivation for measuring this was throughput the talk until it was finally mentioned at the end. I’ll start with the motivation first instead – the reasoning given was mainly to use knowledge of a VM guest’s consumption to provision your datacenter power accordingly. Other usages are to charge your cloud users according to power consumption (although I don’t buy this, as I don’t see how it would differ from billing with current methods – cpu, memory, storage, bandwidth), and to track which VMs are consuming the most power so you can target power reduction for “green” initiatives.

To answer this question is a two step process. First, they measure the power consumption of each component in the system when a context switch between VM guests happens by using OS-level events. Next, they have to figure out which VM guest is using each component. Rather than doing this live, they monitor a VM guest for a period of time to determine its power consumption and then use that value for future calculations. The reason for this is that different loads may use different internal power for the same externally visible power state, so learning a VM guest’s power profile over time is a more accurate measurement of power consumption.

There were a lot of technical details glossed over in this talk, and there were many formulas on the slides that weren’t explained or accompanied by variable descriptions, so I found the presentation somewhat confusing. I’m sure reading the paper would make this more clear.

This work targets large data applications. These are things like web analytics, graph mining, log analysis, and PageRank, which use massive amounts of data. An insight here is that these applications have to continually process on the order of TBs of new data per day and they are stateful, but the running time is proportional to the total amount of state, not proportional to the amount of new data.

Continuous Bulk Processing (CBP) provides users with an additional API of a Translate() function and a RouteBy() function similar to the map and reduce stages of MapReduce. Current systems only do “outer” grouping, while CBP allows for “inner” grouping so that only the state that needs to be accessed is shipped around. In an evaluation, this inner grouping method reduced running time by up to 53%.

Perhaps I’m not familiar enough with MapReduce, but the presentation went too fast for me to follow the details of the Translate and RouteBy API, so see the paper for details.

Comet is a system that tries to reduce the running time of large data-intensive applications that have work in common. The example given was a set of four jobs: one that computes the top ten hottest Chinese pages daily, another that computes the top ten hottest English pages daily, and corresponding jobs that compute the top ten hottest Chinese and English pages weekly. The first two jobs have the same input data, but have different filters in place, so the first step of each of those jobs is the same.

The flow of the system is: query series, normalization, logical optimization, physical optimization, execution plan, and then execution. By doing these optimization steps, Comet is able to reduce the amount of work done by 52% for jobs that have commonalities with previous jobs.

The evaluation showed that computing the top ten pages weekly was able to take advantage of the top ten daily calculation, but the top ten pages for the week don’t necessarily overlap with the top ten pages of each day, so it’s not clear how this works. This question was asked in the Q&A, but the author wasn’t able to answer. The presentation of this work was very confusing, and it was clear that the rest of the audience didn’t understand either. I’m sure reading the paper would make more sense.

This work addresses how to reduce the running time of large scientific experiments. The example given here was an application that takes astronomical images, does feature extraction to identify the celestial objects in the images, and then runs a Friends of Friends algorithm, which is used in astronomy to cluster celestial objects. MapReduce-type systems like Hadoop are a great fit for workloads like this, but it is hard to express complex algorithms and get good performance. As an example, this algorithm when first implemented took 14 hours to run, and after a week of working, they were able to reduce the running time to 70 minutes.

The reason these types of algorithms can take a long time to run is that the same amount of input data doesn’t always have the same running time (e.g. if the cluster of celestial objects is more dense, takes longer), so a static partitioning scheme doesn’t get good performance. An alternative is to use micro partitions, which reduces the impact of skew, but there is additional framework overhead and to find the sweet spot, the algorithm must be run many times, which is undesirable.

The SkewReduce algorithm takes a sampling approach to figure out the best partitioning scheme. In evaluation, this SkewReduce algorithm was able to reduce the running time of algorithms by 2-8 times. This seems like a nice scheduling algorithm, and I hope this finds its way into the main branch of Hadoop. A person who works at Google shared a similar optimization that Google uses, but they do their optimizations in the Reduce stage rather than the Map stage.

I will attempt to get my writeup for Day 2 posted late tomorrow.
[Update: Day 2 posted.]

We’ve recently been working hard on Firecoral – a browser-based, peer-to-peer content distribution network for web caching. I’ll be presenting a short talk on Firecoral at the 8th International Workshop on Peer-to-Peer Systems (IPTPS) on April 21st in Boston, MA.

Peer-to-peer content distribution has been inarguably successful for large file distribution (e.g. BitTorrent), but P2P services have been restricted to stand-alone applications, not transparently incorporated into Web browsing and seamlessly running over HTTP. CoralCDN has served as a web content distribution network for the past five years, but its deployment has been limited to PlanetLab and demand quickly outgrew capacity.

Firecoral’s goal is to scale web content distribution to end users by allowing mutually distrustful users to share their web browser caches, yet ensure the authenticity of content and enable users to preserve privacy by expressing flexible content sharing policies.

To achieve these goals, we have built a Firefox extension that uses subscription-based XPath queries to extract URLs from web sites and redirect them to a proxy server running inside the browser. For example, all external links from Slashdot are redirected to Firecoral. Once a URL is received by the proxy, a tracker is queried for other users caching the same URL, and the content is fetched from peers instead of the origin web server. We ensure data integrity with cryptographic signatures from a trusted service.

Architecting distributed systems can be very difficult. Arguably the hardest part of programming a distributed application is getting node coordination correct. I’ll define a node in this context as a service running on a single server which communicates with other nodes and together make up your distributed application.

What I mean by coordination here is some act that multiple nodes must perform together. Some examples of coordination:

Group membership

Locking

Publisher/Subscriber

Ownership

Synchronization

One or more of these primitives show up in all distributed systems, so implementing them correctly is extremely important. While developing CRAQ, I originally implemented a very simple group membership service, but it didn’t provide reliability, replication, or scalability and was therefore a single point of failure. I went looking for an out-of-the-box coordination service and came across ZooKeeper.

ZooKeeper is an open-source, reliable, scalable, high-performance coordination service for distributed applications – just what I was looking for. It provides a metadata warehouse that is indexed like a file system and its primitive operations allows programmers to implement any of the common coordination examples I mentioned above. It has client bindings for Java and C, which made it relatively easy for me to integrate into CRAQ’s source code which is written in Tame.

The only thing I wish ZooKeeper had was support for wide-area deployments. I’m hopeful that this might eventually get added (maybe even by me!) some day.

I highly recommend ZooKeeper for anyone who is considering building a distributed system. It simplifies one of the hardest parts of implementing a distributed application.