Thursday, November 19, 2009

E. Michael Maximilien has a post, "Extreme Agility at Facebook", on blog@CACM. The post reports on a talk at OOPSLA by Robert Johnson (Director of Engineering at Facebook) titled "Moving Fast at Scale".

Here is an interesting excerpt on very frequent deployment of software and how it reduces downtime:

Facebook developers are encouraged to push code often and quickly. Pushes are never delayed and applied directly to parts of the infrastructure. The idea is to quickly find issues and their impacts on the rest of system and surely fixing any bugs that would result from these frequent small changes.

Second, there is limited QA (quality assurance) teams at Facebook but lots of peer review of code. Since the Facebook engineering team is relatively small, all team members are in frequent communications. The team uses various staging and deployment tools as well as strategies such as A/B testing, and gradual targeted geographic launches.

This has resulted in a site that has experienced, according to Robert, less than 3 hours of down time in the past three years.

Monday, November 16, 2009

An upcoming paper, "The Case for RAMClouds: Scalable High-Performance Storage Entirely in DRAM" (PDF), makes some interesting new arguments for shifting most databases to serving entirely out of memory rather than off disk.

The paper looks at Facebook as an example and points out that, due to aggressive use of memcached and caches in mysql, the memory they use already is about "75% of the total size of the data (excluding images)." They go on to argue that a system designed around in-memory storage with disk just used for archival purposes would be much simpler, more efficient, and faster. They also look at examples of smaller databases and note that, with servers getting to 64G of RAM and higher and most databases just a couple terabytes, it doesn't take that many servers to get everything in memory.

An excerpt from the paper:

Developers are finding it increasingly difficult to scale disk-based systems to meet the needs of large-scale Web applications. Many people have proposed new approaches to disk-based storage as a solution to this problem; others have suggested replacing disks with flash memory devices.

In contrast, we believe that the solution is to shift the primary locus of online data from disk to random access memory, with disk relegated to a backup/archival role ... [With] all data ... in DRAM ... [we] can provide 100-1000x lower latency than disk-based systems and 100-1000x greater throughput .... [while] eliminating many of the scalability issues that sap developer productivity today.

One subtle but important point the paper makes is that the slow speed of current databases have made web applications both more complicated and more limited than they should be. From the paper:

Traditional applications expect and get latency significantly less than 5-10 μs ... Because of high data latency, Web applications typically cannot afford to make complex unpredictable explorations of their data, and this constrains the functionality they can provide. If Web applications are to replace traditional applications, as has been widely predicted, then they will need access to data with latency much closer to what traditional applications enjoy.

Random access with very low latency to very large datasets ... will not only simplify the development of existing applications, but they will also enable new applications that access large amounts of data more intensively than has ever been possible. One example is ... algorithms that must traverse large irregular graph structures, where the access patterns are ... unpredictable.

The authors point out that data access patterns currently need to be heavily optimized, carefully ordered, and must conservatively acquire extra data in case it is later needed, all things that mostly go away if you are using a database where access has microsecond latency.

While the authors do not go as far as to argue that memory-based databases are cheaper, they do argue that they are cost competitive, especially once developer time is taken into account. It seems to me that you could go a step further here and argue very low latency databases brings such large productivity gains to developers and benefits to application users that they are in fact cheaper, but the paper does not try to do that.

If you don't have time to read the paper, slides (PDF) are also available that are very quick to skim from a talk by one of the authors.

If you can't get enough of this topic, please see my older post, "Replication, caching, and partitioning", which argues that big caching layers, such as memcached, are overdone compared to having each database shard serve most data out of memory.

Thursday, November 12, 2009

Paul Graham has a fantastic article up, "What Startups Are Really Like", with the results of what happened when he asked all the founders of the Y Combinator startups "what surprised them about starting a startup."

A brief excerpt summarizing the findings:

Unconsciously, everyone expects a startup to be like a job, and that explains most of the surprises. It explains why people are surprised how carefully you have to choose cofounders and how hard you have to work to maintain your relationship. You don't have to do that with coworkers. It explains why the ups and downs are surprisingly extreme. In a job there is much more damping. But it also explains why the good times are surprisingly good: most people can't imagine such freedom. As you go down the list, almost all the surprises are surprising in how much a startup differs from a job.

There are 19 surprises listed in the essay. Below are excerpts from some of them:

Be careful who you pick as a cofounder ... [and] work hard to maintain your relationship.

Startups take over your life ... [You will spend] every waking moment either working or thinking about [your] startup.

It's an emotional roller-coaster ... How low the lows can be ... [though] it can be fun ... [But] starting a startup is fun the way a survivalist training course would be fun, if you're into that sort of thing. Which is to say, not at all, if you're not.

Persistence is the key .... [but] mere determination, without flexibility ... may get you nothing.

You have to do lots of different things ... It's much more of a grind than glamorous.

When you let customers tell you what they're after, they will often reveal amazing details about what they find valuable as well what they're willing to pay for.

You can never tell what will work. You just have to do whatever seems best at each point.

Expect the worst with deals ... Deals fall through.

The degree to which feigning certitude impressed investors .... A lot of what startup founders do is just posturing. It works.

How much of a role luck plays and how much is outside of [your] control ... Having skill is valuable. So is being determined as all hell. But being lucky is the critical ingredient ... Founders who succeed quickly don't usually realize how lucky they were.

Definitely worth reading the entire article if you are at all considering a startup.

For my personal take on some surprises I hit, please see my earlier post on Starting Findory.

Tuesday, November 10, 2009

An amusingly titled paper to be presented at the CSS 2009 conference, "Your Botnet is My Botnet: Analysis of a Botnet Takeover" (PDF), contains some not-so-funny data on how sophisticated hijacking computers has now become, the data they are able to collect, and the profits that fuel the development of more and more dangerous botnets.

Extended excerpts from the paper, focusing on the particularly scary bits:

We describe our experience in actively seizing control of the Torpig (a.k.a. Sinowal, or Anserin) botnet for ten days. Torpig ... has been described ... as "one of the most advanced pieces of crimeware ever created." ... The sophisticated techniques it uses to steal data from its victims, the complex network infrastructure it relies on, and the vast financial damage that it causes set Torpig apart from other threats.

Torpig has been distributed to its victims as part of Mebroot. Mebroot is a rootkit that takes control of a machine by replacing the system's Master Boot Record (MBR). This allows Mebroot to be executed at boot time, before the operating system is loaded, and to remain undetected by most anti-virus tools.

Victims are infected through drive-by-download attacks ... Web pages on legitimate but vulnerable web sites ... request JavaScript code ... [that] launches a number of exploits against the browser or some of its components, such as ActiveX controls and plugins. If any exploit is successful ... an installer ... injects a DLL into the file manager process (explorer.exe) ... [that] makes all subsequent actions appear as if they were performed by a legitimate system process ... loads a kernel driver that wraps the original disk driver (disk.sys) ... [and] then overwrite[s] the MBR of the machine with Mebroot.

Mebroot has no malicious capability per se. Instead, it provides a generic platform that other modules can leverage to perform their malicious actions ... Immediately after the initial reboot ... [and] in two-hour intervals ... Mebroot contacts the Mebroot C&C server to obtain malicious modules ... All communication ... is encrypted.

The Torpig malware ... injects ... DLLs into ... the Service Control Manager (services.exe), the file manager, and 29 other popular applications, such as web browsers (e.g., Microsoft Internet Explorer, Firefox, Opera), FTP clients (CuteFTP, LeechFTP), email clients (e.g., Thunderbird, Outlook, Eudora), instant messengers (e.g., Skype, ICQ), and system programs (e.g., the command line interpreter cmd.exe). After the injection, Torpig can inspect all the data handled by these programs and identify and store interesting pieces of information, such as credentials for online accounts and stored passwords. ... Every twenty minutes ... Torpig ... upload[s] the data stolen.

Torpig uses phishing attacks to actively elicit additional, sensitive information from its victims, which, otherwise, may not be observed during the passive monitoring it normally performs ... Whenever the infected machine visits one of the domains specified in the configuration file (typically, a banking web site), Torpig ... injects ... an HTML form that asks the user for sensitive information, for example, credit card numbers and social security numbers. These phishing attacks are very difficult to detect, even for attentive users. In fact, the injected content carefully reproduces the style and look-and-feel of the target web site. Furthermore, the injection mechanism defies all phishing indicators included in modern browsers. For example, the SSL configuration appears correct, and so does the URL displayed in the address bar.

Consistent with the past few years' shift of malware from a for-fun (or notoriety) activity to a for-profit enterprise, Torpig is specifically crafted to obtain information that can be readily monetized in the underground market. Financial information, such as bank accounts and credit card numbers, is particularly sought after. In ten days, Torpig obtained the credentials of 8,310 accounts at 410 different institutions ... 1,660 unique credit and debit card numbers .... 297,962 unique credentials (username and password pairs) .... [in] information that was sent by more than 180 thousand infected machines.

The paper estimates the value of the data collected by this sophisticated piece of malware to be between $3M - $300M/year on the black market.

Saturday, November 07, 2009

Findory was my first startup and a nearly five year effort. Its goal of personalizing information was almost laughably ambitious, a joy to pursue, and I learned much.

I learned that a cheap is good, but too cheap is bad. It does little good to avoid burning too fast only to starve yourself of what you need.

I re-learned the importance of a team, one that balances the weaknesses of some with the strengths of another. As fun as learning new things might be, trying to do too much yourself costs the startup too much time in silly errors born of inexperience.

I learned the necessity of good advisors, especially angels and lawyers. A startup needs people who can provide expertise, credibility, and connections. You need advocates to help you.

And, I learned much more, some of which is detailed in the other posts in the Starting Findory series:

I hope you enjoyed these posts about my experience trying to build a startup. If you did like this Starting Findory series, you might also be interested in my Early Amazon posts. They were quite popular a few years ago.

Wednesday, November 04, 2009

A recent paper from SIGIR, "The Wisdom of the Few: A Collaborative Filtering Approach Based on Expert Opinions from the Web" (PDF), has a very useful exploration into the effectiveness of recommendations using only a small pool of trusted experts.

The results suggest that using a small pool of a couple hundred experts, possibly your own experts or experts selected and mined from the web, has quite a bit of value, especially in cases where big data from a large community is unavailable.

A brief excerpt from the paper:

Recommending items to users based on expert opinions .... addresses some of the shortcomings of traditional CF: data sparsity, scalability, noise in user feedback, privacy, and the cold-start problem .... [Our] method's performance is comparable to traditional CF algorithms, even when using an extremely small expert set .... [of] 169 experts.

Our approach requires obtaining a set of ... experts ... [We] crawled the Rotten Tomatoes web site –- which aggregates the opinions of movie critics from various media sources -- to obtain expert ratings of the movies in the Netflix data set.

The authors certainly do not claim that using a small pool of experts is better than traditional collaborative filtering.

What they do say is that using a very small pool of experts works surprisingly well. In particular, I think it suggests a good alternative to content-based methods for bootstrapping a recommender system. If you can create a high quality pool of experts, even a fairly small one, you may have good results starting with that while you work to gather ratings from the broader community.