Advisor, Board Member, Professor, Former Tech Exec, …

Monthly Archives: February 2012

Facebook is building an amazing, enduring business based on data. Data about
people, and the connections between people. And it is about real people – unlike
(say) Twitter, Facebook has maintained a razor-sharp focus on ensuring it has real
people with known identities.

Facebook. Data that's meaningful, but it's unstructured.

Facebook is my favorite web entry point. But there’s something important they’re not doing that is about me and my connections. (And I should say at this point that I’m happy to share my personal data with Facebook.)

We create and manage structured data, and we care about manipulating it.
Facebook isn’t a great place to manage that data. Here’s an example:
I’ve been a runner for nearly 20 years. For the first 10, I used my Palm Pilot or some
other device with a spreadsheet to track my runs: time, distance, how I felt. I then
had it compute total time for the year, number of runs, total distance for the year,
average time per kilometer, and so on. It was pretty motivational — can I beat the number of runs I did last year? Am I slowing down? Am I running enough miles? It’s the kind of foundation that companies such as fitbit are based on.

There’s lots of other structure in life: calendars, weather, budgets, exercise, television schedules, sporting calendars, airline travel, and so on. Facebook isn’t a great place to store that data as it applies to you, your friends, or to manipulate it. How much did I spend last month? How many miles did I fly last year? Where did I travel in 2011? How many baseball games did I attend in the past 5 years? What time on average do my friends go to bed? Do I eat more or less calories than my average friend in the eBay list? What are the top ten songs my friends from school listen to?

Fitbit. The power of structured data -- my steps, active minutes, and more.

Structured data is inherently easier to manipulate, understand, and monetize with ads. It’s easier to find patterns (Hugh seems to travel to Australia every Christmas). It’s easier to predict from (Hugh has run nearly 200 miles in the past few months, it’s time for new sneakers). It’s easier to sell a substitution (Hugh, did you know that Virgin also flies that route, and they do it cheaper?).

You could argue they’re edging in that direction: the recent changes to the profile page have been working toward more structure. Not that they’ve made this transition well (and I’m not denying it is hard). A couple of years ago they massacred my unstructured text about favorite movies, music, TV shows, and so on by trying to force it into a structured schema. They completely destroyed my profile page – the music was too obscure, I’d made a few
jokes in my hobbies, and so on, and it didn’t map into any neat schema. Perhaps they should have started with a little more structure in the beginning.

But it was a move towards more structure. The timeline is moving in that direction too – though I am not sure that’s the fundamental reason for it, it’s probably about trying to simply get more data of any type. Sure, they’ve always had birth dates for the purpose of birthday reminders — a good example of what can be done with structured data. And there’s some other basic structured data too – I’m not arguing they have none.

Adding structured data, and ways to manipulate it, is something Facebook needs. And it’s a hole in the online social world. The challenge is how to do it right: blending a structured experience into an inherently simple, unstructured stream of text and media is probably not easy. Particularly when you want to provide search over it all — maybe a topic for another time…

Like this:

I’m on my way back from the UK after the PHP UK Conference. I delivered the keynote for the conference’s second day, and spent time on a panel about PHP at Scale. Update: Here’s the video of the session for your viewing pleasure.

PHP at Scale panel: Nikolay, Hugh, Rasmus, Ian. Rasmus and I are reading the tweets for the session (which are also on the screen behind us)

I came unprepared to add anything substantive – I didn’t have any interesting challenges I’d encountered in scaling PHP, nor could I see any reason why any robust web scripting language wouldn’t scale. We’d decided to continue to use Java at eBay, primarily because we didn’t want to sink the cost to change without a clear benefit. I’m a PHP fan, but Java is another fine choice.

The good news was that Rasmus and Nikolay couldn’t see any reasons why PHP was particularly challenging. What we saw were general challenges in scaling applications to large traffic volumes – and that’s pretty much where the panel discussion went.

The audience at the PHP at Scale panel. Ciaran Rooney is introducing the panel.

Here’s a few key points from the discussion.

Rasmus made the point that the easiest way to scale a bottleneck is to remove it. Here’s an eBay example – we have very flat, denormalized database tables, we precompute joins and store the results. Why? We had challenges scaling database joins at our scale – we typically execute 75 billion data requests in the process of serving over 2 billion pages each day. So, we removed many of the joins, precomputing them once, and stored the results in the database; of course, the tradeoff is additional space, often more queries, and update challenges, but the serving bottleneck is fundamentally gone.

There was general agreement that PHP scripts should never be the bottleneck that you’re having trouble scaling. Producing web pages with a PHP script should take a few milliseconds – the I/O overhead should be near zero, the memory use low, and they shouldn’t read large amounts of data to produce pages that contain less data. The heavy lifting should be behind the scenes – use the right tools for managing and crunching data, storing and restoring sessions, and complex processing. Build services in other languages, make them fast, and get those to scale. That’s where the complexity should be – the PHP script should be a very thin, light layer.

Perhaps unsurprisingly, Rasmus thought Facebook’s HipHop was an odd directional choice. HipHop converts PHP into C++, so you can compile the code, and typically leads to a 50% memory reduction and other benefits. I’d been thinking the same – if you’re having trouble scaling PHP, you’re probably trying to do too much in the web tier, and should be doing more in scalable services behind the scenes. So, perhaps facebook are doing too much in their web app – that wouldn’t be a surprise, that happens to everyone as systems grow at a high development velocity. It’s the solution that’s quirky – why not refactor what’s in the the web app, and what’s in the scalable backend service. Why solve it by making it attractive to build complexity into the web app? Maybe it was just to save operations costs.

We also talked about the general problems of scaling from a startup to a large, major web property. We touched on building experimental frameworks, so that you can configure A/B tests that help you make decisions about what to ship to customers. There were mentions of tooling for supporting multiple markets and languages, and large character set languages. Scaling a session store (which you use to restore state about a user when they return) is very challenging at scale – and Rasmus and I had both seen the pain of that at extreme scale. Interestingly, Nikolay was a fan of storing state in browser cookies – that may work for wordpress, but it wouldn’t work at eBay’s scale because of the sheer amount of data and the substantial site speed implications. (If you’re interested, here’s my views on site speed.)

We also spent some of the discussion on scaling teams – an important part of working at scale. I advocated that hiring smart, driven engineers is really the key – it’s not about hiring people with particular skills, you’ll find that smart people can figure out most anything. We talked about keeping teams independent, so that each team feels like it’s a small startup. We also talked about needing a couple of key folks who have that special blend of operations and engineering skills to debug what’s going on the live site at scale – it’s a very special skill, perhaps it’s an emerging role that we’ll all formally understand with the emergence of the “DevOps” mentality. In any case, we’d all worked with one or more of these people – the ones that have the inquisitiveness, breadth, skills, and persistence to nail down very subtle bugs at amazing scale (and I argued that you can help these folks significantly be encouraging your developers to log pretty much every event that happens in their code). Rasmus made a few blunt comments about young startups needing to hire adults who’ve been through it before…

I’m expecting there’ll be a video of the session up on the web sometime. I’ll update this post when that happens.

What’s your experience at scale? Do you have other key themes? Is there anything inherently challenging about scaling PHP (that doesn’t apply to, say, Java or python)?