Performance issues on the AO3

Published: 2012-01-17 17:05:30 -0500

As many users will no doubt have noticed, the AO3 has been experiencing some performance issues since the start of the year. When we posted on 5th January, we were expecting those problems to ease once the holiday rush was over. However, that hasn't turned out to be the case. We're working on ways of dealing with the performance issues, but we wanted to keep you updated with what's going on while we do that.

Why the slowdowns?

In the past month, over 2000 new users have created accounts on the Archive. At the same time, the number of people reading on the Archive - with or without accounts - has been steadily growing. This has been part of a general trend, as you can see if you look at the graph showing number of visits to the Archive since November:

We're always much busier on Sundays, but the number of visits has been gradually going up each week since November (and the same holds true for the preceding months). However, before December we were hovering around the 135,000 level for visitor numbers at peak times. You can see that the visitor numbers began to climb more dramatically in December, peaking on 2nd January when we had 182,958 visitors. Crucially, after that spike it didn't drop back down to anything like the levels it had been at previously: we're now at more than 150,000 visits on a regular day, and more than 165,000 on Sundays, our busiest day. Wow!

We were expecting a big spike over the holidays, when there are lots of challenges and lots of people with a little spare time for reading and creating. However, we hadn't expected site usage to remain quite so high after the holidays were over! The increases mean that the site is now under a holiday load every day, which is one reason things have been running a little slowly.

The other reason for the slowdowns is that the increase in our number of registered users, and the holiday challenge season, has produced a big increase in the number of works. In fact, 11,516 new works have been posted since the end of December already! More data in our databases means more work for things like sorting, searching, etc - this means that sometimes the database just doesn't serve up the result you need in time, and the unicorn which is waiting to get that result gives up and goes away (yes, really - our servers are assisted by unicorns :D).

We've been expecting this general effect for a while now, and we've been working towards implementing things to deal with it; however, we weren't expecting quite such a big jump in site usage in the past month!

What are you doing about this?

The Accessibility, Design & Technology and Systems Committees had a special meeting on Saturday to discuss ways of dealing with the immediate problem, as well as longer term plans. It can be tricky to test for high load situations before they actually occur, but once they do occur there's lots of data we can gather to help us address the most crucial issues. (We're also working on implementing more tools which will help us test this stuff before it comes up.)

Short term

More caching: We already cache pages (or sections of pages) across the site - this means we store a copy which we can serve up directly, instead of creating the page every time someone wants to use it. If something changes, then the cache is expired and a new, updated copy is created. Hitherto, we've focused on caching chunks of information which are unlikely to change rapidly: for example, on any works index the 'blurbs' which show the information about each work are cached. However, some of the heaviest load is caused by rapidly changing pages like the works index. We're moving towards more caching of whole pages, so that a new copy of the works index (for example) will be created every five minutes rather than generated each time someone asks for it. This means things like works indexes will be a little slower to update - when you add a new work, it won't appear on the list until the cache expires - but that five minute delay will massively reduce the weight on our servers.

More indexes: We have a few places in our databases - for example the tables for the skins - which could use more indexes. Indexes speed things up because the server can just search through those rather than the whole table. So, we're hunting out places where more indexes are needed, and implementing them. :)

Medium term

Bad queries must die: We have a few queries which are very long and complicated, and take a long time to run. We need to rewrite these bits of the code to make them simpler and faster! In many cases this will be quite complicated (or else we would have done it already), but it's a priority to help us speed things up.

New filters for great justice: The filters that are implemented on our index pages are not really optimal considering the size of the site now - the limitations of that code are the reason we have to have a 1000 work cap on the number of works returned. We have been working on this for a long time - we need to completely throw out what we have and implement a system which works better for the site as it is now. Again, this is really complicated, which is why it's taken us a long time to achieve it even though we knew it was important - the good news is that we have now done quite a lot of work on this area and the first round of changes should be out in the next few months.

Long term

Long term, we're going to be moving to a setup which allows us to distribute our site load across more servers. This will involve database sharding - putting different bits of the database on separate servers - so it will take quite a lot of planning and expertise. If you're a user of Livejournal or Dreamwidth, you might be aware that your journal is hosted on a certain 'cluster' - we'd be moving to a similar system. We want to make sure we do this right, but based on the way the site is growing we think this is now high priority, and our Systems team are working to figure out the right ways forward.

Summary

We know it's really frustrating when the site runs slow or is timing out on you: many apologies. We really appreciate users' patience while we deal with the issues. As you'll see from the above, there are some immediate things we can do to ease the problems, and we also have a good sense of where we need to go from here. So, while these changes need to be implemented as a matter of urgency, we feel confident we will be able to tackle the problems. If you have expertise in the areas of performance, scalability and database management, we would very much welcome additional volunteers.

As we move forward on dealing with problem spots on the site, we may implement some changes which are visible to users: the caching on the index pages and the changes to browsing and searching are two of the most obvious. We'll let you know about this as we go along - we think the effect will be beneficial for everyone, but do be prepared for a few changes! You can keep up with status and deploy news on our Twitter @AO3_Status.

While the growth in the site means we're facing some problems a little sooner than we expected, we're really excited about the fact so many people want to read and post to the AO3. Thanks to everyone for your fannish energy - and apologies for the fact we sometimes slow you down a little.

Comment Actions

Comment Actions

Very grateful for the update. Meanwhile, oddly (or not?), the site's not timing out for me...just running slowly. I'm happy to put up with that as long as I know the issues are being dealt with. Take care!

Comment Actions

Thanks! A lot of the issues are things that have to do with the volume of data we now have in our database and how we're accessing it on different pages, so you may notice that some pages load very quickly at the same time others (like the listings for big fandoms or the tag wrangling areas) load more slowly or won't load at all. It's really a combination of traffic and volume (although Sherlock hasn't been helping with the traffic this week!), and we're doing our best to address all the issues at play.

Elz, AD&T

Comment Actions

Your site has, by far, the best set of features and usability that I've experienced yet in a fanfic host. The value you bring to such communities is more than worth the wait, and while I have been receiving 502's and lags fairly regularly, they usually clear up within a minute or two, so I have very few complaints.

Comment Actions

I love this post, and hope to see lots more like it -- explaining problems and the plans to work on them on both a broader and a more granular level -- whenever there's a Systems (or any other OTW aspect) item of concern -- or of accomplishment.

Comment Actions

Thank you for this! And many congrats to your growing site! I appreciate all your efforts in making this better. I stumbled onto this site by accident and realized how much quality fiction there is here :D It's amazing. Thank you again!

Comment Actions

Comment Actions

Denise
Thu19Jan201211:08PMEST

You could set up the works page to speed up access without having a delay in updates: memcache the list, then invalidate the memcache key when a new work is uploaded (and memcache the result). Not much of a performance improvement for fandoms that have lots of works being uploaded constantly, but it would improve speed when you have a lot of people just browsing.

In general, for something like the AO3, it makes a lot of sense to memcache the result of just about any query, with instant invalidation of the key whenever something happens that would affect that query (like, if someone adds a tag to a work, invalidate the memcache key for that tag feed). You pretty much can't go wrong with stuffing lots of things into memcache, as long as you clear the keys every time something happens that would change the results.

Comment Actions

Thank you! This is really useful. And I was about to ask you if you were interested in volunteering with us, but then I looked again at your name and realised that you, er, probably have your hands full. <3

LucyAD&T / Communications

Comment Actions

Denise
Tue24Jan201204:44PMEST

Comment Actions

I've been noticing lately--today in particular, but on and off since Christmas--that pages are loading with the Java Script and the background images noticeably lagging behind. This is the first time I've ever seen this kind of loading lag at a noticeable by a human level.

Have you considered using sprites for the images?

A page showing 20 work blurbs is likely loading 15 to 20 separate images--each one an HTTP request to the same server competing with the 17 scripts that also need to load.

There's cool tools now to make the generation of the sprite imagemaps and the CSS a lot less of a chore than it used to be.

Comment Actions

Yes, if you go to any index page (which happens when you click on a fandom tag) then you're likely to hit a 502 error, because these are the pages with the big performance issues. We know it's really frustrating - we're hoping to get on tops of it soon.

Comment Actions

Comment Actions

My biggest complaint, with all due respect to the unicorns, is the inability to go to Fandoms and pick a fandom such as Sherlock to look at and browse through. Every single time, no matter what time of day or night, I get a "not found" response. I have to wait for people to recommend a story or find a link on another site that leads me back here. *sadface* But I understand the growing pains, so I am trying to be patient. Thanks for all you do!

Last Edited Sat21Jan201208:02PMEST

Comment Actions

Yes, unfortunately, the index pages (which includes all the fandom pages) are the biggest offenders in terms of performance problems, so they're the most likely place to get an 502. We're really sorry for the inconvenience. (However, our new release might help you out a little bit - you can now subscribe to feeds for fandoms, such as Sherlock.)