3 Some figures The Wikimedia Foundation: Was founded in June 2003, in Florida Currently has 9 employees, the rest is done by volunteers Yearly budget of around $2M, supported mostly through donations Supports the popular Wikipedia project, but also 8 others: Wiktionary, Wikinews, Wikibooks, Wikiquote, Wikisource, Wikiversity, Wikispecies, Wikimedia

4 Some figures Wikipedia: 8 million articles spread over hundreds of language projects (english, dutch,...) 110 million revisions 10th busiest site in the world (source: Alexa) Exponential growth: doubling every 4-6 months in terms of visitors / traffic / servers

13 Squid caching Up to 40 GB of disk caches per Squid server Disk seek I/O limited The more disk spindles, the better! 8 GB of memory, half of that used by Squid Up to 4 disks per server (1U rack servers) Hit rates: 85% for Text, 98% for Media, since the use of CARP

16 Squid cache invalidation Wiki pages are edited at an unpredictable rate Only the latest revision of a page should be served at all times in order not to hinder collaboration Invalidation through expiry times not acceptable, explicit cache purging needs to be done Implemented using the UDP based HTCP protocol: on edit application servers send out a single message containing the URL to be invalidated, which is delivered over multicast to all subscribed Squid caches

17 The Wiki software All Wikimedia projects run on a MediaWiki platform Open Source software (GPL) Designed primarily for use by Wikipedia/Wikimedia, but also used by many outside parties Arguably the most popular wiki engine out there Written in PHP Very scalable, very good localization Storage primarily in MySQL, other DBMSes supported

21 MediaWiki optimization We try to optimize by... not doing anything stupid avoiding expensive algorithms, database queries, etc. caching every result that is expensive and has temporal locality of reference focusing on the hot spots in the code (profiling!) If a MediaWiki feature is too expensive, it doesn t get enabled on Wikipedia

22

23 MediaWiki profiling

24 Persistent data Persistent data is stored in the following ways: Metadata, such as article revision history, article relations (links, categories etc.), user accounts and settings are stored in the core databases Actual revision text is stored as blobs in External storage Static (uploaded) files, such as images, are stored separately on the image server - metadata (size, type, etc.) is cached in the core database and object caches

25 Core databases Separate database per wiki (not separate server!) One master, many replicated slaves Read operations are load balanced over the slaves, write operations go to the master The master is used for some read operations in case the slaves are not yet up to date (lagged) Runs on ~15 DB servers with 4-16 GB of memory, 6x GB disks and 2 CPUs each

29 Text compression All revisions of all articles are stored Every Wikipedia article version since day 1 is available Many articles have hundreds, thousands or tens of thousands of revisions Most revisions differ only slightly from previous revisions Therefore subsequent revisions of an article are concatenated and then compressed Achieving very high compression ratios of up to 100x

31 Media storage New API between media storage server and application servers, based on HTTP Methods store, publish, delete and generate thumbnail New file / directory layout structure, using content hashes for file names Files with the same name/url will have the same content, no invalidation necessary Migration to some distributed, replicated setup

Common Server Setups For Your Web Application - Part II Introduction When deciding which server architecture to use for your environment, there are many factors to consider, such as performance, scalability,

Cloud Based Application Architectures using Smart Computing How to Use this Guide Joyent Smart Technology represents a sophisticated evolution in cloud computing infrastructure. Most cloud computing products

: Tidbits from the sites that made it work Gabe Rudy What Is This About Scalable is hot Web startups tend to die or grow... really big Youtube Founded 02/2005. Acquired by Google 11/2006 03/2006 30 million

ZEN LOAD BALANCER EE v3.02 DATASHEET The Load Balancing made easy OVERVIEW The global communication and the continuous growth of services provided through the Internet or local infrastructure require to

ZEN LOAD BALANCER EE v3.04 DATASHEET The Load Balancing made easy OVERVIEW The global communication and the continuous growth of services provided through the Internet or local infrastructure require to

BORG DIGITAL High Availability The BORG DIGITAL Cloud is something we are extremely excited about. It is a highly specialised web hosting service which puts resilience, security and contingency at it s

CS 188/219 Scalable Internet Services Andrew Mutz October 8, 2015 For Today About PTEs Empty spots were given out If more spots open up, I will issue more PTEs You must have a group by today. More detail

1 Serving 4 million page requests an hour with Magento Enterprise Introduction In order to better understand Magento Enterprise s capacity to serve the needs of some of our larger clients, Session Digital

Removing Failure Points and Increasing Scalability for the Engine that Drives webmd.com Matt Wilson Director, Consumer Web Operations, WebMD @mattwilsoninc 9/12/2013 About this talk Go over original site

An overview of Drupal infrastructure and plans for future growth prepared by Kieran Lal and Gerhard Killesreiter for the Drupal Association Drupal.org Old Infrastructure Problems: Web servers not efficiently

MySQL and Virtualization Guide Abstract This is the MySQL and Virtualization extract from the MySQL Reference Manual. For legal information, see the Legal Notices. For help with using MySQL, please visit

2015 REQUIREMENTS LIVEBOX http://www.liveboxcloud.com LiveBox Srl does not release declarations or guarantees about this documentation and its use and decline any expressed or implied commercial or suitability

WINDOWS AZURE EXECUTION MODELS Windows Azure provides three different execution models for running applications: Virtual Machines, Web Sites, and Cloud Services. Each one provides a different set of services,

BASICS OF SCALING: LOAD BALANCERS Lately, I ve been doing a lot of work on systems that require a high degree of scalability to handle large traffic spikes. This has led to a lot of questions from friends

Why your extension will not be enabled on Wikimedia wikis in its current state! (and what you can do about it) Technical advice for extension developers Roan Kattouw - Wikimania 2010 - Gdańsk, Poland Focus

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9a: Cloud Computing Slide 1 Slide 3 A style of computing in which dynamically scalable and often virtualized resources are provided as a service over the Internet.

Published in the 2nd USENIX Workshop on Hot Topics in Cloud Computing 2010 CiteSeer x in the Cloud Pradeep B. Teregowda Pennsylvania State University C. Lee Giles Pennsylvania State University Bhuvan Urgaonkar

MEASURING WORKLOAD PERFORMANCE IS THE INFRASTRUCTURE A PROBLEM? Ashutosh Shinde Performance Architect ashutosh_shinde@hotmail.com Validating if the workload generated by the load generating tools is applied

1. Comments on reviews a. Need to avoid just summarizing web page asks you for: i. A one or two sentence summary of the paper ii. A description of the problem they were trying to solve iii. A summary of

High Availability Solutions for the MariaDB and MySQL Database 1 Introduction This paper introduces recommendations and some of the solutions used to create an availability or high availability environment

Magento server & environment optimization Get very fast page rendering, even under heavy load! E-commerce is also about NBS System 2011, all right reserved Managed Hosting & Security www.nbs-system.com

Magento & Zend Benchmarks Version 1.2, 1.3 (with & without Flat Catalogs) 1. Foreword Magento is a PHP/Zend application which intensively uses the CPU. Since version 1.1.6, each new version includes some

Tableau Server 7.0 scalability February 2012 p2 Executive summary In January 2012, we performed scalability tests on Tableau Server to help our customers plan for large deployments. We tested three different

An overview of the Drupal infrastructure and plans for future growth prepared by Kieran Lal, Gerhard Killesreiter, and Drupal infrastructure team for the Drupal Association and the Drupal community Recommendations

Introduction An object store is a distributed storage platform were objects (files) can be stored, managed and queried by using simple API-calls. This is the ideal cloud storage solution for larger pieces

Virtual Managment Appliance Setup Guide 2 Sophos Installing a Virtual Appliance Installing a Virtual Appliance As an alternative to the hardware-based version of the Sophos Web Appliance, you can deploy

Building Reliable, Scalable Solutions High-Availability White Paper Introduction This paper will discuss the products, tools and strategies available for building reliable and scalable Action Request System

Contributions for this vendor neutral technology paper have been provided by Blade.org members including NetApp, BLADE Network Technologies, and Double-Take Software. June 2009 Blade.org 2009 ALL RIGHTS

Web Application Hosting in the AWS Cloud Best Practices September 2012 Matt Tavis, Philip Fitzsimons Page 1 of 14 Abstract Highly available and scalable web hosting can be a complex and expensive proposition.

VMware vcenter Log Insight Getting Started Guide vcenter Log Insight 2.0 This document supports the version of each product listed and supports all subsequent versions until the document is replaced by

Drupal High Availability High Performance Drupal High Availability High Performance How to sleep without the server-crash-fear High Availability High Availability no Single Point of Failure High Availability