5 Answers
5

Scaling on the backend

In a very simple setup, one DNS entry goes to one IP which belongs to one server. Everybody the world over goes to that single machine. With enough traffic, that's just too much to handle long before you get to be YouTube's size. In a simple scenario, we add a load balancer. The job of the load balancer is to redirect traffic to various back-end servers while appearing as one server.

With as much data as YouTube has, it would be too much to expect all servers to be able to serve all videos, so we have another layer of indirection to add: sharding. In a contrived example, one server is responsible for everything that starts with "A", another owns "B", and so on.

Moving the edge closer

Eventually, though, the bandwidth just becomes intense and you're moving a LOT of data into one room. So, now that we're super popular, we move it out of that room. The two technologies that matter here are Content Distribution Networks and Anycasting.

Where I've got this big static files being requested all over the world, I stop pointing direct links to my hosting servers. What I do instead is put up a link to my CDN server. When somebody asks to view a video, they ask my CDN server for it. The CDN is responsible for already having the video, asking for a copy from the hosting server, or redirecting me. That will vary based on the architecture of the network.

How is that CDN helpful? Well, one IP may actually belong to many servers that are in many places all over the world. When your request leaves your computer and goes to your ISP, their router maps the best path (shortest, quickest, least cost... whatever metric) to that IP. Often for a CDN, that will be on or next to your closest Tier 1 network.

So, I requested a video from YouTube. The actual machine it was stored on is at least iad09s12.v12.lscache8.c.youtube.com and tc.v19.cache5.c.youtube.com. Those show up in the source of my webpage I'm looking at and were provided by some form of indexing server. Now, from Maine I found that tc19 server to be in Miama, Florida. From Washington, I found the tc19 server to be in San Jose, California.

anycasted IP addresses

A single IP could be handled by any number of Autonomous Systems (a Network on the internet) simultaneously. For instance, many of the root DNS servers as well as Google's 8.8.8.8 DNS server are anycasted at many points around the globe. The idea is that if you're in the US, you hit the US network and if you're in the UK, you hit the UK network.

media coming from different server

Just because you're on www.youtube.com, that does't mean that all the content has to come from the same server. Right on this site, static resources are served from sstatic.net instead of serverfault.com.

multiple internet connections

I assure you, Youtube has more than one internet connection. Notwithstanding all the other techniques, even if Youtube really was a single site and a single server, it could in theory have connections to every single other network to which it was serving video. In the real world that's not possible of course, but consider the idea.

Any or all of these ideas (and more!) can be used to support a Content Delivery Network. Read up on that article if you'd like to know more.

"it could in theory have connections to every single other network to which it was serving video. In the real world that's not possible of course, but consider the idea." Why is it not possible in the real world? You can subscribe to many internet providers
–
user1034912Mar 13 '12 at 5:58

You really want to have independent connections to more than thirty-five thousand separate networks? It's not practical.
–
MikeyBMar 13 '12 at 6:02

You are wrong to imagine that YouTube (aka Google) has only one server; this inforgraphic might help illustrate the scale of the system that backs that service.

Even if you only have one point of presence you can absolutely have more than one server behind a single name, and even IP, using tools like load balancers and all.

Google, though, have an awful lot of points of presence, and use tools like AnyCast - a technique to publish the same IP at multiple places on the Internet, and have people routed to the closest server pool owning it - to back the infrastructure.

How does google put a million servers worldwide? Do they rent the servers? Wouldn't it be hard for them to maintain data security managing all those third party servers?
–
user1034912Mar 13 '12 at 5:37

2

They own every single one of them. Seriously, they buy - well, make, these days - them. This costs as much as you would imagine, in some ways, but less in others.
–
Daniel PittmanMar 13 '12 at 5:39

@user1034912 - yes, it's staggering. But this is Google, so why the hell not? There are thousands of datacenters worldwide, Google happens to operate a tiny fraction of them.
–
tombull89Mar 13 '12 at 8:59

I'll touch on the network side of things a bit: Google has a Point of Presence (PoP) in 73 unique datacenters around the world (not including their own). They are a member of 69 unique Internet exchanges. Google is in more datacenters and Internet Exchange points than other network listed on peeringdb.

Google's total internet exchange capacity is >1.5Tbps, and that 1.5Tbps is reserved for networks with >100Mbps of traffic with Google, but less than I'd guess around 2-3Gbps. After you have 'sufficient volume', you are moved to private peering (PNI).

In addition to Internet Exchange peering and private peering (with AS15169), YouTube also operates a transit network: AS43515, and another network which I assume is for paid peering/overflow, AS36040. Google also operates Google Global Cache servers, for ISPs to deploy even more locally within their network. (Data from peeringdb, bgp.he.net).

Based on my experience, I believe YouTube uses much more than just IP geolocation or Anycast to chose a location to serve video from.

So to actually answer your question, from a network perspective, in order to scale like YouTube you have to make a massive investment in your network - from the fiber in the ground to the WDM gear, and the routers. You have to get the content and the network as close as possible to your users. This usually means peering, IXs, and maybe a bit of transit. You have to be able to intelligently tell users where to get the content from to keep traffic as evenly distributed and cheap as possible. And of course, you have to have the massive server infrastructure to store, process, convert, and deliver 4 billion views a day!

If you are curious about the server side, I wrote a blog post which breaks down some of the recently released datacenter images.