Networking @Scale, May 2016 — Recap

Last year, we held our first Networking @Scale, our invitation-only, one-day technical conference for engineers working on large-scale networking solutions. We received a tremendous amount of interest and positive feedback. So, this year, we decided to go even bigger with a two-day event. We hosted this second Networking @Scale on May 10 and 11 and had speakers from Akamai, AT&T, Comcast, Facebook, Google, the Jet Propulsion Laboratory (JPL), Microsoft, and Netflix. This year's event reinforced the incredible range of network challenges we all face as a community.

Day one of Networking @Scale covered the breadth of networking — showing how operators continue to innovate across the whole networking stack, including the data plane, control plane, and the management/operations plane, as well as across various network domains:

Google iterated multiple times on its B4 wide area network to improve availability and other critical aspects of the service.

AT&T relied heavily on sophisticated EVPN configuration and corresponding automation to roll out and manage services daily for its customers.

Both Netflix and Facebook had to develop significant distributed systems higher up the networking stack in order to scale their application services and leverage their underlying global networks.

The Jet Propulsion Laboratory (JPL) manages a network that is literally out of this world and described how networking operations apply when talking to rovers on Mars, where science is critically dependent on successful long-range networked communications.

Facebook gave a deep dive on Terragraph, its new 60 GHz, multi-node wireless system focused on bringing high-speed internet connectivity to dense urban areas, and Open/R, a new extensible distributed network application platform.

Day two of Networking @Scale went in-depth into IPv6. Presentations highlighted how the industry needs to collectively push IPv6 and how operators can better take advantage of all that IPv6 offers.

Comcast showed its IPv6 buildout and adoption, highlighting how support for legacy IPv4 will become a service on top of its native IPv6 network.

Akamai shared its experience in analyzing and visualizing the IPv6 address space, giving us all a lesson in how extensive and varied the deployments and uses are for IPv6.

Facebook gave deep dives into its experiences converting its backbone and data centers to IPv6, as well as how it worked deep within the Linux kernel itself to make sure it's ready for IPv6 at scale.

Videos and talk descriptions from the event are posted below. If you're interested in joining the next event or have topic ideas for next time, please join the Networking @Scale Facebook page. Also, we're always looking for speakers!

Day 1

Albert Greenberg set the tone for network operations @scale by describing the PingMesh and NetBouncer systems used across the Microsoft network. Finding a network problem across millions of hosts and many global data centers is like the proverbial "needle in a global, networked haystack." Albert even showed how the NetBouncer system takes the solution to the next level, with automatic remediation of the problem found by PingMesh. This is critical to allow applications to function properly across the network in spite of ever-present, individual network issues. The underlying motivation and theory behind PingMesh are available in a SIGCOMM paper by Microsoft.

Djordje Tujkovic began by outlining the Connectivity Lab at Facebook and its mission to connect the world. He then went deep into the underlying motivations, science, and challenges that come with providing fiber-like speeds wirelessly and how Facebook is tackling them with Terragraph. Using WiGig-based radios in the unlicensed 60 GHz spectrum is the starting point for Facebook to provide high-speed internet to dense urban areas. Terragraph had to overcome challenges in wireless signal propagation stemming from oxygen absorption in the atmosphere, changing weather conditions, and foliage and other street-level obstructions.

Google's B4 wide area network was first revealed several years ago. The outside observer might have thought, "Google's B4 is finished. I wonder what they're going to do next." Turns out, once any network is in production @scale, there's a continued need to make it better. Subhasree covered the reality of how Google iterated multiple times on different parts of B4 to improve its performance, availability, and scalability. Several of the challenges and solutions that Subhasree detailed were definitely at the intersection of networking and distributed systems. B4 was covered in a SIGCOMM 2013 paper from Google.

Facebook, Federico Larumbe — "Scaling Facebook Live"

Federico Larumbe went up the stack and described the challenges with distributed systems and Facebook Live. Facebook Live enables people to share their experiences and perspectives in real time with those who matter to them — whether they're someone who wants to broadcast to friends and family, or a public figure who wants to connect with fans around the world. Federico showed the unique challenges of Live traffic, with its sudden and massive bursts in bandwidth, and then covered how Facebook addressed challenges like the thundering herd problem (using request coalescing), anticipating the sudden bandwidth increases (with predictive cubic splines), and handling the varied available bandwidth that each client has with adaptive bit rate encoding.

Last year, we learned about high-frequency financial trading from JPMorgan Chase and the nanoseconds that are important to that type of networking. This year, we went to the other extreme as we let Matt Damon (aka Luther Beegle) from the Jet Propulsion Laboratory take us off-planet by explaining the network operations involved in talking to the Mars rovers. When you have 24 minutes of round-trip time and your signal bounces through multiple satellite dishes and satellites in the Deep Space Network, then proper planning, monitoring, and error handling is critical. The science teams have only short windows to work in each day in terms of sending and receiving data using technology that was prepped a decade ago because of mission preparation times and long launch windows. (They also measure their throughput in late '80s-style kilobits per second.) It's inspiring to see what the science teams have accomplished a world away.

Facebook, Petr Lapukhov — "Open/R: The Joy of Packet Routing"

Facebook announced the Open/R modular routing platform at Networking @Scale, and Petr Lapukhov outlined why Facebook felt compelled to develop a new distributed application platform instead of leveraging existing routing protocol implementations. Routing protocols were largely written in the '80s and '90s and assumed minimal compute power and software available for such operations. Back then, there wasn't much of a software ecosystem to leverage for a network device. Fast-forward about 30 years, and Facebook needed to be able to move fast and innovate for Terragraph and other portions of the network. Open/R leverages modern open source libraries like Zero MQ and Thrift to handle the mechanical encoding/data transfer functions — making the actual routing algorithms easy to write as software modules on top. As a result, Open/R can easily support both a centralized and distributed model of operations, and Facebook does so for both Terragraph and specific internal backbone applications in the Facebook production network.

AT&T, Chris Chase — "Scaling Overlays and EVPN as Universal Overlay"

Chris Chase covered how AT&T deals with one of the largest networks in the world in terms of access service customers, providing both wired and mobile access, supporting consumers and businesses, and enabling data/voice/cloud services. The numbers are incredible: 16 million broadband connections, 6,000 PE routers, 300,000 Wi-Fi hot spots, and more. AT&T leverages a sophisticated configuration of EVPN and BGP into order to overlay this huge number of services on its shared network. Chris shares the details on the specific configurations/topologies to make this all scale appropriately.

Netflix recently announced its expansion into the global market, and Faisal Siddiqi covered the global caching system, EVCache, that was developed to handle global and cross-region replication requirements. During EVCache development, the team had to handle many distributed systems requirements, such as reliability, low latency, and asynchrony (but interestingly, not strong global consistency). Faisal also dived into the multiple challenges they faced as they developed and iterated on EVCache — such as understanding how to best scale Apache Kafka and deal with AWS packets-per-second limits.

Day 2

Facebook, Paul Saab — Welcome to Day 2 — IPv6 @Scale

Paul kicked off the day by sharing some insights into IPv6 around the world — starting with the challenge of the IPv6-based Wi-Fi for the Networking @Scale conference itself! Shortly before Networking @Scale, we wanted to run the Wi-Fi exactly how mobile networks in the very near future will be run, a pure IPv6 network with NAT64 and DNS64 to reach the IPv4 internet. As projects often go, we were running short on time, and the day before, it was proposed that maybe we defer the IPv6 Wi-Fi until next year. We paused a moment and realized, "Wait, if we don't do it this year, then we'll be doing exactly the same thing that so many have done for IPv6 in general: push it off!" So the corporate networking team jumped into the problem and several of us marked out the old IPv4 SSID/password from the badges with a Sharpie and printed out placards with the IPv6 info. For us, this story represents our experiences with IPv6: Done is better than perfect, so just get started on it.

Our IPv6 Wi-Fi revealed that attendees did not have any issues accessing the internet but some had problems with company VPN gateways that did not work with networks that rely on NAT64 and DNS64 to reach IPv4 only endpoints. Our hope is that the networking-savvy attendees go back to their companies and not only push for IPv6 but also insist that applications also work on IPv6-only networks, since that is the future of mobile networking.

Paul also shared the rapid increases some providers are seeing around the world, especially in mobile devices — sometimes 10x increases in the number of IPv6 devices in a matter of months. On the other side of the supply-demand equation, we see that content providers themselves are still not moving as fast with IPv6 support. This disparity — where content lives in IPv4, but people are accessing the network via IPv6 — shows us that people are not taking advantage of the observed better performance on IPv6, and content providers aren't taking advantage of the flexibility and scalability that IPv6 provides. The move to IPv6 is no longer about IPv4 address exhaustion; it's about providing better networking and innovative services to our customers.

Comcast, John Brzozowski — "IPv6@Comcast"

John Brzozowski has been a long-time IPv6 advocate, and he gave an overview of the advanced state of IPv6 for Comcast's network. Today, it's used in the majority of Comcast's business needs, and it has seen IPv6 usage grow to over 25 percent of its internet-facing traffic. One extremely interesting revelation from John was that IPv6 will effectively become the underlay for all services in Comcast's network, including IPv4 itself, so effectively it plans to implement IPv4 as a service. This is mainly to support legacy content and endpoints, but natively going with IPv6 has greatly simplified their problems and sidestepped all the complex operations involved in trying to run them as separate offerings.

Facebook, William Collier-Byrd — "Leveraging IPv6 in the Facebook Backbone"

Will started off a trifecta of talks from the Facebook team about the history and rollout of IPv6 into the Facebook internal network. Facebook first started adopting IPv6 on its edge and backbone around the time of World IPv6 Day back in 2011; we realized an IPv4 renumbering would be painful and temporary, so the team went with IPv6 adoption — still challenging, but at least a long-term solution. Will covered the IPv6-ready IGP choice (not OSPF, so went with IS-IS), and then covered all the operational issues that the team ran into (primarily around management and visibility around IPv6). Implementations were inconsistent in how they interpreted IPv6 addresses, which led to bounced BGP sessions; show commands, MIBs, and other debug utilities were also woefully lacking. NetFlow, one of the fundamental tools for flow visibility, had significant bugs that made it so unstable and unreliable that Facebook had to develop its own host-based flow monitoring as a workaround. The overall lessons here are (1) if you haven't already, just get started, and (2) think about your addressing plan and architecture, but then make sure you do it quickly, as you'll learn so much more in the actual doing (see No. 1).

Todd brought the Facebook IPv6 story into the data center and the different stages of IPv6 deployment. Facebook's current data center fabric is all IPv6, but a few years ago, the team faced an interesting and unexpected dilemma: We had run out of 10.0.0.0/8 space! Assigning large prefixes to each rack made all the tooling and summarization easier, but it was wasteful. Facebook tried to model IPv6 like IPv4 where possible, but it turns out it wasn't really possible because of the lack of proper support throughout the protocols. Further, Facebook decided to allocate a /64 network per rack, which seems a little excessive but winds up being efficient in terms of routing table lookups in ASICs and ECMP implementation for IPv6. Finally, Todd covered the challenges on the management plane with IPv6; traceroute, ping, SNMP, SSH, and other tools all initially had significant bugs with their IPv6 implementation.

Akamai has a unique viewpoint on global internet activity, as it stands between the end users of the internet and the major content providers to serve up many of the photos and videos for users. From this vantage point, Dave Plonka and his research group inspired us with their studies on how IPv6 addresses are being used across the internet, especially as a result of the great flexibility that is possible in the large IPv6 address space. For example, Akamai sees half a billion unique IPv6 addresses per day and 10 billion per month. They analyzed the stability of these addresses over multiple days and weeks, and shared how different data structures like Patricia tries are useful in the classification and analysis of these addresses. Dave also shared stunning "treemap" visualizations of the IPv6 address space and introduced the concept of IPv6 address "dendrachronology." Finally, his takeaways were that IPv6 changes everything you know about IP geolocation, client reputation, threat mitigation, and monitoring/logging — there's so much more to do!

Facebook, Martin Lau — "Productionizing IPv6 in the Linux Kernel"

Martin Lau from the Facebook kernel team shared how the Linux kernel had significant scalability issues (up until recently) in supporting IPv6. He went through how simple commands like printing the IPv6 routing table in Linux was taking more than 10 seconds for what was a seemingly small table and then explained the details around why this was happening. (Spoiler: The kernel was creating clone routes for every unique destination IPv6 address in order to record MTU per destination, resulting in sometimes tens or hundreds of thousands of route table entries. This slowed down any access to the table and caused TCP lockups when garbage collection ran on the table.) Martin then reviewed a number of other issues in IPv6 that the Facebook team has upstreamed in order to make the Linux kernel more scalable for IPv6.

A big thank-you to all the speakers who presented and to all the attendees. We look forward to an even bigger and better Networking @Scale next time, and we hope to see you there! We also encourage everyone to post questions, comments, and follow-ups in the Networking @Scale Attendees Facebook group.