9:10 am–11:00 am

Few companies invest in SRE before there is a raging operational fire on their hands. As a result SREs often start out as firefighters, desperately trying to keep the company alive for one more day. Once we put out the fires and our site is safe we can begin evolving—but where to?

In this talk we share the five distinct stages of SRE evolution. Moreover we’ll cover the transition of roles and responsibilities between site reliability and software engineers over time, the relationship dynamics that impede progress, and the shifts in mindset that must occur over time.

Join us, and we’ll help you transform the reactive growth of your SRE team into directed evolution.

Ubiquitous compute power has created frequent impedance mismatches between client capabilities and server capacity in distributed systems. Careful client behaviour design protects the server from unintended load and enables safe recovery after outages. These techniques improve resiliency both in microservice environments (where they protect microservices from each other) and in more traditional client-server environments (where a large number of clients such as mobile phone apps might be stacked against a comparatively small number of servers.)

Attendees will learn how to identify types of requests that are potentially unsafe. They will learn about the effects of unsafe client behaviour on the server, demonstrated with pseudocode samples and simulations of the behaviour. They will learn how to modify client behaviour with techniques like back-off and jitter to achieve better results on the server side.

11:00 am–11:30 am

Break with Refreshments

Grand Ballroom Foyer

11:30 am–11:55 am

Monitoring a.k.a figuring out what production code is doing is extremely important for an SRE organization. Monitoring services the right way can have a profound impact on how we do SRE. Modern software systems can be incredibly complex, code running on thousands of machines, depending on services we don't control and running code on user devices. Observing behavior of such systems means we have to change how we think about monitoring.

This talk will go over what a modern monitoring infrastructure for running software at scale looks like:

Asking the right questions—how to decide what to monitor

Types of data we want to collect and what answers it can help us find

A look at how we build services at Facebook

Collecting, storing and querying monitoring data at scale

When things go wrong—what makes for a good alarm and what makes a bad one

Putting it all together—debugging an outage using data

As an attendee, you will come out of the talk with fresh ideas about logging and monitoring. You will hear how we tackle these problems at Facebook, and why we do things the way we do.

Nikola has spent the last 2 years at Facebook as a Production Engineer, based in Dublin, Ireland. Prior to that he worked as a Software Engineer in several industries, ranging from small internet startups to large Telcos. Due to this prolonged exposure to production software, he's developed an acute need to measure and verify.

We all know that monitoring is one of the most important topics in the field of devops, but sometimes we are also suffering from it, such as alarm storm and high cost of deployment. And in this talk, we will share how Alibaba deal with these problems.

In Alibaba, there are hundreds of major KPIs has been defined to measure the running status of the business. In our system, we have created a CMDB for the business monitoring, which is called as Hammurabi. This is used to record the business function points with their priority levels and stakeholders. We will map the business monitoring to this CMDB. Thus, when the alarm comes, we can quickly confirm the business impact based on the trend of business indicator and make an emergency response. Intelligent business monitoring method based on time series analysis has already been used in our business monitoring, which helped to improve the accuracy of alarms from the baseline of 20% to 80% recently.

A Case Study that include Taobao and Youku will be shared to show how the business monitoring contributes to the optimization of operations and business.

Striking a balance between feature delivery and reliability is a challenge many organizations face. Error Budgets, the practice of blocking feature releases when a service fails to meet Service Level Objectives, is an effective way of achieving the right balance. However, getting buy-in to adopt Error Budgets can be difficult as many see the process as heavy handed. This talk explores the agile approach Atlassian took in adopting Error Budgets to reap benefits quickly and avoid impacting dev velocity. Come hear about what compelled us to give Error Budgets a try, how we went about it, the challenges we faced, the results to date, and where we are going next with Error Budgets.

For better than a decade Gui Vieiro has positioned organizations for success by growing and coaching technologists, developing technology strategies, and establishing practices leading operational excellence. Joining Atlassian in 2016 gave Gui the opportunity to enter the world of SRE and drive reliability initiatives for the benefit of millions of customers worldwide.

11:55 am–1:25 pm

Luncheon

Waterfront & Riverfront BallroomsSponsored by Baidu

1:25 pm–2:50 pm

Many companies want to grow a site reliability engineering team, but first need to ask “Is my company ready for SRE?” Taking the "lid-off" one of the most mature SRE organizations, I’ll describe the cultural tenets that provide the foundation for a high-trust and inclusive environment that is necessary for any SRE team to exist and evolve.

Todd Palino is a Senior Staff Site Reliability Engineer at LinkedIn, tasked with keeping Zookeeper, Kafka, and Samza deployments fed and watered. He is responsible for architecture, day-to-day operations, and tools development, including the creation of an advanced monitoring and notification system. Previously, Todd was a Systems Engineer at Verisign, developing service management automation for DNS, networking, and hardware management, as well as managing hardware and software standards across the company.

In his spare time, Todd is the developer of the open source project Burrow, a Kafka consumer monitoring tool, and can be found sharing his experience on both SRE and Apache Kafka at industry conferences and tech talks. He is also the co-author of Kafka: The Definitive Guide, now available from O'Reilly Media. When everything else is not keeping him busy, you'll find him out on the trails, training for his next marathon.

The goal of a Site Reliability Engineer is to create a reliable, scalable, performant service. To drive the right results, you need the right measures. Enter Service Level Objectives.

A good SLO starts with your customers. Next understand the role your service plays in your business. Then turn the understanding into numbers. Dive into the data to discover where diminished reliability and performance harm user experience and business results. Only then you can set good objectives.

Good SLOs will mirror your users' and business needs. When your users hurt, you want to know it. By feeling your users' pain you can discover the levels of service that hit the sweet spot of reliability, product success, and operability. It's empathy, but measured and quantified.

This talk will walk you step by step through finding, gathering, and understanding the inputs that inform good SLOs. We will review different classes of SLO and when to use them, with real examples to guide you. When you leave this talk, you'll be armed with practical advice to measure, validate, and improve the reliability of your products.

Ketan Gangatirkar is the Vice President of Engineering for Indeed's Job Seeker products. For the last 9 years, he's been helping millions of people get jobs. He has broken Indeed's site in dozens of different and creative ways over the years and has finally learned what not to do. For a time, he was responsible for the SRE organization at Indeed, helping the company evolve from centralized operations to a faster, more independent, and more scalable model, so that people like Ketan can't break the site anymore.

Chris enjoys all the weird bits of computing that fall between building software users enjoy and running distributed systems reliably.
All his programs are made from organic, hand-picked, artisanal keypresses.

Application metrics and distributed traces are immensely powerful for developers, but are difficult to automatically retrieve. Based off of the same technology used at Google, OpenCensus is an open source project that aims to make the collection and submission of app metrics and traces easier for developers.

In this talk you will learn about:

The benefits of traces and metrics, and how we use them at Google

The case for a common instrumentation implementation

An architectural overview of OpenCensus, including integrations and exporters

Introspection via z-pages

Our vision of the future for instrumentation

While the Census project originates from Google, it has evolved into an open source collaboration between multiple cloud and APM vendors and the OSS community, and already supports Prometheus, Zipkin, Stackdriver, and SignalFx.

One of the most important works for SREs is troubleshooting the problem causing KPI degradation such as decrease of PV, advertisement income, feed click rate, etc.

Many of the problems only affect a portion of the incoming traffic. If the on-call engineers can learn about the characteristics of the affected portion, such as one of traffic source area, browser type or access network standard, then diagnosis would be accelerate.

Therefore, we mark a set of tags on each user request. When a failure happens, we look for the common points among the faulty requests. This generates a huge amount of tagged data, which increase the searching scope and thus leading to low efficiency in trouble shooting, therefore automatic analysis is imperative.

In this talk, we will present our work in Baidu where we apply machine learning techniques to recommend the tags most relevant to the failure. This approach adopts unsupervised anomaly detection and entropy-based dimension reduction techniques, which can automatically recommend key data features for troubleshooting. The proposed approach has been validated by hundreds of real cases. It significantly speeds up the troubleshooting procedure when compared to traditional approaches.

Kafka at Linkedin processes over 3 Trillion messages a day with over 2000 kafka brokers. At such a scale, maintaining balanced workload on kafka clusters as they go through irregular traffic patterns and hardware failures is a daunting task. SREs at Linkedin expend significant time and effort in handling these curveballs and making sure the hardware resources are utilized evenly, which made it quite evident that intelligent automation was crucial to scale any further. This talk outlines Linkedin’s approach towards solving this problem with the help of Kafka Cruise Control.

2:50 pm–3:20 pm

Break with Refreshments

3:20 pm–5:15 pm

Every SRE team attempting to manage, mitigate or eliminate the risks facing their system will encounter two fundamental problems:

As humans our intuitive judgement about risk is unreliable.

The work required to address all potential risks far outstrips our available time and resources.

The CRE team (Customer Reliability Engineering—a group of Google SREs who partner with cloud customers to implement SRE practices in their application and across the cloud provider/customer relationship) battles these challenges every day in our interactions with customers. We have drawn on Google’s deep experience managing reliable systems, and the broader field of risk management techniques to develop a process that allows us to communicate an objective ranking of risks and their expected cost to a system. This ranking and the associated cost data can then be used as an input to team and business decision making.

This talk will cover the development of our process, explain how anyone can apply it to any system today and demonstrate how the resulting ranking and costs provide objective, consistent data which can take the tension and subjectivity out of often tense discussions around work priorities and focus (e.g. more features or more reliability?).

Matt began his SRE career with Google in Dublin in 2007, shifting to London in 2012, and since 2016 works remotely from Cambridge in New Zealand.
During this time Matt has worked on and led a range of diverse SRE teams ranging from Google's internal corporate infrastructure, through to the Internet facing load-balancing infrastructure that keeps Google fast and always available.
His current role with the Customer Reliability Engineering team is pioneering how to apply SRE practices across organisations to address the challenges posed by today's world where the traditional boundaries between platforms and their customers are being blurred.

SREs are software engineers with a broad skill set who work with systems in general. Depending on the type of work and teams, our usual time is spent in correlating incidental data to conclude the causes of issues. While we use ELK, Splunk etc. to visualize our logs; It’s an essential skill to parse log file by hand & visualize it to make useful observations quickly. Many times, we end up writing APIs & command line shortcuts to accelerate our debugging. We can make use of some of the techniques I’ll show to visualize these data quickly.

Data is in abundance. Where to focus requires sets of skills. Visualizing data is one of these essential skills.

SRE is often perceived as an emergency response function—dealing with incidents and restoring system health. While it is true that there is much useful work done in this space, reactive processes impact only the MTTR not the MTBF. Once the low-hanging fruit of detection and remediation improvements are gone, improvement takes more and more investment. At this point, I'd argue, it is time to start preventative work.

We saw some reduction in incident rates through establishing a Post-Mortem process but these often involved only the poor souls who happened to have been called upon to help triage and fix the issue. Realisation dawned on me that attempting to evangelise to engineers with little influence on the balance of functional/non-functional development effort was going to have limited success.

By embarking on a campaign evangelising reliability (similar to the way forest fire prevention, or health promotion campaigns might work) and targeting the right level within the organisation, we're seeing positive cultural and behavioural changes, and better operational morale and we believe it will result in fewer severity outages.

Operating containerized infrastructure brings with it a new set of challenges. How do you instrument containers? How do you evaluate API endpoint performance? How do you identify bad actors in your infrastructure?

The Istio service mesh enables instrumentation of APIs without code change. Istio provides service latencies for free; how can you make sense all that data? With math, that’s how. I will demonstrate use of mathematical techniques to ask and answer business queries. I’ll show how to create RED (Rate, Errors, Duration) dashboards that provide insight into API performance; they are essential for meeting service level objectives. And how to monitor at scale cost effectively with histograms, which preserve metric fidelity and enable statistical analysis.

This talk is targeted at K8s developers and SREs who are faced with the challenge of reporting to business decision makers. Attendees will come away with the know how to be able to answer the questions posed in this description, an understanding of their infrastructure performance, and the ability to determine if they are under or over provisioned.

Fred implemented the first external metrics adapter for the Istio service mesh to monitor Docker based services using Circonus. He is actively involved in connecting with Circonus' users and engineers at a technical level, as well as developing code bridges between Circonus and external systems. Fred is a recovering Perl and C programmer, and these days likes to hack in Go and is learning Lua. He is a 2013 White Camel award winner, and Apache Software Foundation member, and works as an engineer for Circonus.

Changes/updates are a major source of service faults. In Baidu, around 54% of the faults are introduced by changes. As a result, progressive rollout becomes imperative to improve service stability. Progressive rollout divides the deployment process into several stages. Each stage only deploys the change on a subset of the instances. Checkings are applied between consecutive stages to detect faults. If a fault is detected, the deployment is terminated and rolled back.

Intuitively, we can build a rollout system that enables development engineers to specify checking rules in each stage. Surprisingly, however, the Devs are not good at this, although they are the creators of the modules. Therefore, the reliability engineers are forced to add rules on stability indicators. But this leads to numerous false alarms, stalling the release procedure frequently. As a result, we turn to machine learning based methods. In order to obtain satisfying results, the algorithm must be able to learn the “normal” changes of each indicators, and quantitatively measure current changes to decide whether there are faults or not.

In this talk, we will present several real cases to demonstrate the dilemma we confront in rollout checking, and how the machine learning algorithm works.

Pingping Xue is the Senior SRE in the SRE Department of Baidu. She has worked on release efficiency and stability for four years. she helped to construct the progressive rollout mechanisms and avoid a lot of release faults. Her work improved Baidu's release efficiency and accelerate the product iteration siginificantly.

Yu Chen is a Data Architect at the IOP group of Baidu's Cloud Unit. His work focuses on service stability issues, including alerting and diagnosis. Previously, he has been working at Microsoft Research Asia. His research interests are distributed systems, consensus protocols, search ranking and query recommendation.

What is considered "good-communication" is different for different cultures. In some cultures "good communication" is being as explicit as possible, and the responsibility of conveying the message is on the person communicating. In other cultures, "good communication" is more implicit and the responsibility of understanding the message falls on the the person receiving it.

When I started working with Americans, I often heard in meetings “What do you mean?”. In India, this is considered rude, as the question is perceived as a challenge to what was said. However, the person is really just looking for more information. It would have been better if the person said, "I didn’t quite understand". In America, the former question is considered “good communication” because the person was being direct and explicit, but in India, this question would have been perceived as rude.

Americans are also taught to give negative feedback in a positive frame. In European countries, good feedback is direct feedback. In a situation where an American manager may be giving negative feedback to a European employee, there is a high probability that the employee won't understand the feedback since they are used to receiving direct feedback. This may leave the manager wondering why the employee has not improved even after they received the feedback.

Hailing from Kerala, India, I have around 15 years of experience in the IT industry. Based out of Bangalore, India, I have lead the Grid SRE team at Yahoo! and the Data SRE team at Intuit. Currently, I lead all of LinkedIn's Data SRE teams in Bangalore. I am deeply passionate about SRE and anything data.

Over the years Cloudflare have built a huge network: today we have over 5000 servers in 120 data centers around the world and operate a dozen of workloads in our infrastructure, including Nginx, Kafka, Mesos, ClickHouse, Prometheus, ElasticSearch, Hadoop etc. But with all the diversity in software stacks our hardware fleet still shared one common thing—CPU architecture: we, like most other SaaS companies around the world, exclusively used x86-based CPUs.

Recently we began exploring using a second CPU architecture for our hardware. The obvious choice was ARM, because it is the second most popular architecture in the world thanks to the boom of the mobile and IoT markets. Initially we considered to evaluate cost-effectiveness of running a different CPU architecture in our data centers and how easy it is to avoid vendor lock-in. The recently published Meltdown and Spectre attacks gave us even more reasons to pursue this goal.

After doing preliminary tests and some synthetic benchmarks and the results were promising came the question: so what’s next? The answer: we need to put this into the wild...

The talk provides an overview of potential steps and pitfalls of adopting a second CPU architecture in your cloud.

Ignat is a systems engineer at Cloudflare working mostly on platform and hardware security. Ignat’s interests are cryptography, hacking, and low-level programming. Before Cloudflare, Ignat worked as senior security engineer for Samsung Electronics’ Mobile Communications Division. His solutions may be found in many older Samsung smart phones and tablets.

Randomized load balancing is a common strategy to distribute requests across a server farm. When M requests are randomly assigned to N servers, every server is on average responsible for M/N requests. But in practice the distribution is not uniform: how many requests does the busiest server receive, relative to the average? This is the "peak to average load ratio". It is an important quantity that describes how much capacity is “wasted” when a system is provisioned for peak load. The closer this ratio is to one, the more uniform the utilization of servers is.

Applied to the "requests to servers" scenario, the first theorem gives a closed expression for the amount of requests hitting the busiest server (with high likelihood). One somewhat surprising conclusion is that the peak to average ratio gets worse if number of requests and number of servers grow proportionally. The second theorem analyzes how effective small, fast caches can be, and how effective they remain as a system scales in size.

Chaos engineering is the practice of conducting thoughtful, planned experiments designed to reveal weaknesses in our systems. This talk will share how you can get started practicing Chaos Engineering in your organization.

Ana is a Software Engineer living in San Francisco. She is currently working as a Chaos Engineer at Gremlin, helping companies avoid outages by running proactive chaos engineering experiments. She last worked at Uber where she was an engineer on the SRE and Infrastructure teams specifically focusing on chaos engineering and cloud computing. Catch her tweeting at @Ana_M_Medina mostly about traveling, diversity in tech and mental health.

Thursday, 7 June 2018

8:00 am–9:00 am

Continental Breakfast

Grand Ballroom Foyer

9:00 am–10:55 am

When you first built services in one data center, it is always easy to do the deployments and changes. When the number of services increases, you would spend more time on service management and governance.

When we package all the services into one cloud offering and need to deploy in hundred of data centers, we have new problems. We need to evaluate all the capacity and data center or even network requirements of each services, and then initialize the new data center based on the evaluation. Normally it would need experts from datacenter, network and software development team with one or two months. How to do all these things automatically only based on the capacity planning artifacts? In this talk, I’ll share some of how we build and ship Apsara Stack, a dedicated cloud offering from Alibaba, to the data centers from different vendors.

Specifically:

A full datacenter capacity planning based on the measures from users' perspective.

A unified deploy model for network, operating systems and various cloud products.

Maintaining lightweight, easy-to-change production configuration instead of providing different toolsets.

Validate and test all changes automatically when changing any deployments

Xiaoxiang is the Technical Lead of Apsara Infrastructure Team in Alibaba, responsible for the technical infrastructure of Alibaba eCommerce platform and Alibaba Cloud. He studied Electronic Engineering at the Tsinghua University, Beijing.

Don't have time to write automated tests for your infrastructure code? Don't see the point? Or don't know where to start? This talk is for you.

Now we're writing code to manage our infrastructure with tools like Puppet, Chef, Ansible etc, we are effectively developing software. One of the wonderful aspects to this is that we have the world of software development quality best practices to draw on in order to achieve a high rate of change while not compromising on reliability. Writing tests for infrastructure code (and having them execute automatically as part of a continuous integration pipeline) is a key element to this, and is the focus for this talk.

But how do you get started on this? What are some tools to help? How should we think about this problem? This talk will provide an overview of the different types of tests that can be written, from small unit tests to integration and acceptance testing. It will focus on integration testing where existing monitoring checks can come in handy, or at least provide a crossover or an entry point. In some cases the tests can also be used as checks in the monitoring system.

Shopify is one of the largest commerce web sites in the world, with over 500,000 merchants including Kylie Jenner and Kanye West. In 2017, we made the decision to move from primarily co-located data centres to the cloud.

This talk will dig into why we made the decision to abandon the DC, one that may interest other companies considering the same move. We'll carefully talk through each step of the process—how we planned, managed and executed the migration.

We'll also dive deep into the tooling we built to make this possible: a tool for performing zero-downtime shard failovers; a live shop mover to migrate shops between shards; among others. These tools are what allowed us to successfully perform the migration with almost no downtime for our merchants.

We'll also go into our performance tuning methodologies and capacity planning process. And of course, no project executes as planned, so we'll also share some of the problems we encountered and lessons learned along the way.

Scott Francis is a senior production engineer lead at Shopify, focusing primarily on reliability, scalability, and performance. He'll take any opportunity to jump into gdb or debug a core dump. He enjoys cooking and sometimes dog walking in what little free time he has.

This talk goes a bit beyond the traditional SRE tasks. It throws light into the story of LinkedIn Lite which is now the default mobile web experience in the developing countries. We start with describing the product constraints and how as an SRE we help in debugging, monitoring and often contributing code into the product in order to make sure its highly reliable, available and performant.

We start with how we have a hybrid model of both SSR(Server Side Rendering) and CSR(Client Side Rendering) which makes the application load fast i.e under 6 seconds and perform smoothly on even low end devices. We also share our experiences on implementing the Progressive Web Application, its merits and also the risk involved. We also evaluate whether this level of performance is possible with javascript frameworks.

And then we dive into the Android app whose size is half of the default Hello World application on android. It also emphasizes on how SREs can contribute to a production level mobile application code. Finally we also see how we can monitor "lite" applications(applications which are primarly webview based) and optimize webview based applications.

Anoop is a Site-Reliability Engineer on the LinkedIn India Products SRE team which handles products developed in India like LinkedIn Lite, LinkedIn Placements, and a bunch of relevance related services.
He is also one of the major contributors to the LinkedIn Lite Android App which is less than 1MB in size.

Break with Refreshments

11:25 am–11:50 am

In this talk, I will discuss the different aspects of incident management and how we've built automation around each part at Xero. This includes: transforming manual alerts into automatic notifications through an issue report pipeline; a chat bot that streamlines incident coordination by facilitating effective communication and providing guidance through the process; and how we extract data from each incident for postmortem review. I'll also discuss how our tools have evolved and the lessons we learned on the way.

Karthik is a Senior Site Reliability Engineer at Xero. He's been based in the Auckland (New Zealand) office since 2016. Previously, he worked as a computer systems researcher and an enterprise server infrastructure consultant, both in New Zealand and the United Kingdom.

It's very easy to launch a debugger on your dev box, attach to the right process and step through code. However, things are different when you need to debug an issue in production that's getting tens of thousands of requests per second. What if the issue reproduces only in production? How do you debug without affecting production traffic? What techniques can you use in your development to make it easier to debug issues? Does your application use tracing? What debug logs are written out to aid in analysis?

This talk will cover:

Challenges with debugging in production

Various approaches that are used in the industry

Examples from Bing & Cortana incidents and steady state problems to illustrate the techniques

Kumar works at Microsoft and has been in the online services world for several years. He currently runs the Bing and Cortana Live site/SRE team. For the last several years, he has focused on growing the culture around live site quality, incident response and management, service hardening, availability, performance, capacity, SLA metrics, DRI/SRE development and educating teams on how to build services that run at scale.

1:20 pm–2:45 pm

Have you thought that your model trained on a Monday might not work on Saturday? Or that the model that you trained on users in Florida might not work for all Spanish-speaking users? In this talk, we present lessons learned from deploying and productionizing ML systems across various products at Google.

Carlos Villavieja is a Computer Architect/Researcher working as a Software/Site Reliability Engineer at Google.
He works on Storage optimizations and his interests vary from micro-architecture to machine learning.

Humans are slow, unreliable and hard to train. Azure has saved many millions of downtime minutes by using a knowledgeable and intelligent Bot. This Bot enhances and automates impact assessment, mitigation and problem management from your incident management.

You will learn about how to effectively run your outages and the strategy that we used to ensure that our solution was what our users wanted and would in fact lead to the immense time and cost savings that we predicted. We will share our guiding principles and lessons learned along the way.

Cezar Guimaraes is a Site Reliability Engineer Lead on the Microsoft Azure team. He has more than 15 years of experience and has worked at Microsoft for 12 years as a Software Engineer. Currently, he is working on Azure to identify and resolve problems that stand in the way of service uptime through engineering solutions such as bots and intelligence/autonomous engines.

An SRE team is responsible for availability, performance, efficiency, change management, emergency response, and capacity planning. As the need for high availability systems are growing, demands for and from SREs are growing further. To support these demands, an efficient and effective ecosystem is needed around SREs to ensure we deliver what we commit and we commit what is needed in a timely manner.

Isha Ganeriwal is presently working at LinkedIn, India, as a Senior Technical Program Manager for SRE organization. Isha has more than 10 years experience in the field of project/program management and has worked with data analytics and report engineering teams in the past as a program manager. She is associated with LinkedIn's SRE organization since last year and has had the opportunity to take a close look at SRE day-to-day functions and work through the evolution of SREs. Her vision is to create an efficient and effective operating system for SREs.

Client isolation is an important consideration for the reliability of Google Maps. We want to avoid becoming overloaded where possible and degrade gracefully otherwise. Following some user-visible outages in the area, Geo SRE began working on implementing client isolation.

But these things are never easy. There are multiple points in the stack where you can drop traffic so where should you do it and what are the tradeoffs? Are all requests created equal? What if your system changes partway through? Why doesn't exempting important traffic make sense? What other unexpected benefits does client isolation give you? All this and more!

Attendees will learn about what goals you can set for traffic management, identifying characteristics of your traffic and system architecture to leverage, and the strategies which can be used to design the solution.

2:45 pm–3:15 pm

Break with Refreshments

3:15 pm–5:10 pm

Monitoring system is vital for service stability and availability. To support Baidu’s massive services and machines, the metrics being collected has grown to 1 billion. These metrics must be stored in a reliable and efficient database, which must support real-time insertion of new data and various queries, ranging from aggregation, alerting, to reports and visualization, with diversified time granularities.

Our time-series database (TSDB) consists of three layers, a memory database based on Redis stores hot data, a HBase stores warm data, and a HDFS stores cold data. To achieve efficient insertion, we extensively apply batch and asynchronous methods to the write path, in addition to HBase’s ability of high throughput writing. To improve reading performance, we design specialized data model and embed multi-layer down-sampling mechanism into HBase. The memory database incorporates compression techniques to serve real-time, frequent, and small queries, while preserving memory consumption at a reasonable level. All the data are backed up in a separate HDFS to support offline analysis.

In this talk, we will explore the challenges of large scale time-series processing, and introduce our practice of building TSDB. We will also share some successful experiences, such as retention policy, and trade-offs between cost and performance.

Software Fault Isolation, or SFI, is a way of preventing errors or unexpected behavior in one program from affecting others. Sandboxes, processes, containers, and VMs are all forms of SFI. SFI is a deeply important part of not only operating systems, but also browsers, and even server software.

The ways in which SFI can be implemented vary widely. Operating systems take advantage of hardware capabilities, like the MMU (Memory Management Unit). Others, like processes and containers, use facilities provided by the operating system kernel to provide isolation. Some types of sandboxing even use a combination of the compiler and runtime libraries in order to provide safety.

Each of the methods of implementing SFI have advantages and disadvantages, but we don't often think of them as different options toward a similar end goal. When we consider the growing prevalence of things like edge computing and "Internet of Things", our common patterns start to falter.

In this talk, we'll focus on how sandboxing compilers work. There are important benefits, but also major pitfalls and challenges to making it both safe and fast. We'll talk about machine code generation and optimization, trap handling, memory sandboxing, and how it all integrates into an existing system. This is all based on a real compiler and sandbox, currently in development, that is designed to run many thousands of sandboxes concurrently in server applications.

Tyler McMullen is CTO at Fastly, where he’s responsible for the system architecture and leads the company’s technology vision. As part of the founding team, Tyler built the first versions of Fastly’s Instant Purging system, API, and Real-time Analytics. Before Fastly, Tyler worked on text analysis and recommendations at Scribd. A self-described technology curmudgeon, he has experience in everything from web design to kernel development, and loathes all of it. Especially distributed systems.

When it comes to high availability (HA) or user experience (UE), people often think about the stability of backend services or product design. Network connectivity, especially the Internet connectivity, is neglected. This might partly come from the impression that network is usually stable. Our observation shows that network failures are far from scarce, at least in China. Every week, we detect 3-5 PoP failures, 5-10 backbone failures breaking the connectivity at the province level. Most of the failures can be remedied by modifying DNS setting to bypass the broken path. The remediation depends on two systems, the detection system and the traffic scheduling system.

The detection system must detect failures precisely and punctually. Besides dedicated monitoring agents, we recruit volunteering agents to improve coverage and punctually. Dealing with their heterogeneity and unpredictable presence is crucial to the detection performance.

The traffic scheduling system is responsible for detouring the traffic to the correct path. It must consider not only the connectivity of external network links, but also the users’ experience and the load of target IDCs.

In this talk, we will introduce how to implement and use the above two systems to handle Internet connectivity failures.

At Facebook, we created a new MySQL storage engine called MyRocks (https://github.com/facebook/mysql-5.6). Our objective was to migrate one of our main databases (UDB) from compressed InnoDB to MyRocks and reduce the amount of storage and number of servers used by half. In August 2017, we finished converting from InnoDB to MyRocks in UDB. The migration was very carefully planned and executed, and it took nearly a year. But that was not the end of the migration. SREs needed to continue to operate MyRocks databases reliably. It was also important to find any production issue and to mitigate or fix it before it becomes critical. Since MyRocks was a new database, we encountered several issues after running in production. In this session, I will introduce several interesting production issues that we have faced, and how we have fixed them. Some of the issues were very hard to predict. These will be interesting for attendees to learn too.

Attendees will learn the following topics.

What is MyRocks, and why it was beneficial for large services like Facebook

Yoshinori Matsunobu is a Production Engineer at Facebook, and is leading MyRocks project and deployment. Yoshinori has been around MySQL community for over 10 years. He was a senior consultant at MySQL Inc since 2006 to 2010. Yoshinori created a couple of useful open source product/tools, including MHA (automated MySQL master failover tool) and quickstack.

PV (Page View) curve is one of the most important curves for SREs. Every significant drop on the curve is regarded as an incident. Therefore, SREs are badly in need of a good anomaly detection algorithm.

Because PV fluctuates during day and night, the detection heavily depends on its expected values. Moving average is a naïve method to generate the expected values. It suffers from two reasons. First, it lags behind the actual trend, which will miss the drop on a rise trend. Second, it cannot easily differentiate between the drop and the recovery after a rise. Advanced methods such as exponential smoothing also have their own shortcomings. When PVs are large, the local fluctuations of the curve are relatively small, rendering a smooth curve. This inspired us to apply linear regression to generate the expected value. But linear regression is susceptible to abnormal values.

In this talk, we will present a method based on robust linear regression to compute expected values. This method is able to resist the impact of anomalies. Moreover, we will also introduce a statistical hypothesis testing method to detect anomalies, eliminating the need to set different thresholds at different time in simple methods.

Missing, incomplete, or stale/inaccurate documentation hurts development velocity, software quality, and—critically—service reliability. And the frustration it causes can be a major cause of job unhappiness.

SREs often spend 35% of their time on operational work, which leaves only 65% for development. Time spent on documentation comes out of the development budget, and this is challenging if there's a perception that creating and maintaining docs is grunge work that may not be recognized or rewarded. To convince fellow engineers and leadership to invest time and resources in documentation, it's essential not only to create good docs, but to gather data that communicates their quality, effectiveness, and value.

Attendees will learn:

How to understand and communicate documentation quality

Functional requirements for SRE documentation

Best practices for creating better, useful documentation

How to better communicate the value of documentation work to your business in order to drive change

Riona is senior staff technical writer at Google, where she has worked for 11 years, and leads the team that builds g3doc, Google's internal platform for engineering documentation, used by thousands of projects within the company. Before Google, she worked at Amazon and spent almost 10 years as a writer, editor, and program manager at Microsoft.

9:00 am–10:55 am

Participants in the site reliability field come from varied backgrounds and companies with varying levels of implementation of SRE principles and practices. There are no hard boundaries on this journey, but using a phased model of skill acquisition, useful signposts along the way can be discovered to help the traveler.

Using a selection of exemplar values and practices for detailed examination and then extrapolating to a wider set of other practices, I'll explore some of the landmarks that can characterize the approaches used by SRE teams. This can help participants to evaluate where they and their company are operating along the spectrum of practice and can be helpful when looking toward and planning for the next turns in the journey.

Some of the practice areas that I'll cover include incident prevention and handling, postmortems, KPI/SLOs, monitoring, and capacity management.

Kurt Andersen is one of the co-chairs for SREcon18Americas and has been active in the anti-abuse community for over 15 years. He is currently the senior IC for the Product SRE (site reliability engineering) team at LinkedIn. He also works as one of the Program Committee Chairs for the Messaging, Malware, and Mobile Anti-Abuse Working Group (M3AAWG.org). He has spoken at M3AAWG, Velocity, SREcon, and SANOG on various aspects of reliability, authentication, and security.

Distributed teams are an attractive proposition for your customers, your services, your team and you. They provide extended coverage hours, co-location with development teams, increased talent pools and a never-ending sense of fragility.

I've been working and leading distributed teams in a variety of industries, most recently within SRE at Atlassian. Join me whilst I reflect on the collaboration techniques to keep distributed teams connected globally while delivering world-class distributed systems to millions of customers across the world. I'll share my insights and reflections on the successes and failures I've experienced throughout joining, leading & building SRE teams.

From this talk I aim for a new or existing team lead of a distributed SRE team to hear my personal reflections and experience with tools and techniques to aid their teams and ultimately assist their services and customers to unleash their full potential.

Paul began his career in finance and operations in the late 90's. He then moved into technology and specifically trading technology and reliability. Next, he flew into the world of hedge funds, ensuring the reliability of high-frequency ultra low latency trading engines and market data from the world's stock exchanges. For the last 5 years, he's been at Atlassian and currently leads multiple groups ensuring reliability for the JIRA and Confluence Cloud services used by Atlassian's customers.

Google SRE has developed a special interview format called "Non-Abstract Large Systems Design" or NALSD. The focus of this interview is developing a credible approach for solving a specific problem at large scale. Going beyond coding and algorithm skills, candidates demonstrate their skills in designing for scalability, reliability and robustness, estimating provisioning needs, and managing change. All candidates for SRE positions at Google participate in one NALSD interview as part of their recruiting process.

Attendees will learn why Google has developed this interview format and which aspects of a candidate's skill set are covered in the format. They will see an example of this interview type, and learn how to come up with their own interview questions. Tips and tricks derived from practical experience in conducting this interview type will help attendees avoid common pitfalls when interviewing candidates.

Sebastian Kirsch is a Site Reliability Engineer for Google in Zürich, Switzerland. Sebastian joined Google in 2006 in Dublin, Ireland, and has worked both on internal systems like Google's web crawler or Google's payment processing systems, as well as on external products like Google Maps and Google Calendar. He specializes in the reliability aspects of new Google products and new features of existing products, ensuring that they meet the same high reliability bar as every other Google service.

A system is called scalable if it manages to take additional users and requests without losing any noticeable performance. Scaling a data system involves significant movement and replication of data within a cluster. This can put considerable load on a system that is already running hot, affecting the service experience.

10:55 am–11:25 am

Break with Refreshments

11:25 am–11:50 am

How do we make an impact to the starters in our field regardless of your position in the industry? Let’s put ourselves in the shoes of a newcomer and explore mentoring from a different perspective.

Fresh from being a part of an 18 month Graduate Program position at REA Group, my peers and I have been fortunate enough to find ourselves in an environment where mentoring and pairing is a big part of the culture. This talk is designed to give the attendees an insight on what a diverse range of mentees actually wish and crave for, to empower people to mentor and help shape the future of tech.

Leoren is a Systems Engineer for the Global Infrastructure team at REA Group in Melbourne. She grew up in an almost zero-tech background in the Philippines, but now has a continuous growing passion for mentoring and learning all things DevOps. She's also in love with playing online multiplayer games.

Good engineers are goal-driven. We can work relentlessly to reach a metric performance goal for our application, but fail to realize that over a period of time, the metric, instead of our service, becomes the goal of our success. Very soon, there is a tendency to make changes that make the metric better. In other words, the 'gaming' of the system begins.

There is a time in the life of a metric when it needs to change, or it needs to make way for another. This talk will make you comfortable with that idea of letting go and give examples of ways in which we realized that creating metrics becomes a journey, and not a destination.

Audience Takeaways:

Changing metrics often isn't being shifty; its a good idea.

Bulletproof Metrics eventually begin to fail over a larger time period

Engineers 'game' metrics. Shake things up to refocus.

Some metrics flat-line and shouldn't be improved just for the sake of it

Kumar works at Microsoft and has been in the online services world for several years. He currently runs the Bing and Cortana Live site/SRE team. For the last several years, he has focused on growing the culture around live site quality, incident response and management, service hardening, availability, performance, capacity, SLA metrics, DRI/SRE development and educating teams on how to build services that run at scale.

1:20 pm–3:15 pm

“Fail fast, fail often” is a refrain heard throughout the tech industry. We’ve seen organisations who embrace this mantra succeed. When things go wrong in the workplace, we know it’s important to not just tolerate, but accept and embrace failure.

But what’s the point of embracing failure if we’re not learning anything from it? It’s not that we’re not learning because we’re not trying. Embracing and learning from failure is far more complicated than people are aware of.

Have you held a blameless postmortem, but the outcome was the same as the blameful postmortems you held before – “root cause: human error”? Were your blameless postmortem's findings interpreted and twisted elsewhere in your organisation?

Does this sound familiar?

The language we use when talking about failure shapes the outcome of that discussion. It shapes how we treat people involved in incidents. It shapes how capable the organisation is of learning from incidents in the future.

In this talk we’ll cover some common pitfalls when constructing a narrative for What Went Wrong. We’ll learn which cognitive biases taint our perception of events. We’ll discover how to hack our language to minimise blame.

Lindsay Holmwood is an engineering leader based in Australia. He served as the Head of Technology at the Australian federal government's Digital Transformation Agency, where he was responsible for technology strategy, advice, and delivery. He currently works at Envato leading engineering on Envato Elements.
Since bringing DevOps to Australia by running the second ever DevOpsDays conference in 2010, he runs the the longest running DevOps meetup in the world in Sydney. He regularly speaks on technology culture, DevOps, digital transformation, and building high performing teams. He also won third place at the 1996 Sydney Royal Easter Show LEGO building competition.

As systems grow, they get more components, and more ways to fail. The alerts of the last systems' design can slowly "boil the frog", and suddenly no-one has time to help the system scale further because they're constantly firefighting. Alert fatigue sets in and the team burns out.

The way to avoid this is to only page when the SLO is not met, or when the "error budget" is being burned at a rate requiring immediate action.

Perhaps you've moved on from a check-based alerting system and lots of spammy alerts to a timeseries based monitoring system like Prometheus; you've heard about SLOs and error budgets but they sound like a unicorn dream -- at the very least you can't visualise how they might even be constructed in a monitoring system. Fear Not! In this talk, a well-rested champion of work/life balance Jamie Wilkinson will talk about the ideas of alerting on SLO and error budget, how the implementation of that changes as systems scale, and the tools you'll need once the alerts themselves no longer tell you what part is broken.

Jamie Wilkinson is a Site Reliability Engineer at Google. Contributing author
to the "SRE Book," he has presented on contemporary topics at prominent
conferences such as linux.conf.au, Monitorama, PuppetConf, Velocity, and
SREcon. His interests began in monitoring and automation of small
installations, but continues with human factors in automation and systems
maintenance on large systems. Despite over 15 years in the industry, he is
still trying to automate himself out of a job.

Break with Refreshments

3:45 pm–4:35 pm

When recruiting and onboarding new grads and others who haven't worked in site reliability, how do we build (and become) the engineers we want to work with? While seasoned engineers debug, fix issues in production, engage with clients, automate mundane tasks, and build new tools to streamline their workflows, in school new grads are mostly only taught how to build things from scratch—not support, maintain, and protect them.

In this talk, I'll share my experience and describe how Facebook has made me bring ideas and people together, not only to realize my potential, but also to make a difference at the company.

New grads have to be able to “drink from the firehose” of new information, learn by doing, and make connections throughout the company. On top of that, there's also the Production Engineering role and philosophy which has learned to embrace and grow people who haven't done exactly this type of work before. At Facebook, the goal is to have impact, while doing the things we enjoy. Connecting those two dots is the key.

Espen Roth was born in Copenhagen, Denmark, and grew up in Colorado. During his time at Colorado School of Mines he interned at 3 different companies, held a job in the computer lab, and participated in multiple programming clubs and challenges. Since then, he's been a Production Engineer at Facebook in Menlo Park, California for a year and a half.

A mental model is an explanation of someone's thought process about how something works in the real world. They set an approach to solving problems and can be thought of as 'personal algorithms'.

A trusted SRE must have reasonably good problem-solving and decision-making skills. Unfortunately, these skills do not improve with knowing more technology alone. This talk brings mental models from behavioral psychology into our world and describes a few that make engineers take better rational decisions and solve the right problems without unconscious bias.

Mohit is an Engineer on Bing's Live Site Engineering team. By day, he investigates all issues that subtly affect Bing’s availability and performance. Designing systems to proactively improve availability and route around problems is a core mission of the team. In his spare time he loves long walks, tinkers with hardware, and chases his goal of reading more books than Bill Gates.