Topics

Featured in Development

Peter Alvaro talks about the reasons one should engage in language design and why many of us would (or should) do something so perverse as to design a language that no one will ever use. He shares some of the extreme and sometimes obnoxious opinions that guided his design process.

Featured in AI, ML & Data Engineering

Today on The InfoQ Podcast, Wes talks with Katharine Jarmul about privacy and fairness in machine learning algorithms. Jarul discusses what’s meant by Ethical Machine Learning and some things to consider when working towards achieving fairness. Jarmul is the co-founder at KIProtect a machine learning security and privacy firm based in Germany and is one of the three keynote speakers at QCon.ai.

Featured in Culture & Methods

Organizations struggle to scale their agility. While every organization is different, common patterns explain the major challenges that most organizations face: organizational design, trying to copy others, “one-size-fits-all” scaling, scaling in siloes, and neglecting engineering practices. This article explains why, what to do about it, and how the three leading scaling frameworks compare.

Related Vendor Content

Related Sponsor

The conference opened with a presentation from Jez Humble & Nicole Forsgren, two of the authors of “Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations” – one of InfoQ’s recommended books for 2018.

Some members of InfoQ's team of practitioner-editors were present and filed a number of stories about the event, but the main focus for this article is the key takeaways and highlights as blogged and tweeted by attendees.

@danielbryantuk: "Most of contemporary AI is about pattern matching signals on the edge, and inductive reasoning" @Grady_Booch #QConSF https://t.co/JoXkxfhuyi

@danielbryantuk: "As a developer you can ask yourself what value AI projects will have for you. One answer is that you can plug in these new frameworks in order take advantage and add value within your overall system" @Grady_Booch #QConSF

@TDJensen: Right now AI and deep learning specifically are valuable as parts of other systems vs. as a stand-alone system #QConSF per Grady Booch, IBM

@danielbryantuk: "When using AI within software you will need to get data scientists involved early. You need to think about data usage and also ethics" @Grady_Booch #QConSF https://t.co/kRhPn6RPSF

@eranstiller: "The human brain runs at about 20 Hz, some run slower" -- @Grady_Booch, a talk about #AI and computation #QConSF https://t.co/VJXW94l8HN

@danielbryantuk: "The disparity in computational effort required for training AI models versus inference of the models will effect software architectures, particularly around inference at the edge" @Grady_Booch #QConSF https://t.co/ZP1wangWI5

@danielbryantuk: "Everything is a system. In complex systems, malfunctions may not be detectable for long periods" @Grady_Booch #QConSF https://t.co/xq8P054UoP

@cinq: Every line of code represent an ethical and moral decision - @Grady_Booch #QConSF

@danielbryantuk: "We are increasingly seeing social and ethical issues getting more attention when building systems. Modern systems can have a big impact on people" @Grady_Booch #QConSF https://t.co/jJt0COUt5v

@gwenshap: "First question I ask is "Do you have a release process and can release on regular cadence?" If not, fix this first." &<- @Grady_Booch at #QConSF

@danielbryantuk: "When working with a new system I focus on 'can the system be easily and rapidly deployed' and 'is there a sense of architectural vision'. If you have these two thing right you get rid of 80% of your typical issues" @Grady_Booch #QConSF https://t.co/FkXGoowaYa

@Ronald_Verduin: 'Grow a system through the iterative and incremental release of an executable architecture' Grady Brooch #QConSf https://t.co/aFjkTYnYVq

@danielbryantuk: "Software whispers stories to be interpreted and realised by the underlying hardware. You [as developers] are the story tellers" @Grady_Booch on the importance of making your work count #QConSF https://t.co/4P3JwANGAr

He shared his experience as a creator and a person in tech, who also happens to be a black trans-man. It seemed fitting as a final talk, because even if we are software developers, we are all human beings first and foremost. It’s nice to know that even though each of us has our own specific kinds of struggles and challenges, we are not alone in that all of us are struggling together. One of the things that we as people might find ourselves dealing with is at the end of the day, we want to know that what we build is going towards something meaningful to us.

Twitter feedback on this keynote included:

@tsunamino: Purpose is a force: it drives us, it's strong, it's the core of what we are building #QconSF https://t.co/O9PsLeGcyR

@tsunamino: Important keynote message at #QConSF: be skeptical of the data you see out there (eg the number of pool deaths are correlated with years Nic Cage movies come out) https://t.co/jZCDCz16J1

@bridgetkromhout: Correlation is not causation; @nicolefv explains how cherry-picking data isn't enough, as we have logical fallacies and biases. #qconsf https://t.co/lP0q8zeZPW

@danielbryantuk: I always enjoying showing clients these software delivery performance metrics from The DevOps Reports and "Accelerate" -- these pics are from the @nicolefv and @jezhumble #qconsf keynote https://t.co/aPY3lXW6qm

@danielbryantuk: "You can drive cultural change in many ways -- one way is by changing your technical practices" @nicolefv #qconsf https://t.co/urw8rWBeSv

@neilathotep: You can be using mainframes and achieve the outcomes - @jezhumble @QConSF #qconsf https://t.co/06FwcFjBMN

@tsunamino: Meaning the technical stack isn't the most important (cough kubernetes) #QConSF

@tsunamino: If you aren't seeing benefits from moving to *the cloud* you probably aren't hitting all these points #QConSF https://t.co/nof0mm5rS0

@charleshumble: "High performance is possible you just have to execute." Dr. Nicole Forsgren, #qconsf

@bridgetkromhout: Cloud is "the illusion of infinite resources" according to @jezhumble. For me, not worrying about data center cooling or rack-n-stack is magic; I think moving up the stack is powerfully freeing. #qconsf https://t.co/98wLXYkSB5

@danielbryantuk: Several great predictive findings from the 2018 State of DevOps Report for high performing teams, via @jezhumble and @nicolefv at #qconsf Architectural outcomes, adopting the cloud properly, and monitoring and observability are core predictors https://t.co/GwezC1Prjo

@tsunamino: How do you actually measure effective automated testing? These three caused the most issues #QConSF https://t.co/SMF1I2Y1r3

@danielbryantuk: "Continuous testing is not all about automating everything and firing all of the testers" @nicolefv and @jezhumble dropping some testing wisdom at #QConSF https://t.co/c6LiN5qpe6

@tsunamino: Properly scoping projects is key to building features or updating org practices #QConSF https://t.co/5vqCEjVWSi

@randyshoup: Taking big bits of work and splitting them into smaller bits that actually provide value to customers is key to *everything* @jezhumble and @nicolefv #QConSF

@tsunamino: You will not be able to fix your tech stack without also fixing the trust within your culture #QConSF https://t.co/6HmCdKZ6zU

@danielbryantuk: "When something goes wrong, ask yourself that if you had the same information and tools, would you have done the same thing? Don't automatically fire the person seen to be the cause of the issue" @nicolefv and @jezhumble at #qconsf https://t.co/aIAaUkWlz4

@randyshoup: Team *dynamics* are critical; team skills are not. There was *no* correlation at Google between the technical skills of a team and its performance. #psychologicalsafety @nicolefv and @jezhumble #QConSF https://t.co/ikUdZQSzW6

@bridgetkromhout: Leaders set clear goals and direction, then let their teams decide how to implement. @nicolefv & @jezhumble on how high-performing orgs work. #qconsf https://t.co/OHVP0dTyBU

@vllry: @nicolefv @jezhumble On the subject of doing DevOps "right" - buy into the goals and not just the checkboxes of key success indicators, like cloud adoption. Using GCP/AWS/Azure and having a test suite don't inherently mean you're getting value from those things. #QConSF https://t.co/s3b8P1NRYK

@vllry: @nicolefv @jezhumble Enable autonomy to be agile. Teams (and members) should be able to work independently with little coordination and consultation. To enable this: fast change management and architectural isolation. I strongly recommend watching the talk when posted! #QConSF https://t.co/CJkoYt0gFP

@ag_dubs: what was fun 10 years ago isn't fun anymore- are we due for a new programming language paradigm revolution? #21stCenturyLangs #QConSF https://t.co/zFba02IdTg

@ag_dubs: i want to be able to leverage multicore processors from the *language* level #21stCenturyLangs #QConSF https://t.co/DF1UiFq0v3

@ag_dubs: the history of languages and platforms is tending towards simplicity #21stCenturyLangs #qconsf https://t.co/RjTv7Md1J7

@ag_dubs: @golang wants to be the simplicity that both developers and platforms seek #21stCenturyLangs #QConSF https://t.co/mVmNspKFjL

@katfukui: Applying love languages to technology can shape and change how humans use what you build. For ex, use words of affirmation to add love and thought in notification and text-based workflows #QConSF

@ag_dubs: stability really mattered for Java - and @golang is paying attention to that, even as it looks to go 2.0 #QConSF #21stCenturyLangs https://t.co/riqFAHACLn

@ag_dubs: "i like programming in Go because jt is bridging the gap between application and systems programming." - @classyhacker closes us out with some inspiring thoughts about the legacy of @golang #21stCenturyLangs #QConSF https://t.co/b4vmFu3lN7

@maheshvra: There is a common concern in adopting #kotlin, that is owned by Jetbrains, not a community or any affiliated open source foundation. Though, it offers good value. What is the future? #21stCenturyLangs #QConSF

@ag_dubs: because passing state into a class is such a common pattern, @kotlin gives u syntax sugar for collapsing that data into a property #21stCenturyLangs #QConSF https://t.co/FfR7VJu2vG

@ag_dubs: toolchains are incredibly important for making a multiple language solution work- and kotlin has great solutions for this interop #21stCenturyLangs #QConSF https://t.co/Y0yxMljQ7M

@ag_dubs: an architecture for a multiplatform application- can we rewrite it *all* in kotlin? @JakeWharton says yes! #21stCenturyLangs #QConSF https://t.co/bamiTy6gOg

@ag_dubs: being able to have a small team target multiple platforms with a single #kotlin codebase gives you lots of options for folks to access your application #21stCenturyLangs #QConSF https://t.co/p42phNPRLt

@ag_dubs: asm.js was a great frontrunner in leading the #webassembly revolution - check out that bitwise or 0, it's functionally a type annotation for integer! #QConSF #21stCenturyLangs https://t.co/Ci0UsL5DhS

@ag_dubs: #webassembly is revolutionary because it's the first open and standardized bytecode #QConSF #21stCenturyLangs https://t.co/pst5mKCB3V

@ag_dubs: #webassembly is going to affect language and compiler design as we go forward- and it's because the web has very interesting and unique constraints! #QConSF #21stCenturyLangs https://t.co/nMUCA1zsCP

@ag_dubs: you may already be using #webassembly ! (and not even know it!) shoutout to @fitzgen's sourcemap impl in @rustwasm ! #QConSF #21stCenturyLangs https://t.co/xhvEHY468b

The most important takeaway from this talk is that “We should think of privacy and fairness by design, at the very beginning, when developing AI models; not just as an afterthought.”, and “Privacy and fairness are not just technical challenges. They have huge societal consequences. That’s why it’s our responsibility to include as many and as diverse a group of people as possible to be our stakeholders.”

@danielbryantuk: Excellent overview of the @AirbnbEng migration to microservices by @jessicamtai at #QConSF "It's worked out well for us, but SOA (and distributed systems) isn't for everyone." https://t.co/OdseHPnNkf

@danielbryantuk: The @AirbnbEng team have clearly thought about their approach and tooling as they've moved to microservices. @jessicamtai at #QConSF giving a shout out to @TwitterEng's Diffy for comparing migrated endpoints https://t.co/tL7D3EjHcc

@danielbryantuk: Great #QConSF real world architecture panel by @randyshoup
- Automation and robotics is being driven by large orgs. Democratizing the tech is important (esp for small projects)
- People outside the industry have high expectations and think humanoid robots will arrive next year https://t.co/0uyU3sDwUw

@danielbryantuk: "Deploying is only the beginning. We're taking time to migrate data correctly while still running at scale. We're also conscious of the second system effect where we are tempted to sneak in new updates along with the migration" @mjdemmer #QConSF https://t.co/dA3n1UF9vs

@sangeetan: Operability is important consideration for internal tools as well. Think how well your entire team will be able to support your users. Elise McCallum @WeAreNetflix #QConSF #DevEx

@sangeetan: You don't always have to replace your system to address gaps. Incorporate best parts of other tools into your own and customize to your use cases to provide optimal experience. CI evolution @WeAreNetflix #QConSF with Elise McCallum

@danielbryantuk: Exploring the PhD research of @GailOllis at #QConSF, looking at the effect of good and bad software development practices. Some of the top responses from interviewees about the biggest impacts https://t.co/cbzi9pCFyg

@danielbryantuk: The dangers of working in isolation within a software system, via @GailOllis at #QConSF https://t.co/CsSB9jU84E

@sangeetan: The moment anything is written, it becomes legacy and has constraints around it. @GailOllis #QConSF #DevEx https://t.co/uyLyMuTd4v

@neilathotep: "I saw him create a legacy system in 3 weeks" @GailOllis on helping developers to help each other #QConSF

@danielbryantuk: "You see a lot of themes in software variable names, which are not helpful when trying to solve an issue -- at least 'wibble' is obviously meaningless" @GailOllis #QConSF https://t.co/DwxgLAuhN4

@sangeetan: Re inventing the wheel, adding elephant to it and a spaceship on top - what others' code can feel like to people who have to work with it @GailOllis #QConSF

@danielbryantuk: "Banking IT has traditionally been about stability and conservatism -- the opposite to what modern organizations do. This is driven by the requirement that when you put your money in the bank you don't expect it to be stolen by child wizards."

@danielbryantuk: The combination of @scsarchitecture and the DITTO architecture has allowed @StarlingBank to build a resilient online banking system

@tsunamino: Or if you want to, let the robots fix their own security vulnerabilities! #QConSF https://t.co/8JzrgSGlU4

@annthurium: Studies show that people pick languages for projects based on what libraries are available, which lines up with npm survey data. @seldo #QConSF https://t.co/ZFSndW2Yok

@tsunamino: Frameworks never really die, they get maintained in legacy products and then people start new products in new frameworks #QConSF https://t.co/DGlUwvlvUx

@annthurium: React isn't a framework so much as an ecosystem. React solves the view problem, but decouples itself from other related problems such as routing, state management, and data fetching that are solved by other tools. @seldo #QConSF https://t.co/YnamPHoLuz

@annthurium: Only 56% of npm users are using a testing framework. ðŸ˜± @seldo #QConSF

@tsunamino: Predictions of the future of JavaScript: learn GraphQL and Typescript and WASM because they aren't going away #QConSF https://t.co/lAoCqmy1tk

@tsunamino: The best frameworks are the ones that have the most people using it bc you get a community to help #QConSF https://t.co/aAVeBBIP56

@danielbryantuk: "Adding circuit breakers for mitigating risks related to data propagation added a lot of value, but also a few problems. Genuine scenarios accidentally tripped the circuit. We learned to use knobs to fix this at first" @lavanya_kan #qconsf https://t.co/wGmKEYYI1G

@danielbryantuk: "The Netflix data propagation system is very fast, but this means that bad data is propagated just as rapidly as good data! This can lead to challenges" @lavanya_kan #qconsf https://t.co/k4xm8SNMnf

@danielbryantuk: "Data validation is key to high availability" great thinking points from @lavanya_kan about data changes and safety #qconsf https://t.co/VfzpkPNtDW

@danielbryantuk: Put circuit breaking into a microservices system by default. "This stops a single match creating a system wide fire." Andrew McVeigh #qconsf https://t.co/BAOca0QqU7

@danielbryantuk: "Put an API gateway into a microservices system straight away, e.g. something that handles auth and rate limiting" Andrew McVeigh at #qconsf also mentioning the benefits of @EnvoyProxy Can I also subtly plug @getambassadorio which is an open source gateway also built on Envoy :) https://t.co/eD9kHaxI4t

@randyshoup: Prefer fluent interfaces: By setting data on an entity, you are not conveying why you are doing that. Define a protocol instead of a pile of setters. @VaughnVernon at #QConSF https://t.co/5ZvsRJRAPZ

@danielbryantuk: "Moving into the C language era was like moving into modernity. Every language before was clearly written with a broken Caps Lock key -- it was screamed" @bcantrill #qconsf https://t.co/TiCIZJwbBj

The biggest changes in this space include: performance driven improvement, such as eBPF and userspace networking; the changing role of operations, and how operators use and deploy operating systems; and emulation and portability. There are also areas with little change so far but with signs that this is on the horizon, for example: operating systems are effectively the "last monolith"; there is a lack of diversity in OSs, and OS programming languages and contributors; and security has not yet received the full attention it requires….

Modern computer storage and networking have become much faster; moving from 1Gb ethernet to 100Gb over the past decade has resulted in a communication speed increase of two orders of magnitude, and SSD can now commit data at network wire speed. Accordingly, modifications to operating systems have been required in order to keep up with these changes.

The first approach to mitigate the issue of increasingly fast I/O is to avoid the kernel/userspace switch latency. System calls are relatively slow, and so these can be avoided by writing hardware device drivers in userspace. For example, in networking DPDK is the most widely used framework, and for non-volatile memory express (NVMe) there is SPDK.

Another approach is to never leave the kernel. However, it is challenging to code for the kernel, and so eBPF has emerged as a new safe in-kernel programming language. Cormack suggested that eBPF is effectively "AWS Lambda for the Linux kernel", as it provides a way to attach functions to kernel events. eBPF has a limited safe language that uses a LLVM toolchain, and runs as a universal in-kernel virtual machine. The Cilium service mesh project makes extensive use of eBPF for performing many networking functions within the kernel at very high speed.

Twitter feedback on this session included:

@danielbryantuk: "The changes in hardware I/O has driven a lot of change in modern operating systems" @justincormack #qconsf https://t.co/PiIB5J1Xkd

@danielbryantuk: "At some point, hardware has stopped becoming hardware, and instead you talk to the abstraction via an API" @justincormack on the move away from hardware drivers to userspace frameworks #qconsf https://t.co/i2tJfbIF2t

@danielbryantuk: "Userspace can add too much overhead to manipulate I/O, and so the move is now to implement functionality in the kernel -- eBPF is kind of like AWS Lambda for the Linux kernel" @justincormack #qconsf https://t.co/3tvbBTDVJt

@danielbryantuk: Great example of using eBPF within @ciliumproject, via @justincormack at #qconsf https://t.co/tuGU9TDLGI

@danielbryantuk: "Operations has changed quite a lot over the past decade. There has been a move away from the Sun workstation model of the 90s, which was massively impactful throughout its time " @justincormack #qconsf https://t.co/nFKCJpssS1

@danielbryantuk: Great overview of @Linuxkit for building minimal OSs, from @justincormack at #qconsf https://t.co/t9TZUHAaGj

@danielbryantuk: "The stable ABI of Linux made it an emulation target. What does this mean? It's a great abstraction, and increasingly this is being used to add security isolation" @justincormack #qconsf https://t.co/nCzoA0A9C5

@danielbryantuk: "OSs used to be all about multi user, but now (in the cloud) we often have a single user and we care more about tail latency than who is logged in and who owns what file" @justincormack #qconsf https://t.co/G3LwWMNkCu

@danielbryantuk: "We're heading towards an OS monoculture with Linux, which makes it easy to write code, but there are a lot of disadvantages..." @justincormack #qconsf https://t.co/OoFcH6u22F

@tsunamino: Watching @annthurium talk about how to actually change your habits when coding: context matters when you're trying to figure out the the right reminder system! #QConSF https://t.co/u57b2a97sE

@tsunamino: Distractions: is it your IDE or are you actually distracted by the internet? #QConSF https://t.co/789C7h75tq

@tsunamino: Pro tips on how to ask for help: set a timer for 15 min and have a backup plan if other people are busy #QConSF https://t.co/ij1ZfoisXB

@danielbryantuk: "Sometimes working in an agile team feels like being a racehorse on a treadmill -- going very fast, but not getting anywhere that provides real value" @hollyjallen #qconsf https://t.co/nullK7FU0B

@jpaulreed: "@SlackHQ lives and breaths this cycle. There is massive executive commitment to it."-@hollyjallen #QConSF https://t.co/bXk6E1ZAwX

@danielbryantuk: "You need at least two things to drive real change: executive dedication to learning, and high trust teams. There are also several other technical practices that help too" @hollyjallen #qconsf https://t.co/kWbM7eUnUD

@neilathotep: Being on call like everything else - learn by doing. @hollyjallen #QConSF

@danielbryantuk: This pain seems familiar from my past experience... @hollyjallen dropping some knowledge on how a product team may often scale faster than ops, but ops is super important. Moving to dev on call requires change management and empathy #qconsf https://t.co/ZGyVHv4EBh

@danielbryantuk: "Sometimes working in an agile team feels like being a racehorse on a treadmill -- going very fast, but not getting anywhere that provides real value" @hollyjallen #qconsf https://t.co/nullK7FU0B

@danielbryantuk: "You need at least two things to drive real change: executive dedication to learning, and high trust teams. There are also several other technical practices that help too" @hollyjallen #qconsf https://t.co/kWbM7eUnUD

@danielbryantuk: "Ops had to build and keep these crazy big mental models of who knew each part of the system the best, in order to be able to best fix production issues. They effectively became human routers. This needed to change" @hollyjallen #qconsf https://t.co/Q3Bc6NWqE2

@danielbryantuk: "The service engineers provide the platform and best practices. We empower service ownership for product teams" @hollyjallen on the @SlackHQ organisation structure at #qconsf https://t.co/IZn2zcA48l

@danielbryantuk: "We run 'incident lunches' at @SlackHQ, which must use our incident resolution process to order the lunch. We also throw in random constraints too, to simulate the real world. Anyone in the company can attend" @hollyjallen on running gamedays at Slack #qconsf https://t.co/A5FvebO8Qx

@danielbryantuk: "The ops team spent lots of time looking down at the machine level metrics and alerts to detect outages. Giving this up was hard. But ultimately, user-driven alerting was the way forward" @hollyjallen #qconsf https://t.co/BVfaEn1iDs

@danielbryantuk: Excellent shout out to what works for one organisation, won't necessarily work in another. Be careful in what you take away from conference talks. "Copy the questions, not the answers!"

@danielbryantuk: "The Netflix Edge team work as full cycle developers. We are responsible for coding, deployment, and operation, and we rely on centralised teams for specialised tooling and input" @gburrell_greg #qconsf https://t.co/Dh8iK9VtT9

@danielbryantuk: An example of some of the tooling at Netflix that facilitates full cycle developers, via @gburrell_greg at #qconsf https://t.co/gErIyYTHiI

@danielbryantuk: "Creating a team of full cycle developers is *not* a cost cutting measure. It requires proper staffing and training, and commitment and prioritization" @gburrell_greg on the importance of leadership in the Netflix team #qconsf https://t.co/4smGVMoDuj

@danielbryantuk: "The full cycle developer model is not for everyone or for every organisation" @gburrell_greg #qconsf https://t.co/t3JHFz7dRz

@danielbryantuk: "We don't do YOLO deployments at Netflix, but we do deploy independently with good tests and metrics. We also coordinate around some big events" @gburrell_greg #qconsf https://t.co/hCnmfMpEeF

@vllry: Whispers in the Chaos: Monitoring Weak Signals at #QConSF by @jpaulreed Once there's an incident, how do we find the cause? Expertise vastly speeds up pattern matching. Target changes, relevant memories, and proximity to find the source. https://t.co/S5OIhO2oMm

A production-ready application or service is one that can be trusted to serve production traffic…

… We trust it to behave reasonably, we trust it to perform reliably, we trust it to get the job done and to do its job well with very little downtime.

… For the tenet of stability and reliability Kehoe began by arguing for the need for stable development and deployment cycles. In this context stability is all about "having a consistent pre-production experience". Within development this should focus on well-established testing practices, code reviews, and continuous integration. For the deployment practice, engineers should ensure that builds are repeatable, a staging environment (if required) is functioning "like production", and that a canary release process is available….

The discussion of the next tenet, scalability and performance, began with a focus on understanding "growth scales", i.e. how each service scales with business goals and key performance indicators (KPIs). Engineers should be resource aware, knowing what bottlenecks exist within the system and what the elastic scaling options are. Constant performance evaluation is required -- ideally testing this should be part of the CI process -- and so is understanding the production traffic management and capacity planning that is in place.

In regard to the tenet of fault tolerance and disaster recovery, engineers should be aware of, and avoid, single points of failure within design and operation….

The next tenet of readiness discussed was monitoring. Kehoe argued that dashboards and alerting should be curated at the service-level, for resource allocation, and for infrastructure. All alerts should be require human action, and ideally present pre-documented remediation procedures and links to associated runbooks. Logging is also an underrated aspect of software development; engineers should write log statements to assist debugging, and the value of these statements should be verified during chaos experiments and game days.

The final tenet to be explored was documentation. There should be one centralised landing page for documentation for each services, and documentation should be regularly reviewed (at least every 3-6 months) by service engineers, SREs, and related stakeholders. Service documentation should include key information, like ports and hostnames, description, architecture diagram, API description, and oncall and onboarding information….

The key learnings of the talk were summarised as: create a set of guidelines for what it means for a service to be "ready"; automate the checking and scoring of these guidelines; and set the expectations between product, engineering, and SREs teams that these guidelines have to be met as part of a "definition of done".

Handa … opened the talk by discussing that there are several known risks with the migration of data that is associated with a change in functionality or operational requirements, including data loss, data duplication, and data corruption….

With a large-scale geographically distributed customer base, many organizations cannot afford to take systems fully offline during any maintenance procedures, and therefore some variation of a "zero-downtime" migration is the only viable method of change. Variants of this approach that the Netflix team uses within a migration that involves data include: "perceived" zero downtime, "actual" zero downtime, and no migration of state….

Summarizing the talk, Handa shared the key lessons that her team had learned: aim for zero downtime even for data migration; invest in early feedback loops to build confidence; find ways to decouple customer interactions from your services; build independent checks and balances through automated data reconciliation; and canary release new functionality "rollout small, learn, repeat".

Twitter feedback on this session included:

@danielbryantuk: "When thinking about data migrations with a high availability system we think about various options: perceived zero downtime; actual zero downtime; and choose another approach to avoid migration" Sangeeta Handa from @netflix at #qconsf https://t.co/jPgXQtdBFB

@vllry: Finishing up the last #qconsf notes edits - "Building Resilience In Production Migrations" by Sangeeta Handa. Naive database migration involves a lot of downtime and hope, and that's usually unacceptable. We need strategies to migrate in the background and test the data. https://t.co/TCzxmbldf6

@aysylu22: "This is what production looks like in my mind. It's a lot messier" @mipsytipsy concludes the debut of Production Readiness track with the talk Yes, I Test In Production (And So Do You) #QConSF https://t.co/PEsaQCWFl3

@koldobsky: If your tools tell you that everything is up, your tools are not good enough - Charity Majors #QConSF

@vllry: @mipsytipsy Only production is production. Reproducing it is asymptotically costly, and incidents will occur that you can't foresee. You need strong tooling and observability to understand production, rather than trusting that prod will always be fine. #QConSF https://t.co/zkDN5E6btt

@eanakashima: Invest in tools that let you protect the 99% of users who are using the system in a reasonable way from the rest -@mipsytipsy #QConSF https://t.co/pi1O8Mzgf2

@mikellsolution: "Software engineers spending too much time in an artificial environment is dangerous" #qconsf @mipsytipsy on the importance of testing in Prod.

@neilathotep: @mipsytipsy on "testing in production" - because when it comes down to it, you are always testing. #QConSF https://t.co/WomTmGFRP8

@tsunamino: SNAP benefits used to only be able to check their balance by calling (and entering a 16 digit card number every time), saving receipts, or using a website no one knew about #qconsf https://t.co/5a47A675l5

@tsunamino: They made an app that scrapes the website but made it easy to read and use #QConSF

@tsunamino: NOOOOOO. If you don't remember your security question, the error message is the same as if you answered the question incorrectly #QConSF https://t.co/hODADsnRQa

@tsunamino: If you have a tiny team and not everyone can code, making it easy to use JSON can empower non-coders to keep services going #QConSF JSON can turn into code here https://t.co/1WJJI7Z7hX

@tsunamino: Some states (ahem TEXAS) doesn't have a web portal so they built a way to call and translate the automated phone trees #QConSF https://t.co/8ltiaUDTpb

@tsunamino: The Puerto Rico web portal gives you an error message that says "if you aren't using IE6-9, this web portal will close" o they changed the user agent to IE9 in the app and it was ok?? #QConSF

@tsunamino: FreshEBT had interesting user stories to work with: sharing a smartphone with another user, running out of data, and being afraid of running out of benefits at checkout #QConSF https://t.co/DvbLnHKYaa

@tsunamino: Making sure to focus on each one of these categories made it possible to scale with minimal engineers #QConSF https://t.co/T8fk5kvPoD

@Ronald_Verduin: Cool the keynote is started at #QConSf. Videos of the sessions can be shared up to 50 colleagues so let me know:-) https://t.co/jOJ3u6n6ud

@tsunamino: Pretty cool that #qconsf allows you to iterate on the feedback you provide speakers and also delete and anonymize your history!

@santiagopoli_: I want to congratulate the organizers of the @QConSF for a great conference! Keep the great work going people, you are awesome. Today: workshops #QConSF

@yusbel: #QConSF conference end, very valuable conference for those that practice architecture, my number one reason for returning was the opportuinity to ask anything to the guys that make the magic happen. Now workshop begin, making the abstract concrete. #TD https://t.co/H3SAnoVbBM

@unfunky_ufo: Had an amazing first trip to SF this week. Thanks to the organizers of #QConSF for putting together a really solid conference.

@danielbryantuk: Yet another thoroughly enjoyable #QConSF in the books. Many thanks to all of the speakers, sponsors, attendees and organisers! See you in London for the next event :-)

@fabiale: I really liked #qconsf and all the language talks. One of them made me realise how cool #kotlin is.

@Ronald_Verduin: Okay #QConSF is a wrap. Lots of new information after 3 days conference and 2 days workshops. Cool I could be here, thanks #InfoSupport

I liked that the topics covered in the talks and keynotes were thoughtfully curated, and are as varied as possible to represent most of the range of human experience, and that the staff took care of the needs of the attendees. I learned a lot and made some new friends. It’s a refreshing experience to have a conference for developers, by developers. And I thank the QConSF staff and speakers for this experience.

AWESOME/10, would attend again.

@NotMyself: While at #qconsf I had a chance to see a demo of @ballerinalang, which tags itself as a "Cloud Native Programming Language." Seemed kinda neat, might have to play with it when I get home.

@pavelmicka_xyz: The main themes of #qconsf: microservice architecture works + success stories. Challenges: its distributed system, circuit breakers & resilient design are a necessity. Recommendations: proper component ownership is the way to go (no throwing the app over the wall).

InfoQ produces QCons in 6 cities around the globe. Our focus on practitioner-driven content is reflected in the fact that the program committee that selects the talks and speakers is itself comprised of technical practitioners from the software development community. Our next QCon will be in London March 4-8, 2019. We will return to San Fransisco April 15-17 2019 with QCon.ai, our event focussed on machine learning and AI for developers.