Minimizing customer impact is a key feature in successfully rolling out frequent code updates. Learn how to leverage the AWS cloud so you can minimize bug impacts, test your services in isolation with canary data, and easily roll back changes. Learn to love deployments, not fear them, with a blue/green architecture model. This talk walks you through the reasons it works for us and how we set up our AWS infrastructure, including package repositories, Elastic Load Balancing load balancers, Auto Scaling groups, internal tools, and more to help orchestrate the process. Learn to view thousands of servers as resources at your command to help improve your engineering environment, take bigger risks, and not spend weekends firefighting bad deployments.

Do you want a longer and thicker penis without expensive surgery, extenders or suction devices that just don't work? Introducing the Penis Enlargement Bible, a 94 page downloadable e-book that has an exclusive two step system that can growth your penis by between 2 and 4 inches within 89 days using safe natural methods ■■■ http://ishbv.com/pebible/pdf

Like to know how to take easy surveys and get huge checks - then you need to visit us now! Having so many paid surveys available to you all the time let you live the kind of life you want. learn more...◆◆◆ https://tinyurl.com/realmoneystreams2019

18.
When do we deploy?
•Teams deploy end of sprint releases together
•Hot-fix/Upgrades are performed via rolling restart deployments frequently
•Early on deployments took an entire day
–Lack of automation
•Deploys today generally take 45 minutes
–Everyone has run a deployment

19.
Sustaining engineer
•Every team member including QA has run deployments
•Builds confidence, understanding, and redundancy
•Ensures documentation is up to date and all things are automated that can be.
Sustaining engineer badge of honor
shirt after their tour of duty

36.
Kafka
DynamoDB
Redis
Amazon RDS
Amazon Redshift
Amazon Glacier
Amazon S3
Data plane
External service ELB load balanceer
Sensors
Termination server
Termination server
Termination server
Termination server
Termination server
Termination server
Termination server
Termination server
Content Router
Content Router
Sensors
Sensors
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
Active topic
Launching new cluster
Active topic
Active topic
Inactive Topic
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
Content Router
Content Router
•Green cluster is launched
•Termination servers are kept out of the ELB load balancer by failing health checks
•Content Routers write to the “active” topics
•Processors in green read from the “inactive” topics

42.
Testing the new stuff
External service ELB load balancer
Sensors
Termination server
Termination server
Termination server
Termination server
Sensors
Active topic
Active topic
Kafka
DynamoDB
Redis
Amazon RDS
Amazon Redshift
Amazon Glacier
Amazon S3
Data plane
Termination server
Termination server
Termination server
Termination server
Integration tests
Active topic
Inactive Topic
Content Router
Content Router
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
Content Router
Content Router
•Test customer(s) are *canaried
•Integration test suite is run by connecting to a termination server directly
•Tests pass; then we canary real customers

44.
•Canary information is stored in zookeeper
•Fortunately we dogfood our own tech
•This affords us the ability to use ourselves as canaries for new code
•The inactive processing cluster is set to read from the .inactivetopics
•The standard Kafka topics with .inactiveappended
•The ingestion layer has a watcher on that znode and routes any canaried customer to a the .inactive topics
•Ex. regular traffic goes to foo.bar, canary traffic goes to foo.bar.inactive
•When we are ready to test real traffic we mark several customers as canaries and start the monitoring process to determine any issues
Canary customers

49.
IT tests run
•Integration tests are run
–~3000 tests in total
–Test customer must be “canaried”
•If any tests fail, we triage and determine if it is still possible to move forward
•Testing is only done when we are passing 100%—no exceptions!

52.
Kafka
DynamoDB
Redis
Amazon RDS
Amazon Redshift
Amazon Glacier
Amazon S3
Data plane
Trust, but verify!
Sensors
Termination server
Termination server
Termination server
Termination server
Sensors
Active Topic
Active Topic
Inactive Topic
Sensors
External service ELB load balancer
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
Content Router
Content Router
Inactive Topic
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
•Monitor green services
•Verify health of the cluster by inspecting graphicaldata and log outputs
•Rerun tests with load

54.
Logging and errorchecking
•Every server forwards its relevant logs to Splunk
•Several dashboards have been set up with common things to watch for
•Raw logs are streamed in near real-time and we watch specifically for log-level ERROR
•This is one of our most important steps, as it gives us the most insight into the health of the system as a whole

56.
Moving customers over
Termination server
Termination server
Termination server
Termination server
Termination server
Termination server
Termination server
Termination server
Sensors
Sensors
Sensors
External service ELB load blaancer
Kafka
DynamoDB
Redis
Amazon RDS
Amazon Redshift
Amazon Glacier
Amazon S3
Data plane
Active topic
Active topic
Content Router
Content Router
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
Content Router
Content Router
Active topic
Active topic
•Flip all customers back away from canary
•Activate green cluster
•Event processors and consuming services in blue and green now write to and consume the “active” topics
•We are in a state of active-activefor a few minutes

57.
Each node in the data processing layer has a watcher on a particular znode which tells the environment whether it is active (use standard Kafka topics) or inactive (append .inactiveto the topics)
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
Active Topic
Kafka
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
Active -active
Inactive Topic
Ingestion

58.
Inactive Topic
Active topic
When we are ready to make the switch, we start by making the new cluster active and enter into an active-active state where both processing clusters are doing work.
Kafka
Green, switch
to active!
Active Topic
This is where is it paramount that code is forward compatible since two different code bases will be doing work simultaneously
Active -active
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
Ingestion

59.
However, blue and green are fully partitioned and there is no intercommunication between the clusters. This allows for things like changes in serialization for inter- service communication.
Active Topic
Kafka
Active Topic
Active -active
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
Ingestion

60.
Kafka
DynamoDB
Redis
Amazon RDS
Amazon Redshift
Amazon Glacier
Amazon S3
Data plane
Flipping the switch
Termination server
Termination server
Termination server
Termination server
Content Router
Content Router
Sensors
Sensors
Sensors
External service ELB load balancer
Termination server
Termination server
Termination server
Termination server
Content Router
Content Router
Active topic
Active topic
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
Inactive topic
Active topic
•We deactivate Blue, which forces Termination Servers in Blue to fail health checks and all Blue sensors disconnect
•Blue processors switch to read from the “inactive” topic
•Once all consumers of the “inactive” topic have caught up to thehead of the stream, Blue can be decommissioned

61.
Out with the old…
Termination server
Termination server
Termination server
Termination server
Content Router
Content Router
Kafka
DynamoDB
Redis
Amazon RDS
Amazon Redshift
Amazon Glacier
Amazon S3
Data plane
Active topic
Active topic
Sensors
Sensors
Sensors
External service ELB load balancer
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
•Green is now the active cluster
•If we need to roll back code, we have a snapshot of the repository in Amazon S3
•We haven’t had to roll back code… yet

74.
Data plane migrations
•Migrations applied to the database are forward only
•We have past experiences with two way migrations, but the cost outweigh the benefits.
•Code must be forward compatible in case rollbacks are necessary
•Database schemas are only modified via migrations even in development and integration environments
•We use an in-house migration service(based on flyway) to parallelize the process

75.
Final Thoughts
•blue-green deployments can be done in many ways
•Our requirement of never losing customer data made this the best solution for us
•The automation and tooling around our deployment system were built over many months and was a lot of work(built by 2 people –Hi Dennis!)
•But it is completely worth it, knowing we have a very reliable, fault-tolerant system