Topics

Featured in Development

Understandability is the concept that a system should be presented so that an engineer can easily comprehend it. The more understandable a system is, the easier it will be for engineers to change it in a predictable and safe manner. A system is understandable if it meets the following criteria: complete, concise, clear, and organized.

Featured in Architecture & Design

Sonali Sharma and Shriya Arora describe how Netflix solved a complex join of two high-volume event streams using Flink. They also talk about managing out of order events and processing late arriving data, exploring keyed state for maintaining large state, fault tolerance of a stateful application, strategies for failure recovery, data validation batch vs streaming, and more.

Featured in Culture & Methods

Tim Cochran presents research gathered from ThoughtWorks' varied clients and projects, and shows some of the metrics their teams have identified as guides to creating the platform and the culture for high performing teams.

On this week’s podcast, Wes Reisz talks with Ben Kehoe of iRobot. Ben is a Cloud Robotics Research Scientist where he works on using the Internet to allow robots to do more and better things. AWS and, in particular, Lambda is a core part of cloud enabled robots. The two discuss iRobot’s cloud architecture. Some of the key lessons on the podcast include: thoughts on logging, deploying, unit/integration testing, service discovery, minimizing costs of service to service calls, and Conway’s Law.

Key Takeaways

The AWS Platform, including services such as Kinesis, Lambda, and IoT Gateway were key components in allowing iRobot to build out everything they needed for Internet-connected robots in 2015.

Cloud-enabled Roombas talk to the cloud via the IoT Gateway (which is MQTT) and are able to perform large file uploads using mutually authenticated certificates signed via an iRobot Certificate Authority. The entire system is event-driven with lambda being used to perform actions based on the events that occur.

When you’re using serverless, you are using managed infrastructure rather than building your own. So that means, when they exist, you have to accept the limitations of the infrastructure. For example, until recently Lambda didn’t have an SQS integration. So because of that limitation, you have to have inventive ways to make things work as you want.

Serverless is all about the total cost of ownership. It’s not just about development time, but across on areas that need to support operating the environment.

iRobot takes an approach of unit testing functions locally but does integration testing on a deployed set of functions. A library called Placebo helps engineers record events sent to the cloud and then replay them for local unit tests.

For logging/tracing, iRobot packages up information that a function uses into a structured record that is sent to CloudWatch. They then pipe that into SumoLogic to be able to trace executions. Most of the difficulties that happen tend to happen closer to the edge.

iRobot uses Red/Black deployments to have a completely separate stack when deploying. In addition, they flatten (or inline) their function calls on deployment. Both of these techniques are used as cost optimization techniques to prevent lambdas calling lambdas when not needed.

Looking towards the future of serverless, there is still work to be done to offer the same feature set that more traditional applications can use with service meshes.

Subscribe on:

Sponsored by Goldman Sachs

Goldman Sachs Engineers don’t just make things—we make things possible. Our engineers are innovators and problem-solvers, building solutions in risk management, big data, mobile and more. Interested? Find out how you can make things possible at goldmansachs.com/careers.

Show Notes

What does a cloud robotics research scientist do?

1:15 It’s a very buzz-wordy title, and would only get more buzz-wordy if I were to add serverless to it.

1:25 It’s using the internet to allow robots to do more and better things.

2:50 They were dealing with the connectivity, back-end, firmware update delivery processes - all of those pieces.

3:00 We released even before launch that while the provider had been chosen a few years before, it didn’t have the scale or extensibility needed.

3:15 We had decided that we wanted to own the back end on the cloud and own that capability as part of core technology of iRobot.

3:25 We wanted to choose a different connectivity provider that manages the connections back to the cloud.

3:40 It turned out that AWS IoT that was launching was a good choice for the connectivity layer.

3:50 While iRobot is historically a device company, and while we have experience with networking robotics and some cloud connected robotics, we hadn’t really done cloud-connected robots at the scale of Roomba before.

4:00 We really didn’t want to be in the business of building scalable cloud applications - we make robots, and we wanted to focus on that.

4:10 AWS lambda had just come out, and AWS IoT is a serverless offering - there are no scale knobs to tune.

4:20 We looked at the services available from AWS and thought we could build this without having to run any servers or any containers.

4:35 We were able to go all in because switching off the turnkey provider didn’t require porting any existing applications.

4:50 It allowed us to take on a very complex project and deliver it in a timely fashion and support it all with a small team.

What did it look like afterwards?

5:20 We use around 30 different AWS services in production as part of that single line of business.

5:30 That is probably a large number of services used to apply to a single business problem.

5:40 There are many organisations which likely use a lot of AWS services across many different business solutions.

5:45 We made a decision early on that we were not going to be afraid of those services, and become comfortable with cloud formation custom resources.

6:50 Devices have a bidirectional persistent connection to the cloud - so we can push data down to the robot as well as receiving events from them.

7:05 Events are mostly related to device life-cycle, so when it starts a cleaning mission, it starts transmitting some data about what it is doing.

7:15 At the end of the cleaning cycle, it sends a report that said what it has done.

7:20 It also sends a request to upload a map, which is too large to transmit via MQTT.

7:35 It doesn’t use the MQTT to authenticate; rather, it goes by a certificate-authenticated TLS connection to the cloud.

7:45 Those certificates are signed by iRobot certificate authority - which we have registered with AWS - so any robots that are connecting can be connected to the cloud.

8:100 We don’t have to go through a step of batch-sending all of the robot identities to AWS; we send the authority that signs them once, and the manufacturing process is unaffected.

8:25 When we switched over before, with robots in the field, whose cryptographic identities we didn’t record at the factory, and just having a chain from manufacturers in China to US-East-1 could be problematic.

8:45 That was a key choice in selecting AWS IoT, since they allowed you to bring your own device identities, which since then has become a feature of more IoT cloud providers.

9:00 What they don’t have is standard AWS credentials, so you can’t directly upload to S3 from the robot directly.

9:10 This is where pre-signed URLs come in - AWS created temporary URLs which have the credentials baked in - so we can send those URLs to the robots and tell them to upload the file.

9:25 The robot does an HTTP PUT to the URL (which is opaque to it) and uploads the file.

9:30 Based on that, we can receive events from S3 to determine that we have received that file, and update the mission record so that the app knows there is a map to be got.

9:50 All of that behaves in an event-driven way, and in IoT, many things are event driven.

9:55 It’s therefore a natural fit for an event driven architecture and therefore a serverless architecture.

What are some of the gotchas in an event-driven infrastructure?

11:20 There’s a couple of things to think about when you’re serverless.

11:35 You need to be able to accept the limitations of these services as they exist.

11:40 A consequence is that (for example) until recently, AWS Lambda didn’t support SQS.

11:45 If you wanted to integrate these, you had to go through a process of setting up a recurring event to drive a Lambda to probe the queue and trigger operations on any messages found.

11:55 That worked and was a pattern that you could build up into a cloud formation custom resource.

12:05 It certainly wasn’t elegant, and so there’s a lot of things about going to war with the services that you have instead of that you want.

12:10 You don’t end up having as much when you have more control of the software that you’re putting on the cloud.

12:20 We’re using DynamoDB as an operation store - not even for data that is relational.

12:30 We then do the join in the client side in the Lambda - because there isn’t any HTTP support for RDS, and so maintaining a database connection in a highly scaled out Lambda is going to choke your database.

12:45 The cold start time is going to be high because you have to set up a database connection.

12:50 Those are architectural choices which - at the small scale - narrowly scoped to just that choice limiting.

13:05 Serverless is looking at the total cost of ownership across all of the system; not just the development time and the bill, but your operations, your time to market and so on.

13:35 If AmazonRDS ends up getting an HTTP interface, we will reconsider.

13:40 This happened with Athena - it originally started with just a JDBC connection, then it gained an HTTP connection.

14:30 In the ideal case, the amount of code you’re writing is minimal, so the function development isn’t the primary focus when you’re creating serverless architecture.

14:40 You need to focus on the services; how you bring it all together.

14:45 If you’re using Lambda to handle all of the computation but accessing databases and message queues and authentication that’s running on Kubernetes, that’s less serverless than a system that’s using managed services.

15:10 The overall system where the code should ideally just be your business logic that you’re doing differently from everybody else.

15:30 The starting point of creating a service for us is starting in cloud formation, not starting in Lambda; figuring out the building blocks that you need to wire together.

15:40 Once you know how it is structured, and where you’re using Lambda, you can then decide what code goes in there.

15:45 Once you’re in Lambda, we test locally but integration testing occurs on the deployed system.

16:00 We want to be able to use any service that we need to means that we can’t rely on everything being locally mocked - so we integrate tests in the cloud.

16:15 On the other hand, unit testing takes the approach that we can use Placebo with our Python code and hooks into the AWS SDK.

16:35 This allows you to record and replay AWS calls with your credentials, and records the state of the database, S3 etc at that time.

17:10 That means your unit tests can run locally without needing to mock out services, because it’s using the SDK calls to intercept requests.

17:45 Both of those work - it’s developer preference - but they both do local unit testing and remote integration testing.

20:30 The developers are then using local tools to perform the deployments.

21:05 Often you want your long lived data resources to be separate from your short lived resources, even in the same service.

21:20 You’d want your template that contains the function code to be separate from the template that defines the database, because the database doesn’t change frequently - or may be defined externally.

11:35 We allow developers to define - in a single template - all of the resources that they depend on, and when we deploy it, say that the resources may be defined elsewhere through a reference.

27:20 If you look into the code repository, there is very clear isolation between the different functionality.

27:40 The functionality that we provide for Roomba today, we don’t charge for.

28:20 There was a very interesting transition from the turnkey provider - they have a fixed cost, per device, per year.

28:35 AWS is all pay-per-use - and when we went to the cost folks with predictions of what it was likely to be.

28:45 They’re used to buying parts and hardware and know the costs years in advance - but pay-as-you-go is a paradigm shift for a hardware manufacturer for cost forecasting.

What does the deployment flattening work like?

29:45 Every service provides an SDK - we’re Python across the whole thing.

29:55 Normally the SDK would be a thin client and talk back to the HTTP interface.

30:00 We can make that a thick client that includes most of the logic behind the API, and at deployment time, the thick SDK gets pulled over into the corner and is deployed with the service so that it can access the resources directly.

30:20 The separation of ownership is still present - that service owns the data but is shifted to the function itself.

30:45 When the app makes a call to the cloud, it’s talking to an API which is then talking to a wrapper around the service in the cloud.

30:50 I wouldn’t recommend it as a normal pattern - we have unique requirements because of our business and our scale that make it a wise choice for us.

31:00 We’re looking at our next generation functionality - where it’s going to be in the hot path - we still want to put it in this model where we’re avoiding Lambdas calling other Lambdas.

31:20 While you don’t pay for the Lambda function if it’s not running, you do pay for it if it’s waiting for something else.

31:30 However, the monolith has an impact on our cadence - it’s still a model that’s working for us and is mature.

31:50 As we want to implement and prototype more features, we need to be able to deploy experimental or testing features alongside our monolith.

32:05 We’re looking for how to deploy them independently, and where the rough edges are for authentication, service deployment and discovery - all those problems we will need to solve.

What is the reverse Conway’s law?

32:20 Conway’s law says that your organisation reflects the architecture of software that you build.

33:00 When you change from traditional architectures to serverless architectures, you want to change the communication patterns to be more event driven.

33:15 What that implies is that if you don’t think about how to make your organisation communication patterns match what you want to do now in your software, you won’t be successful in changing.

33:30 You’re going to end up trying to fit traditional architectural patterns into a serverless world, and there will be an impedance mismatch.

33:40 You can build synchronous event driven serverless services, but it’s a more natural fit to think about event-driven services.

33:50 Serverless architecture is ephemeral, and events are also ephemeral.

34:00 It’s useful to think in those patterns of how to pass information between services in an event-driven fashion, rather than thinking of writing to a database and periodically polling that database.

34:20 So instead of thinking about it in those terms, think of it as changes to the database being a stream of events which can be used to notify someone that something has happened.

34:25 This broadens what you can consider an API - if you have a serverless infrastructure, there will be an API gateway somewhere.

34:35 There also may be a kinesis stream of events that you can hook into.

34:50 I like step functions as a model, workflow orchestration - state as a service, which goes hand-in-hand with stateless computation.

35:00 I like the idea that there’s federated state machines, where perhaps these state machines or workflows are part of the API between services and events that can be worked on, or be fired out.

35:30 These state machines can then be connected to each other to compute in a distributed fashion, while still being surrounded by a bounded context.

36:15 They get this model from being able to deploy a side-car alongside their code.

36:20 In serverless, you don’t get the option to run side-cars — they are only running when they are invoked.

36:30 That means if you put something inside your function to be a side-car, it’s only going to run when your service is running.

36:35 That means it’s going to spend most of its time catching up to what happened while it was off.

36:40 All of these things that are being solved in Kubernetes around a service mesh, authentication, authorisation, services discovery - these aren’t solved yet in serverless infrastructure.

36:55 AWS generally now have to think about multi-account environments - what does authentication and authorisation look like between accounts?

37:05 Who is defining authorisation policy - is it the caller, or the called service?

37:10 Doing discovery as a service using AWS’ parameter stores - how does that work across accounts?

37:40 Right now, you can cobble it together with various services, but it’s not something that happens out of the box, and it’s not plug-and-play.

37:50 At the same time, exactly what it should look like is an open question.

38:00 We in the community need to do more work about what should it look like - that should be the next big step.

About QCon

QCon is a practitioner-driven conference designed for technical team leads, architects, and project managers who influence software innovation in their teams. QCon takes place 8 times per year in London, New York, Munich, San Francisco, Sao Paolo, Beijing, Guangzhou & Shanghai. QCon New York is at its 9th Edition and will take place Jun 15-19, 2020. 140+ expert practitioner speakers, 1000+ attendees and 18 tracks will cover topics driving the evolution of software development today. Visit qconnewyork.com to get more details.

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via
SoundCloud,
Apple Podcasts,
Spotify,
Overcast
and the Google Podcast.
From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.