I've been watching the rise and maturing of AWS lambda and similar offerings with excitement. I've also shipped several microservices in both node and Java that are entirely serverless, making use of API gateway, lambda, dynamo db, sqs, kinesis, and others.

For the simple case, I found the experience to be great. Deployment was simple and made use of shell scripts and the excellent AWS CLI.

I've been hesitant to build anything serious with it tho. The primary concern has been visibility into the app. The app's operation can be quite opaque when deployed that way.
Further exacerbating the issue, we've a few times lost Cloudwatch logs and other reporting due to both configuration issues and improper error handling, but these are things that would have been much easier to identify and diagnose on a real server.

Have you shipped anything serious with a serverless architecture? Has scaling and cost been favorable? Did you run into any challenges? Would you do it again?

* Eventually you need to promote a function to legitimate service for cost savings (e.g. GCF ️ AppEngine Node service)

• You need a buffer (e.g. GCF outscales services it calls) such as Pub/Sub

• Multi-repo/project layout is best for deployment speed, but needs extra dev/CI tooling to simplify boilerplate

• Minimizing costs can be creative/tricky compared to legacy services

• GCF is great for automatic stats and logging (Stackdriver) and "it just works" configs, compared to Lambda

• Don't go cloud functions everything, just the parts that are a good fit

We've gotten a good uptime using cloud functions, but we're always pushing for more nines. Since the functions tie together a bunch of backend Pub/Sub queues and services/stores, a brief cold start or queue backup has no notable impact on the overall system latency or throughput.

BTW, the coolest feature of AWS Lambda I've found is tying it to SES/SNS for inbound and outbound email routing. I've been running my personal email for years through a Lambda function for a few cents a year.

Overall the space is rapidly evolving and we'll see lots more features on Azure Functions, AWS Lambda, and Google Cloud Functions. See our learnings [1].

I'm curious about that as well. Does he have a Lambda that sends email he gets to a spool file somewhere? What does that workflow look like? Whats the advantage over just using a regular email client or Gmail?

The advantage is collecting email from disparate addresses (e.g. admin@your-domain.com) and forwarding them all to your preferred Gmail accounts. You can't setup ACM or other certs services without that to prove ownership. Previously I paid ~$100/yr to get all these mail forwarding routes setup. You can also capture inbound mail directly to a queue (e.g. unsubscribe@your-domain.com) and feed it to a Lambda to take action.

Developer time is the highest cost. Building and deploying a function in an hour minimizes time to market and opportunity cost. Lifting and shifting to a dedicated service like AppEngine is a "good problem to have" once the function's usage exceeded specific thresholds. However, if you never reach said thresholds (e.g. only a few thousand calls a day), you didn't spend the time/money setting up a service backed by dedicated services.

That just isn't true. For many companies the cost of their IT infrastructure is a very small fraction of their annual turnover. It all depends on the economics of your particular case whether or not this makes sense. For some companies it does, for some it doesn't.

That point comes way further along, assuming you're talking about bare metal and not simply cloud instances.

For my workloads, serverless functions would end up costing 4-5x more than operating the infrastructure myself. Given that my infrastructure costs exceed the salary of some number of developers, that's a very real difference.

I have spent the last year and a half building a completely serverless production service on Lambda, API Gateway, and DynamoDB (along with the standard auxiliary services like CW, SNS, Route53, S3, CF, X-Ray, etc.). It was a lot of work establishing new patterns for many of the operational aspects, particularly custom CW metrics and A/B deployments with Lambda traffic shifting, but in the end everything is set up nicely and I'm quite pleased with the end result. We're starting to ramp up traffic now by orders of magnitude (with many more to come) and it's soooooo awesome knowing the stack is pretty much bombproof. Another super-nice thing is all internal authentication and networking being controlled by IAM rather than security groups/VPC/traditional networking - that aspect alone eliminates a tremendous number of headaches.

My biggest complaints are probably DynamoDB eventual consistency (unavoidable when using GSIs), occasional CloudFormation funkiness (though no urgent prod issues yet, thankfully), CodeDeploy CW alarm rollback jankiness (which doesn't tell you which alarm triggered a rollback!!), and lambda coldstarts. But none of these are too terribly concerning and I have faith they'll get incrementally better over time, hopefully.

The biggest cautionary tip I have is we run all our Lambdas with the max 3GB memory both for peace-of-mind and because the underlying EC2 instances have significantly faster CPU. We were seeing weird timeouts and latency initially with <1GB memory, so I'd be hesitant to run the service if the extra cost of using the biggest possible instances is a concern, which for us it is not.

Another cost concern I should also mention is that we mitigate cold starts by running multiple canaries using scheduled lambdas (in addition to the standard canary role of generating a baseline of metrics and immediately detecting/alarming on end-to-end issues). We are effectively maintaining a constant warm pool which, in theory anyway, greatly decreases the chances customer traffic will hit cold starts. I'm not intimately involved with the financial aspects but I suspect achieving the same effect with EC2 would be significantly cheaper, at least with respect to infrastructure costs. I would guess, though, that the developer time savings achieved by massively reduced ops burden and overall system simplicity are probably comparable to the increased infrastructure cost, and very possibly hugely outweighing it.

Your setup, although appealing from an AWS ecosystem perspective, sounds like a bit expensive to me on a first reading. Of course everything depends on the specifics but Lambdas and DynamoDB are expensive at scale. I wonder how it compares cost-wise to a more traditional solution.

I've felt the same way. Serverless is most appealing when you're starting out and have low traffic. It enabled us at ipdata.co to have the most global infrastructure possible with the lowest latencies at an insignificant cost.

At some point I believe when we're big enough we might switch to using servers in all the regions where we currently run APIG+Lambda.

The advantage of Serverless to me seems that it already forces you in a somewhat sane design and separation of concerns so all work out into the lambda functions should be easily translatable into a different architecture.

Yes, very much this. A lot of the effort was extremely in-depth planning for scalability with respect to both unbounded traffic growth/adoption and expanding the team. In fact I coded an initial prototype in about a month that could have run quite happily on a single instance and subsequently broken out into a typical LB/Autoscaling group/DB architecture without much trouble. Now we have about 7 core microservices which are independenty scalable and deployable, many of which we anticipate handing off to dedicated teams as we expand and hire.

This is one reason we (FaunaDB) offer on-premise as well as managed cloud options. So you can run the database on any machines you want. When ease-of-use matters most, small scale apps are cheaper on cloud. With high transaction volumes, you can pre-purchase cloud capacity or run on your own iron.

It really wouldn't be that bad - at least not any worse than any other migration of a massively complex project from one platform to another. We deliberately kept our implementation flexible enough to be able to move off Lambda if necessary. Our entire stack can be containerized using Docker. All database interactions are behind interfaces that allow us to swap DB implementations if needed (even to relational ones). All custom CW metric publishing is centralized in a single object that can just as easily publish somewhere else. All our APIs are defined using Swagger which is portable to lots of tooling. The worst part would be replacing IAM with whatever networking/permission model the other platform had, but even that could be approached programmatically to reduce the difficulty.

Edit: We also broke the stack into a number of independent microservices, each with their own API, DB, and dedicated CI pipeline. This would allow us to incrementally migrate chunks in parallel without disrupting the entire service.

At one point I was highly concerned with lock-in, but I'm becoming less so. in the last 7 or 8 years the companies I've worked for have exclusively used AWS, and prices have come down and stayed competitive. Giving up the platform-independence also allows you to take advantage of platform specific features, which can make development a lot faster.

I still think platform neutrality is a good goal, but I'm starting to view it like I do database neutrality. It's great in theory to be able to swap out postgres with mysql and vice versa, but you miss out on a lot of features of postgres that aren't portable. And in practice, I've never swapped out postgres for something else. Just some thoughts.

I wouldn't see it that black and white. If you do it wrong of course you will have a hard time moving away to a different vendor. But if you approach it right and make sure you have the proper abstractions in place and can switch to any other cloud provider or in-house solution without to much hassle. It's all about what cost you want to pay when. Do you want to invest upfront in all development time of frameworks and infrastructure to support your core business API or do you just want to get the MVP out and invest a little more once you established a solid user base?

Yes, it's definitely expensive. But, as I stated, cost isn't much of a concern in our particular situation, at least not now. I'll also point out that the entire team is about 1/3 the size of other teams running comparable non-serverless services in production so there is a massive cost savings with respect to developer salaries. There are also multiple viable ways to incrementally migrate to more cost-effective implementations that I've detailed in other comments.

We were recently on the receiving end of a massive HTTP GET Flood DDoS and although we did not experience any downtime as a result of it, I ended up finding out about it a few days later when billing alarms started going off.

We were wary of limiting paid users. Even with lambda's max concurrent function executions limit, when the function completes in a few milliseconds, the number of invocations per second can still be high.

It's also possible to use API Gateway canary deployments and dedicated "preprod" stages as well - the benefit being you can, in theory, use traffic shifting lifecycle hooks to deploy a new version completely isolated from customer traffic and test it before incrementally rolling out your main A/B deployment. I created an experimental proof-of-concept for this but haven't had the time to flesh it out for production usage, but would very much like to at some point.

tl;dr The answer to your question is yes, there is a good story around CI with Lambda using the AWS ecosystem. However Lambda alone is not something that should be relied on for long-term or definitive versioning on its own.

Edit: thinking about your question more, it seems you are assuming that individual developers will directly edit Lambda function code in the console and you want to track versions or trigger deployments based on that activity. You absolutely should not ever be editing Lambda code manually directly in the console outside of one-off experiments/prototypes that are completely unrelated to dev/test/production. Always keep your Lambda function code under source control and deploy using zips uploaded to S3 and CloudFormation (the serverless framework[1] provides good tooling for this, though we don't use it - instead we use SAM and our own internal tools).

Broadly speaking, it started out as a greenfield/experimental project with buy-in from senior management that is now going through the initial phase of productization and productionization. Although the AWS bill sounds expensive the whole project was effectively done by 3 developers including myself (2 backend/full stack and 1 frontend). There were also some deliberate management decisions that greatly prolonged implementation time - we could have easily shaved 8 months off that figure taking a more straightforward path.

That is still a year long, 1/4M or more speculative investment. I would not moan too loudly about mgmt interference prolonging the project - that was a rare piece of long term willingness to invest for future technical gains - rare in my experience. (But imo the only way)

Yes, it was/is a rare opportunity and I'm very grateful I had/have it. (That being said, the potential upsides are in the range of tens to hundreds of millions of dollars, even with only minimal success - so there is definitely a very real business incentive to invest in the project.) The prolongment I'm referring to isn't to do with the project being put on hold or anything like that. Essentially what happened was it was decided we'd build an "alpha" version of the stack to 'validate' the value of the project even though it was plainly clear what we had to do. The alpha stack nominally was supposed to be a cheap, quick version of the real architecture (which we'd already designed) whose supposed savings were gained by substituting manual processes for some of the APIs rather than actually building them. The end result was a. confirming what we already knew in that yes, the proposed functionality is fundamentally useful (very obvious from the outset) and b. a huge diversion of time and effort doing throwaway work that was about 65% as much work as just doing it the right way would have been, with the additional burden of having to perform the manual "API" functions, operational overhead of maintaining that stack while implementing the real one, burden of having to migrate data from the alpha stack to new stack once it was ready, and work to deprecate/tear it down once it was completely out-of-use. The overall wasted time was easily 8 months.

Lamdas aren't stateless or ephemeral either. Anything that occurs on code load will persist on that container (i.e., if you initialize something at the module level in Node, it will persist between calls to the same container; this can cause all kinds of weirdness if you aren't aware of it. For instance, I have seen where a dev read some data at load time, and then performing destructive operations on it as part of data transformations in the code, and then wondered why he was getting non-deterministic results back). And there's a half gig of temp space on each container you can write to as well.

While the definition of what constitutes 'serverless' is pretty ambiguous, no one includes ephemeral state as a systems requirement, else you have something useless.

DynamoDB is generally viewed as serverless because there's no management of an underlying VM, and for some definitions because it can scale out horizontally automatically, without downtime, to meet demand (as compared with RDS, or another managed database solution that can only scale vertically).

When you create a DB with DynamoDB, you're just telling AWS "I need a database" and it gives you one. No need to worry about deciding how much CPU power, RAM, or storage you'll need for it.

> It seems that every every "hosted" solution is now being dubbed "serverless". :\

Got a good example of something being called "serverless" that you don't think should be? I mean, yes, DynamoDB, Lambda, etc. are all running on servers. But the idea is that you don't manage them. No packages to keep up to date. No worrying about whether or not the instance size you chose is big enough. No dealing with autoscaling to meet demand when a million reddit users hit your app.

DynamoDB, Aurora, Kinesis, etc all existed BEFORE Lambda, and noone called them "serverless" until Lambda and now everything is called that.

Meanwhile Kinesis requires you to specify the number of shards you have to use, so there is management even if they're not called "servers".

> No need to worry about deciding how much CPU power, RAM, or storage you'll need for it.

You realize you specify the RAM for lambda functions, which correlates to CPU.

And with Dynamo you specify RCUs and WCUs and you enable autoscaling which adds more...

I'm not trying to be pedantic about "The cloud being just someone else's servers". I mean that "serverless" to mean is a very explicit thing about Lambda and writing stateless code. And every existing hosted multi-tenant service shouldn't just be dubbed that.

DDB is serverless in the sense that you don't have to worry about scalability issues, but you have to worry about design issues[1] if you want to take advantage of everything DynamoDB has to offer. I find dynamo's autoscaling very problematic: It's both slow to kick-in and you can scale down 4 times in 24h, not cost-effective if you have spikes.

We've shipped something simple and non-mission-critical in production (URL rewriting for ad placements).

It has been pretty much set-and-forget. Last anyone had to even look at it was almost 3 years ago, and afaik it's still working (our ad sales team would be complaining loudly if it weren't).

For something peripheral like that, it's nice not to have to run servers for it or devote any energy to keeping it running.

In terms of both server costs and upkeep costs, the economics have been highly favorable.

I'm not sure I'd use it yet for something mission-critical or that shipped changes frequently. My recollection is that when we did have to adjust it, debugging was a bear. Tho tooling for that may have improved in the last 30 months.

You front production with an API gateway and version your APIs. You turn migrations into a business validation and testing process rather than a technical dependency; using the load balancer / gateway as the control lever.

In this way, serverless can actually be WAY better than traditional methods for dealing with frequent changes. You can have many valid endpoints, but only one “production” endpoint that changes based on business rules (or even split for A/B testing)

Example: there's a website for all the weather radar stations in Canada that lets you see the last two hours of 10-minute radar snapshots. I wanted a dump of them to try some ML algorithms, but even after requests and emails there simply wasn't one.

So I set up a lambda to run every hour, load the website for all 31 weather stations and save all 6 images from the last hour to S3.

It took me an hour to setup. I've never gotten around to making that ML project, but lambda has kept on chugging away for me, GBs of data saved away. The only real cost is the s3 storage, still under $2/month.

I use a Raspberry Pi with Raspbian for things like these. It's a really cheap one-time cost (ignoring power costs, which are fairly minor), and you can setup all sorts of things on it. If you don't need high availability, running a small system at home can often be a good solution.

The initial learning curve for running your own system is a bit higher, but I think it's well worth learning. Once you have the system up and running you start finding all sorts of uses for it. So if you need it for just one thing it might not be worth it, but as you start doing more things with it I think it really pays off.

This is brilliant! I hadn't yet found any reason to mess with Raspberry Pis. But the way you've described it makes a lot of sense.
They'd definitely be a maintenance overhead, uptime wouldn't be guaranteed but I can imagine the sense of self-sufficiency you'd get would be very satisfying.

If you are into python I highly recommend Zappa. It turns your flask app endpoints into lambda functions + API gateway. The big benefit here is that it is trivial to test locally because before it gets transformed you can just do 'flask run' and use postman to test the endpoints

I can't name any names for obvious reasons or give you more hints about what industry this company is in but I just did DD on a very impressive outfit that ran their entire company on Google's cloud platform, it held about 500T of data and held up amazingly well under load.

I was super impressed with how they had set this all up and they were extremely well aware of all the limitations and do's and dont's of that particular cloud implementation.

Obviously there is the lock-in problem, if you ever decide to move you have a bit of work ahead, so build some abstraction layers in right from day 1 to avoid hitting all your code if that time should ever roll around.

Well, given that it is serverless the infrastructure is operated by the provider, in this case Google.

That leaves the company to use the various APIs.

So you use Google Cloud Functions to ingest data and do all preliminary processing, store the data in one of the various persistent storage options (Spanner, Bigtable, whatever is best suitable for the job) processing optionally using background functions or containers for further processing or presentation.

Given that a reasonably short while ago I did not yet see Google as a serious contender in this space I'm actually surprised how far they have come.

You can basically create an enterprise class application dealing with vast amounts of data and never even know on what silicon (or where...) your processes are running.

Of course you still have to give some parameters, such as in which DC you want to run your stuff but on the whole it is about as painless as it can be.

My uptime monitoring project uses AWS Lambda heavily, almost exclusively https://apex.sh/ping/ — it has been great. I've processed 3,687,727,585 "checks" (requests really) with it, and I only had roughly 1 hour of downtime two years ago in a single region. Since then it has been stable.

I have 14 or so regions so doing the same thing with EC2 would have considerable overhead, though I can still imagine many cases where Lambda would not be cost effective, but its integration with Kinesis is fantastic as well, stream processing almost cannot be easier, and while people say Kafka is more cost-effective, with a bit of batching you can get a long way with Kinesis as well.

99% off our sas analytics frontend is backed by aws lambda. I love not worrying about underlying infrastructure. We have close to 150 lambdas running our api. We do not use api gateway instead we use apigee. For logging we built a logging module that logs to kinesis then to s3 and elasticsearch. Hardly ever look at cloudwatch, those logs get expensive after a while so we only keep 3 days. We use node, python and java depending on needs. It's a good idea to benchmark your lambdas and determine the resource size, a little bump can have a dramatic difference in execution time but after some point you are just wasting $$$.

We've actually found them excellent to work with, fairly cheap and scaling very efficiently.

The main issues we've had are internal disagreements about how to pass app settings in efficiently. Originally we put all the app settings in the ARM templates used to deploy them. Then we put them as variables in VSTS. And then finally we decided to put them inti variable groups, which are within task groups, which are used by releases. It's a bit of a weird chain of dependencies, but now all of our parameters are located in one place

I've found functions apps have a slow startup time, but once they are going, they perform pretty fast.

I think we currently don't have any functions in time-sensitive streams (they're rather new, so they are used for new features such as Oyster automated refunds, which can be applied a few days later if need be), so when I say we haven't had any performance issues with them, it has to be taken with a pinch of salt.

I think if we were using microservices, they would have substantially more logic in them than the individual functions have, so they would have a lower network overhead. They wouldn't scale automatically, so we'd have to have quite a few machines on all the time, which would cost a bit more.

The main benefit we enjoy from functions is the ability to change code with zero downtime, low risk and minimum disruptions to the whole service.

We've found the interfaces in Azure also to be great - we can write end-to-end tests that can poll service bus to check when all messages are delivered to assert against any final case.

At ipdata.co we use the same APIG+Lambda setup replicated in 11 regions to have the lowest latencies globally. We had to do some extra work to get API keys and rate limiting working but it was worth it. Our setup averages ~44ms response times - https://status.ipdata.co/.

We wrote about our setup in detail on the Highscalability blog [1]

A few things have changed since we wrote that article;

- We implement custom authorizers, which have helped lower our costs and the auth caching means authentication only happens once per x minutes and all subsequent requests are much faster.

- We use redis and a couple of Kinesis consumers running on a real server to sync data across all our regions. This setup has been battle tested and has successfully processed more than a hundred million API calls in a single day in near real time. [Use pipes and mget in redis for speed]

Here are some answers to a few specific things you raise in your question;

1. Use Sentry for lambda for error handling. The logs you get are incredibly detailed and have single handedly given us the greatest visibility into our application, moreso than any other tool we've tried (like AWS Xray).

2. Cloudwatch logs are tough. You might want to consider piping your logs to an Elasticsearch cluster, that might be a bit costly if you use AWS's Elasticache.

3. We use terraform for deploying our lambda functions and other resources. I'd strongly recommend it.

We at ReadMe recently launched Build (https://readme.build), a tool for deploying and sharing API's! It uses serverless under the hood, which makes it fast and easy to spin up your tasks in the cloud. Services can be consumed with code (Node, Ruby, Python), or via integrations with Slack, Google Sheets, etc. All you need is the one API key we provide to you.

We use it internally when fetching usage metrics, receiving notifications for new sign-ups, and to monitor page changes on our enterprise app (https://readme.io), and use these endpoints frequently from Slack channels.

AWS Lambda performs wonders. It's enabled us to make it as simple as humanly possible to create, deploy, and share functions. It requires little prior knowledge to start tinkering with, has a growing community to provide support, and handles smoothly in a production setting.

Instead of hitting our ingresses / load balancer, we made it so that webhooks hits a cloud functions, which then transform it to a cloud pubsub.

We listen to the cloud pubsub from a worker.

1) we don't manage it. We receive quite a lot of webhooks and it's nice to offload that
2) all of our webhooks are async. We just have 1 worker that handles it all, instead of provisioning a bunch of pods.
3) managing cloud functions is dope, since you can make it autodeploy from git.

10/10 would use again. Not sure about building a whole app around it tho

For our processing we handled 200k uploads in one hour on ayvri when we where building scenes for the Wings for Life event (the World's largest organized run).

When just relying on serverless triggering an event from s3, the cost was high due to the volume of scaling, time in spinning up new services, etc. etc. We built a queuing system which manages load and then spins-up new instances based on the load in the queue. This resulted in a much faster response, and SIGNIFICANT reduction in cost.

For the ayvri website, some pages are slow due to the lambda's not being warm, and I'm surprised users haven't complained. The important stuff is kept warm, and we're working on scaling that out for more responsiveness across the site.

As far as visibility into the app, I'm not going to pretend this is a solved problem. At the moment, we have most of the visibility we need via cloudwatch, and we have built some of our own analytics.

We had one instance where there was an issue between db connectivity which we were not able to resolve. We have put it down to a short networking issue between services. It lasted for 5 minutes one Sunday morning and then went away. So we had enough visibility into the service not being available, but failed in deeper understanding of where the problem was.

If you have further questions I can help with, feel free to reach out.

I will say, that I bought into serverless and went whole hog. I probably don't recommend that. We jump through some hoops we probably wouldn't need to if we had run our website via an ec2 instance and cloud-formation managing the scaling.

However, we have a few of our services which can come under high load quickly, and we don't need to scale up the entire site to serve those, such as our track processing. We believe Serverless was the correct decision for those processes.

I think the marketing is aimed mainly at technology management, not developers. Obviously, they have to evangelize developers, too, to a certain extent because they need a critical mass with familiarity so that the managers aren't saying “that’s nice, but who do we hire to build stuff on it”, but developers aren't the ultimate target of the marketing for serverless or most other cloud technologies.

To my understanding, it mean you write endpoints and only endpoints. You don't have to configure an OS, or even write the code that says "bind to port X and start". You just say "when you get this request, do this". Which is a neat abstraction! But there's still a server, you just don't have to manage it directly.

I've heard the analagy used that "Servers are to Serverless what wires are to wireless", the main point being that they're still there, but they're no longer something you manage or directly interact with.

I'm also starting to see the term LaaS (Logic as a Service) used as an alternative to "Serverless" here and there.

I think the analogy falls apart because in the part of the system that's referred to as "wireless", there is literally no wire. The serverless comparison would be more like if they used a lot of zip ties and some well-placed rugs so you never saw the wires going directly to you computer and called that "wireless".

Not just that, but "serverless" typically involves integrating a muddle of AWS/GCP/Azure services, locking you in. Portable software is the analog to "wireless" here, as it gives you the freedom to... move.

Virtually all async processing we do is achieved by the main service (running as a normal microservice - no serverless stuff) pushing into an SQS queue, then a Lambda function running every minute pulls from the queue to e.g. report policies to our underwriters, issue policy documents, capture payments, etc.

Essentially anything which happens after the hot path to get a response to the user - all this stuff can take a little while, isn’t really time sensitive, often needs to be tried multiple times, etc.

The Cloudfront distribution we have for our API passes all requests and responses through a Lambda@Edge function before/after the request hits our real system.

If you’re not aware, Lambda@Edge runs in the Cloudfront PoPs, so super low latency and can reject/respond to requests without going back to our real server.

We use it basically as middleware and one way to protect our real backend from potential bad actors:

- enabling the use of persistent HTTP/2 connections on a single hostname, even though our underlying services are all on separate hostname (basically some URL rewriting)

- enforcing minimum mobile app versions (as a regulated company, we eventually have to break very old versions of our mobile app, as they contain copy which is no longer correct/true)

- even calculating and returning insurance pricing with ultra-low-latency without having the latency of going all the way to our real servers in eu-west-1 - we’ll do a blog post about this at some point

Overall I must say, I’m incredibly pleased with Lambda and Lambda@Edge.

Only criticisms are that the GUI is an incredible pain in the ass to use (we don’t use any off-the-shelf serverless framework), and they’re very slow at supporting new Node versions.

Lambda does now support Node 8 (with async/await support etc.), though it took months. But Lambda@Edge still only supports Node 6 which has been quite difficult to continue supporting across our monorepo.

Never done any API/HTTP stuff with lambda, but I've built many ETL pipelines using it.

A few years ago we tried the Kinesis -> Lambda thing, but it failed during a large traffic event. This was due to the way the Kinesis poller works, and was made worse by the fact that we had two lambdas running off the same Kinesis stream.

The main issue was you can't control the Kinesis poller - meaning we ran into resource contention during this high traffic event and the iterator age fell behind quite drastically. So we abandoned it in favour of EMR + Flink + Apache Beam.

Other then that though, the S3 -> Lambda stuff works perfectly, and has been running for 3+ years with no issues.

We did. We are building our entire company: SQQUID on 100% serverless architecture. Scalability is awesome, in fact we had to do extra work to serialize some operations in order not to bring down other major corporation's server stack. Cost is a fraction of the traditional app scaling setup.

The best part is no devops needed. We use Serverless Framework. The biggest downside are cold starts for frontend response time. But this hasn't been a terrible issue as of yet. We have considered moving these 20 API endpoints to a nodeJS server which will resolve the issue but didn't have the time to do it yet.

I have some python based lambdas for simple service/user story monitoring. The couple problems I have with lambdas, which mean I will _never_ use them for a proper application:

- The language choices available do not meet my needs
- Impossible to create a prod-like setup; all our services run on k8, so mini-kube works great locally
- You lose any sort of control over architectural decisions (for better or worse)
- Poor code structure/quality/re-usability leads into poor developer experience. I enjoy most new technologies I pick up, all I got from Lambda was frustration.

For near realtime systems that scale it is right up there with the fastest application servers. In fact, if you take the auto-scaling properties into account it probably beats those servers because it can do it seamlessly up to incredible number of requests / sec without missing a beat. If you want low latency you can replicate your offering in as many zones as you feel like.

People start worrying about throughput, latency and error rates when they become high enough (or low enough) to measure.

My personal biggest worry is that if your Google account should die for whatever reason your company and all its data goes with it. That's the one thing that I really do not like about all this cloud business, it feels very fragile from that point of view.

> In fact, if you take the auto-scaling properties into account it probably beats those servers because it can do it seamlessly up to incredible number of requests / sec without missing a beat.

Autoscaling is one of those things that's easy to name but hard to actually achieve. I've had some involvement with an autoscaler for a few months and it's been educational, to say the least.

In particular people tend to forget that autoscaling is about solving an economic problem: trading off the cost of latency against the cost of idleness. I call this "hugging the curve".

No given autoscaler can psychically guess your cost elasticity. Lambda and others square this circle by basically subsidising the runtime cost -- minute-scale TTLs over millisecond-scale billing. I'm not sure whether how long that will last. Probably they will keep the TTLs fixed but rely on Moore's Law to reduce their variable costs over time.

Cloudwatch logs have been a big game changer versus on-disk logs for us. Getting into larger clusters (and larger log files), figuring out what’s happening in log files became somewhat arduous. https://github.com/jorgebastida/awslogs for viewing / tailing / searching is a lot easier. It’s also fairly straightforward to get logs streamed through to ELK hosted within AWS, if you’re interested in that angle.

We haven’t had issues but we keep our time windows short for any searching through logs (and we don’t use json logs so that’s an interesting difference). We use other systems for longer range log analysis.

I have a production service running on AWS Lambda and I haven't run into any major challenges. The Lambda service is responsible for authenticating to a downstream third party service and proxying requests along with an access token. I would consider this a simple use case.

CloudWatch have provided me all the visibility necessary to troubleshoot issues. I think the important thing here is to have a good logging strategy (logs are only as good as what you put in them). In my case, I made sure info messages were logging for the start and end of use cases (e.g., "reseting password", "password reset successfully"), warn messages for non-fatal errors (e.g. "username not found"), and error messages for fatal errors (e.g. "unable to connect to database").

The only frustrating limitation I've run into is when the Lambda function times out before receiving a response from the downstream service. At one point, the downstream service was having major performance issues and responses times were crazy high. This meant I couldn't get a response code and had to run the downstream calls locally to troubleshoot.

Performance is not great (most requests are in the 400-500 ms range), but it's more than adequate for my use case. A large portion of the response time is likely due to the downstream service, but there are cold starts that spike response time way out of normal range.

Overall, I'm really happy with AWS Lambda and it's definitely top of my list when taking on a new project. I'm really interested in experimenting with AWS Mobile Hub in the future. It doesn't get much better than one-stop serverless shopping.

Disclaimer: this is all AWS related as this is the cloud I'm using. I haven't tried Google Cloud Functions or the Azure equivalent.

I've been working with Lambda a lot more lately and it is not so bad... but also not great.
I'm saying this cause I found it hard to have a git-first (or git ops) workflow that is good in AWS: it looks like everything is made to be changed manually. CloudFormation is a slow with some resources (if you need CloudFront it will take tens of minutes) and CodePipeline has a pretty terrible UX and user experience. CodePipeline is cheap and it works for sure, but it's not a good system for pipelines as restarting, terminating steps and getting the output of steps just don't work in a decent way (I want to see the output in the steps, not jump to CloudWatch). Pretty much every other system outside of AWS is better than that, but the integration with Lambda and APIGateway is not as good unfortunately. If you know of a better system for CI/CD with AWS Lambda outside of CodePipeline, I'd be interested to try it.

In a similar way, most of the serverless frameworks I've tried are written for a workflow that is executed from CLI which is great to start and attractive for developers but not good enough for a company that aims at full reproducibility of setups and "hands off" operations. Source code change should trigger changes in the Lambda/API Gateway setup all the time and it would be great if devs don't have to trigger changes manually.

Apart from those steps, I think Lambda is definitely promising and I see the company I'm working for right now using it more and more. The developer experience is still lacking IMO but I'm confident we'll get there at some point.

I've written and shipped numerous sites using Zappa for Python which makes deploying on Lambda/API gateway very simple. https://www.storjdash.com is entirely Lambda based (sorry for no real home page....you can read about StorJ at https://storj.io/)

We run a service that manages deployments for Serverless Framework applications - https://seed.run and it is completely serverless. It’s been great not worrying about the infrastructure. Would definitely do it again.

Yes I have— Cloud Custodian (https://github.com/capitalone/cloud-custodian) relies heavily on its event driven serverless policies to enable compliance across a large AWS fleet— filling in the gaps that are otherwise missing in IAM. Currently there are in the order of magnitude of 1000s of lambdas deployed across 100s of accounts and it definitely does exactly what it needs to do with little maintenance. Monitoring is done with a combination of cloudwatch, datadog and pagerduty so getting alerted on failing or errored invocations is completely built into our workflow.

We’re currently using Xamarin and azure as a backend, though not serverless yet as stuff like cosmos pricing is still too expensive. But using Xamarin, which is terrible on its own, we’re also splitting our clients up in two languages. I’d like if we could adopt flutter and angularDart for clients, and I’d really like to run some of the backend in something like Firebase / cloud firestore.

I’m not sure if google is a good option in terms of privacy and EU legislation though. I have my lawyers looking into it currently, but to be honest I’d love if Microsoft made an Azure alternative and fully embraced dart for azure.

Developing a SaaS product based on completely proprietary stack that you can't even host yourself is VERY dangerous!!! Just yesterday there was a report that twitter bought a company and immediately closed their API access to all customers.

What will you do, if Amazon decide to close your AWS account for some reason? What if they discontinue one of the services you use?

Most of the stuff in GCP won't be closed like the list as they are used by enterprise customers (Or at least, there will be proper notice, migration path etc). I agree that at the end of the day you have little control over the infrastructure, its impossible for a lot of companies to maintain them which is why they are going to cloud. If you are very worried, you can just use the VMs (EC2 or GCP's VMs etc) and not use other services.

I think the other more important path mentioned would be: "what happens in case of a ban/restriction?". Not with AWS, but we've previously had accounts locked/closed "accidentally" - thankfully they were non-mission critical.

We're really passionate about static sites with cloud functions for dynamic functionality. After using Snipcart on a few sites and feeling we could do better, we actually built out our own e-commerce solution as a drop in product for static sites. It's all baked on firebase and cloud functions and we're loving it. It's super fun to work with and costs dollars to run. I'm usually very averse to the "build" end of build or buy scenarios but we couldn't be happier with the end result.

You are right that Cloudwatch logs are a hassle. So we pipe all of the log events into Scalyr (and log JSON objects, which Scalyr parses into searchable objects).

In terms of error handling, Lambda retries once on exception. So we raise exceptions in truly exceptional cases (e.g. - some weather in the cloud prevents a file from being downloaded or uploaded). We have Cloudwatch alerts that notify the team for every true exception. Happens less than once a day.

In pseudo-exceptional cases (e.g. a user emails an invalid image), we simply log to Scalyr with an attribute that identifies that the event was pseudo-exceptional, and then set up Scalyr alerts to email us if the volume of those events goes above x per hour.

tl;dr - Cloudwatch + Scalyr with good alerts and thoughtful separation of exceptions from pseudo-exceptions is my recommendation!

Deep Learning classifier services via AWS Chalice. It's trivial. However, 50MB/250MB lambda limit is a pain, one has to cut down TensorFlow, Keras etc. significantly before deployment (doable but difficult) or do some S3 tricks with /tmp. I wish they allowed increasing this limit for extra money. It's cheaper than EC2 Elastic Bean Stalk though.

I wouldn't do that again; I can waste time on more interesting things than to hack artificial limits of architecture that will be changed at some point anyway.

Yes, at my last company we needed to generate Open Graph thumbnails that composited several images together. We decided to use a serverless architecture that basically shelled out to an ImageMagick command and then pushed the thumbnail to S3 which was served via a CDN. The main problem we ran in to was a lack of processing power, but the new options from AWS solved our issues. I'd definitely do it again.

I think this gets at the core of why I think "100% serverless" isn't the right move for many projects.

If you decide you're never going to boot a machine and manage it yourself, you're locked into the exact set of choices your cloud had made available for you. When your project has a need that's not covered, you're stuck.

I'm not talking about vendor lock in here, purely about the reduced flexibility within a given vendor if you choose to never manage a server yourself.

This is totally correct, but I think it's also correct to say there is a huge vendor lock-in component to serverless. People make the claim that you can engineer your application to be vendor agnostic, but even if that's the case, you're still dumping a ton of time/money into AWS-specific tooling to get your application going, and almost none of that experience is transferable. Nothing from AWS API Gateway is applicable anywhere else, for example, and that's frankly one of the most awkward of all of the AWS services I've ever used (and so the costs are even higher than if it weren't). That's not to say that there won't be some form of serverless in the future that doesn't have this enormous vendor-specific lock-in cost. But that serverless is not here now.

I did a project using the first version of Azure Functions. It was early days with growing pains. I'll probably use Lambda on my next project. My general take is that serverless is a black box just like any computer - you need to figure out the rules of that black box and then accept those rules or tell the people who can open the box to fix something if it's broke.

This setup gets us an average of 70ms api response time and less than 200ms worst case rendering time. Higher than 90% of cache misses never gets the worst case as we can serve stale content in those cases. Lots of room for improvement too. =)

I'm curious on what's missing from Cloudflare Workers to allow you to remove the API Gateway usage. We're actively looking for more advanced use cases so we can make sure we prioritize upcoming features. Reply here or send me an email at <username> at cloudflare.com.

Anyone have a recommendation how to get started with AWS serverless options? There are so many AWS services and it isn’t intuitive when you login and see all the stuff you can do. Coming from Heroku where you do it all yourself, knowing when and how to break up functionality among AWS tools is not always clear.

we’re running a collaborative document editing service for mind maps (www.mindmup.com) entirely using lambda and associated services (such as kinesics and api gateway). started migrating from Heroku in 2016, went all in around February 2018. My anecdotal evidence is that we’re a lot more productive this way.

"Serverless" (by the current way it is approached) is more of the likes of "Cloud computing". Someone's server is your server. It is not difficult to create real serverless apps today, that works disconnected and fetches cached data when there is new data available.

Not "serious", but I can definitely recommend it for simple (transformative as in webhook -> api) endpoints you don't want to care about hosting/maintaining a server for
Low volume stuff is even free (at least on googles cloud)

I've been shipping serverless ever since it was launched on AWS. More recently I am using Google.

The biggest thing I am struggling with right now is how to appropriately split projects for module size. This may be more of an issue with Firebase functions than AWS because it's a lot easier to create separate projects with AWS for firebase and constrain the project. Google Firebase Functions very much assumes a big package architecture. We could break it up into multiple firebase projects but that separation creates a lot of annoyances.

It'd be awesome if you could split packages up so the resources don't have to be shared within the various functions.

Additionally I found it to be an anti-pattern to use any DB that requires a connection pool vs HTTP based commands. It's annoying as heck to manage connection pools with serverless and seems downright buggy or broken. If you want to support it you need to centralize it with something like PGpool which seems like a big anti-pattern. I hate dynamodb but am loving Google's offering (firebase firestore or datastore).

I've shipped significant work on all 3 now, with the least focus on the newest bit (cloud functions). Since you asked, here are my opinions:

# Cloud Databases

These are almost always a slam dunk unless you or someone else on your team has a deep understanding of MySQL or Postgres [2]. They often have unique interfaces with different constraints, but you can work around these constraints and the freedom to scale these products quickly and not worry as much about maintenance can be an enormous boon for a small team. This is fundamentally different from something like AWS RDS, where you do in fact sort of "have a server" and "configure that server". These other services have distribution build into their protocol.

Of the modern selection, DynamoDB and Firebase come to mind as particularly useful and spectacular products for key value and graph stores (DynamoDB is surprisingly good at it!). If you're using GCE, Spanner is some kind of powerful sorceros storm that does your bidding if you pay Google, it's really surreal the problems it can just magically solve (it's the sort of thing where it's so good your success with it disappears until you have to replicate it elsewhere and realize how much your code relied on it).

# Cloud Streams

I've been using these nonstop for about 6 years now, with most time logged on SQS. For some reason a lot of people object to streaming architecture on grounds of backpressure [3], or "want to run their own because of performance" and end up hooking zookeeper and Kafka into their infrastructure.

For small products or growing products, You Will Almost Certainly Not Overload SQS or Kinesis. You Just Won't Unless You're Twitter or Segment. Write your system such that you can swap streaming backends, and be prepared to solve obnoxious replay problems moving to a faster and less helpful queue.

Lots of folks are convinced they need to run their own RabbitMQ service so that they "can see what's going on." Given how incredibly reliable SQS has been for me since its introduction, I'm disinclined to believe that. While RabbitMQ is a fine product, I'd rather just huck stuff on SQS, obey sound design principles, and then only transition to faster queues ones.

# Cloud Functions (Cλ)

Firstly, these solutions work fine. I've only shipped on Lambda, and I will say I was underwhelmed. There are two reasons for this: cost and options. Cloud Functions with API Gateway is just about the most expensive way you can serve an API in the world of CSPs right now. The hidden requests costs are (or were when I set this up, shipped then tore it down looking in horror at my spend) just stupid. As for Options, it's very obnoxious how these environments (GAE, Lambda, etc) can only bless specific environments rather than giving us a specification over I/O or shm we could bind to. I want to ship Haskell in some cases and it's stupid what I have to do to enable that [4].

Much has been said about how spaghetti-like these solutions are, but I think this is more of a tooling issue. If you can actually specify Cλ endpoints in a single file, then you can write a uni-repo for a family of endpoints that share common libraries, build for those, and terraform/script them into deployment. This is actually probably more principled than how most folks cram endpoints into a single fat binary. It also makes things like partial rollouts on an API a heck of a lot more easy to implement.

But still, out of the trinity of CSP products, Cλ is by far the least exciting to me. I seldom ship API endpoints there. I usually use it for small cron jobs or data collection jobs where I'm confident I wont' end up with 4 running instances because a looped call is timing out.

[0]: I'm experimenting with writing these mega posts with classical footnotes as opposed to making them epic journeys to slog through my prose style.

[1]: I hate myself more every time I say the word cloud even knowing it's the lingo folks will understand the most. They're service products. Let's all sink into despair together.

[2]: And by "deep" I mean, "Good enough to have a reputation suitable for a professional consultant and attract desperate clients."

[3]: To which I say, "Look, if you wanna pretend that the only possible architecture is a spiderweb of microservices that positively push backpressure up to the client and pretend that introspect-able queues don't give your services equivalent confirmation, that's a game you can play. I think it's disrespectful to folks who have equivalent backpressure schemes because they have similarly refined infrastructure for understanding their queue volume. Both methods are similar, and have different strengths. Needham's duality is real and it's exactly the same here as it is on one single computer."

[4]: It's 2018, we have containers, and if you support Java with its slower startup times you surely could support lightning fast Rust or Haskell executables as well. Get with it, Amazon!

Lambda et al have some serious shortcoming and a lot more work needs to be put into these serverless platforms. The approach they're taking I don't think will last. It really needs a redesign/restructure.

I think serverless is the future... but not today.. in 5-10 years. That sounds like a long way off, but it's not.. it'll pass in no time. And maybe they'll improve it enough by then to make it viable.

I wouldn't build anything serious with it, unless you're ok rewriting it a few years from now.

Listen to I just have to say this for everybody else out there there is no serverless design it's not like the fucking things running in JavaScript on the clients there are servers the whole name is stupid come up with a better name for service man it's not serverless God help us