> Redis streams aren’t exciting for their innovativeness, but rather than they bring building a unified log architecture within reach of a small and/or inexpensive app. Kafka is infamously difficult to configure and get running, and is expensive to operate once you do. Pricing for a small Kafka cluster on Heroku costs $100 a month and climbs steeply from there. It’s temping to think you can do it more cheaply yourself, but after factoring in server and personnel costs along with the time it takes to build working expertise in the system, it’ll cost more.

This. Precisely this. Redis Streams, if done right, will be a killer feature. Redis is already pretty much omnipresent and has a great managed cloud story (e.g. Elasticache). And it's cheap.

The unified log pattern - and event sourcing based on that - is a very powerful pattern.

We are doing some work at my job right now inspired by this approach (in no small part, that LinkedIn article). I hope we can open-source some of this tech, or at least start blogging about the concepts as nobody really talks about it, and the mechanics are a little tricky to get right.

We currently think the best approach is to bring ideas from DDD (specifically the idea of a bounded context and a domain), a unified log which we call the GEL (for Global Event Log), domain-specific event logs, event-sourcing with CQRS and then some quite lightweight means to produce "projections" (different structural representations of data).

We therefore have a few pieces that get names and become easy to reason about: every event goes to the GEL. Domains have systems that listen to the GEL for events they are interested in and map those events into commands on a command service. Clients can issue commands directly, as well. Commands alter projections and produce events back to the GEL. Clients and domains can query a domain's query service.

It's not really rocket science, these ideas are as old as the oldest mainframe you've ever heard of, but it affords incredible flexibility and unrivalled performance.

We're using Kinesis for our GEL that has a read performance issue, but not a big enough issue for us yet to look at alternatives - the zero-admin cost is attractive to us right now.

For me, now, MVC and CRUD style applications look like really odd anti-patterns. I'm amazed we got as far as we have with them.

Yes ... it's a very powerful pattern and Redis streams look promising; building event sourced systems is still perceived as difficult, but I think a lot of that is because people are picking overly heavy, complicated plumbing for it when it's not really needed to get started. This has great potential to help with that.

Not gp, but maybe because without book-keeping you lose information in a CRUD style app? The append-only (until/unless you truncate it) log gives you a lot of durability and insight into the system via replaying events from a snapshot + log combination. At least, that's my understanding

Paul are you based in London? I've been thinking about dusting off my Unified Log London meetup (https://www.meetup.com/unified-log-london/) and it sounds like you would have a ton to talk about there. Email address is in my profile!

My company uses Kinesis for stream consumption, but everything is stored in both Redshift and Glacier (we expire redshift data after a year or two but Glacier is kept forever). This is our canonical data store.

Smaller subsets of events are kept in Redshift/Postgres/MongoDB databases, depending on the purpose and how they'll be queried. These keep anywhere from 48 hours to 6 months of data depending on the purpose, sometimes it's just the raw events filtered in a certain way, and sometimes a projection based on those events. Either way, if we catch a bug in our logic we can re-run it for the past year of data or whatever.

We didn't use Redis Streams, but NuGet.org is built on this concept. NuGet.org is the public repository for NuGet, the .NET package manager.

We have a public unified log (JSON blobs, behind a CDN) containing all events about package publishes, edits, and deletes. The rest of our endpoints have background jobs reading this log and updating different "views" of the package data. Each event's unique ID is a timestamp of when it was added to the log, much like Redis Streams.

We have found this is a powerful concept which has helped us build a very reliable infrastructure where almost all public endpoints are static JSON blobs served by a CDN. The only compute we need for customer API calls is a search service.

We hope one day that our unified log (called the "catalog") will be used for custom client needs and for package replication (e.g. corporations behind a firewall that can't talk to NuGet.org directly).

Having worked with both Kafka and Redis, this certainly rings true for me:

"Redis streams aren’t exciting for their innovativeness, but rather than they bring building a unified log architecture within reach of a small and/or inexpensive app. Kafka is infamously difficult to configure and get running, and is expensive to operate once you do. [...] Redis on the other hand is probably already in your stack."

My current job is maintaining Kafka infrastructure, dozens of clusters. I literally don’t understand how it got so popular. It is by far the least reliable software I have ever experienced. We hit so many bugs and weird edge cases that essentially cripple the entire cluster and stop all producing. Zookeeper in addition has been a constant source of headaches.

Great read. I agree that Kafka is overkill if even necessary for most applications. I have been using NATS Streaming [1] which has the same properties as Redis Streams for some time and it is wonderful.

Can anyone who has done it both ways characterize the tradeoffs vs. using amqp (say, rabbit) as the bus? I've been excited about this feature in redis, and this is a very interesting use case for it, but when I read the part toward the end regarding consumer checkpointing to prevent premature log truncation I was thinking message queuing acknowledgement semantics basically deal with that for you. I'm sure its more costly, but if you need to know whether something's been consumed then you need to know.

The point of this kind of log architecture is that producers never need to know if a consumer has consumed something, because the log is designed to allow as many consumers as you like to reliably work through the log from any given position.

As a producer, you just need to know that your message has definitely been delivered to the log. What consumes it is none of your business.

> The point of this kind of log architecture is that producers never need to know if a consumer has consumed something, because the log is designed to allow as many consumers as you like to reliably work through the log from any given position.

Right I think we're saying the same thing. In the article checkpointing was introduced to enable safely managing the size of the log, i.e. it appears that in this use case _somebody_ needs to know whether the consumers are caught up in order to adhere to at-least-once semantics. It just seemed to me that if you reach the point where you're rolling your own to solve that problem there are other possible solutions.

I haven't read the linked-in piece yet but I intend to. It was also linked in the article. Thanks for linking it here.

I'm finishing up an application that's built around a Kafka Streams topology. Kafka Streams is quite nice and I am excited to put it into production to see how it works.

All mutations run through the Kafka Streams application. A Relay webapp client communicates with a graphql-java server which resolves GraphQL queries against a gRPC service layer. This service layer either queries Kafka Streams state stores, other state stores, or writes mutations to Kafka, wherein they are processed by the Kafka Streams topology. Another Kafka Streams topology indexes topics to Elasticsearch. Having separate topologies seems to make it easier to reset them independently in case I want to replay state. All of this is glued together with Protocol Buffers, which are a great complement to both Kafka and Elasticsearch.

It is really nice to see Redis join this scene and to see more people getting excited about building applications this way. Looking forward to checking this out.

I was thinking that too, then I realized he was wrapping the code in `DB.transaction` blocks, which rollback on any type of exception, so that all those staged/checkpoint entries only persistent if the block cleanly executes.

People are coming out with their own cases where they've implemented this pattern, and I'm sure there are a lot more that haven't bothered to note their cases. That's because it's actually the simplest (and possibly most obvious) solution whenever you have disjoint producers and consumers. I use it myself through the disk. I have a time based directory structure, and within that compressed blobs of data that are timestamped and have whatever other uniquely identifying information is required.[1] In fact, I have no less than seven distinct sets of data like this, some JSON, some HTML from scraped pages, etc

One set of processes retrieves and stores the data in the normalized structure, another set of processes reads and processes them. If there's ever a problem in the processing, the system can fail until someone looks into the error and adds an exception or fixes the processing. This is useful for when you can't be sure the remote side will continue to work as expected (I've found API's to be only slightly more stable than random web scraping. Sure, they may version their APIs, but that doesn't help when they don't communicate new versions or when they are deprecating the old ones and you're hit with a dead API endpoint it out of the blue).

Not only is the processing asynchronous, but the development is so asynchronous, I've added data sources that I knew I would eventually want to process (as adding a data source is fairly easy), only to get around to writing the processor six months later, and the nature of the system means that it's trivial to process the old data. In fact, the normal operation would likely note it is that far out of date and process the old data. The only consideration that needs to be made is whether you want to optimize the back-processing by aggressively caching, since that can greatly speed up the process.

I really think the pattern it is the natural conclusion most people would come to when presented with the right requirements. The real art comes from recognizing when your current problem can map to a pattern such as this when it doesn't appear to initially.

Anyone who has worked with redis at scale knows exactly how the persistent queue approach of Kafka is far superior for any robust system that relies on the queue for truth and scales out (aka not your tiny crud app).

And Kafka is hard to deploy ? Please... If you deploy a redis cluster without knowing the pitfalls of distributed systems, you'll have the same problems as with Kafka. Also, soon enough you can deploy brokers over jbods without ZK, what will be the argument then I wonder...

Regardless of complexity, what I find interesting in the deployment of Redis for this use case is the following: if for the use case the consistency provided by Redis is sufficient, or even more, when a relaxed level of consistency and durability are enough for the business requirements, being Redis an in memory system and because of the specific design of the Stream inside Redis (a radix tree of blobs), you get a very high number of operations per second per process, with a memory usage which is very compact compared to the other data structures. This allows to scale certain use cases using a small percentage of the resources needed with other systems.

About the use cases where strong guarantees are needed: for instance Disque (http://github.com/antirez/disque) provides strong guarantees of delivery in face of failures, being a pure AP system with synchronous replication of messages. For Redis 4.2 I'm moving Disque as a Redis module. To do this, Redis modules are getting a fully featured Cluster API. This means that it will be possible, for instance, to write a Redis module that orchestrates N Redis masters, using Raft or any other consensus algorithm, as a single distributed system. This will allow to also model strong guarantees easily.

"Pricing for a small Kafka cluster on Heroku costs $100 a month and climbs steeply from there. It’s temping to think you can do it more cheaply yourself, but after factoring in server and personnel costs along with the time it takes to build working expertise in the system, it’ll cost more."

If your project can afford Kafka, use Kafka. This article is about achieving the same pattern in any project that already has Redis.

I think Gepsens argument was that your not achieving the same pattern, rather your achieving a subset of the pattern where several important considerations in dealing with a distributed system have been ignored and/or are unsupported. Once you do add those features back in then you basically arrive at Kafka and all the complexity involved with standing it up, so Redis isn't really an Apples to Apples comparison. While it certainly might work for certain use cases (and indeed appears to do so), many of the use cases that Kafka is necessary for would end up being just as complicated to setup Redis to support.

I think this is largely a rephrasing of one of the findings of dealing with NoSQL vs. Relational DBs, which is that for a subset of problems relational DBs have been used to solve in the past, NoSQLs are perfectly capable of replacing them, but not every problem a relational DB handles can be handled by a NoSQL without investing significant time and effort layering more complex systems on top, at which point you've basically implemented a poorly optimized ad-hoc relational DB. In this case, Redis, with the new stream type is capable of handling a subset of problems that have previously required Kafka to solve, but it isn't itself a replacement for Kafka in all situations since there are several important features of Kafka that aren't available in Redis.

The system I work on as of now isn't really fit for a unified log design. But there are definitely some critical part of the system that can use a high performance centralized event log and can be made simpler and better (faster/more reliable). The ability to decouple producer & consumer, fanning in/out in both side, batching all are very attractive properties. But I live in a transactional world and even though in quite a few cases I can make the consumer idempotent the publishing of an event becomes tricky. If you are not using the log as the only store but in addition to a traditional transactional data store (databases for e.g.) how can you ensure that events are published only and surely for a successful transaction?

The staged log record idea is interesting but it makes the database bottleneck bigger specially if a lot of different transaction (spread across many tables) now is bottlenecked on the same stagedlog table.

I have seen it happen though in such contexts (though in Oracle). I assume the difference is that the database has much more freedom in how it writes to WAL (batching and such) but in the table itself it has to do some degree of coordination and locking.

The other big issue is that even if the insert into the shared table is fast (simpler structure) the transaction hence the commit can be long due to the complexity of the application transaction.

I will be happy to be corrected though if we are doing something wrong.

By my (admittedly limited) understanding of MVCC, one long running transaction that includes an insert will not block other inserts to the same table. The worst you'd see in that situation is if you were using a sequence for IDs, an ID might be grabbed from the sequence at the beginning of transaction A, then another ID grabbed and inserted into the table by transaction B, and then when transaction A finally commits, the order of the IDs won't match the chronological order of the commits.

For simpler setups, you can just search for "kafka" in the AWS marketplace and snag a bitnami instance. Larger clusters take more administrative overhead; I've had to set a few of these up manually and it's not fun or particularly intuitive.

Kafka requires only a single ZooKeeper instance to get started. Kafka's official quickstart actually walks you through running a local deployment of Kafka that has 1x Kafka broker and 1x ZK instance: https://kafka.apache.org/documentation/#quickstart

Yeah and it's not obvious that there's any way to get Kafka up and running without Zookeeper, or what their interaction is. And you can get Zookeeper and Kafka running on one machine, but it's such a pain and there's nothing to help really in the docs.

Also, since you talked above about the "lack of documentation" to get up and running with a small Kafka deployment of only 1 broker and 1 ZK instance: this is covered in the Kafka quickstart, which is probably what most people will read when starting out with Kafka. See https://kafka.apache.org/documentation/#quickstart.