kafka intra-cluster replication support

Details

Description

Currently, Kafka doesn't have replication. Each log segment is stored in a single broker. This limits both the availability and the durability of Kafka. If a broker goes down, all log segments stored on that broker become unavailable to consumers. If a broker dies permanently (e.g., disk failure), all unconsumed data on that node is lost forever. Our goal is to replicate every log segment to multiple broker nodes to improve both the availability and the durability.

Activity

I read the design document and it looked like it is quite a complex feature to implement and maintain going forward. The sheer amount of complexity in managing the replicas and partition just bothers me. I am wondering why can't we use HDFS which is hardened over a number of years. Obviously I may be missing some subtle things. Would be great if folks shed light on this.

Sharad Agarwal
added a comment - 18/Oct/11 08:09 Sorry for coming late to this.
I read the design document and it looked like it is quite a complex feature to implement and maintain going forward. The sheer amount of complexity in managing the replicas and partition just bothers me. I am wondering why can't we use HDFS which is hardened over a number of years. Obviously I may be missing some subtle things. Would be great if folks shed light on this.

Good question. HDFS is a great system and is the very first thing that we thought about when looking at Kafka replication. The pros are that (1) we can offload the replication complexity to another system and (2) HDFS can recover various data failure very effectively. Some of the cons are:

1. HDFS only provides data redundancy, but not computational redundancy. If at any given point of time, there is only one broker that can serve the data, the availability is not going to be high. When a broker is down, we need to elect another broker to take over its data. Even though data doesn't have to be physically moved, this process may require a little a bit of recovery of each partition and may take some time to complete. In that window, some partitions become unavailable. Further, data logically moved to the new broker is initially cold.

2. HDFS currently is not a highly available system. The namenode is a SPOF and you need something like Avatar namenode to make it HA. It's not clear when this feature is going to be generally available and used.

3. Using HDFS brings in another complex system, it's not clear how easy it is to operate, especially for an online system.

4. HDFS is not a true POSIX file system. The append/truncate support is relatively new. This may force us to redesign some of the things that currently require in-place update (e.g., during recovery).

5. HDFS manages its data at block level. Kafka replication can manage data at the partition level (a partition can be 3 orders of magnitude bigger than a block). This means we can manage much less meta data and therefore, potentially have a simpler design and implementation.

Jun Rao
added a comment - 18/Oct/11 15:33 Sharad,
Good question. HDFS is a great system and is the very first thing that we thought about when looking at Kafka replication. The pros are that (1) we can offload the replication complexity to another system and (2) HDFS can recover various data failure very effectively. Some of the cons are:
1. HDFS only provides data redundancy, but not computational redundancy. If at any given point of time, there is only one broker that can serve the data, the availability is not going to be high. When a broker is down, we need to elect another broker to take over its data. Even though data doesn't have to be physically moved, this process may require a little a bit of recovery of each partition and may take some time to complete. In that window, some partitions become unavailable. Further, data logically moved to the new broker is initially cold.
2. HDFS currently is not a highly available system. The namenode is a SPOF and you need something like Avatar namenode to make it HA. It's not clear when this feature is going to be generally available and used.
3. Using HDFS brings in another complex system, it's not clear how easy it is to operate, especially for an online system.
4. HDFS is not a true POSIX file system. The append/truncate support is relatively new. This may force us to redesign some of the things that currently require in-place update (e.g., during recovery).
5. HDFS manages its data at block level. Kafka replication can manage data at the partition level (a partition can be 3 orders of magnitude bigger than a block). This means we can manage much less meta data and therefore, potentially have a simpler design and implementation.

I must agree completely with Jun here. The beauty of Kafka lies in it's simplicity. To add another piece to the puzzle such as HDFS would break this and diminish the value. I would even argue that if possible Kafka might consider removing Zookeeper as a dependency - or at least make it optional.

I would also add that it's not clear that HDFS would actually exhibit the write/read performance that Kafka achieves using NIO. And because there is basically zero copy, Kafka's memory and cpu overhead is incredibly low for what it does.

These two main factors - it's incredible performance with very low overhead using commodity components are Kafka's strengths. Adding HDFS would eliminate them.

I would suggest that any replication strategy should focus on non-guaranteed delivery. The Kafka clients already do not provide a guaranteed delivery mechanism, and as such Kafka should only be used in a setting where it's tolerable to have some amount of message loss during a failure event. Minimizing this loss is a reasonable goal to achieve, but it should not compromise the simplicity and performance of Kafka in any way.

Taylor Gautier
added a comment - 18/Oct/11 16:02 I must agree completely with Jun here. The beauty of Kafka lies in it's simplicity. To add another piece to the puzzle such as HDFS would break this and diminish the value. I would even argue that if possible Kafka might consider removing Zookeeper as a dependency - or at least make it optional.
I would also add that it's not clear that HDFS would actually exhibit the write/read performance that Kafka achieves using NIO. And because there is basically zero copy, Kafka's memory and cpu overhead is incredibly low for what it does.
These two main factors - it's incredible performance with very low overhead using commodity components are Kafka's strengths. Adding HDFS would eliminate them.
I would suggest that any replication strategy should focus on non-guaranteed delivery. The Kafka clients already do not provide a guaranteed delivery mechanism, and as such Kafka should only be used in a setting where it's tolerable to have some amount of message loss during a failure event. Minimizing this loss is a reasonable goal to achieve, but it should not compromise the simplicity and performance of Kafka in any way.

To put it another way, in terms of what I want from replication - currently if I have a failure event I will currently lose the history of all of my messages. I would like Kafka to preserve as much of those messages as possible in a failure event. It's ok if not every message that appeared to be delivered was delivered.

This is a classic CAP tradeoff - does Kafka provide C or A? I propose it continue to focus on A.

Taylor Gautier
added a comment - 18/Oct/11 16:05 To put it another way, in terms of what I want from replication - currently if I have a failure event I will currently lose the history of all of my messages. I would like Kafka to preserve as much of those messages as possible in a failure event. It's ok if not every message that appeared to be delivered was delivered.
This is a classic CAP tradeoff - does Kafka provide C or A? I propose it continue to focus on A.

> I would even argue that if possible Kafka might consider removing Zookeeper as a dependency - or at least make it optional.

It's already optional (enable.zookeeper=false), but you loose a lot if you disable it. Taylor, maybe you could elaborate in the mailing list or another ticket what subset of functionality you would be willing to give up to not use ZK?

Chris Burroughs
added a comment - 18/Oct/11 16:22 > I would even argue that if possible Kafka might consider removing Zookeeper as a dependency - or at least make it optional.
It's already optional (enable.zookeeper=false), but you loose a lot if you disable it. Taylor, maybe you could elaborate in the mailing list or another ticket what subset of functionality you would be willing to give up to not use ZK?

We do plan to offer both async replication and sync replication. In async mode, the latency should still remain low since the client doesn't wait for the data to hit all replicas. However, a small amount data may not be replicated to the followers during a failure and will be lost. Sync mode gives you more or less the opposite. This could be useful for people who want to use Kafka as traditional messaging systems like ActiveMQ.

This is another difference with HDFS, which only has a sync replication mode.

Jun Rao
added a comment - 18/Oct/11 16:33 We do plan to offer both async replication and sync replication. In async mode, the latency should still remain low since the client doesn't wait for the data to hit all replicas. However, a small amount data may not be replicated to the followers during a failure and will be lost. Sync mode gives you more or less the opposite. This could be useful for people who want to use Kafka as traditional messaging systems like ActiveMQ.
This is another difference with HDFS, which only has a sync replication mode.

> HDFS only provides data redundancy, but not computational redundancy.

If data resides in HDFS, theoretically it can be served by any broker. The default could be being served from the broker which has hot data. (the data being written to)

> The namenode is a SPOF
True, but namenode going down doesn't let you loose the data, yes the cluster is not accessible for that period. However if kafka has acks and producer side spooling (which anyway should be there IMO for data durability), no data would be lost.

> The append/truncate support is relatively new.
Its been used by Hbase folks for quite sometime.

> HDFS manages its data at block level.
Doesn't really matter as users of hdfs care least about blocks. They have a file view of things.

That all said, I don't want to derail this work with any kind of debate here. I was just thinking to get production quality replication quickly. Look forward to having the replication in. Thanks!

Sharad Agarwal
added a comment - 19/Oct/11 06:56 Thanks Jun for the comments.
> HDFS only provides data redundancy, but not computational redundancy.
If data resides in HDFS, theoretically it can be served by any broker. The default could be being served from the broker which has hot data. (the data being written to)
> The namenode is a SPOF
True, but namenode going down doesn't let you loose the data, yes the cluster is not accessible for that period. However if kafka has acks and producer side spooling (which anyway should be there IMO for data durability), no data would be lost.
> The append/truncate support is relatively new.
Its been used by Hbase folks for quite sometime.
> HDFS manages its data at block level.
Doesn't really matter as users of hdfs care least about blocks. They have a file view of things.
That all said, I don't want to derail this work with any kind of debate here. I was just thinking to get production quality replication quickly. Look forward to having the replication in. Thanks!

Hey Sharad, your comments are all correct. I think using HDFS would certainly require the least implementation effort and contains a mature replication system tested at large scale. The downside is that HDFS is fairly complex in its own right, and has a number of drawbacks for high-availability, low-latency cases (spof is one but not the only one). Also many use cases do not need replication, but supporting hdfs and local fs efficiently probably means two pretty different implementations. We felt that this kind of multi-subscriber log is a really important abstraction in its own right for systems of all kinds and so our thought was to just kind of suck it up and do the full implementation since we thought the end result would be better.

Jay Kreps
added a comment - 19/Oct/11 16:43 Hey Sharad, your comments are all correct. I think using HDFS would certainly require the least implementation effort and contains a mature replication system tested at large scale. The downside is that HDFS is fairly complex in its own right, and has a number of drawbacks for high-availability, low-latency cases (spof is one but not the only one). Also many use cases do not need replication, but supporting hdfs and local fs efficiently probably means two pretty different implementations. We felt that this kind of multi-subscriber log is a really important abstraction in its own right for systems of all kinds and so our thought was to just kind of suck it up and do the full implementation since we thought the end result would be better.

Jun Rao
added a comment - 14/Nov/11 00:33 - edited The dependencies of the sub-jiras look like the following:
48
49
47 <-- 46,44/45 <-- 43,42,41
This means that initially, 47,48,49 can be worked on independently.

1. V1 design has 2 separate ZK paths for a broker, one registered and one alive. Simplify it to have just 1 ZK path for live broker. The implication is that if a topic is created while a broker is down, no partition will be assigned to that broker. Since topic creation is infrequent, this is likely not a big issue.

2. Use broker id as replica id for each partition, instead of using an explicit replica id.

Jun Rao
added a comment - 05/Jan/12 17:52 Attach v2 of the detailed design doc. Made 2 minor changes:
1. V1 design has 2 separate ZK paths for a broker, one registered and one alive. Simplify it to have just 1 ZK path for live broker. The implication is that if a topic is created while a broker is down, no partition will be assigned to that broker. Since topic creation is infrequent, this is likely not a big issue.
2. Use broker id as replica id for each partition, instead of using an explicit replica id.

Neha Narkhede
added a comment - 01/Mar/12 19:10 Can the design be moved to the Kafka wiki instead of a non-editable pdf attached here ? It will make it much easier to discuss missing details in the current design.

Cool, so I was thinking if the original design was on a wiki too, it will be much easier to point to sections of it, to make discussions easier. If you have a text copy of it, would you mind pasting it in a Kafka JIRA page ? It would be very useful.

Neha Narkhede
added a comment - 01/Mar/12 21:59 Cool, so I was thinking if the original design was on a wiki too, it will be much easier to point to sections of it, to make discussions easier. If you have a text copy of it, would you mind pasting it in a Kafka JIRA page ? It would be very useful.

Moving the kafka replication design docs to a wiki. This includes both the high-level as well as the low level design details. The following changes are made on the wiki -

1. The state machine will be maintained and changed only by the leader for a partition. The leader co-ordinates each state change by requesting the followers to act on state change requests. This ensures that we don't have a split-brain problem during state changes amongst the replicas for a partition.
2. More details are added for the various algorithms

Neha Narkhede
added a comment - 12/Mar/12 22:46 Moving the kafka replication design docs to a wiki. This includes both the high-level as well as the low level design details. The following changes are made on the wiki -
1. The state machine will be maintained and changed only by the leader for a partition. The leader co-ordinates each state change by requesting the followers to act on state change requests. This ensures that we don't have a split-brain problem during state changes amongst the replicas for a partition.
2. More details are added for the various algorithms