This wiki contains some ideas on improving the Kafka wire format. This could be either a breaking change or introduced as new requests with the existing requests to be removed after one release. The goal would be to do some or all of the following:

Known Proposals

Description

API

New Field

Related JIRA

Discussion

Add correlation id to all requests

All requests

correlation_id:int32

KAFKA-49

A field to make it possible to multiplex requests over a single socket. This field would be set by the client and would be returned by the server with the response. This would allow a client to make multiple requests on a socket and receive responses asynchronously and know which response was for which request.

Reduce duplication in APIs

ProduceRequest
MultiProducerRequest
FetchRequest
MultiFetchRequest

-

-

Currently we have both ProduceRequest and MultiProducerRequest and FetchRequest and MultiFetchRequest. The Multi*Request is just the single request version repeated N times. There are a few problems with this: (1) the ProduceRequest and FetchRequest are just special cases of the general Multi*Request format with no real benefit to them (the reason for their existence is largely historical), (2) having both means more API formats to maintain and evolve and test. We should get rid of the single topic/partition APIs and rename the existing Multi*Requests to ProduceRequest and FetchRequest to keep the naming clean.

Reduce repetition of topic name in Multi* APIs

MultiProducerRequest
MultiFetchRequest

-

-

Currently the form of the APIs for the Multi* requests looks something like this: [(topic, partition, messages), (topic, partition, messages), ...]. (Here square brackets denote a variable length list and parenthesis denote a tuple or record). This format is driven by the fact that the Multi* requests are really just a bunch of repeated single topic/partition ProducerRequests. This is really inefficient, though, as a common case is that we are producing a bunch of messages for different partitions under the same topic (i.e. if we are doing the key-based partitioning). It would be better for the format to be [(topic, [(partition, messages), ...], topic, [(partition, messages), ...], ...]. This would mean that each topic name is only given once per request no matter how many partitions within that partition are being produced to.

Support "long poll" fields in fetch request

(Multi-)FetchRequest

max_wait:int32
min_size:int32

KAFKA-48

Add two fields to the fetch request which cause the request to not immediately response. Currently fetch requests always immediately return, potentially with no data for the consumer. It is hence up to the consumer to continually poll for updates. This is not desirable. A better approach would be for the consumer request to block until either (1) min_bytes are available in total amongst all the topics being requests or (2) max_wait time in milliseconds has gone by. This would greatly simplify implementing a high-throughput, high-efficiency, low-latency consumer.

Add producer acknowledgement count and timeout

(Multi-)ProduceRequest

required_acks: int8
replication_timeout: int32

KAFKA-49

Currently the produce requests are asynchronous with no acknowledgement from the broker. We should add an option to have the broker acknowledge. The orginal proposal was just to have a boolean "acknowledgement needed" but we also need a field to control the number of replicas to block on, so a generalization is to allow the required_acks to be an integer between 0 and the number of replicas. 0 yields the current async behavior whereas > 1 would mean that in addition to blocking on the master we also block on some number of replicas.
The replication timeout is the time in ms after which the broker will respond back with an error even if the required number of acknowledgements have not been sent.

Add offset to produce response

ProduceResponse

message_set_offset: int64

KAFKA-49

As discussed in KAFKA-49 it would be useful for the acknowledgement from the broker to include the offset at which the message set is available on the broker.

Separate request id and version

All requests

version_id: int16

Currently we have a single int32 that identifies both the api and the version of the api. This is slightly more confusing then splitting out the request id and the version id into two 16 bit fields. This isn't a huge win but it does make it more clear the intention when bumping the version number versus adding a new request entirely.

Add a client id

All requests

client_id: string

Currently we can only correlate client applications to server requests via the tcp connection. This is a pain. It would be good to have a shared logical id for each application so that we can track metrics by client, log it with errors, etc.

Add replica id to fetch request

FetchRequest

replica_id: int32

This replica id allows the broker to count the fetch as an acknowledgement for all previous offsets on the given partition. This should be set to -1 for fetch requests from non-replicas outside the cluster.

Open Questions

Can we do a one-time incompatabile refactoring for this?

Pros: no need to keep the old stuff working while adding the new stuff

Con: hard to roll out. Requires updating all clients in other langs at the same time.

One thought on this is that it is probably not too hard to make most of the above changes as new request types and map the old request types to the new. However if we are changing the request id and version id scheme then this will likely not be possible.

If we want to do a 0.7.1 release we will need to figure out a sequencing and branching strategy so that no backwards-incompatable changes block this.

Any other fields need for replication or other use cases we know about?

Currently the multi-* responses give only a single error. I wonder if this is sufficient or do they potentially need more. For example if you send a produce request to the wrong partition we need to tell you the right partition, which would be different for each partition.

Request Details

This section gives the proposed format for the produce and fetch requests after all the above refactorings.

To aid understanding I will use the following notation:

int8, int16, int32, and int64 will be integers of the given byte length

string is a int16 giving the size N of the string followed by N bytes of UTF-8 characters.

message_set denotes the existing message set format

[] denote a variable length list prefixed by a int16

{} denote the fields of a record. These aren't stored they are just used for grouping.

// denote comments

<x> denotes that x is a type that will be defined seperately

So as an example a list of records each of which has a name and id field would be denoted by:

[{id:int32, name:string},...]

Common fields

The following fields are common to all requests:

Request Fields

field

type

order

description

size

int32

1

The size of this request (not counting this 4 byte size). This is mandatory and required by the network layer. It must be the first field in the request.

request_type_id

int16

2

An id for the API being called (e.g. FetchRequest, ProduceRequest, etc.).

version_id

int16

3

A version number for the request format. This number starts at 0 and increases every time a protocol change is made for this API.

correlation_id

int32

4

An id that can be set by the client and will be returned untouched by the server in the response.

client_id

string

5

An user-defined identifier for the client which is used for logging and statistics purposes (e.g. to aggregate statistics across many client machines in a cluster).

Response Fields

field

type

order

description

size

int32

1

The size of this request (not counting this 4 byte size). This is mandatory and required by the network layer. It must be the first field in the request.

correlation_id

int32

2

An id that can be set by the client and will be returned untouched by the server in the response.

version_id

int16

3

A version number that indicates the format of the response message

error

int16

4

The id of the (request-level) error, if any occurred.

ProduceRequest

{
size: int32 // the size of this request
request_type_id: int16 // the request id
version_id: int16 // the version of this request
correlation_id: int32 // an id set by the client that will be returned untouched in the response
client_id: string // an optional non-machine-specific identifier for this client
required_acks: int8 // the number of acknowledgements required from the brokers before a response can be made
ack_timeout: int32 // the time in ms to wait for acknowledgement from replicas
data: [<topic_data_struct>] // the data for each of the topics, defined below
}
topic_data_struct =>
{
topic: string // the topic name
partition_data: [<partition_data_struct>] // the data for each partition in that topic, defined below
}
partition_data_struct =>
{
partition: int32 // the partition id
messages: message_set // the actual messages for that partition (same as existing)
}

ProduceResponse

{
size: int32 // the size of this response
correlation_id: int32 // an id set by the client returned untouched in the response
version_id: int16 // the version of this response error: int16 // the id of the error that occurred (if any)
errors: [int16] // per-partition errors, one for each message set sent (or all -1 if none)
offsets: [int64] // the offsets for each off the message sets supplied, in the order given in the request
}

The errors and offsets array MUST contain one entry for each message set given in the request an the order must match the order in the request. That is the Nth offset in the offset array corresponds to the Nth message set in the request.

FetchRequest

{
size: int32 // the size of this request
request_type_id: int16 // the request id
correlation_id: int32 // an id set by the client returned untouched in the response
version_id: int16 // the version of this request
client_id: string // an optional non-machine-specific identifier for this client
replica_id: int32 // the node id of the replica making the request or -1 if this client is not a replica
max_wait: int32 // the maximum time to wait for a "full" response to accumulate on the server
min_bytes: int32 // the minimum number of bytes accumulated to consider a response ready for sending
topic_offsets: [<offset_data>]
}
offset_data =>
{
topic: string
partitions: [int32]
offsets: [int32]
}

FetchResponse

{
size: int32 // the size of this response
correlation_id: int32 // an id set by the client returned untouched in the response
version_id: int16 // the version of this response
error: int16 // global error for this request (if any)
data: [<topic_data_struct>] // the actual data requested (in the same format as defined for the produce request)
}

No labels

23 Comments

coorelation id: Not sure if we need this. It seems that it's simpler if we allow each client to make just 1 outstanding request. If so, we don't need a coorelation id.

max_wait/min_size: I can definitely see that max_wait is very useful. There are a couple of problems with min_size. First, it's a bit hard to implement, especially with multifetch. Second, it's not clear how useful it is. I feel it's much simpler for consumers to just reason about delay in time.

required_acks: For generality, we need this field. However, I suggest that we use 0 to mean waiting for acks from all replicas that are currently synced up. The async behavior probably doesn't need a response.

We will need a way to distinguish btw a FetchRequest made by a follower replica and that made by a consumer. One way to do that is to add a flag field in the fetch request. By default, the flag is 0. The follower replica can set the flag to a non-zero value.

For a multifetch request, currently there is an error code per response, in addition to the error code for the whole multifetch. We can probably do the same thing for multiproduce.

About removing fetch and keeping only multifetch, one minor concern is that sync produce makes most sense on a single topic (probably with a single message too). That's probably the main use case when a response needs to be sent to the producer. If we have both fetch and multifetch, one thing that we can do is to only allow fetch to send data synchronously and get a response back.

Agree with Jun on points 1, 2 and 3. Regarding point 4 Jun, I'm not sure I see the distinction between a replica follower and a regular consumer. Followers only fetch when catching up to the primary replica, this behaviour is the same as an old-fashioned consumer, no? Maybe I'm missing something.

One point of my own is:

The new ProduceRequest has required_acks at the top level, meaning that the entire request needs an ack from that many brokers. What if the replica_count for some topic "A" in the request has < required_ack replicas but topic "B" meets the requirement?

For 4, from the consumer's perspective, the fetch request made by followers is no different from a regular consumer. However, the leader has to maintain the freshness of each replica based on fetch requests from the follower. So, the leader has to know whether a fetch request is initiated by a follower. In fact, I think we need to add a field "replica id" in the fetch request. A follower will set the field to be the id of the replica it is responsible for. Regular consumers can simply set the field to -1.

You second questions is related to 5. One possibility is to allow each individual request to fail.

Related to required_acks, we probably need to specify a timeout when we fail a request if not enough required acks are obtained within the timeout. The question is whether that timeout should be specified on the broker or on the producer.

If we do end up with a correlation ID, shouldn't it be included in the FetchRequest? Currently, it's in the FetchResponse but not in the request, which I think doesn't really make sense...

A general question about required_acks and replication in general (sorry if it's off-topic...). What is supposed to happen if, for example, a topic is expected to have a replication factor of 3 but only 2 brokers are able to record a given message (the original one + 1 replica)? I guess the ProduceResponse should include an error that indicates that the ProduceRequest has failed, but will the two replicas be deleted from the brokers that did record the message successfully? It might be tricky to ensure those deletions while maintaining a high-throughput ... The problem is that if those messages are not deleted from the 2 replicas that have them then the Producer might retry sending the same message again and the brokers would then contain duplicate messages...

1. It's actually there. Just happen to be in the same line as the previous field. I fixed this.
2. That's a good question. If the client insists on 3 successful writes from each of the 3 replicas but only 2 are successful, I think the contract should be the following. The producer request will return with an error code. This just means that the message is published, but under replicated. The under replicated messages will be copied to other replicas over time (when brokers are up). The producer, when getting an error, can take appropriate actions on its own, such as stopping sending new requests or issuing a warning, etc.

3. For required_acks, how about we use -1 to designate no ack and 0 to designate ack from all current in-sync replicas? A positive value will designate the # of replicas with successful writes. Related to that, if an ack is needed, do we need to specify the timeout in the request?

I think that makes sense, but it should be reversed, right? 0 should mean 0 and -1 can mean "all in sync".

Let's think through the use case a bit for the "all in sync" option. I am not sure if I understand why I would want this. If I want to ensure 2x replication before I consider the value written that is because I want that much redundancy. It is not clear to me when my requirement would be for all the in sync replicas to have it.

One issue with having a timeout is that if all the required replicas is 3 but only 2 replicas are available then every single request will time out after blocking for the maximum time which is actually fairly bad. I tend to think that we should only block for minimum of the requested acks or the number of in-sync replicas. Blocking on non-in-sync replicas seems like a bad idea, since it will lengthen the request time with very little chance of leading to a successful request.

1. The correlation id enables the ability of the client to multiplex requests, it doesn't require that. I don't recommend we implement this yet in the scala client BUT we may well want to do this. Once we add produce requests that block on other replicas this will become somewhat slow, so tying up a full connection is not desirable. We will have to see how this plays out, but adding this now is just enabling this style of client.

2. The goal of min_size is to allow high throughput with low latency. Currently polling on a topic getting a continuous trickle of messages will lead to very small fetches. To work around this the only ability the consumer has is to basically sleep for some guessed period of time and hope that some data accumulates. But setting this sleep time correctly is impossible. This is a way to allows the consumer to precisely specify the size they want and the minimum timeout. This does add to implementation complexity so I am open to the idea of passing on this, but I think it would be nice to have.

3. Required_acks--isn't it a bit unintuitive to have 0 acks mean all in_sync replicas and not 0 acks are required? Here is how I see it, you really need to specify a timeout for the acknowledgement and the number of acknowledgements you want. I don't think your replication requirement is likely to depend on whether replicas are in sync or not.

4. I agree we need a replica_id field in the fetch request which would be set to -1 for consumers outside the cluster.

5. I agree we need both per-fetch error code and a global. I can't think of a use for the global error right now, though, can anyone else?

6. I agree that the multi-* requests are less intuitive than the single topic/partition variety. I think the idea here, though, is to make the network-level api as general as possible even if it is less intuitive.

Jun--The timeout should definitely, definitely be on the client not the server. Different applications have different latency requirements and we definitely don't want to get into the business of tuning this for each user.

Felix--Added the correlation id to FetchRequest, that was a typo.

Jun, Felix--The contract should definitely be just that we could not meet the replication requirements you gave but at least the master got the message. If a replica times out or crashes it is impossible for us to know if this replica got the message or not. Ideally we should tell the client sorry, we could only replicate to N replicas which is less than the M replicas you required, but right now we don't have a way to send back N so it might be fine just to say that we couldn't meet the required replicas in the time limit.

One more thing I forgot that I think we should add is an optional client id. This would make it easier to correlate requests, exceptions, and metrics collected on the server side with the client that sent the request. We can do this now on a per-connection basis, but it can be a huge pain to correlate connections to applications. This token would allow that.

1. Can we have the global error code come after the size in the *Response protocol? This aligns with how the current wire format is, unless someone sees something wrong with this?
2. I think we need to add a fetch_size at the topic/offset to the FetchRequest to align with current log API's and for general ease-of-use.
3. Also, I think we need add an error_code per fetch (topic/offset level) to FetchResponse in addition to the global request-level error. Because of this, I'm not sure how much of the data structures can be reused.

So the proposed FetchResponse would be:

fetch_response =>
{
size: int32 // the size of this response
error: int16 // global error for this request (if any)
correlation_id: int32 // an id set by the client returned untouched in the response
version_id: int16 // the version of this response
data: [<topic_data_struct>] // the actual data requested (in the same format as defined for the produce request)
}
topic_data_struct =>
{
topic: string // the topic name
partition_data: [<partition_data_struct>] // the data for each partition in that topic, defined below
}
partition_data_struct =>
{
partition: int32 // the partition id
error: int32 // the error code (if any) for this particular fetch
offset: int64 // the offset of this fetched message
messages: message_set // the actual messages for that partition (same as existing)
}

Why include a correlation_id & request_id, if the response contains the request_id then the client could pipeline the requests just as easily..

So for the Produce Response, it might be easier if you create another structure to group the error & offset per message set or maybe just add an id to the message_set to map the errors / offsets in the response. Otherwise it seems a bit confusing, the errors are per partition but the offsets are per message set and you have to assume they're in the order you sent.. if the count for either isn't the same (as expected) for some reason how does the client know what happened or if anything was posted?

The request id just identifies the type of the request (e.g. fetch request, produce request, etc). The correlation id is meant to be an identifier unique for each request entered for that client (i.e. a counter).

The intention for errors and offsets is that a value must be specified for each message set sent (0==no error) and the order is the same as the order the message sets were sent in. This could be done either as two arrays as above:

Question regarding the FetchRequest format, specifically in the topic_offset field. Can we model the offset_data structure like the topic_data_struct/partition_data_struct of the produce requests? Though I realize introducing another new data structure adds complexity, I feel it adds clarity; rather than matching an offset to a partition based purely on indices into the array, that information is captured in a trivial class (or maybe even a simple tuple). It also protects against any potential transmission problems in case the offset or partition array get's truncated. Just a thought, I'm open to discuss

Regarding acking in ProduceRequest; it would probably need a similar field in the ProduceResponse to reflect the number of actual succesful acks.

errors: [int16] // per-partition errors, one for each message set sent (or all -1 if none)
offsets: [int64] // the offsets for each off the message sets supplied, in the order given in the request
acks: [int16] // per-partition ack counts, one for each message set sent, or -1 if acks were not requested.

This goes in line with previous discussions here on what to do when the desired number of acks cannot be satisfied (including timeout) and delegates the decision to the producer application, where it belongs.

Not sure if returning acks is useful to the clients and it probably unnecessarily exposes internal implementation details. As long as the error code indicates an error, the client probably will always do the same thing (e.g., retry send) independent of the actual ack value.

Sending the client_id in every message is a bit excessive since it shouldnt change during the lifetime of a connection (right?).
How about having an (optional) Hello/Register message sent on the initialization of the connection?

For simplicity a Hello message must only be accepted as the first message on a new connection, and not sending a Hello messages makes the broker generate its own unique client_id ("192.168.0.3:23593", ...).

If the Hello handshake is made to be required then the version_id from the generic respone and reply headers could be moved here instead. Thus only checking the version once for each connection. Albeit checking a int16 for each request isnt that much work in the context.

Regardless of Hello message or not; what if the client_id collides with an existing connection?

assume the new connection is from a new instance of the same application, replacing a malfunct/dying connection that is yet to be torn down by TCP timeout. This closes the old connection.

or, reject the new connection, probably not preferable because of the point above.

append something unique to the client_id and let both connections persist.

It is true that a client id typically does't change within a connection. However, having an extra handshake complicates the protocol. Most requests are likely non-trivial. So, it's not clear to me if the increase in complexity outweigh the benefit from saving the client id overhead.

i kind of like keeping the client_id on each request. if we wanted to have an encrypted transmission over the socket the client_id value can be turned into the symmetrically encrypted data's "password" in the transmission which would could encrypt with the first client id (with certificates and private keys) it makes a lot of possibilities in implementation, lots of good uses.

too much discussion of conditionals if we wanted to-do hold off until 0.8.1 or 0.9 or however we wanted to plan for releases ahead -> better to have discussion on JIRA

Moving some fields into a connection handshake would certainly be possible. That is true of the versioning and the client id. I don't have a strong opinion either way. As an optimization it is probably not major. The only downside is that the versioning information has to be maintained with the connection which would somewhat couple the network layer and the api layer.

Our intention for client_id was not actually a unique per-connection identifier, which I think TCP already provides. Rather the problem we have right now is that in a highly multi-tenant environment it is difficult to aggregate metrics or trace errors to a logical application. The client id was meant to be a logical identifier potentially used across many servers, e.g. "foobar-front-end". This then becomes an aggregation level for metrics, quotas, as well as being helpful in error reporting.

I could go either way on returning the ack count. I think the important distinction from the client pov, though, is whether the message is (1) known to have been committed, (2) written at least once but not to sufficient servers to be guaranteed, (3) definitely not written, or (4) unknown. Case (4) would be in the case of a socket timeout or network error in which no response is received. We should think through if we have correctly captured at least these cases.

Anonymous

Okay Jay, et.al.,

thanks for clarifying the background of these fields.
For the Hello semantics I dont have a use case so it was mostly a good-practice future-proofness idea that didnt catch on.
And you are all right that proper error indication is sufficient status reporting for acks.

Regarding performance; thats a numbers game, so it all depends on the volume of messages being fetched and produced, but its most likely not an issue.