Exactly once semantics for Flume

Details

Type: Bug

Status:Open

Priority: Major

Resolution:
Unresolved

Affects Version/s:
None

Fix Version/s:
None

Component/s:
None

Labels:

None

Description

Currently Flume guarantees only at least once semantics. This jira is meant to track exactly once semantics for Flume. My initial idea is to include uuid event ids on events at the original source (use a config to mark a source an original source) and identify destination sinks. At the destination sinks, use a unique ZK Znode to track the events. If once seen (and configured), pull the duplicate out.

This might need some refactoring, but my belief is we can do this in a backward compatible way.

Hari Shreedharan
added a comment - 20/Aug/13 23:49 There are cases which we'd have to handle - like what happens if update to ZK fails or if the agent dies before ZK is updated, but after the transaction is committed to the channel.

Edward Sargisson
added a comment - 23/Aug/13 17:29 I would recommend thinking carefully about how to manage ZK's garbage. ZK can run happily for quite a while but then slows down dramatically unless given a clean-up.

Yep, that is an important aspect of it. We need to come up with a way of handling that. One approach I can think of is to have a configurable period for which the event will not be duplicated. After this period, we go in and clean up older event uuids. Since duplication is primarily due to timeouts etc somewhere in the pipeline, once all agents in the pipeline are up, the event should reach HDFS sinks pretty quickly.

Hari Shreedharan
added a comment - 23/Aug/13 22:32 Yep, that is an important aspect of it. We need to come up with a way of handling that. One approach I can think of is to have a configurable period for which the event will not be duplicated. After this period, we go in and clean up older event uuids. Since duplication is primarily due to timeouts etc somewhere in the pipeline, once all agents in the pipeline are up, the event should reach HDFS sinks pretty quickly.

Insert uuid at the first agent when the event is received (or when the event is created at the client SDK).

At the destination HDFS (presumably we should be able to support this in all sinks, including custom ones by abstracting this out into another library), at the time of take, do a create for FLUME/<uuid>.

If the create succeeds, this agent "owns" that event and writes it out to HDFS.

If the create fails, it means another agent will eventually write the event out - so drop the event.

After a configured time period for which we guarantee that the event will not duplicate, delete the path which was created (where to do this is a good question - presumably any agent should be able to do it).

This algorithm seems to guarantee that an event will eventually be written since there is at least one agent that will not drop it, and barring hdfs reporting false failures (actually writing the events but throwing exceptions) and hdfs timeouts (due to which we don't know if the event really got written or not), this algorithm should not cause duplicates.

Also, this algorithm assumes that an agent which dies will eventually come back up and will be able to access the old disk on which it held its file channel (this agent has to eventually come back up to write the event out - it does not matter when it comes back, but it needs to - this brings up the question of what happens if the agent comes back up and tries to write an event out because it took it but never committed the transaction - we need to handle the ownership case).

Hari Shreedharan
added a comment - 23/Aug/13 22:43 Here is an initial stage algorithm I can think of:
Insert uuid at the first agent when the event is received (or when the event is created at the client SDK).
At the destination HDFS (presumably we should be able to support this in all sinks, including custom ones by abstracting this out into another library), at the time of take, do a create for FLUME/<uuid>.
If the create succeeds, this agent "owns" that event and writes it out to HDFS.
If the create fails, it means another agent will eventually write the event out - so drop the event.
After a configured time period for which we guarantee that the event will not duplicate, delete the path which was created (where to do this is a good question - presumably any agent should be able to do it).
This algorithm seems to guarantee that an event will eventually be written since there is at least one agent that will not drop it, and barring hdfs reporting false failures (actually writing the events but throwing exceptions) and hdfs timeouts (due to which we don't know if the event really got written or not), this algorithm should not cause duplicates.
Also, this algorithm assumes that an agent which dies will eventually come back up and will be able to access the old disk on which it held its file channel (this agent has to eventually come back up to write the event out - it does not matter when it comes back, but it needs to - this brings up the question of what happens if the agent comes back up and tries to write an event out because it took it but never committed the transaction - we need to handle the ownership case).

Thanks for your reply. You are right in the fact the global state check and update to the sink will require each sink to explicitly support it. We can, of course have this implementation be in an abstract class which is inherited, but yes, this would also mean that there needs to be code changes.

It makes sense to check state in the channels, pretty much the same way as in the sinks. What is a bit concerning is that we will need to do this check at every agent that the event passes through, and probably make some changes in the channel interface to get rid of race conditions (not sure if that is the case, but I think we will need to). Given that an event is likely to pass through 2-3 tiers, each event gets delayed by the time taken by that many ZK round-trips. I am open to this as well, especially considering that it is likely to be a better OOB experience for many users (the ones who have their own custom sinks). Would it suffice to check at the sinks at the terminal agent to make sure that an event gets written out only once?

Thinking about this, having a once-only delivery at the channel level also opens up some possibilities with regards to being able to do some sort of processing on events. Having a guarantee of seeing an event exactly once allows us to do some event processing like counters etc. That seems like a good side effect to have.

Either way, I am glad we agree on the aspect of checking a global state manager to verify that events are deduped.

Thanks,
Hari

On Tuesday, August 27, 2013 at 2:12 PM, Arvind Prabhakar wrote:

Hi Hari,

Thanks for bringing this up for discussion. I think it will be tremendously
beneficial to Flume users if we can extend once-only guarantee. Your
initial suggestion seems reasonable of having a Sink trap the events and
reference a global state to drop duplicates. Rather than pushing this
functionality to Sinks is there any other way by which we can make it more
generally available? The reason I raise this concern is because otherwise
this becomes a feature of a particular sink and not every sink will have
the necessary implementation opportunity to get this.

Alternatively what do you think about this being done at the channel level?
Since we normally do not see custom implementations of channels, an
implementation that works with the channel will likely be more useful for
the broader community of Flume users.

Thanks for your input. The part where we use replicating channel selector
to purposefully replicate - we can easily make it configurable whether to
delete deplicate events or not. That should not be difficult to do.

The 2nd point where multiple agents/sinks could write the same event can
be solved by namespacing the events into different namespaces. So each sink
checks one namespace for the event, and multiple sinks can belong to the
same namespace - this way, if multiple events are going to write to the
same HDFS cluster, then if a duplicate occurs we can easily drop it.
Unfortunately, this also does not work around the who
HDFS-writing-but-throwing issue.

I agree updating ZK will hit latency, but that is the cost to build once
only semantics on a highly flexible system. If you look at the algorithm,
we actually go to ZK only once per event (to create, there are no updates) - this
can even happen per batch if needed to reduce ZK round trips (though I am
not sure if ZK provides a batch API).

The two phase commit approach sounds good, but it might require interface
changes which can now only be made in Flume 2.x. Alse, if we use a single
UUID combined with several flags we might be able to work duplicates caused
by this replication.

Thanks,
Hari

On Sunday, August 25, 2013 at 7:24 AM, Gabriel Commeau wrote:

Hi Hari,

I deleted my comment (again). The mailing list is probably a better
avenue
to discuss this ­ sorry about that!

I can find at least one other way duplicate events can occur, and so what
I provided helps to reduce duplicate events but is not sufficient to
guaranty exactly once semantics. However, I still think that using a
2-phase commit when writing to multiple channels would benefit Flume.
This
should probably be a different ticket though.

Concerning the algorithm you offered, the case of replicating channel
selector should probably be handled, by creating a new UUID for each
duplicate message.
I hope this helps.

Hari Shreedharan
added a comment - 28/Aug/13 01:55 Copying over the discussion from the dev@ list:
Hi Arvind,
Thanks for your reply. You are right in the fact the global state check and update to the sink will require each sink to explicitly support it. We can, of course have this implementation be in an abstract class which is inherited, but yes, this would also mean that there needs to be code changes.
It makes sense to check state in the channels, pretty much the same way as in the sinks. What is a bit concerning is that we will need to do this check at every agent that the event passes through, and probably make some changes in the channel interface to get rid of race conditions (not sure if that is the case, but I think we will need to). Given that an event is likely to pass through 2-3 tiers, each event gets delayed by the time taken by that many ZK round-trips. I am open to this as well, especially considering that it is likely to be a better OOB experience for many users (the ones who have their own custom sinks). Would it suffice to check at the sinks at the terminal agent to make sure that an event gets written out only once?
Thinking about this, having a once-only delivery at the channel level also opens up some possibilities with regards to being able to do some sort of processing on events. Having a guarantee of seeing an event exactly once allows us to do some event processing like counters etc. That seems like a good side effect to have.
Either way, I am glad we agree on the aspect of checking a global state manager to verify that events are deduped.
Thanks,
Hari
On Tuesday, August 27, 2013 at 2:12 PM, Arvind Prabhakar wrote:
Hi Hari,
Thanks for bringing this up for discussion. I think it will be tremendously
beneficial to Flume users if we can extend once-only guarantee. Your
initial suggestion seems reasonable of having a Sink trap the events and
reference a global state to drop duplicates. Rather than pushing this
functionality to Sinks is there any other way by which we can make it more
generally available? The reason I raise this concern is because otherwise
this becomes a feature of a particular sink and not every sink will have
the necessary implementation opportunity to get this.
Alternatively what do you think about this being done at the channel level?
Since we normally do not see custom implementations of channels, an
implementation that works with the channel will likely be more useful for
the broader community of Flume users.
Regards,
Arvidn
On Sun, Aug 25, 2013 at 9:07 AM, Hari Shreedharan <hshreedharan@cloudera.com
wrote:
Hi Gabriel,
Thanks for your input. The part where we use replicating channel selector
to purposefully replicate - we can easily make it configurable whether to
delete deplicate events or not. That should not be difficult to do.
The 2nd point where multiple agents/sinks could write the same event can
be solved by namespacing the events into different namespaces. So each sink
checks one namespace for the event, and multiple sinks can belong to the
same namespace - this way, if multiple events are going to write to the
same HDFS cluster, then if a duplicate occurs we can easily drop it.
Unfortunately, this also does not work around the who
HDFS-writing-but-throwing issue.
I agree updating ZK will hit latency, but that is the cost to build once
only semantics on a highly flexible system. If you look at the algorithm,
we actually go to ZK only once per event (to create, there are no updates) - this
can even happen per batch if needed to reduce ZK round trips (though I am
not sure if ZK provides a batch API).
The two phase commit approach sounds good, but it might require interface
changes which can now only be made in Flume 2.x. Alse, if we use a single
UUID combined with several flags we might be able to work duplicates caused
by this replication.
Thanks,
Hari
On Sunday, August 25, 2013 at 7:24 AM, Gabriel Commeau wrote:
Hi Hari,
I deleted my comment (again). The mailing list is probably a better
avenue
to discuss this ­ sorry about that!
I can find at least one other way duplicate events can occur, and so what
I provided helps to reduce duplicate events but is not sufficient to
guaranty exactly once semantics. However, I still think that using a
2-phase commit when writing to multiple channels would benefit Flume.
This
should probably be a different ticket though.
Concerning the algorithm you offered, the case of replicating channel
selector should probably be handled, by creating a new UUID for each
duplicate message.
I hope this helps.
Regards,
Gabriel

Thanks Hari. In the spirit of keeping processing components pluggable, it would make sense to have this de-dupe logic pluggable itself. One benefit of doing so would be the choice of different implementations that could provide broader degree of guarantees. For example, the ZK based approach over the enter pipeline could provide complete once-only delivery guarantee but as you pointed out could add latency to delivery. Alternatively there could be locally optimized implementation of this approach that act on subsets of the event stream and thus benefit partitioned deployments where events cannot cross wires.

Another use-case to consider would be to locally optimize for multiple channels within the same Agent. That way an Agent that has a File Channel setup as the primary channel and a Memory Channel setup as a fall-back channel in case the primary is full - would need local deduping without having to store state in ZK.

Arvind Prabhakar
added a comment - 29/Aug/13 05:12 (continuing the discussion here instead of email)
Thanks Hari. In the spirit of keeping processing components pluggable, it would make sense to have this de-dupe logic pluggable itself. One benefit of doing so would be the choice of different implementations that could provide broader degree of guarantees. For example, the ZK based approach over the enter pipeline could provide complete once-only delivery guarantee but as you pointed out could add latency to delivery. Alternatively there could be locally optimized implementation of this approach that act on subsets of the event stream and thus benefit partitioned deployments where events cannot cross wires.
Another use-case to consider would be to locally optimize for multiple channels within the same Agent. That way an Agent that has a File Channel setup as the primary channel and a Memory Channel setup as a fall-back channel in case the primary is full - would need local deduping without having to store state in ZK.

Yep, that is what I was thinking about. I was planning to keep these as interfaces which are pluggable, and having no requirement of once-only can easily be done by implementing a pass through dedupe logic. Local dedupe can easily be implemented and we can simply suggest that users configure a ZK based dedupe at the final channel(s). This allows low latency, and local dedupe too.

This brings me to another point - having local dedupe can actually allow for some interesting stuff. We could use local dedupes to allow for once only processing of events entering an agent.

Putting in a once-only guarantee allows for being able to do some interesting processing on events. For example, if we know that events will always arrive only once (even if there are multiple channels), we could use that to make "accurate" counts of events/event-types.

In fact, if we are able to somehow do some processing on the sink side (after dedupe), we could do some simple event processing while still moving the events through. I am thinking whether it makes sense to do something like allowing sinks/sink-based new component to do some processing on events picked up from the channel, and then allow it to write events out to a channel. This sort of creates a workflow that could look like this:

AvroSource->Channel->Sink->Channel->Sink->Channel->HDFS.

This allows for rolling back some failed processing without losing data (assuming the sink actually duplicates data and does not modify based on references returned by a memory channel). This is sort of how classical processing systems work (with processing code, separated by queues). Allowing the sinks to pull from multiple channels would even allow us to do cartesian product like processing too - like pseudo-joins.

Doing this, combined with once-only delivery would allow us to quite reliably do some simple event processing (I agree, the definition of "simple" is different for different people).

Hari Shreedharan
added a comment - 29/Aug/13 06:49 Yep, that is what I was thinking about. I was planning to keep these as interfaces which are pluggable, and having no requirement of once-only can easily be done by implementing a pass through dedupe logic. Local dedupe can easily be implemented and we can simply suggest that users configure a ZK based dedupe at the final channel(s). This allows low latency, and local dedupe too.
This brings me to another point - having local dedupe can actually allow for some interesting stuff. We could use local dedupes to allow for once only processing of events entering an agent.
Putting in a once-only guarantee allows for being able to do some interesting processing on events. For example, if we know that events will always arrive only once (even if there are multiple channels), we could use that to make "accurate" counts of events/event-types.
In fact, if we are able to somehow do some processing on the sink side (after dedupe), we could do some simple event processing while still moving the events through. I am thinking whether it makes sense to do something like allowing sinks/sink-based new component to do some processing on events picked up from the channel, and then allow it to write events out to a channel. This sort of creates a workflow that could look like this:
AvroSource->Channel->Sink->Channel->Sink->Channel->HDFS.
This allows for rolling back some failed processing without losing data (assuming the sink actually duplicates data and does not modify based on references returned by a memory channel). This is sort of how classical processing systems work (with processing code, separated by queues). Allowing the sinks to pull from multiple channels would even allow us to do cartesian product like processing too - like pseudo-joins.
Doing this, combined with once-only delivery would allow us to quite reliably do some simple event processing (I agree, the definition of "simple" is different for different people).
Thoughts?

May I suggest the following idea: instead of assigning a UUID to the events, which I assume would be arbitrary if not random, what about enforcing ordering of events? Each "ingest" agent/client (i.e. first tier) would have a unique identifier (e.g. a random UUID , or host name + agent name), and a local counter, which would increment for every event generated/ingested by that agent. Consequently, each event has an "ingest" ID and a counter value. In ZooKeeper, instead of having a long list of UUID for the events recently gone once, we'd only have as many Z-nodes as ingest agents/clients (let it be N), which contain the highest counter value of events successfully passed through from the corresponding ingest agent/client. If an event has successfully been processed (i.e. the first time), the dedup channel increments the ZK counter to the counter value of that event. If the ZK counter is equal or greater than the counter value of the event, it's a duplicate.
The advantage is that a batch of M events successfully processed can be acknowledged in k <= N ZooKeeper operations, and not in M - usually much larger than N. The inconvenient is that if some events get stuck in process, the dedup channels will be waiting on them, and so we'd need a way to resend these events - and therefore, a channel seems like the appropriate place to do that. The "ingest" channel can clear the events that have a counter value below the ZK counter, as they have successfully been through once.

On another note, what about grouping the dedup channels for exactly-once-semantics? We would define a namespace for this group of channels, and would guaranty that the events come exactly once in that group of channels; but it could come twice in 2 distinct groups of channels - say one that goes to HDFS and one to HBase for instance. The ZK structure detailed above can be duplicated for each namespace (which would be a parent Z-node); and in order to clear the events, the "ingest" channel needs to check all existing namespaces.

Gabriel Commeau
added a comment - 02/Sep/13 07:12 Hi Hari & team,
May I suggest the following idea: instead of assigning a UUID to the events, which I assume would be arbitrary if not random, what about enforcing ordering of events? Each "ingest" agent/client (i.e. first tier) would have a unique identifier (e.g. a random UUID , or host name + agent name), and a local counter, which would increment for every event generated/ingested by that agent. Consequently, each event has an "ingest" ID and a counter value. In ZooKeeper, instead of having a long list of UUID for the events recently gone once, we'd only have as many Z-nodes as ingest agents/clients (let it be N), which contain the highest counter value of events successfully passed through from the corresponding ingest agent/client. If an event has successfully been processed (i.e. the first time), the dedup channel increments the ZK counter to the counter value of that event. If the ZK counter is equal or greater than the counter value of the event, it's a duplicate.
The advantage is that a batch of M events successfully processed can be acknowledged in k <= N ZooKeeper operations, and not in M - usually much larger than N. The inconvenient is that if some events get stuck in process, the dedup channels will be waiting on them, and so we'd need a way to resend these events - and therefore, a channel seems like the appropriate place to do that. The "ingest" channel can clear the events that have a counter value below the ZK counter, as they have successfully been through once.
On another note, what about grouping the dedup channels for exactly-once-semantics? We would define a namespace for this group of channels, and would guaranty that the events come exactly once in that group of channels; but it could come twice in 2 distinct groups of channels - say one that goes to HDFS and one to HBase for instance. The ZK structure detailed above can be duplicated for each namespace (which would be a parent Z-node); and in order to clear the events, the "ingest" channel needs to check all existing namespaces.
I hope this helps.