I am testing with apache-flume-1.4.0-bin.I made a naive python script for exec source to do throttling by calling sleep() function.But the sleep() doesn't work when called by exec source.Any ideas about this or do you have some simply solution for throttling instead of a custom source?

I've setup something similar with the spooling directory source. I have a script that is scheduled on the app server to create an incremental file every minute and then drop the incremental file in the spool directory for processing. The use case is web logs that roll over daily, but we want events 'near' real time. We didn't want to use the exec source as that gives no delivery guarantee, at least with a spooling source if the flume agent stops processing the incremental files stay in the spool dir until it's back up.

I am testing with apache-flume-1.4.0-bin.I made a naive python script for exec source to do throttling by calling sleep() function.But the sleep() doesn't work when called by exec source.Any ideas about this or do you have some simply solution for throttling instead of a custom source?

I am testing with apache-flume-1.4.0-bin.I made a naive python script for exec source to do throttling by calling sleep() function.But the sleep() doesn't work when called by exec source.Any ideas about this or do you have some simply solution for throttling instead of a custom source?

I’ve setup something similar with the spooling directory source. I have a script that is scheduled on the app server to create an incremental file every minute and then drop the incremental file in the spool directory for processing. The use case is web logs that roll over daily, but we want events ‘near’ real time. We didn’t want to use the exec source as that gives no delivery guarantee, at least with a spooling source if the flume agent stops processing the incremental files stay in the spool dir until it’s back up.

I am testing with apache-flume-1.4.0-bin.I made a naive python script for exec source to do throttling by calling sleep() function.But the sleep() doesn't work when called by exec source.Any ideas about this or do you have some simply solution for throttling instead of a custom source?

Yes, I am curious what you mean as well. When testing I had dropped a few 15GB files in the spoolDir and while they processed slowly they did complete. In fact, my only issue with that test was the last hop HDFS sinks couldn't keep up and I had to add a couple more to keep upstream channels from filling up.

I am testing with apache-flume-1.4.0-bin.I made a naive python script for exec source to do throttling by calling sleep() function.But the sleep() doesn't work when called by exec source.Any ideas about this or do you have some simply solution for throttling instead of a custom source?

If it happened at the last hop in your test, it could possibly happen at the first hop.Maybe the network is not fast in my test. I got "ChannelException: The channel has reached it's capacity." either on agent side (first hop) or collector side (last hop sinking to hadoop).

Yes, I am curious what you mean as well. When testing I had dropped a few 15GB files in the spoolDir and while they processed slowly they did complete. In fact, my only issue with that test was the last hop HDFS sinks couldn’t keep up and I had to add a couple more to keep upstream channels from filling up.

I am testing with apache-flume-1.4.0-bin.I made a naive python script for exec source to do throttling by calling sleep() function.But the sleep() doesn't work when called by exec source.Any ideas about this or do you have some simply solution for throttling instead of a custom source?

I would start by logging performance metrics from your flume agent which will let you determine which component is falling behind. It's likely a sink though, so the first thing you can do is just add an extra sink where you think it's falling behind. You can always add additional sinks pointing at the same channel and they will both attempt to take as fast as they can and since it's transactional they won't ever attempt to send the same data (barring transaction failures).

I'm not sure if just adding another avro sink pointing at the same host is a viable method to improve performance, I would think just bumping up the transaction capacity would help there. I know there's a recent thread with some research on the fileChannel being the actual bottleneck though so it may work.

In my case I just added another hdfs sink to each channel on my last hop and that was enough to clear the bottleneck.

If it happened at the last hop in your test, it could possibly happen at the first hop.Maybe the network is not fast in my test. I got "ChannelException: The channel has reached it's capacity." either on agent side (first hop) or collector side (last hop sinking to hadoop).

On 2013/08/21, at 1:15, Paul Chavez wrote:Yes, I am curious what you mean as well. When testing I had dropped a few 15GB files in the spoolDir and while they processed slowly they did complete. In fact, my only issue with that test was the last hop HDFS sinks couldn't keep up and I had to add a couple more to keep upstream channels from filling up.

I am testing with apache-flume-1.4.0-bin.I made a naive python script for exec source to do throttling by calling sleep() function.But the sleep() doesn't work when called by exec source.Any ideas about this or do you have some simply solution for throttling instead of a custom source?