I came upon an issue with ScribeSource, though it's theoretically applicable to any EventDrivenSource whose event generating thread(s) die. Simple put, sending a bad packet to the thrift(scribe protocol) port will result in it trying to allocate space for some arbitrarily large packet resulting in an OOMException which kills the thread(incidentally I thought this would be an issue in avro too, but it throws an exception before making excessive allocation requests).

As far as flume is concerned, the component is still alive. stop() was never called, so even monitoring the component state using jmx will not notice anything wrong. This situation occurs from user error, but there is potential for other errors leaving a zombie component. I think it would be more user friendly to be able to recover from such errors.

I'm thinking of adding a StatusPollable interface that EventDrivenSources can optionally implement(because we can't change the interface without a version change). If implemented, the EventDrivenSourceRunner would schedule a regular poll to check the state. Upon failure it could either call stop() to signal it broke. With autoRestartPolicy, the source would then get restarted by its supervisor.

Would appreciate any opinions before I put together a patch/post an issue.

Yes I can definitely see the issue. It sucks that we'd have to add yetanother thread. An alternative which wouldn't require another threadwould be to check the optional interface in the supervisor,approximately here:

However, I am not sold on the supervisor being the best place to fixthis as I am not sure that other lifecycle components would need this.

Brock

On Wed, Jan 16, 2013 at 7:45 PM, Juhani Connolly<[EMAIL PROTECTED]> wrote:> I came upon an issue with ScribeSource, though it's theoretically> applicable to any EventDrivenSource whose event generating thread(s) die.> Simple put, sending a bad packet to the thrift(scribe protocol) port will> result in it trying to allocate space for some arbitrarily large packet> resulting in an OOMException which kills the thread(incidentally I thought> this would be an issue in avro too, but it throws an exception before making> excessive allocation requests).>> As far as flume is concerned, the component is still alive. stop() was never> called, so even monitoring the component state using jmx will not notice> anything wrong. This situation occurs from user error, but there is> potential for other errors leaving a zombie component. I think it would be> more user friendly to be able to recover from such errors.>> I'm thinking of adding a StatusPollable interface that EventDrivenSources> can optionally implement(because we can't change the interface without a> version change). If implemented, the EventDrivenSourceRunner would schedule> a regular poll to check the state. Upon failure it could either call stop()> to signal it broke. With autoRestartPolicy, the source would then get> restarted by its supervisor.>> Would appreciate any opinions before I put together a patch/post an issue.

Hmm, overriding the implementation of getLifecycleState provided by AbstractSource could work. It would be going against the convention that has been maintained in all other components(that I can think of)

On 01/17/2013 01:20 PM, Brock Noland wrote:> Hi,>> Yes I can definitely see the issue. It sucks that we'd have to add yet> another thread. An alternative which wouldn't require another thread> would be to check the optional interface in the supervisor,> approximately here:>> https://github.com/apache/flume/blob/trunk/flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java#L240>> However, I am not sold on the supervisor being the best place to fix> this as I am not sure that other lifecycle components would need this.>>>> Brock>> On Wed, Jan 16, 2013 at 7:45 PM, Juhani Connolly> <[EMAIL PROTECTED]> wrote:>> I came upon an issue with ScribeSource, though it's theoretically>> applicable to any EventDrivenSource whose event generating thread(s) die.>> Simple put, sending a bad packet to the thrift(scribe protocol) port will>> result in it trying to allocate space for some arbitrarily large packet>> resulting in an OOMException which kills the thread(incidentally I thought>> this would be an issue in avro too, but it throws an exception before making>> excessive allocation requests).>>>> As far as flume is concerned, the component is still alive. stop() was never>> called, so even monitoring the component state using jmx will not notice>> anything wrong. This situation occurs from user error, but there is>> potential for other errors leaving a zombie component. I think it would be>> more user friendly to be able to recover from such errors.>>>> I'm thinking of adding a StatusPollable interface that EventDrivenSources>> can optionally implement(because we can't change the interface without a>> version change). If implemented, the EventDrivenSourceRunner would schedule>> a regular poll to check the state. Upon failure it could either call stop()>> to signal it broke. With autoRestartPolicy, the source would then get>> restarted by its supervisor.>>>> Would appreciate any opinions before I put together a patch/post an issue.>>

Why limit it to the sources? If there is going to be a change to onecomponent's lifecycle, then I see no reason not to change every component'slifecycle.

Sinks and Channels could very well have this problem; so what about givingeach LifecycleAware component a takePulse method (or something); or toavoid creating a new method, add a new lifecycle state 'crashed' or suchwhich, when detected, causes a restart of the component. Then componentswould just need to override getLifecycleState (if this method is polledregularly; I don't know. maybe use a listener for when there's a statechange) to detect if it has crashed/needs to be restarted.

> Hmm, overriding the implementation of getLifecycleState provided by> AbstractSource could work. It would be going against the convention that> has been maintained in all other components(that I can think of)>>> On 01/17/2013 01:20 PM, Brock Noland wrote:>>> Hi,>>>> Yes I can definitely see the issue. It sucks that we'd have to add yet>> another thread. An alternative which wouldn't require another thread>> would be to check the optional interface in the supervisor,>> approximately here:>>>> https://github.com/apache/**flume/blob/trunk/flume-ng-**>> core/src/main/java/org/apache/**flume/lifecycle/**>> LifecycleSupervisor.java#L240<https://github.com/apache/flume/blob/trunk/flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java#L240>>>>> However, I am not sold on the supervisor being the best place to fix>> this as I am not sure that other lifecycle components would need this.>>>>>>>> Brock>>>> On Wed, Jan 16, 2013 at 7:45 PM, Juhani Connolly>> <juhani_connolly@cyberagent.**co.jp <[EMAIL PROTECTED]>>>> wrote:>>>>> I came upon an issue with ScribeSource, though it's theoretically>>> applicable to any EventDrivenSource whose event generating thread(s) die.>>> Simple put, sending a bad packet to the thrift(scribe protocol) port will>>> result in it trying to allocate space for some arbitrarily large packet>>> resulting in an OOMException which kills the thread(incidentally I>>> thought>>> this would be an issue in avro too, but it throws an exception before>>> making>>> excessive allocation requests).>>>>>> As far as flume is concerned, the component is still alive. stop() was>>> never>>> called, so even monitoring the component state using jmx will not notice>>> anything wrong. This situation occurs from user error, but there is>>> potential for other errors leaving a zombie component. I think it would>>> be>>> more user friendly to be able to recover from such errors.>>>>>> I'm thinking of adding a StatusPollable interface that EventDrivenSources>>> can optionally implement(because we can't change the interface without a>>> version change). If implemented, the EventDrivenSourceRunner would>>> schedule>>> a regular poll to check the state. Upon failure it could either call>>> stop()>>> to signal it broke. With autoRestartPolicy, the source would then get>>> restarted by its supervisor.>>>>>> Would appreciate any opinions before I put together a patch/post an>>> issue.>>>>>>>>>>

Sink, Source and Channel all extend LifecycleAware, so the function is available to all components already.

I was more questioning whether it was reasonable to start including logic to determine the state. That being said, I think the precedent of just returning the state set by start/stop is more one of habit, so on further thought I don't see it as being unreasonable. I'm going to give fixing ScribeSource with it a poke.

As to new lifecycle states, I took a pass at reworking the lifecycle model with guavas service implementation in the past, but it took some very significant changes and didn't get the momentum/interest necessary to keep working on it. That ticket is here https://issues.apache.org/jira/browse/FLUME-966 . Brock might be working on it now though the issue doesn't appear to have had attention since october.

On 01/17/2013 03:01 PM, Connor Woodson wrote:> Why limit it to the sources? If there is going to be a change to one> component's lifecycle, then I see no reason not to change every component's> lifecycle.>> Sinks and Channels could very well have this problem; so what about giving> each LifecycleAware component a takePulse method (or something); or to> avoid creating a new method, add a new lifecycle state 'crashed' or such> which, when detected, causes a restart of the component. Then components> would just need to override getLifecycleState (if this method is polled> regularly; I don't know. maybe use a listener for when there's a state> change) to detect if it has crashed/needs to be restarted.>> Just my thoughts,>> - Connor>>> On Wed, Jan 16, 2013 at 9:08 PM, Juhani Connolly <> [EMAIL PROTECTED]> wrote:>>> Hmm, overriding the implementation of getLifecycleState provided by>> AbstractSource could work. It would be going against the convention that>> has been maintained in all other components(that I can think of)>>>>>> On 01/17/2013 01:20 PM, Brock Noland wrote:>>>>> Hi,>>>>>> Yes I can definitely see the issue. It sucks that we'd have to add yet>>> another thread. An alternative which wouldn't require another thread>>> would be to check the optional interface in the supervisor,>>> approximately here:>>>>>> https://github.com/apache/**flume/blob/trunk/flume-ng-**>>> core/src/main/java/org/apache/**flume/lifecycle/**>>> LifecycleSupervisor.java#L240<https://github.com/apache/flume/blob/trunk/flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java#L240>>>>>>> However, I am not sold on the supervisor being the best place to fix>>> this as I am not sure that other lifecycle components would need this.>>>>>>>>>>>> Brock>>>>>> On Wed, Jan 16, 2013 at 7:45 PM, Juhani Connolly>>> <juhani_connolly@cyberagent.**co.jp <[EMAIL PROTECTED]>>>>> wrote:>>>>>>> I came upon an issue with ScribeSource, though it's theoretically>>>> applicable to any EventDrivenSource whose event generating thread(s) die.>>>> Simple put, sending a bad packet to the thrift(scribe protocol) port will>>>> result in it trying to allocate space for some arbitrarily large packet>>>> resulting in an OOMException which kills the thread(incidentally I>>>> thought>>>> this would be an issue in avro too, but it throws an exception before>>>> making>>>> excessive allocation requests).>>>>>>>> As far as flume is concerned, the component is still alive. stop() was>>>> never>>>> called, so even monitoring the component state using jmx will not notice>>>> anything wrong. This situation occurs from user error, but there is>>>> potential for other errors leaving a zombie component. I think it would>>>> be>>>> more user friendly to be able to recover from such errors.>>>>>>>> I'm thinking of adding a StatusPollable interface that EventDrivenSources>>>> can optionally implement(because we can't change the interface without a>>>> version change). If implemented, the EventDrivenSourceRunner would>>>> schedule>>>> a regular poll to check the state. Upon failure it could either call

What I was trying to say is that the ScribeSource should not be responsiblefor restarting itself (just from my understanding of your idea; it breaksthe existing paradigm as components should not have to control their ownlifecycle). Going from Brock's link, I feel the most dynamic solution wouldbe possible to just add in a lifecyle state RESTART and place it in thatswitch statement; when that state is reached, it tries to stop then restartthe component and then sets the desired state to START (or STOP if itcouldn't start it again, or set error=true if it couldn't stop it).

And in a way to prevent overriding getLifecycleState to return RESTART,there could either be an inherited function from AbstractSource to call fora restart, there could maybe be an option to set it to RESTART on crash, orsomething else.

I'll admit though that I know little about the lifecycle system, so I haveno idea if this idea is any better.

> Sink, Source and Channel all extend LifecycleAware, so the function is> available to all components already.>> I was more questioning whether it was reasonable to start including logic> to determine the state. That being said, I think the precedent of just> returning the state set by start/stop is more one of habit, so on further> thought I don't see it as being unreasonable. I'm going to give fixing> ScribeSource with it a poke.>> As to new lifecycle states, I took a pass at reworking the lifecycle model> with guavas service implementation in the past, but it took some very> significant changes and didn't get the momentum/interest necessary to keep> working on it. That ticket is here https://issues.apache.org/**> jira/browse/FLUME-966 <https://issues.apache.org/jira/browse/FLUME-966> .> Brock might be working on it now though the issue doesn't appear to have> had attention since october.>>> On 01/17/2013 03:01 PM, Connor Woodson wrote:>>> Why limit it to the sources? If there is going to be a change to one>> component's lifecycle, then I see no reason not to change every>> component's>> lifecycle.>>>> Sinks and Channels could very well have this problem; so what about giving>> each LifecycleAware component a takePulse method (or something); or to>> avoid creating a new method, add a new lifecycle state 'crashed' or such>> which, when detected, causes a restart of the component. Then components>> would just need to override getLifecycleState (if this method is polled>> regularly; I don't know. maybe use a listener for when there's a state>> change) to detect if it has crashed/needs to be restarted.>>>> Just my thoughts,>>>> - Connor>>>>>> On Wed, Jan 16, 2013 at 9:08 PM, Juhani Connolly <>> [EMAIL PROTECTED].**jp <[EMAIL PROTECTED]>>>> wrote:>>>> Hmm, overriding the implementation of getLifecycleState provided by>>> AbstractSource could work. It would be going against the convention that>>> has been maintained in all other components(that I can think of)>>>>>>>>> On 01/17/2013 01:20 PM, Brock Noland wrote:>>>>>> Hi,>>>>>>>> Yes I can definitely see the issue. It sucks that we'd have to add yet>>>> another thread. An alternative which wouldn't require another thread>>>> would be to check the optional interface in the supervisor,>>>> approximately here:>>>>>>>> https://github.com/apache/****flume/blob/trunk/flume-ng-**<https://github.com/apache/**flume/blob/trunk/flume-ng-**>>>>> core/src/main/java/org/apache/****flume/lifecycle/**>>>> LifecycleSupervisor.java#L240<**https://github.com/apache/**>>>> flume/blob/trunk/flume-ng-**core/src/main/java/org/apache/**>>>> flume/lifecycle/**LifecycleSupervisor.java#L240<https://github.com/apache/flume/blob/trunk/flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java#L240>>>>> >>>>>>>>>>>>> However, I am not sold on the supervisor being the best place to fix

Yeah I'd prefer to avoid overriding getLifeCycle to do this sincegetters are typically lightweight. As such, I feel it violates thegeneral contract behind getters.

In regards to channels/sinks/pollable sources, I don't think they needthis behavior as sinks and poll-able sources both have threads pollingthem at regular intervals. We have two robust channels memory and filewhich don't require this behavior as well. Given this, I feel likethis is limited to EventDrivenSources which don't have a threadpolling them.

On second thought, FLUME-1630 got rid of one layer of threads in theflume agent, perhaps adding a single threaded scheduled executor toEventDrivenSource is not a bad solution.

Brock

On Wed, Jan 16, 2013 at 11:02 PM, Connor Woodson <[EMAIL PROTECTED]> wrote:> What I was trying to say is that the ScribeSource should not be responsible> for restarting itself (just from my understanding of your idea; it breaks> the existing paradigm as components should not have to control their own> lifecycle). Going from Brock's link, I feel the most dynamic solution would> be possible to just add in a lifecyle state RESTART and place it in that> switch statement; when that state is reached, it tries to stop then restart> the component and then sets the desired state to START (or STOP if it> couldn't start it again, or set error=true if it couldn't stop it).>> And in a way to prevent overriding getLifecycleState to return RESTART,> there could either be an inherited function from AbstractSource to call for> a restart, there could maybe be an option to set it to RESTART on crash, or> something else.>> I'll admit though that I know little about the lifecycle system, so I have> no idea if this idea is any better.>> - Connor>>> On Wed, Jan 16, 2013 at 10:21 PM, Juhani Connolly <> [EMAIL PROTECTED]> wrote:>>> Sink, Source and Channel all extend LifecycleAware, so the function is>> available to all components already.>>>> I was more questioning whether it was reasonable to start including logic>> to determine the state. That being said, I think the precedent of just>> returning the state set by start/stop is more one of habit, so on further>> thought I don't see it as being unreasonable. I'm going to give fixing>> ScribeSource with it a poke.>>>> As to new lifecycle states, I took a pass at reworking the lifecycle model>> with guavas service implementation in the past, but it took some very>> significant changes and didn't get the momentum/interest necessary to keep>> working on it. That ticket is here https://issues.apache.org/**>> jira/browse/FLUME-966 <https://issues.apache.org/jira/browse/FLUME-966> .>> Brock might be working on it now though the issue doesn't appear to have>> had attention since october.>>>>>> On 01/17/2013 03:01 PM, Connor Woodson wrote:>>>>> Why limit it to the sources? If there is going to be a change to one>>> component's lifecycle, then I see no reason not to change every>>> component's>>> lifecycle.>>>>>> Sinks and Channels could very well have this problem; so what about giving>>> each LifecycleAware component a takePulse method (or something); or to>>> avoid creating a new method, add a new lifecycle state 'crashed' or such>>> which, when detected, causes a restart of the component. Then components>>> would just need to override getLifecycleState (if this method is polled>>> regularly; I don't know. maybe use a listener for when there's a state>>> change) to detect if it has crashed/needs to be restarted.>>>>>> Just my thoughts,>>>>>> - Connor>>>>>>>>> On Wed, Jan 16, 2013 at 9:08 PM, Juhani Connolly <>>> [EMAIL PROTECTED].**jp <[EMAIL PROTECTED]>>>>> wrote:>>>>>> Hmm, overriding the implementation of getLifecycleState provided by>>>> AbstractSource could work. It would be going against the convention that>>>> has been maintained in all other components(that I can think of)>>>>>>>>

What would happen is:- source start. State=start- OOM exception happens- getLifecycleState called, override calls server.isServing(). This is a lightweight call. If it returns false, it either calls stop() or local lifeCycleState is set to ERROR (I'm not actually sure how this is handled). The latter is probably the correct call in this case, but I'm not entirely sure if the supervisor will ever restart it then.

The supervisor would still be doing any restarting. I guess it's really a matter of who detects the bad state(getLifecycleState or a scheduled runnable) and what that call is allowed to do(switch state to ERROR or call stop()). When the source breaks, changing its status to accurately reflect this is not breaking the paradigm(though calling stop() may be)

It looks to me like switching the state to ERROR should be sufficient... Monitor should then try to start the source again. However not calling stop() before that may cause resource leakage? Could someone confirm the behavior in regards to ERROR?

On 01/17/2013 04:02 PM, Connor Woodson wrote:> What I was trying to say is that the ScribeSource should not be responsible> for restarting itself (just from my understanding of your idea; it breaks> the existing paradigm as components should not have to control their own> lifecycle). Going from Brock's link, I feel the most dynamic solution would> be possible to just add in a lifecyle state RESTART and place it in that> switch statement; when that state is reached, it tries to stop then restart> the component and then sets the desired state to START (or STOP if it> couldn't start it again, or set error=true if it couldn't stop it).>> And in a way to prevent overriding getLifecycleState to return RESTART,> there could either be an inherited function from AbstractSource to call for> a restart, there could maybe be an option to set it to RESTART on crash, or> something else.>> I'll admit though that I know little about the lifecycle system, so I have> no idea if this idea is any better.>> - Connor>>> On Wed, Jan 16, 2013 at 10:21 PM, Juhani Connolly <> [EMAIL PROTECTED]> wrote:>>> Sink, Source and Channel all extend LifecycleAware, so the function is>> available to all components already.>>>> I was more questioning whether it was reasonable to start including logic>> to determine the state. That being said, I think the precedent of just>> returning the state set by start/stop is more one of habit, so on further>> thought I don't see it as being unreasonable. I'm going to give fixing>> ScribeSource with it a poke.>>>> As to new lifecycle states, I took a pass at reworking the lifecycle model>> with guavas service implementation in the past, but it took some very>> significant changes and didn't get the momentum/interest necessary to keep>> working on it. That ticket is here https://issues.apache.org/**>> jira/browse/FLUME-966 <https://issues.apache.org/jira/browse/FLUME-966> .>> Brock might be working on it now though the issue doesn't appear to have>> had attention since october.>>>>>> On 01/17/2013 03:01 PM, Connor Woodson wrote:>>>>> Why limit it to the sources? If there is going to be a change to one>>> component's lifecycle, then I see no reason not to change every>>> component's>>> lifecycle.>>>>>> Sinks and Channels could very well have this problem; so what about giving>>> each LifecycleAware component a takePulse method (or something); or to>>> avoid creating a new method, add a new lifecycle state 'crashed' or such>>> which, when detected, causes a restart of the component. Then components>>> would just need to override getLifecycleState (if this method is polled>>> regularly; I don't know. maybe use a listener for when there's a state>>> change) to detect if it has crashed/needs to be restarted.>>>>>> Just my thoughts,>>>>>> - Connor>>>>>>>>> On Wed, Jan 16, 2013 at 9

I don't know if this is within the scope of this change, but sinkssometimes will need to be restarted; for instance, I had an HDFS sink crashfrom an Out of Memory error (caused by the JIRA I filed; I wasn't planningon using a large heap) and it doesn't automatically restart; I don't knowif there's a nice way to detect of a sink/source has crashed, but if so itwould be nice to have them auto-restart when they go down.

> What I described isn't really taking control of lifecycle.>> What would happen is:> - source start. State=start> - OOM exception happens> - getLifecycleState called, override calls server.isServing(). This is a> lightweight call. If it returns false, it either calls stop() or local> lifeCycleState is set to ERROR (I'm not actually sure how this is handled).> The latter is probably the correct call in this case, but I'm not entirely> sure if the supervisor will ever restart it then.>> The supervisor would still be doing any restarting. I guess it's really a> matter of who detects the bad state(getLifecycleState or a scheduled> runnable) and what that call is allowed to do(switch state to ERROR or call> stop()). When the source breaks, changing its status to accurately reflect> this is not breaking the paradigm(though calling stop() may be)>> It looks to me like switching the state to ERROR should be sufficient...> Monitor should then try to start the source again. However not calling> stop() before that may cause resource leakage? Could someone confirm the> behavior in regards to ERROR?>>> On 01/17/2013 04:02 PM, Connor Woodson wrote:>>> What I was trying to say is that the ScribeSource should not be>> responsible>> for restarting itself (just from my understanding of your idea; it breaks>> the existing paradigm as components should not have to control their own>> lifecycle). Going from Brock's link, I feel the most dynamic solution>> would>> be possible to just add in a lifecyle state RESTART and place it in that>> switch statement; when that state is reached, it tries to stop then>> restart>> the component and then sets the desired state to START (or STOP if it>> couldn't start it again, or set error=true if it couldn't stop it).>>>> And in a way to prevent overriding getLifecycleState to return RESTART,>> there could either be an inherited function from AbstractSource to call>> for>> a restart, there could maybe be an option to set it to RESTART on crash,>> or>> something else.>>>> I'll admit though that I know little about the lifecycle system, so I have>> no idea if this idea is any better.>>>> - Connor>>>>>> On Wed, Jan 16, 2013 at 10:21 PM, Juhani Connolly <>> [EMAIL PROTECTED].**jp <[EMAIL PROTECTED]>>>> wrote:>>>> Sink, Source and Channel all extend LifecycleAware, so the function is>>> available to all components already.>>>>>> I was more questioning whether it was reasonable to start including logic>>> to determine the state. That being said, I think the precedent of just>>> returning the state set by start/stop is more one of habit, so on further>>> thought I don't see it as being unreasonable. I'm going to give fixing>>> ScribeSource with it a poke.>>>>>> As to new lifecycle states, I took a pass at reworking the lifecycle>>> model>>> with guavas service implementation in the past, but it took some very>>> significant changes and didn't get the momentum/interest necessary to>>> keep>>> working on it. That ticket is here https://issues.apache.org/**>>> jira/browse/FLUME-966 <https://issues.apache.org/**jira/browse/FLUME-966<https://issues.apache.org/jira/browse/FLUME-966>>>>> .>>>>>> Brock might be working on it now though the issue doesn't appear to have>>> had attention since october.>>>>>>>>> On 01/17/2013 03:01 PM, Connor Woodson wrote:>>>>>> Why limit it to the sources? If there is going to be a change to one

NEW: Monitor These Apps!

All projects made searchable here are trademarks of the Apache Software Foundation.
Service operated by Sematext