We are setting up a flume cluster but we are facing some issues related toheap size (out of memory). Is there a standard configuration for a standardload?

If there is can you suggest what would it be for the load stats given below?

Also, we are not sure what topology to go ahead with in our use case.

We basically have two web servers which can generate logs at the speed of2000 entries per second. Each entry of size around 137Bytes.

Currently we have used rsyslog( writing to a tcp port) to which a phpscript writes these logs to. And we are running a local flume agent on eachwebserver , these local agents listen to a tcp port and put data directlyin hdfs.

So localhost:tcpport is the "flume source " and "hdfs" is the flume sink.

I am confused between three approaches:

Approach 1: Web Server, RSyslog & Flume Agent on same machine and a Flumecollector running on the Namenode in hadoop cluster, to collect the dataand dump into hdfs.

Approach 2: Web Server, RSyslog on same machine and a Flume collector(listening on a remote port for events written by rsyslog on webserver)running on the Namenode in hadoop cluster, to collect the data anddump into hdfs.Approach 3: Web Server, RSyslog & Flume Agent on same machine. And allagents writing directly to the hdfs.Also, we are using hive, so we are writing directly into partitioneddirectories. So we want to think of an approach that allows us to write onHourly partitions.

Are you using memory channel? You mention you are getting OOME but youdon't even say what the heap you are setting on the flume jvm is?

Don't run an agent on the namenode. Occasionally you will see folksinstalling an agent on one of the datanodes in the cluster but its nottypically recommended. It's fine to install the agent on your webserver butperhaps a more scaleable approach would be to dedicate two servers to flumeagents. This will allow you to load balance your writes into the flumepipeline at some point. As you scale you will not want to have every agentwriting to hdfs so at some point you may consider adding a collector tierthat will aggregate the flow and reduce the connections going into yourhdfs cluster.

Yes, I am using the memory channel, and that's because I want it to be morereliable and not miss any events/messages.As I've read in flume documentation that the memory channel is fast butthere could be a chance of missing events if the in-memory buffer fills up.

I am sorry for not mentioning the heap settings but I was running it withdefault vm settings which I increased later(to 1GB), after that I did notget the OOME. But then again I am not sure what is the right setting ormaybe this is more like a hit n trial setting depending on our data loadand environment.So as per your suggestion, I need to consider having two dedicated machinesfor running flume agents for two web servers and one for collector? Wehave just started working on flume and I think your suggestion really makessense because we are pretty sure that it is going to scale.

Also, we are using rsyslog to log to a tcp port on localhost and flumelistening to that tcp port on the same machine. Is that a good and reliabledesign? We tried the exec source with tail -F command on the log file but Iguess that's not a very dependable(also mentioned in the flumedocumentation) way as it fetches all the rows from the file if flumerestarts. Also, I am a little skeptical of the logrotate cron that rotatesthe logs as I did a few test and found a lot of problems with it.

Where as rsyslog tcp option provides an option of dumping data to localdisk if the tcp queue gets full. So even if flume goes down we don't losethe data.

One more thing, I just installed cloudera manager a week back. But I havedone all testing using flume from command line. I want to know if I coulduse cloudera manager to install and manage flume instances in the newmachines. It'd be great to have one UI to manage all the agents andcollector nodes and even change their configurations.

So we are very much beginners in this field, any suggestions orrecommendations are welcome. Thanks for your help :)Mohit

On Thu, Apr 3, 2014 at 11:27 AM, Mohit Durgapal <[EMAIL PROTECTED]>wrote:Memory channel is not reliable, meaning if the flume agent goes down or isrestarted while there are events in the channel than this data will be lost.For reliability please use the file channel.Are you using the multiport syslog source?This is definitely a better option than the exec source.

On Apr 7, 2014 9:35 AM, "Jeff Lord" <[EMAIL PROTECTED]> wrote:wrote:more reliable and not miss any events/messages.there could be a chance of missing events if the in-memory buffer fills up.is restarted while there are events in the channel than this data will belost.

Jeff,

I am using an upstream agent with a spooling directory source and a memorychannel, and the downstream agent uses a memory channel and an HDFS sink.If my downstream agent goes down for any reason, are the entries lost inthe downstream agent's memory channel still preserved in the memory channel/ file directory of the upstream agent?

So, this basically means that Flume's transactional model is alsounreliable. That would have to mean that the downstream agent is sending anack to the upstream agent before it actually persists the event.

On Apr 7, 2014 10:48 AM, "Jeff Lord" <[EMAIL PROTECTED]> wrote:wrote:[EMAIL PROTECTED]> wrote:be more reliable and not miss any events/messages.but there could be a chance of missing events if the in-memory buffer fillsup.or is restarted while there are events in the channel than this data willbe lost.memory channel, and the downstream agent uses a memory channel and an HDFSsink. If my downstream agent goes down for any reason, are the entries lostin the downstream agent's memory channel still preserved in the memorychannel / file directory of the upstream agent? No. If you need to guarantee delivery of events please use a file channel.https://blogs.apache.org/flume/entry/apache_flume_filechannelOn Mon, Apr 7, 2014 at 8:38 AM, Christopher Shannon<[EMAIL PROTECTED]>wrote:

I am using an upstream agent with a spooling directory source and a memory channel, and the downstream agent uses a memory channel and an HDFS sink. If my downstream agent goes down for any reason, are the entries lost in the downstream agent's memory channel still preserved in the memory channel / file directory of the upstream agent?

All the best, Chris

NEW: Monitor These Apps!

All projects made searchable here are trademarks of the Apache Software Foundation.
Service operated by Sematext