I have a situation where I'm seeing duplicated data downstream before the demux process. It appears this happens during high system loads and we are still using the 0.3.0 series.

So, we have validated that there is a single, unique entry in our source file which then shows up a random amount of times before we see it in demux. So, it appears that there is duplication happening somewhere between the agent and collector.

Has anyone else seen this? Any ideas as to why we are seeing this during high system loads, but not during lower loads.

This is expected in Chukwa archives. When agent is unable to post tothe collector, it will retry to post the same data again to anothercollector or retrys with the same collector when no other collector isavailable. Collector may have data written without proper acknowledgeback to agent in high load situation. Chukwa philosophy is to retryuntil receiving acknowledgement. Duplicated data filter will betreated after data has been received.

The duplication filtering in Chukwa 0.3.0 depends on data loading tomysql. The same primary key will update to the same row to removeduplicates. It is possible to build a duplication detection processprior to demux which filter data based on sequence id + data type +csource (host), but this hasn't been implemented because primary keyupdate method works well for my use case.

In Chukwa 0.5, we are treating duplication the same as in Chukwa 0.3,where it will replace any duplicated row in HBase base on Timestamp +HBase row key.

regards,Eric

On Thu, Oct 21, 2010 at 8:22 PM, Matt Davies <[EMAIL PROTECTED]> wrote:> Hey everyone,>> I have a situation where I'm seeing duplicated data downstream before the demux process. It appears this happens during high system loads and we are still using the 0.3.0 series.>> So, we have validated that there is a single, unique entry in our source file which then shows up a random amount of times before we see it in demux. So, it appears that there is duplication happening somewhere between the agent and collector.>> Has anyone else seen this? Any ideas as to why we are seeing this during high system loads, but not during lower loads.>> TIA,> Matt>>

On Fri, Oct 22, 2010 at 8:48 AM, Eric Yang <[EMAIL PROTECTED]> wrote:> Hi Matt,>> The duplication filtering in Chukwa 0.3.0 depends on data loading to> mysql. The same primary key will update to the same row to remove> duplicates. It is possible to build a duplication detection process> prior to demux which filter data based on sequence id + data type +> csource (host), but this hasn't been implemented because primary key> update method works well for my use case.

This isn't quite right. There is support in 0.3 and later versions fordoing de-duplication at the collector, in the manner Eric describes.It works as a filter in the writer pipeline.

> On Fri, Oct 22, 2010 at 8:48 AM, Eric Yang <[EMAIL PROTECTED]> wrote:>> Hi Matt,> > >>>> The duplication filtering in Chukwa 0.3.0 depends on data loading to>> mysql. The same primary key will update to the same row to remove>> duplicates. It is possible to build a duplication detection process>> prior to demux which filter data based on sequence id + data type +>> csource (host), but this hasn't been implemented because primary key>> update method works well for my use case.> > This isn't quite right. There is support in 0.3 and later versions for> doing de-duplication at the collector, in the manner Eric describes.> It works as a filter in the writer pipeline.> > You need the following in your configuration:> > <property>> <name>chukwaCollector.writerClass</name>> <value>org.apache.hadoop.chukwa.datacollection.writer.PipelineStageWriter</value>> </property>> > <property>> <name>chukwaCollector.pipeline</name>> <value>org.apache.hadoop.chukwa.datacollection.writer.Dedup,org.apache.hadoop.chukwa.datacollection.writer.SeqFileWriter</value>> </property>> > > See http://incubator.apache.org/chukwa/docs/r0.3.0/collector.html for background> > > --Ari> > --> Ari Rabkin [EMAIL PROTECTED]> UC Berkeley Computer Science Department>

I've been playing out several ideas on where to put in the correction for our system. Upon investigation it seems that 2 separate demux operations see the duplicate record so doing some sort of distinct in demux seems unreliable given our use.

It appears you are putting data into a database and using the db to enforce the uniqueness constraint. Do you see any way we could do a dedup operation after demux (within the chukwa environment) if we write our data strait into HDFS?

I could see writing a simple MR job to go and figure this stuff out for me, but it seems very inelegant and introduces more delay before I can utilize the data.

Any other thoughts?

"Eric Yang" <[EMAIL PROTECTED]> said:

> Note, the Dedup collector is only good for a single collector. If you use> multiple collector, it will not help.> > Regards,> Eric> > On 10/22/10 9:21 AM, "Matt Davies" <[EMAIL PROTECTED]> wrote:> >> Thank you for the insight.>>>> "Ariel Rabkin" <[EMAIL PROTECTED]> said:>>>>> On Fri, Oct 22, 2010 at 8:48 AM, Eric Yang <[EMAIL PROTECTED]> wrote:>>>> Hi Matt,>>>>>>>>>>>>>> The duplication filtering in Chukwa 0.3.0 depends on data loading to>>>> mysql. The same primary key will update to the same row to remove>>>> duplicates. It is possible to build a duplication detection process>>>> prior to demux which filter data based on sequence id + data type +>>>> csource (host), but this hasn't been implemented because primary key>>>> update method works well for my use case.>>>>>> This isn't quite right. There is support in 0.3 and later versions for>>> doing de-duplication at the collector, in the manner Eric describes.>>> It works as a filter in the writer pipeline.>>>>>> You need the following in your configuration:>>>>>> <property>>>> <name>chukwaCollector.writerClass</name>>>>>>> <value>org.apache.hadoop.chukwa.datacollection.writer.PipelineStageWriter</va>>> lue>>>> </property>>>>>>> <property>>>> <name>chukwaCollector.pipeline</name>>>> <value>org.apache.hadoop.chukwa.datacollection.writer.Dedup,org.apache.hadoop>>> .chukwa.datacollection.writer.SeqFileWriter</value>>>> </property>>>>>>>>>> See http://incubator.apache.org/chukwa/docs/r0.3.0/collector.html for>>> background>>>>>>>>> --Ari>>>>>> -->>> Ari Rabkin [EMAIL PROTECTED]>>> UC Berkeley Computer Science Department>>>>>>>>>> >

Eric in chukwa 0.5 is hbase the final store instead of hdfs? What format will the hbase data be in (e.g. A chukwarecord object ? Something user configurable? )

Sent from my iPhone

On Oct 22, 2010, at 8:48 AM, Eric Yang <[EMAIL PROTECTED]> wrote:

> Hi Matt,>> This is expected in Chukwa archives. When agent is unable to post to> the collector, it will retry to post the same data again to another> collector or retrys with the same collector when no other collector is> available. Collector may have data written without proper acknowledge> back to agent in high load situation. Chukwa philosophy is to retry> until receiving acknowledgement. Duplicated data filter will be> treated after data has been received.>> The duplication filtering in Chukwa 0.3.0 depends on data loading to> mysql. The same primary key will update to the same row to remove> duplicates. It is possible to build a duplication detection process> prior to demux which filter data based on sequence id + data type +> csource (host), but this hasn't been implemented because primary key> update method works well for my use case.>> In Chukwa 0.5, we are treating duplication the same as in Chukwa 0.3,> where it will replace any duplicated row in HBase base on Timestamp +> HBase row key.>> regards,> Eric>> On Thu, Oct 21, 2010 at 8:22 PM, Matt Davies <[EMAIL PROTECTED]> > wrote:>> Hey everyone,>>>> I have a situation where I'm seeing duplicated data downstream >> before the demux process. It appears this happens during high >> system loads and we are still using the 0.3.0 series.>>>> So, we have validated that there is a single, unique entry in our >> source file which then shows up a random amount of times before we >> see it in demux. So, it appears that there is duplication happening >> somewhere between the agent and collector.>>>> Has anyone else seen this? Any ideas as to why we are seeing this >> during high system loads, but not during lower loads.>>>> TIA,>> Matt>>>>

Eric, I'm also curious about how the HBase integration works. Do youhave time to write something up on it? I'm interested in thepossibility of extending what's there to write my own custom data intoHBase from a collector, while said data also continues through to HDFSas it does currently.On Fri, Oct 22, 2010 at 5:21 PM, Corbin Hoenes <[EMAIL PROTECTED]> wrote:> Eric in chukwa 0.5 is hbase the final store instead of hdfs? What format> will the hbase data be in (e.g. A chukwarecord object ? Something user> configurable? )>> Sent from my iPhone>> On Oct 22, 2010, at 8:48 AM, Eric Yang <[EMAIL PROTECTED]> wrote:>>> Hi Matt,>>>> This is expected in Chukwa archives. When agent is unable to post to>> the collector, it will retry to post the same data again to another>> collector or retrys with the same collector when no other collector is>> available. Collector may have data written without proper acknowledge>> back to agent in high load situation. Chukwa philosophy is to retry>> until receiving acknowledgement. Duplicated data filter will be>> treated after data has been received.>>>> The duplication filtering in Chukwa 0.3.0 depends on data loading to>> mysql. The same primary key will update to the same row to remove>> duplicates. It is possible to build a duplication detection process>> prior to demux which filter data based on sequence id + data type +>> csource (host), but this hasn't been implemented because primary key>> update method works well for my use case.>>>> In Chukwa 0.5, we are treating duplication the same as in Chukwa 0.3,>> where it will replace any duplicated row in HBase base on Timestamp +>> HBase row key.>>>> regards,>> Eric>>>> On Thu, Oct 21, 2010 at 8:22 PM, Matt Davies <[EMAIL PROTECTED]> wrote:>>>>>> Hey everyone,>>>>>> I have a situation where I'm seeing duplicated data downstream before the>>> demux process. It appears this happens during high system loads and we are>>> still using the 0.3.0 series.>>>>>> So, we have validated that there is a single, unique entry in our source>>> file which then shows up a random amount of times before we see it in demux.>>> So, it appears that there is duplication happening somewhere between the>>> agent and collector.>>>>>> Has anyone else seen this? Any ideas as to why we are seeing this during>>> high system loads, but not during lower loads.>>>>>> TIA,>>> Matt>>>>>>>

I imagine it is jst another pipelinable class loaded into the collector? If so bill's scenario would work.

Sent from my iPhone

On Oct 23, 2010, at 12:59 PM, Bill Graham <[EMAIL PROTECTED]> wrote:

> Eric, I'm also curious about how the HBase integration works. Do you> have time to write something up on it? I'm interested in the> possibility of extending what's there to write my own custom data into> HBase from a collector, while said data also continues through to HDFS> as it does currently.>>> On Fri, Oct 22, 2010 at 5:21 PM, Corbin Hoenes > <[EMAIL PROTECTED]> wrote:>> Eric in chukwa 0.5 is hbase the final store instead of hdfs? What >> format>> will the hbase data be in (e.g. A chukwarecord object ? Something >> user>> configurable? )>>>> Sent from my iPhone>>>> On Oct 22, 2010, at 8:48 AM, Eric Yang <[EMAIL PROTECTED]> wrote:>>>>> Hi Matt,>>>>>> This is expected in Chukwa archives. When agent is unable to post >>> to>>> the collector, it will retry to post the same data again to another>>> collector or retrys with the same collector when no other >>> collector is>>> available. Collector may have data written without proper >>> acknowledge>>> back to agent in high load situation. Chukwa philosophy is to retry>>> until receiving acknowledgement. Duplicated data filter will be>>> treated after data has been received.>>>>>> The duplication filtering in Chukwa 0.3.0 depends on data loading to>>> mysql. The same primary key will update to the same row to remove>>> duplicates. It is possible to build a duplication detection process>>> prior to demux which filter data based on sequence id + data type +>>> csource (host), but this hasn't been implemented because primary key>>> update method works well for my use case.>>>>>> In Chukwa 0.5, we are treating duplication the same as in Chukwa >>> 0.3,>>> where it will replace any duplicated row in HBase base on >>> Timestamp +>>> HBase row key.>>>>>> regards,>>> Eric>>>>>> On Thu, Oct 21, 2010 at 8:22 PM, Matt Davies >>> <[EMAIL PROTECTED]> wrote:>>>>>>>> Hey everyone,>>>>>>>> I have a situation where I'm seeing duplicated data downstream >>>> before the>>>> demux process. It appears this happens during high system loads >>>> and we are>>>> still using the 0.3.0 series.>>>>>>>> So, we have validated that there is a single, unique entry in our >>>> source>>>> file which then shows up a random amount of times before we see >>>> it in demux.>>>> So, it appears that there is duplication happening somewhere >>>> between the>>>> agent and collector.>>>>>>>> Has anyone else seen this? Any ideas as to why we are seeing this >>>> during>>>> high system loads, but not during lower loads.>>>>>>>> TIA,>>>> Matt>>>>>>>>>>

HBase only supports bytes. What to store in the cell, is decided bythe demux parser. Chukwa data are currently stored as byte string forthe parsers that I implemented. User has full control of data type tostore into each HBase column by customize the demux parser.

regards,Eric

On Fri, Oct 22, 2010 at 5:21 PM, Corbin Hoenes <[EMAIL PROTECTED]> wrote:> Eric in chukwa 0.5 is hbase the final store instead of hdfs? What format> will the hbase data be in (e.g. A chukwarecord object ? Something user> configurable? )>> Sent from my iPhone>> On Oct 22, 2010, at 8:48 AM, Eric Yang <[EMAIL PROTECTED]> wrote:>>> Hi Matt,>>>> This is expected in Chukwa archives. When agent is unable to post to>> the collector, it will retry to post the same data again to another>> collector or retrys with the same collector when no other collector is>> available. Collector may have data written without proper acknowledge>> back to agent in high load situation. Chukwa philosophy is to retry>> until receiving acknowledgement. Duplicated data filter will be>> treated after data has been received.>>>> The duplication filtering in Chukwa 0.3.0 depends on data loading to>> mysql. The same primary key will update to the same row to remove>> duplicates. It is possible to build a duplication detection process>> prior to demux which filter data based on sequence id + data type +>> csource (host), but this hasn't been implemented because primary key>> update method works well for my use case.>>>> In Chukwa 0.5, we are treating duplication the same as in Chukwa 0.3,>> where it will replace any duplicated row in HBase base on Timestamp +>> HBase row key.>>>> regards,>> Eric>>>> On Thu, Oct 21, 2010 at 8:22 PM, Matt Davies <[EMAIL PROTECTED]> wrote:>>>>>> Hey everyone,>>>>>> I have a situation where I'm seeing duplicated data downstream before the>>> demux process. It appears this happens during high system loads and we are>>> still using the 0.3.0 series.>>>>>> So, we have validated that there is a single, unique entry in our source>>> file which then shows up a random amount of times before we see it in demux.>>> So, it appears that there is duplication happening somewhere between the>>> agent and collector.>>>>>> Has anyone else seen this? Any ideas as to why we are seeing this during>>> high system loads, but not during lower loads.>>>>>> TIA,>>> Matt>>>>>>>

There is a architecture diagram to describe the new setup. Yourexisting parser should work with Chukwa 0.5, and by adding Chukwaannotations to the parser, it will stream data into the HBase table.I recommend to take a look of SystemMetrics demux parser, it's a goodexample to follow for updating your existing parser to work withHBase.

In the default chukwa-collector-conf.xml.template, there is a sectionfor HBase configuration, uncomment it, and comment out the defaultseqFileWriter. Restart the collector, and data should appear inHBase.

regards,Eric

On Sat, Oct 23, 2010 at 12:59 PM, Bill Graham <[EMAIL PROTECTED]> wrote:> Eric, I'm also curious about how the HBase integration works. Do you> have time to write something up on it? I'm interested in the> possibility of extending what's there to write my own custom data into> HBase from a collector, while said data also continues through to HDFS> as it does currently.>>> On Fri, Oct 22, 2010 at 5:21 PM, Corbin Hoenes <[EMAIL PROTECTED]> wrote:>> Eric in chukwa 0.5 is hbase the final store instead of hdfs? What format>> will the hbase data be in (e.g. A chukwarecord object ? Something user>> configurable? )>>>> Sent from my iPhone>>>> On Oct 22, 2010, at 8:48 AM, Eric Yang <[EMAIL PROTECTED]> wrote:>>>>> Hi Matt,>>>>>> This is expected in Chukwa archives. When agent is unable to post to>>> the collector, it will retry to post the same data again to another>>> collector or retrys with the same collector when no other collector is>>> available. Collector may have data written without proper acknowledge>>> back to agent in high load situation. Chukwa philosophy is to retry>>> until receiving acknowledgement. Duplicated data filter will be>>> treated after data has been received.>>>>>> The duplication filtering in Chukwa 0.3.0 depends on data loading to>>> mysql. The same primary key will update to the same row to remove>>> duplicates. It is possible to build a duplication detection process>>> prior to demux which filter data based on sequence id + data type +>>> csource (host), but this hasn't been implemented because primary key>>> update method works well for my use case.>>>>>> In Chukwa 0.5, we are treating duplication the same as in Chukwa 0.3,>>> where it will replace any duplicated row in HBase base on Timestamp +>>> HBase row key.>>>>>> regards,>>> Eric>>>>>> On Thu, Oct 21, 2010 at 8:22 PM, Matt Davies <[EMAIL PROTECTED]> wrote:>>>>>>>> Hey everyone,>>>>>>>> I have a situation where I'm seeing duplicated data downstream before the>>>> demux process. It appears this happens during high system loads and we are>>>> still using the 0.3.0 series.>>>>>>>> So, we have validated that there is a single, unique entry in our source>>>> file which then shows up a random amount of times before we see it in demux.>>>> So, it appears that there is duplication happening somewhere between the>>>> agent and collector.>>>>>>>> Has anyone else seen this? Any ideas as to why we are seeing this during>>>> high system loads, but not during lower loads.>>>>>>>> TIA,>>>> Matt>>>>>>>>>>>

Yes, you are right. It should work automatically after annotation isadded to his demux parser.

regards,Eric

On Sat, Oct 23, 2010 at 1:27 PM, Corbin Hoenes <[EMAIL PROTECTED]> wrote:> +1>> I imagine it is jst another pipelinable class loaded into the collector? If> so bill's scenario would work.>> Sent from my iPhone>> On Oct 23, 2010, at 12:59 PM, Bill Graham <[EMAIL PROTECTED]> wrote:>>> Eric, I'm also curious about how the HBase integration works. Do you>> have time to write something up on it? I'm interested in the>> possibility of extending what's there to write my own custom data into>> HBase from a collector, while said data also continues through to HDFS>> as it does currently.>>>>>> On Fri, Oct 22, 2010 at 5:21 PM, Corbin Hoenes <[EMAIL PROTECTED]>>> wrote:>>>>>> Eric in chukwa 0.5 is hbase the final store instead of hdfs? What format>>> will the hbase data be in (e.g. A chukwarecord object ? Something user>>> configurable? )>>>>>> Sent from my iPhone>>>>>> On Oct 22, 2010, at 8:48 AM, Eric Yang <[EMAIL PROTECTED]> wrote:>>>>>>> Hi Matt,>>>>>>>> This is expected in Chukwa archives. When agent is unable to post to>>>> the collector, it will retry to post the same data again to another>>>> collector or retrys with the same collector when no other collector is>>>> available. Collector may have data written without proper acknowledge>>>> back to agent in high load situation. Chukwa philosophy is to retry>>>> until receiving acknowledgement. Duplicated data filter will be>>>> treated after data has been received.>>>>>>>> The duplication filtering in Chukwa 0.3.0 depends on data loading to>>>> mysql. The same primary key will update to the same row to remove>>>> duplicates. It is possible to build a duplication detection process>>>> prior to demux which filter data based on sequence id + data type +>>>> csource (host), but this hasn't been implemented because primary key>>>> update method works well for my use case.>>>>>>>> In Chukwa 0.5, we are treating duplication the same as in Chukwa 0.3,>>>> where it will replace any duplicated row in HBase base on Timestamp +>>>> HBase row key.>>>>>>>> regards,>>>> Eric>>>>>>>> On Thu, Oct 21, 2010 at 8:22 PM, Matt Davies <[EMAIL PROTECTED]>>>>> wrote:>>>>>>>>>> Hey everyone,>>>>>>>>>> I have a situation where I'm seeing duplicated data downstream before>>>>> the>>>>> demux process. It appears this happens during high system loads and we>>>>> are>>>>> still using the 0.3.0 series.>>>>>>>>>> So, we have validated that there is a single, unique entry in our>>>>> source>>>>> file which then shows up a random amount of times before we see it in>>>>> demux.>>>>> So, it appears that there is duplication happening somewhere between>>>>> the>>>>> agent and collector.>>>>>>>>>> Has anyone else seen this? Any ideas as to why we are seeing this>>>>> during>>>>> high system loads, but not during lower loads.>>>>>>>>>> TIA,>>>>> Matt>>>>>>>>>>>>>>

Thanks Eric, this is helpful. I dug around in the following files andI think I have a handle on what's happening but I can use someclarifications:

oahc.datacollection.adaptor.SyslogAdaptoroahc.extraction.demux.processor.mapper.SysLogoahc.datacollection.writer.hbase.OutputCollectorconf/hbase.schemaconf/chukwa-collector-conf.xml.templateTo make sure I'm clear, let me know if this is accurate:

1. SyslogAdaptor sends syslog message byte arrays as the chunk bodybound to the dataType for that facility.

2. In the collector configs, this config says to write data to HBase only:<property><name>chukwaCollector.pipeline</name><value>org.apache.hadoop.chukwa.datacollection.writer.SocketTeeWriter,org.apache.hadoop.chukwa.datacollection.writer.hbase.HBaseWriter</value></property>

If I also wanted to write data to HDFS, would I just need to add",org.apache.hadoop.chukwa.datacollection.writer.SeqFileWriter" as athird item in the chain?

3. In the collector configs, all packages beneath the packageconfigured in hbase.demux.package would be checked for the annotatedclasses (it would be useful to have this also take a comma-separatedlist at some point for extensibility). What about the data being sentindicates that the SysLog processor should be used?

4. The collector via HBaseWriter writes the data to theSystemMetrics/SysLog table/family in HBase per the annotations.Looking at OutputCollector it appears the following data is set:

- key is taken as the '[source]-[ts]' from the ChukwaRecordKey - column family seems to be taken as the reduceType (i.e. dataType),but I thought that was set by the annotation in SysLog. Which is it? - column name/value is every field name and value in the ChukwaRecord.

This last part is throwing me off though, since I can't see wherefield names and values are set on your ChukwaRecord. Can you clarify?It seems like the record was just the entire byte array payload of thesyslog message.Btw, the documentation is a big help thanks, but one bit of feedbackis that the "Configure Log4j syslog appender" section is confusingw.r.t. what nodes your speaking of. I assume you're talking about theHadoop nodes being monitored, but is there anything about thisapproach that limits this to monitoring Hadoop nodes only? Either way,which nodes being discussed and which Hadoop cluster needs to berebooted should be clarified.thanks,BillOn Sat, Oct 23, 2010 at 8:34 PM, Eric Yang <[EMAIL PROTECTED]> wrote:> Yes, you are right. It should work automatically after annotation is> added to his demux parser.>> regards,> Eric>> On Sat, Oct 23, 2010 at 1:27 PM, Corbin Hoenes <[EMAIL PROTECTED]> wrote:>> +1>>>> I imagine it is jst another pipelinable class loaded into the collector? If>> so bill's scenario would work.>>>> Sent from my iPhone>>>> On Oct 23, 2010, at 12:59 PM, Bill Graham <[EMAIL PROTECTED]> wrote:>>>>> Eric, I'm also curious about how the HBase integration works. Do you>>> have time to write something up on it? I'm interested in the>>> possibility of extending what's there to write my own custom data into>>> HBase from a collector, while said data also continues through to HDFS>>> as it does currently.>>>>>>>>> On Fri, Oct 22, 2010 at 5:21 PM, Corbin Hoenes <[EMAIL PROTECTED]>>>> wrote:>>>>>>>> Eric in chukwa 0.5 is hbase the final store instead of hdfs? What format>>>> will the hbase data be in (e.g. A chukwarecord object ? Something user>>>> configurable? )>>>>>>>> Sent from my iPhone>>>>>>>> On Oct 22, 2010, at 8:48 AM, Eric Yang <[EMAIL PROTECTED]> wrote:>>>>>>>>> Hi Matt,>>>>>>>>>> This is expected in Chukwa archives. When agent is unable to post to>>>>> the collector, it will retry to post the same data again to another>>>>> collector or retrys with the same collector when no other collector is>>>>> available. Collector may have data written without proper acknowledge>>>>> back to agent in high load situation. Chukwa philosophy is to retry

> Thanks Eric, this is helpful. I dug around in the following files and> I think I have a handle on what's happening but I can use some> clarifications:> > oahc.datacollection.adaptor.SyslogAdaptor> oahc.extraction.demux.processor.mapper.SysLog> oahc.datacollection.writer.hbase.OutputCollector> conf/hbase.schema> conf/chukwa-collector-conf.xml.template> > > To make sure I'm clear, let me know if this is accurate:> > 1. SyslogAdaptor sends syslog message byte arrays as the chunk body> bound to the dataType for that facility.

Yes, Syslog message looks like this:

<142>This is a log entry

The facility name is derived from the first 3 digit number, priority +severity + facility number*8. Hence, the SyslogAdaptor manually maps theexisting 24 data types into data type make sense to Chukwa. For example, asyslog message with facility LOCAL0, and SyslogAdaptor looks up for runningSyslogAdaptor on port 9095, facility LOCAL1 maps to HADOOP. Chunk data isstamped as HADOOP for demux. This mapping is added inchukwa-agent-conf.xml, like this:

> 2. In the collector configs, this config says to write data to HBase only:> <property>> <name>chukwaCollector.pipeline</name>> <value>org.apache.hadoop.chukwa.datacollection.writer.SocketTeeWriter,org.apac> he.hadoop.chukwa.datacollection.writer.hbase.HBaseWriter</value>> </property>> > If I also wanted to write data to HDFS, would I just need to add> ",org.apache.hadoop.chukwa.datacollection.writer.SeqFileWriter" as a> third item in the chain?

> 3. In the collector configs, all packages beneath the package> configured in hbase.demux.package would be checked for the annotated> classes (it would be useful to have this also take a comma-separated> list at some point for extensibility). What about the data being sent> indicates that the SysLog processor should be used?

HBaseWriter reads chukwa-demux-conf.xml if it is available in collector'sconf directory. Hence, mappings of data type to parser is the same as demuxon hdfs.

> 4. The collector via HBaseWriter writes the data to the> SystemMetrics/SysLog table/family in HBase per the annotations.> Looking at OutputCollector it appears the following data is set:> > - key is taken as the '[source]-[ts]' from the ChukwaRecordKey> - column family seems to be taken as the reduceType (i.e. dataType),> but I thought that was set by the annotation in SysLog. Which is it?> - column name/value is every field name and value in the ChukwaRecord.> > This last part is throwing me off though, since I can't see where> field names and values are set on your ChukwaRecord. Can you clarify?> It seems like the record was just the entire byte array payload of the> syslog message.

This is currently set to reduceType. The annotation for column does nothingat this moment. In the future, it would be nice to have reduce type map toannotation. This means it will become more ORM entity bean code for demuxprocess. I am not sure if that is something that we want Chukwa to do. Itis nicer to have Apache Gora handle ORM for Hbase, hence Chukwa doesn'tdetour from original objective.

SystemMetrics writes to SystemMetrics table. Hadoop logs which streamedthrough SyslogAdaptor is mapped to HADOOP. I have not test the HADOOPparser to see if Hadoop log processing is working. This is on my TODO list.In theory, it should work. ;)

The annotation in SysLogAdaptor is only defining the which data type it is,it has not define which parser to process the data. This is done by demuxconfiguration. I think the default behavior to map data type to demuxparser probably throw you off to assume data is processed by oahc.extraction.demux.processor.mapper.SysLog. Instead, you need to make surethere is configuration in agent for mapping facility name to data type ofyour choice, and configure demux to invoke the proper parser. Let's say ifyou are sending /var/log/messages with SyslogAdaptor, and map facility nameto SysLog and having demux configuration map to use SysLog. Logs willappear in Hbase table: SystemMetrics, SysLog column family, with a columncalled "body" which contains all your log entries. The buildGenericRecordwill create default record with body field.

There are some clean up work to decouple entity bean from our parser, thendemux will look nice and neat. We should change serialization ofChukwaRecord to avro, then it will make a lot of sense, and easier toannotate columns. For now, I only got bare minimum working.Any log file written by SyslogAppender could be stream over toSyslogAdaptor. The only two required pieces are to write a demux parserwhich can process your log file, and map facility name to demux parser. ForHadoop, the modification to log4j.properties should applies to all nodes(namenode, jobtracker, datanode, tasktracker, secondary name node.) Hence,all logs can be streamed over and processed. However, there is a lot ofdata, and the current Chukwa parsers are not written to pick up all thedetails. When log4j.properties is changed, you will need to restart clusterin order to take advantage of the changes. Hope this helps.

Regards,Eric

NEW: Monitor These Apps!

All projects made searchable here are trademarks of the Apache Software Foundation.
Service operated by Sematext