This is kind of a generic HDFS question, but it does relate to flume, so hopefully someone can provide feedback.

I have a flume configuration that sinks to HDFS using timestamp headers. I would like to setup a post-processor using Oozie to pull the data as it lands in HDFS into Hive, doing some cleaning and compression along the way.

However I am running into an issue where if I inadvertently read a .tmp file the flume agent that is writing to it stops sinking with an HDFS error.

The flume docs state "The file in use will have the name mangled to include ".tmp" at the end. Once the file is closed, this extension is removed. This allows excluding partially complete files in the directory." but I cannot figure out how to exclude files based on extension via either Pig or Hive.

In general I should not need to exclude as I could reasonably assume the directory is done being written to, but in the event of delays in flume or my initial app agent starting the data flow the directory could still be written to when the Oozie coordinator materializes a job.

It seems like this should be easy, but I'm not having any luck searching for a solution. Any insight or advice is appreciated,

We recently committed https://issues.apache.org/jira/browse/FLUME-1702 totrunk. This will be available in the next release of Flume. This shouldhelp in the Pig case, not sure about Hive though.HariOn Thursday, December 27, 2012, Paul Chavez wrote:

> **> This is kind of a generic HDFS question, but it does relate to flume, so> hopefully someone can provide feedback.>> I have a flume configuration that sinks to HDFS using timestamp headers. I> would like to setup a post-processor using Oozie to pull the data as it> lands in HDFS into Hive, doing some cleaning and compression along the way.>> However I am running into an issue where if I inadvertently read a .tmp> file the flume agent that is writing to it stops sinking with an HDFS error.>> The flume docs state "The file in use will have the name mangled to> include ”.tmp” at the end. Once the file is closed, this extension is> removed. This allows excluding partially complete files in the directory."> but I cannot figure out how to exclude files based on extension via either> Pig or Hive.>> In general I should not need to exclude as I could reasonably assume the> directory is done being written to, but in the event of delays in flume or> my initial app agent starting the data flow the directory could still be> written to when the Oozie coordinator materializes a job.>> It seems like this should be easy, but I'm not having any luck searching> for a solution. Any insight or advice is appreciated,>> thank you,> Paul Chavez>

We recently committed https://issues.apache.org/jira/browse/FLUME-1702 to trunk. This will be available in the next release of Flume. This should help in the Pig case, not sure about Hive though.HariOn Thursday, December 27, 2012, Paul Chavez wrote:This is kind of a generic HDFS question, but it does relate to flume, so hopefully someone can provide feedback.

I have a flume configuration that sinks to HDFS using timestamp headers. I would like to setup a post-processor using Oozie to pull the data as it lands in HDFS into Hive, doing some cleaning and compression along the way.

However I am running into an issue where if I inadvertently read a .tmp file the flume agent that is writing to it stops sinking with an HDFS error.

The flume docs state "The file in use will have the name mangled to include ".tmp" at the end. Once the file is closed, this extension is removed. This allows excluding partially complete files in the directory." but I cannot figure out how to exclude files based on extension via either Pig or Hive.

In general I should not need to exclude as I could reasonably assume the directory is done being written to, but in the event of delays in flume or my initial app agent starting the data flow the directory could still be written to when the Oozie coordinator materializes a job.

It seems like this should be easy, but I'm not having any luck searching for a solution. Any insight or advice is appreciated,

thank you,Paul Chavez

NEW: Monitor These Apps!

All projects made searchable here are trademarks of the Apache Software Foundation.
Service operated by Sematext