Hi Roberto,
Setting the roll intervals to 0 will stop the sink rolling the files in HDFS. Try setting
hdfs.rollCount to the number of messages you want to roll the file on (I.e. The number of
messages per file). Bare in mind setting this low will result in higher HDFS overhead.
--
Chris Horrocks
On Wed, Nov 16, 2016 at 10:35 am, Roberto Coluccio <'roberto.coluccio@eng.it'> wrote:
Hello folks,
I'm testing a Flume agent defined by a topology made of :
JMS source (Tibco implementation) -> memory channel -> hdfs sink
The JMS source has:
- my_agent.sources.my_source.batchSize = 100
The memory channel has:
- my_agent.channels.my_channel.capacity = 100
The HDFS sink has:
- my_agent.sinks.my_sink.hdfs.batchSize = 100
- my_agent.sinks.my_sink.hdfs.rollCount = 0
- my_agent.sinks.my_sink.hdfs.rollInterval = 0
- my_agent.sinks.my_sink.hdfs.idleTimeout = 0
I don't understand how/why new files on HDFS are created/closed. In fact, when I:
- launch the agent (JMS queue empty)
- push a new text message on the JMS queue
It happens that a new file is created by the HDFS, but not yet closed (as I expect). BUT,
when I
3. push again a new text message on the JMS queue
regardles how much time I waited to perform step 3, the HDFS sink closes the previously open
file, then open a new one for the new incoming message consumed from the queue and processed
through the channel.
This way, files will always have 1 and only 1 message inside them. I was expecting that number
to be 100, according to the configuration mentioned above.
Any hints?
Best regards,
Roberto