Create a SpoolDirectory Source and Client

Details

Description

The proposal is to create a small executable client which reads logs from a spooling directory and sends them to a flume sink, then performs cleanup on the directory (either by deleting or moving the logs). It would make the following assumptions

Files placed in the directory are uniquely named

Files placed in the directory are immutable

The problem this is trying to solve is that there is currently no way to do guaranteed event delivery across flume agent restarts when the data is being collected through an asynchronous source (and not directly from the client API). Say, for instance, you are using a exec("tail -F") source. If the agent restarts due to error or intentionally, tail may pick up at a new location and you lose the intermediate data.

At the same time, there are users who want at-least-once semantics, and expect those to apply as soon as the data is written to disk from the initial logger process (e.g. apache logs), not just once it has reached a flume agent. This idea would bridge that gap, assuming the user is able to copy immutable logs to a spooling directory through a cron script or something.

The basic internal logic of such a client would be as follows:

Scan the directory for files

Chose a file and read through, while sending events to an agent

Close the file and delete it (or rename, or otherwise mark completed)

That's about it. We could add sync-points to make recovery more efficient in the case of failure.

A key question is whether this should be implemented as a standalone client or as a source. My instinct is actually to do this as a source, but there could be some benefit to not requiring an entire agent in order to run this, specifically that it would become platform independent and you could stick it on Windows machines. Others I have talked to have also sided on a standalone executable.

NO NAME
added a comment - 06/Nov/12 06:33 This patch addresses a bug in the way that file timestamps are treated in the unit tests.
Due to varying time granularity in filesystems, tests had inconsistent results. This should fix that error.

Brock Noland
added a comment - 02/Nov/12 11:53 On a mac, java sets java.io.tmpdir (where that temp directory is created) to a very weird location, like:
/private/var/folders/b4/b44x97M0GFydt3jCKcowsU+++TI/ Tmp /
so it's possible something is hosed up in that respect.

Aaron Baff: Would love to get enhancements on top of this work after it's committed. You may want to file a JIRA for that.

NO NAME: I am still getting a unit test error on my Mac. I'll try to dig into it more tomorrow. This is the stack trace:

-------------------------------------------------------------------------------
Test set: org.apache.flume.client.avro.TestSpoolingFileLineReader
-------------------------------------------------------------------------------
Tests run: 16, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.241 sec <<< FAILURE!
testBehaviorWithEmptyFile(org.apache.flume.client.avro.TestSpoolingFileLineReader) Time elapsed: 0.007 sec <<< FAILURE!
java.lang.AssertionError
at org.junit.Assert.fail(Assert.java:92)
at org.junit.Assert.assertTrue(Assert.java:43)
at org.junit.Assert.assertTrue(Assert.java:54)
at org.apache.flume.client.avro.TestSpoolingFileLineReader.testBehaviorWithEmptyFile(TestSpoolingFileLineReader.java:396)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:236)
at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:134)
at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:113)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189)
at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165)
at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85)
at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:103)
at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:74)

Mike Percy
added a comment - 02/Nov/12 09:33 Aaron Baff : Would love to get enhancements on top of this work after it's committed. You may want to file a JIRA for that.
NO NAME : I am still getting a unit test error on my Mac. I'll try to dig into it more tomorrow. This is the stack trace:
-------------------------------------------------------------------------------
Test set: org.apache.flume.client.avro.TestSpoolingFileLineReader
-------------------------------------------------------------------------------
Tests run: 16, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.241 sec <<< FAILURE!
testBehaviorWithEmptyFile(org.apache.flume.client.avro.TestSpoolingFileLineReader) Time elapsed: 0.007 sec <<< FAILURE!
java.lang.AssertionError
at org.junit.Assert.fail(Assert.java:92)
at org.junit.Assert.assertTrue(Assert.java:43)
at org.junit.Assert.assertTrue(Assert.java:54)
at org.apache.flume.client.avro.TestSpoolingFileLineReader.testBehaviorWithEmptyFile(TestSpoolingFileLineReader.java:396)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:236)
at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:134)
at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:113)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189)
at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165)
at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85)
at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:103)
at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:74)

My fairly naive implementation of a Source to take files from a directory and process them in parallel. Not providing a patch as I'm intending this to be more of an inspiration to Patrick or Mike if they see anything useful in it. Or anyone else for that matter.

Aaron Baff
added a comment - 01/Nov/12 18:17 My fairly naive implementation of a Source to take files from a directory and process them in parallel. Not providing a patch as I'm intending this to be more of an inspiration to Patrick or Mike if they see anything useful in it. Or anyone else for that matter.

Mike Percy
added a comment - 31/Oct/12 01:10 @Aaron makes sense. I think we can add that functionality to this in the next iteration.
@Philz Agreed on the Guava thing. Adding a deserialization API on top of this is in scope for FLUME-1633 .

Philip Zeyliger
added a comment - 31/Oct/12 01:02 Cool stuff!
Not sure if Flume already uses Guava, but if it does, I recommend Charsets.UTF_8 instead of calling Charset.forName("UTF-8") multiple times.
It might be useful to comment explicitly that this is designed to read new-line delimited data. There is no way currently to configure this to read the entire file as a single record.

You can always write a Source that builds on top of this. I've written one which takes a file or directory, and then reads in all files (recursive is an option) from the directory, and submits then to a ThreadPoolExecutor which you configure the total number of threads used. Worked quite well, and allows for the Sink to run slow instead of something like a `cat` EXEC Source which will just lose records if the Channel/Sink can't keep up.

Now, it doesn't monitor the directory for new files, and doesn't rename them or look for a specific pattern, but the latter two wouldn't be too hard to add. Possibly add a monitor that every X seconds it'd scan through for new files of the correct patter and put them on the Executor to pull in.

Aaron Baff
added a comment - 30/Oct/12 23:58 You can always write a Source that builds on top of this. I've written one which takes a file or directory, and then reads in all files (recursive is an option) from the directory, and submits then to a ThreadPoolExecutor which you configure the total number of threads used. Worked quite well, and allows for the Sink to run slow instead of something like a `cat` EXEC Source which will just lose records if the Channel/Sink can't keep up.
Now, it doesn't monitor the directory for new files, and doesn't rename them or look for a specific pattern, but the latter two wouldn't be too hard to add. Possibly add a monitor that every X seconds it'd scan through for new files of the correct patter and put them on the Executor to pull in.

Mike Percy
added a comment - 30/Oct/12 18:56 Hans, that is out of scope of this JIRA but it could potentially be done on top of this work in the future using file locks or something like that. Obviously with concurrency it gets more complicated.

Is there any ability to scale this horizontally without having event duplication, one file per client obviously but multiple file readers for faster overall IO(specific use case is high bandwidth NAS/SAN drives).

Hans Uhlig
added a comment - 30/Oct/12 17:54 Is there any ability to scale this horizontally without having event duplication, one file per client obviously but multiple file readers for faster overall IO(specific use case is high bandwidth NAS/SAN drives).

NO NAME
added a comment - 14/Aug/12 22:58 This patch is ready for review. It creates both a new source (SpoolDirectorySource) and adds a spooling directory capability to the existing avro client.
Includes extensive unit tests - probably best to start with those.

One story would be to add this as an option to the AvroCLIClient. Right now you can either read from stdin or a single file, a natural extension would be to watch a directory and read from files dropped in that directory.

NO NAME
added a comment - 03/Aug/12 22:11 One story would be to add this as an option to the AvroCLIClient. Right now you can either read from stdin or a single file, a natural extension would be to watch a directory and read from files dropped in that directory.