We need to automate the movement of files from a Windows fileshare to HDFS. The file types/formats are varied.

To solve this, we wrote a python script running on an edge node which pulls the files from the Windows share to local, then copies to HDFS. This is bad because it requires the temporary use of local disk space on the edge node, and we want to avoid that.

It's been suggested that I should use Flume... but this seems to be geared more towards pulling logs that are being actively written to on remote boxes AS OPPOSED to just copying static files to HDFS. Agree/Disagree?

Our tentative and likely semi-permanent solution was to install the pysmb python package, which allows python to connect to Samba and Windows network fileshares. The main benefits for us is the speed (as opposed to SFTP) as well as the ability to pull files (as opposed to having to push).

Another team installed a Tomcat java restful service to which they push files, and the speed is comparable to the Python solution. I didn't want to go that route because again, we wanted edge-node-resident jobs to pull files rather than remote jobs push files.

It occurred to me that I should also mention that we are using MapR, and this allows us to copy directly to the HDFS w/o running a "hadoop -fs" command. In other words I can write files to HDFS as if it were mounted directly (because it is).... so the file doesn't have to land 'locally' before copying it to HDFS

Yeah that's what I've been testing out... 2 problems there: first is I don't know of a Hadoop-distributed solution to copy from a samba mount. Second, local file space on the edge node is low, and I don't know of any way to copy DIRECT from the samba mount to HDFS without first writing to the local file system

As long as we're talking, could you tell me whether a Flume source box (non-Hadoop box) would need to have an agent/daemon installed? That's how it works, right? Sorry for my ignorance... it's more of a time crunch than laziness

Indeed, mounting the share on an edge node is something I've considered. I don't have root and it's a little painful to get the powers to make changes to their previous cluster, so I'm looking for all good options :)

I agree with the webhdfs comment, if you are willing to write java code you can use the hdfs libraries directly without having to install hadoop if all you want to do is file system operations. This will prevent extra file copies - http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample Nearly all of our source data is from windows and we write it using the java libs.

A lot of people don't realize that you can use old school UNIX semantics with dfs -get and dfs -put ... Namely, that you can use a - as a signifier for stdin/stdout. In other words, "| hdfs dfs -put - /some/path" will write stdin to /some/path on HDFS.

something else: I'm not a big fan of Flume's push methodology, but if it works for you, it works for you. Pull methodology (such as that used by Kafka) has the benefit of being more controlled and usually easier to monitor and correct mistakes in the data pipeline, in my experiences.