README.md

Lua-MapReduce

Travis CI (master branch)

Travis CI (devel branch)

Introduction

Lua MapReduce implementation based in MongoDB. It differs from
ohitjoshi/lua-mapreduce
in the basis of the communication between the processes. In order to
allow fault tolerancy, and to reduce the communication protocol
complexity, this implementation relies on mongoDB. So, all the data
is stored at auxiliary mongoDB collections.

Documentation

Performance notes

Word-count example using Europarl v7 English data,
with 1,965,734 lines and 49,158,635 running words. The data has been splitted
in 197 files with a maximum of 10,000 lines per file. The task is executed
in one machine with four cores. The machine runs a MongoDB server, a
lua-mapreduce server and four lua-mapreduce workers. Note that this task
is not fair because the data could be stored in the local filesystem.

Looking to these numbers, it is clear that the better is to work in main memory
and in local storage filesystem, as in the naive Lua implementation, which needs
only 26 seconds (real time), but uses local disk files. The map-reduce approach
takes 49 seconds (real time) with four workers and 146 seconds (real time) with
only one worker. These last two numbers are comparable with the naive
shellscript implementation using pipes, which takes 146 seconds (real
time). Concluding, the preliminar lua-mapreduce implementation, using four workers
and MongoDB
for communication and GridFS for auxiliary storage, is up to 3 times faster
than a shellscript implementation using pipes. Both implementations sort the
data in order to aggregate the results. In the future, a larger data task will
be choosen to compare this implementation with raw map-reduce in MongoDB and/or
Hadoop.

Last notes

This software is in development. More documentation will be added to the
wiki pages, while we have time to do that. Collaboration is open, and all your
contributions will be welcome.