Clone the repo

After cloning the data, let’s have a look on readme file to learn about its usage.

$ less README.md

Install dependencies

We will install dependencies to run the python script. Installing dependencies directly can mess up with other python libraries. So to avoid that, we will create a virtual env to install our dependencies.

Create and activate virtual env :

$ virtualenv -p python2.7 venv
$ ../venv/bin/activate

Note : Install virtualenv if not already installed

$ pip install virtualenv

Install dependencies:

$ pip install -r requirements.txt

Run the script
Let’s print some sample data on terminal

$ python apache-fake-log-gen.py -n 20

Now let’s create a zip file of log

$ python apache-fake-log-gen.py -n 20 -o GZ

To generate multiple log files, we just have to run the command multiple times.

Copy data to S3

To copy the data to S3, first we need to decide the path where we will kept our data.

It’s always a good idea to keep the raw data separate from the processed/cleaned data. Its also a good idea to keep the data in date sub directories/paths.
I have selected this path