Setting up Piwik log analytics on an OVH shared hosting

Analytics help you see how users interact with your website. This can help identify which topics they are most interested in, or which step or your sales funnel has most users give up on buying your product. I’ve recently set up Piwik to keep an eye on the traffic on my website using their log analytics feature, automatically importing the previous days log into the tool every night. There were 4 main steps to getting it working:

Hosting Piwik

Writing a script to import the log into Piwik

Writing a script to download the logs automatically

Setting up a Cron job to run the combination of both scripts every night

Piwik, not Google Analytics

The most famous tool in this area is Google Analytics. While powerful, it sends data from the visitors of my site to a 3rd party (Google) whose business is selling ads by tracking people around the web. That makes me a bit wary regarding users’ privacy.

To keep control of that data, a strong alternative is Piwik. It’s open-source and free to host on your own servers. This put you in control of the collected data and the respect of your users’ privacy (Do Not Track browser setting, IP Anonymisation…).

There are different ways to go for it. Similar to Google Analytics, you can embed a piece of JavaScript on your pages which will notify Piwik when users visit them. Or you can feed Piwik your server logs. As my website doesn’t have other client side interactions to track, nor does it load much content via JavaScript, log analytics would be enough. One less script to download for the users, no tracking cookie, and additionally, I’ll be able to see requests for non-existing pages. This might allow me to detect when something breaks. Though at the moment, it mostly shows me where attackers are trying to poke for WordPress or its plugins vulnerabilities.

Hosting Piwik

The first step is obviously to get Piwik running, ideally hosted on a separate vhost and with a separate database from your main website. After creating a new site and database on OVH’s hosting admin panel, Piwik’s installation is pretty straightforward following their user guide.

With Piwik ready to receive data, let’s work backward and look first at how to feed it the logs.

Importing the logs in Piwik

Along with the PHP file providing the UI and endpoint for the JS tracker, Piwik also provides a Python Script for importing logs into it. All our import script needs to do is configure the command with the appropriate options.

The script requires some kind of authentication to import the data into Piwik. Either the login/password combination you use to login to the UI, or an auth token that you can generate from the Piwik UI. Either way, these are kept our of the script to make it portable, and the appropriate flags (--login & --password, or --token-auth) are stored in a separate .piwikauth file.

The tool also needed a bit of help to parse the logs. The log format for OVH’s shared hosting doesn’t match exactly one of the well-know log formats supported by the import tool. This prevented the script from extracting the host information from the script, so a specific regex had to be provided with --log-format-regex.

With the host information at hand, the tool could now generate Piwik sites for each host encountered in the log thanks to the --add-sites-new-hosts. If you’re only interested in specific hosts, you can filter the logs with the --hostname option.

Last, Piwik import tool will ignore static file downloads, HTTP redirects, and HTTP errors by default. It provides --enable-<XYZ> flags to include them in the import, though. Redirects added too much noise to the stats, but HTTP errors were definitely something I’m interested in. And I’m thinking downloads could help have a view of the RSS feed audience (tracking a query parameter on images loaded from RSS, for example) so I kept them too.

But to import logs into Piwik… well… we need logs. Let’s see how to collect them.

Downloading OVH logs

For shared hostings, OVH provides access to GZipped versions of the Apache logs. Of course, they are not publicly accessible. In the “Statistics and logs” section of the hosting admin panel, you can create login/password to clear access to them.

From there, the super structured URLs make it easy to collect the logs for a given date.

As for the Piwik import script, the credentials are kept out of the script. You’ll need to put the necessary --user and --passwordcurl flags in a separate .curlauth file.

The different parts are ready, time to combine them together.

Setting up a Cron job

Cron is a tool that schedules scripts to run at regular intervals. That’s what we’ll use to run the import every night. To make things simpler, one last script is needed, that will combine the two previous one and make the magic happen.

Now all the scripts are ready, the last steps start with uploading them to the server. Ideally, you want them in a folder that’s not accessible from the internet (or at least with no access to the .*auth files). Make sure they can be executed with chmod a+x *.sh inside the folder they’re stored. And finally, in the hosting admin panel, you’ll need to schedule a new job to run every day, using the path to the import-yesterdays-log.sh script.

Note: You might want to test things out before setting up the Cron job. Quick catch there, it’s a bit more convoluted than SSHing to the server and running the last script from the command line, unfortunately. OVH seems to prevent network access to scripts on their shared hosting when run from an SSH connection. All is not lost, though! You can create a small PHP page running the import script and showing its output to try the script out.