The latest 3.x branch of cal-heatmap comes with one new big feature: vertical orientation.

Implementing it was not easy, since other features, like label position on the side, etc, must come along.

A quick summary of all the new features in the v3:

Calendar vertical orientation

Label can be positioned on all four sides

Label can be rotated (-90deg/+90deg)

New x_ variants subdomains. Append x_ before a subdomain (eg: x_day, x_month) to rotate its reading order. More useful when using a vertical orientation.

New offset option for the label, for more control on its x/y position.

id option was replaced by itemSelector. It now takes a String Selector, compliant with CSS3. You can now use something like body > p:first-child > [title="calendar"]. If the selector returns more than one result, the first one will be used. itemSelector can also directly take a DOM Element object.

New domainMargin to add margin around domain.

New domainDynamicDimension, to allow the domain width and height to fit its content, since not all the months have the same number of weeks. Disabling it will set all the domains with the same width, and may leave a space between domain.

You can now display a text/date inside subDomain cells.

highlightToday was replaced by highlight, and takes an array of dates, so you can highlight any subDomain cells.

More legend options, so that you can place it either on top or below the calendar, and align it left, center or right. The legend size is now independent from cellSize, and can be set via legendCellSize.

Oh, all references to scale was renamed to legend

The browsing option was removed. It is now enabled by default, and cal-heatmap will not create the next/previous buttons for you anymore.

Two new events methods next() and previous() were added to browse the calendar.

If you don’t like the idea of using javascript to navigate the calendar, two new previousSelector and nextSelector options were added to directly attach the next() and previous() methods to a DOM Element on mouse click.

Domain Highlighting: you can control the background color of any domain. Opposed to the subDomain highlight option, domain highlighting is passive and controlled solely via CSS.

With this lot of new features and option, almost all options were renamed for consistency. Some option like scale are now renamed legend, some have just been camelCased.

See the new shiny website for details about each features.
I’m pretty satisfied with it, as each option is now properly documented and illustrated with a simple example. You can learn a lot more there than in this post.

A recommendation system seems to be a must in this current era websites, where you want to keep the current visitor inside your website, by providing things that will hold him there.

Anyone can build a basic recommendation engine by joining a few tables in a relational databases, and start recommending an item A based on another item B by looking for similarities between these 2 items (common tags/categories, common keywords in name and description, etc, check ShowerHacks.com as example …).

If I said I watched The Dark Knight, the most 2 obvious recommendations you will give me will be the The Dark Knight Rises and Batman Begins. Lot of common keywords, tags, staff, and you obviously have a “sequel/prequel” relation between these movies.

This far, we were dealing with item-to-item recommendation. It’s the most easy recommendation you can implement, you’re just dealing with similarities between two entities.

Now, recommend me a third movie … Superman ? Because I like superheroes movie ? Or Inception, because I like the cast ? You can’t really decide without knowing my preferences. In this user-to-item relation, you have to know all my previous watched movies and behaviors (most viewed genre, actors, theme, etc …), before reaching a conclusion.

So, let’s try Superman … Which one ? The first, second, or the third ?

Here comes the machine learning system: you feed it users and items data, as well as their relations (like, rating, views, etc …), and it will predict the future, based on various algorithm.

Apache Mahout is among one of the popular and free machine learning library, written in Java. It’s used by some big names such as Amazon, Foursquare, Twitter, Yahoo, etc … It uses Hadoop as database, can be scaled, and can process a lot of data. Installing and managing these tools can be intimidating and frustrating, but PredictionIO helps us doing all these petty tasks. In the end, you’ll just have to install predictionIO and start it, all the hadoop and mahout stuff is hidden from you.

PredictionIO, an open source Machine Learning server

PredictionIO is a “one package” tool that will install and setup all the dependencies automatically, then start a tomcat server to expose a REST API, only gateway to your machine learning server. You can learn more about the server structure here.

The dashboard will be available at http://localhost:9000. You’re free to use another port by editing ADMIN_PORT in bin/common.sh. On my system, the port 9000 was already taken by php5-fpm.

This dashboard is the main advantage of using predictionIO compared with a vanilla Hadoop+Mahout installation, as it provides a neat web interface to organize and setup your engines. The REST API can also be consumed by everyone, regardless or your preferred programing language. A PHP, Ruby, Python and Java SDK are already available, and offer basic functions. You’re free to write your own, or implements more functions on top of existent one.

The dashboard is password protected, and you can create a user account easily with

bin/users

After login, you’ll be asked to create your first App.

You’ll obtain an App Key, used to for authenticate all API call.

Next step is to create engine. An engine predicts a relation between 2 entities. If you have some posts, movies, books, etc … one engine can only deal with 2 entities : user-movie, or user-book, or movie-book, etc … Although engines can deal with multiple items relations, staying with a two entities relation raise the accuracy of the prediction.

And as the users and items data are shared among all engines, you’re not losing anything.

There’s 2 kind of engines :

Item recommendation engine

Items similarity prediction engine

As of version 0.4, only Item Recommendation Engine is available. No ETA was given on the other and more interesting engine availability.

Each engines can be fine-tuned by choosing a different prediction algorithm.

The engine is now ready to predict the future. But before that, you need to input some user, item, and behavioral data to train the machine. The more data you’ll add, the more accurate your prediction will be. Sandra Cobain from BestForTheKids.com is helping me with this currently.

PredictionIO doc have some tutorials about building recommendation engine

As far as I know, the only way to input data in predictionIO is to use the API, so when adding 1 millions of data, have some fun with the for loop …

Alternatives

PredictionIO is still young and in development. There’s not much all-on-one free machine learning server out there.

The only other one I found is Myrrix, a similar product also based on Apache Mahout, but packaged as one .jar file.

Usage can not be easier, you just download and run the .jar, and your machine learning server is online. It also used a REST API for adding/editing data and to get predictions.

A server in Myrrix correspond to an engine in predictionIO. So, to have multiple engines, you’ll end up running multiple myrrix servers, on different port. Each server is isolated, so the user data in the user-movie server can not be shared with the user-book server.

Myrrix is also in development, and still in beta. Its website is very complete, with tons of examples, tutorials and use cases.

There’s already a lot of tutorial on internet on how to install awstats for nginx. I didn’t find any for the configuration I wanted for my outdoor gear website, FishingPicks.com, so I’ll write one, for my record.

I have some custom needs, let’s suppose I have 3 domains :

master-domain.com

alpha.com

beta.com

And I want to have stats for the 2 latest domains. The master-domain.com is used as the master domain of the server, with awstats available at awstats.master-domain.com, instead of having alpha.com/awstats and beta.com/awstats. The idea it to group all the server script/tools (phpmyadmin, zabbix, etc …) under master-domain.com.

We also want to password protect the stats, but with different credential for each vhost.

This file should be empty by default. It’s used to set the settings shared by all your awstats config.

We teach awstats the meaning of each field when parsing the log. The last token (%otherquot) means that “Oh, that string here does not mean anything.”.

Creating a configuration file for each vhost

Awstats is picky about the configuration files : you should have one config file by vhost, they should be named following the convention : awstats.domain.tld.conf, and should be placed inside the /ect/awstats/ directory.

So, for the vhost alpha.com and beta.com, you should create these two files :

awstats.alpha.com.conf

awstats.beta.com.conf

The official method

There is already a model configuration file inside the /ect/awstats/ directory : awstats.conf. Documentation says to clone that file when creating your own config files, with

Then you just edit these files to your needs… Method I’m not fond of. If you take a look at awstats.conf, you’ll see that it’s a very complete conf, with plenty of comments, and all the available settings, all of that for just * suspense music * … 1500 lines.

I’m personally not interested into having multiples conf files, for 1500 lines each, with each files differing of just 4 lines.

The DRY method

If you have ls the /etc/awstats folder, you’ll see that there’s by default 2 files here :

awstats.conf

awstats.conf.local

awstats.conf is the main conf file, origin of all the other conf files. It’ll also fallback to this file if no other config file exists.

awstats.conf.local is an empty file. It’s the parent of all the other config files. If you have some rules that are shared among all your config, you put them here.

What I do is I copy all the contents of awstats.conf into awstats.conf.local, and just put the important rules inside each vhost config, so they’re easier to read, and shorter.

What to put in the conf files

Let’s create the conf files for alpha.com.

vi /etc/awstats/awstats.alpha.com.conf

We start with an empty file, insert the following lines

# Path to you nginx vhost log file
LogFile="/var/log/nginx/access.alpha.com.log"
# Domain of your vhost
SiteDomain="www.alpha.com"
# Directory where to store the awstats data
DirData="/var/lib/awstats/"
# Other alias, basically other domain/subdomain that's the same as the domain above
HostAliases="www.alpha.com"

By default, awstats store all its data inside /var/lib/awstats/, which is the default settings. You could change that to another directory, or have a subdirectory for each vhost, like /var/lib/awstats/alpha.com/.

But even if you use the default setting, you have to set it in each config, as it can not be inherited from awstats.conf.local.

You’re free to add more setting if some of your vhost requires additional customization.
Repeat the same steps for each vhost.

Tune the global settings

Set LogFormat to Combined (if you didn’t use the optional step in formatting the nginx log) LogFormat = 1

You could also enable some plugin, like GeoIP (require additional steps, beside uncommenting the line).

Computing data

Awstats is now configured for each vhost. We will now tell it to read the log files, and generate the stats from them. It’s a boring operation that should be done regularly (e.g, once a day, each 6 hours, etc…) depending on your need. More you wait, more the log file grow in size, and more time it will take to process it. It’ll depend on your website traffic.

To compute the data, a perl script is available in /usr/share/doc/awstats/examples. The awstats_updateall.pl will compute the stats for each available config. It’s easy, just run :

/usr/share/doc/awstats/examples/awstats_updateall.pl now -awstatsprog=/usr/lib/cgi-bin/awstats.pl

The -awstatsprog flag tell the script where to find the awstats.pl script, because awstats_updateall.pl is just a wrapper that is executing awstats.pl for each of your config.

The obvious solution to run this script regularly is to use a cron job. The drawback is that nginx logs are rotated with logrotate. It means that every X days, the log file will be archived (and renamed), and a new log file will be created. If you use a cronjob to compute the stats

Just before the log rotation, you’ll lose all data between the computation and the rotation, as the file is renamed and not accessible by awstats anymore

After the rotation, you’ll also lose all data between the computation and the next rotation.

At the rotation, you’ll experience some weird things.

Solution #1

We could prevent the data loss by telling awstats to always parse 2 logs files : the regular one, and the last archived log.

Logrotate always rename the file using the convention filename.1, filename.2. At each rotation, all filenames are incremented, and filename will become filename.1. A new filename will be created, so the newest archive is always filename.1.

You’ll never lose data because of the rotation, since you’ll parse the rotated file too.

Solution #2

Execute the computation just before the rotation, using logrotate postrotate hook. This is useful especially if your computation interval equal the rotation interval (e.g, you rotate every day at midnight, and you compute also every day at midnight).

Edit the logrotate config for nginx :

vi /etc/logrotate/nginx.conf

I like to rotate log every day, to keep them lighter. By default, nginx rotate logs weekly.

/usr/share/doc/awstats/examples/awstats_updateall.pl now -awstatsprog=/usr/lib/cgi-bin/awstats.pl

directly in the shell, if you don’t want to wait for the log rotation at midnight.

You could use a regular cronjob on a single log file if you compute more than once a day, and use the postrotate hook just for the computation near midnight.

Building the html reports

awstats_updateall.pl will compute new stats, but not build the html pages. Awstats come with 2 options :

Build the static html page yourself

Use cgi to build the page dynamically

I’ll use the dynamic options, explained below. There’s already plenty of articles on internet explaining how to build static pages if it’s the way you want to go.

Exposing awstats

Now that awstats is configured and charged with data, let’s make it viewable by the internet.

Let’s create the subdomain where awstats will live : awstats.master-domain.com, linked to /var/www/awstats.

Let’s assume that the subdomain is already redirected to your server (creating the subdomain is not in the scope of this post), you just have to create the nginx virtual host for awstats.master-domain.com.

How you create it is your own choice, there’s multiple ways (single conf file, ‘sites-enabled’ a la apache, etc …).

CakeResque 3.0 was just freshly baked. The most important feature of this version is the support of scheduled jobs.

In addition to the possibility of queuing a job for a later execution, you can now specify when to queue the job.

Scheduling jobs

Queuing a job on a future date

You can know specify when to queue the job with CakeResque::enqueueAt(). This function takes 5 arguments :

CakeResque::enqueueAt($time, $queue, $class, $args, $track);

The last 4 arguments are the same arguments as the basic CakeResque::enqueue(); The new argument is the first argument, which is the the date when you want to queue the job. It can be a DateTime object, or simply an integer representing a timestamp.

Example

Limitations

By the worker polling time

Scheduling a job for time X/after Y seconds does not guarantee that the job will run at the specified time. It only means that the job will be added to the specified queue at that time. When will it be executed will depends on the worker polling the queue.

Example

Let’s suppose we have a queue, with a worker polling it each 15 seconds. Let’s say you started the worker precisely at 00 second, so the worker will poll the queue each minutes at 00, 15, 30 and 45 seconds.

If you have scheduled a job for 14:05:04, the job will only be added in the queue at 14:05:04, it has to wait until 14:05:15, for the worker to execute it.

By the Scheduler Worker polling time

Another point to know is that before the job is added to the queue, it ‘sleeps’ in a special queue. A special worker, the Scheduler Worker, will poll that queue to check for due jobs, and add them to the right queue. That Scheduler Worker also have its own pause between each polling (that you can define yourself), set by default to 3 seconds.

Example

With the previous example, the job is scheduled for 14:05:04. The Scheduler worker, that runs each 3 seconds (at 03, 06, 09 seconds, etc) will add it to the queue at 14:05:06, where the regular worker will execute it at 14:05:15. Of course, you can lower the polling time of the Scheduler Worker to 1 second, depending on your need.

The Scheduler Worker

The Scheduler Worker is a special worker used to moved the jobs in the right queue when they’re due. It must be running to handle scheduled jobs, else these jobs will never be pushed in the right queue, thus never be executed.

To start the scheduler worker, run in the terminal :

cake CakeResque.CakeResque startscheduler

You can also set the polling interval with the -i flag.

cake CakeResque.CakeResque startscheduler -i 3

Unlike the regular worker starting command, the interval flag is the only flag accepted by the startscheduler command. Default value is 3 seconds, that you can edit in the bootstrap, under CakeResque.Scheduler.Worker.interval.

You can only have one scheduler worker. Attempting to start another will fail. If you use the load command to start your worker, the scheduler worker will be started automatically.

This worker can be paused, resumed and stopped like any other worker, with the usual command.

Settings

Refer to bootstrap file for the scheduler worker settings. Scheduler Worker has its own settings.

Scheduler is disabled by default. Enable only if you use it, else the Scheduler Worker will just be a burden.

Other improvement

The stats command

The stats command has been updated with new informations. It now displays the number of jobs in each queues, and notify you when a queue is not monitored by a worker.

It also warn you when there is jobs scheduled, but the Scheduler Worker is not running.

Job tracking

You can track job status like usual. Scheduled jobs are labeled ‘scheduled’.

Using background jobs inside a php framework with php-resque is a little bit different, as the framework is imposing its own convention. Let’s see how to create background jobs in CakePHP, with the CakeResque plugin.

CakeResque is a CakePHP plugin for creating background jobs that can be processed offline later.

CakeResque is more than a wrapper to use php-resque within CakePHP. Where it really shines is the way it manages the dirty jobs of creating and stopping workers via the cake console.