Andreu Sancho Homepage

Monday, May 30, 2016

Today’s mid- and high-end computers come with a tremendous hardware, mostly used in video games and other media software, that can be exploited for advanced computation, that is: High Performance Computing (HPC). This is a hot topic in Deep Learning as modern graphic cards come with huge streaming process power and large and quick memory. The most successful example is in Nvidia’s CUDA platform. In summary, CUDA significantly speeds up the fitting of large neural nets (for instance: from several hours to just a few minutes!).

However, the drawbacks come when setting up the scenario: it is non-trivial to install the requirements and set it running, and personally I had a little trouble the first time as many packages need to be manually compiled and installed in a specific order. The purpose of this entry is to reflect what I did for setting up Theano and Keras with HPC using an Nvidia’s graphic card (in my case a GT730) using GNU/Linux. To do so, I will start assuming a clean Debian 8 Jessy install and the use of anaconda for python.

The first thing to do is to install the package requisites: gcc, g++, gfortran, build-essential, linux-headers, git, and automake after updating apt-get (assuming we are already logged as root):

These are the minimum requisites in order to proceed. Without all these software packages we could not compile Theano and Keras. Next, we configure git in two easy steps:

$ git config --global user.nameYOUR_USER_NAME

$ git config --global user.email YOUR_USER_EMAIL

Now we start downloading the requisites for Theano. We start with OpenBLAS, the efficient library of algebra computing:

$ mkdir git
$ cd git

$ git clone https://github.com/xianyi/OpenBLAS

Transform into root, enter the git/OpenBLAS folder and run the following two lines:

# make FC=gfortran

# make PREFIX=/usr/local install

After this step we can proceed with the installation of the graphic cards driver and the CUDA toolkit. This is one of the most critical parts, so be very careful. First we need to download the package from the Nvidia web page https://developer.nvidia.com/cuda-toolkit and select download CUDA 7.5, Linux, X86_64, Ubuntu, 14.04, runfile (local). Yes, we will use the Ubuntu 14.04 file –no trouble with that. After downloading the file, we have to blacklist the nouveau driver otherwise the correct Nvidia’s one won’t work. To do so, as root, we need to do the following:

# gedit /etc/modprobe.d/nvidia.conf

And enter the following:

blacklist nouveau

blacklist lbm-nouveau

blacklist nvidia-173

blacklist nvidia-96

blacklist nvidia-current

blacklist nvidia-173-updates

blacklist nvidia-96-updates

alias nvidia nvidia_current_updates

alias nouveau off

alias lbm-nouveau off

Save and exit. Then, as root run the following:

# update-initramfs -u

Now we are almost ready for the installation of the drivers. For proceeding with the installation we have to kill the X session. Enter the console mode pressing CTRL + ALT + F1 and log in. After log in with your user, log gin as root. Then do the following:

# telinit 3

This will enter in the classic console-only mode (that is: no X session), and we can proceed with the installation. Now enter the directory where the CUDA drivers are. I assume there are in ~/Downloads/ and the file is called “cuda_7.5_linux.run” (these may change, you have to check the actual name!):

# cd /home/YOUR_USER_HOME/Downloads

# chmod +x cuda_7.5_linux.run

# ./cuda_7.5_linux.run

And follow the instructions. Mostly accept the license and tell yes to all. Before proceeding we have to modify the .bashrc configuration for your user (not for root but for your user!):

$ vim .bashrc

And add the following lines at the end of the file:

export PATH=/usr/local/cuda-7.5/bin:$PATH

export LD_LIBRARY_PATH=/usr/local/cuda-7.5/lib64:$LD_LIBRARY_PATH

Save and exit. After this step, reboot the machine. We have the drivers and CUDA toolkit installed and ready. Now we have to install anaconda and the rest of packages. You can download the package from the following web page: http://continuum.io/downloads selecting the Linux 64 bit for Python 2.7 (Python 3.5 is also available, but we stick to 2.7 for this tutorial). After this, install anaconda (assuming the file is named “Anaconda-Linux-x86_64.sh” and that it is stored in your Downloads folder):

$ cd Downloads

$ bash Anaconda-Linux-x86_64.sh

You will see that anaconda writes into your .bashrc file. At this point I recommend either rebooting (easy solution) or executing the newer version of your .bashrc file in every opened terminal:

$ cd

$ . .bashrc

(yes, type cd and press ENTER then type . .bashrc and press ENTER).

After this, you will have set anaconda as the default python. You can check it by typing:

$ python --version

If everything is OK you will see a message telling that this is a special version of python compiled for anaconda. Now we update the package:

$ conda update conda

$ conda update anaconda

$ conda install pydot

At this moment this we can proceed with Theano and Keras. It is crucial to get the last versions from the git repository otherwise they won’t work (at least in my case). So we proceed with the required packages:

$ cd git

$ git clone https://github.com/Theano/Theano

$ git clone https://github.com/fchollet/keras

We enter the folders and install the packages in the following order: first install Theano, then configure Theano (the .theanorc file) and finally install Keras.

Friday, March 11, 2016

Sometimes
practitioners are forced to go beyond the standard methods in order to gain
more accuracy with their models. If one analyzes the problem of rocketing
accuracy, ensembling is a good starting point. However, the trick lies in
getting enough generalization from feature space. In this regard, ensemble generalization--do
not confuse with classic or "standard" ensemble methods such as Random Forest
or Gradient Boosting--is the right path to follow, however complex. The idea is
to combine predictions from "base learners" to train a second stage regressor,
using these predictions as metafeatures. The trick is to use a J-fold
cross-validation scheme and use always the same data partitions and seed. This
kind of ensemble is often called stacking--as we "stack" layers of
classifiers.

Let’s
do an example: suppose that we have three base learners: GBM, ET, and RF. Then
assume we have a LM as level 2 learner. First we divide the training data
into J-folds, for example in 4--recall that these 4 folds are stratified and
disjoint. Then we train each model using the traditional cross-validation
scheme, that is train with 3 folds and predict with the remaining (works best
if the predictions are in form of probabilities). These predictions are
stored and will be used for training the level 2 model. Figure 1 depicts this
process.

Figure 1. Ensemble generalization (also known as Stacking) training scheme. The idea is to "stack" multiple layers for generalizing further (in this example we use two layers), and use a J-fold cross-validation scheme for avoiding bias (in this example J = 4).

After
training the level 2 algorithm, we can proceed with the final predictions. To
do so, we train again the base learners but using the whole training set. We do this to gain up to a 20% accuracy.
It is important to highlight that we’ve to assure that the random seeds are the
same that in the J-fold training! Afterwards, for each test example we predict
with the base learners and collect the predictions. These are the input of the
level 2 algorithm, which performs the final prediction.

Wednesday, December 16, 2015

The increasing bulk of data generation in industrial and scientific applications has fostered practitioners’ interest in mining large amounts of unlabelled data in the form of continuous, high speed, and time-changing streams of information. An appealing field is association stream mining, which models dynamically complex domains via rules without assuming any a priori structure. Different from the related frequent pattern mining field, its goal is to extract interesting associations among the forming features of such data, adapting these to the ever-changing dynamics of the environment in a pure online fashion--without the typical offline rule generation. These rules are adequate for extracting valuable insight which helps in decision making.

It is a pleasure to detail Fuzzy-CSar, an online genetic fuzzy system (GFS) designed to extract interesting, quantitative rules from streams of samples. It evolves its internal model online, being able to quickly adapt its knowledge in the presence of drifting concepts.