The basic Nearest Neighbor (NN) algorithm is simple and can be used for classification or regression. NN is a non-parametric approach and the intuition behind it is that similar examples should have similar outputs .

Given a training set, all we need to do to predict the output for a new example is to find the “most similar” example in the training set.

A slight variation of NN is k-NN where given an example we want to predict we find the k nearest samples in the training set.
The basic Nearest Neighbor algorithm does not handle outliers well, because it has high variance, meaning that its predictions can vary a lot depending on which examples happen to appear in the training set. The k Nearest Neighbor algorithm addresses these problems.

To do classification, after finding the nearest sample, take the most frequent label of their labels.
For regression, we can take the mean or median of the k neighbors, or we can solve a linear regression problem on the neighbors

Nonparametric methods are still subject to underfitting and overfitting, just like parametric methods.
In this case, 1-nearest neighbors is overfitting since it reacts too much to the outliers. High , on the other hand, would underfit.
As usual, cross-validation can be used to select the best value of .

Distance

The very word “nearest” implies a distance metric. How do we measure the distance from a query point to an example point ?

Typically, distances are measured with a Minkowski distance or norm, defined as:

With this is Euclidean distance and with it is Manhattan distance.
With Boolean attribute values, the number of attributes on which the two points differ is called the Hamming distance.

For our purposes we will adopt Euclidean distance and since our dataset is made of two attributes we can use the following function where .

Weighted distance

Instead of computing an average of the neighbors, we can compute a weighted average of the neighbors. A common way to do this is to weight each of the neighbors by a factor of , where is its distance from the test example.
The weighted average of neighbors is then , where is the distance of the th neighbor.

For our implementation, we chose to use weighted distance according to a paper1 which proposes another improvement to the basic k-NN where the weights to nearest neighbors are given based on Gaussian distribution.

If we are happy with an implementation that takes execution time, then that is the end of the story. If not, there are possible optimization using indexes based on additional data structures, i.e. k-d trees or hash tables, which I might write about in the future.

Strength and Weakness

k Nearest Neighbor estimation was proposed sixty years ago, but because of the need for large memory and computation, the approach was not popular for a long time. With advances in parallel processing and with memory and computation getting cheaper, such methods have recently become more widely used.
Unfortunately, it can still be quite computationally expensive when it comes to large training dataset as we need to compute the distance for each sample.
Some indexing (e.g. k-d tree) may reduce this cost.

Also, when we consider low-dimensional spaces and we have enough data, NN works very well in terms of accuracy, as we have enough nearby data points to get a good answer. As the number of dimensions rises the algorithm performs worst, this is due to the fact that the distance measure becomes meaningless when the dimension of the data increases significantly.

On the other hand, k-NN is quite robust to noisy training data, especially when a weighted distance is used.

Dataset

To test our k-NN implementation we will perform experiments using a version of the automobile dataset from the UC Irvine Repository. The problem will be to predict the miles per gallon (mpg) of a car, given its displacement and horsepower. Each example in the dataset corresponds to a single car.

Number of Instances: 291 in the training set, 100 in the test set
Number of Attributes: 2 continous input attributes, one continuous output

Predict

Using the data in the training set, we predicted the output for each example in the test, for , , and . Reported the squared error on the test set.
As we can see the test error goes down while increasing .

I finally found some time to do some machine learning. It is something I have always wanted to start practicing, as it is pretty clear that it is the future of complex problem solving.
Indeed, for some tasks, we do not have an algorithm we can write and execute, so we make it up from the data.

A typical example of problem ML tries to solve is classification. It can be expressed as the ability, given some input data, to assign a ‘class label’ to a sample.

To make things clearer, let’s make an example. Imagine we performed analysis on samples of objects and we collected their specs. Now, given this information, we would like to know if that object is a window glass (from vehicle or building) or not a window glass (containers, tableware, or headlamps). Unfortunately, we do not have a formula which, given these values, will provide us with the answer.

Someone who has handled glasses might be able to tell just by looking or touching it if that is a window glass or not. That is because he has acquired experience by looking at many examples of different kind of glasses. That is exactly what happens with machine learning. We say that we ‘train’ the algorithm to learn from known examples.

We provide a ‘training set’ where we specify both the input specs of the class and its category. The algorithm goes through the examples, learns the distinctive features of a window glass and so it can infer the class of a given uncategorized example.

We will use a dataset titled ‘Glass Identification Database’, created by B. German from Central Research Establishment Home Office Forensic Science Service.
The original dataset classified the glass into 7 classes: 4 types of window
glass classes, and 3 types of non-window glass classes. Our version
treats all 4 types of window glass classes as one class, and all 3 types of
non-window glass classes as one class.

Naive Bayes classifier

One of the simplest yet effective algorithm that should be tried to solve the classification problem is Naive Bayes.
It is a probabilistic method which is based on the Bayes’ theorem with the naive independence assumptions between the input attributes.

We define C as the class we are analyzing and x as the input data or observation.
The following equation, which is Bayes’ theorem, is the probability of class C, given the observation x. This is equal to the fraction of the probability of class C (without considering at the input) multiplied by the probability of the observation given the class C over the probability of the observation.

P(C) is also called the ‘prior probability’ because it is the knowledge we have as to the value of C before looking at the observables x.
We also know that P(C = 0) + P(C = 1) = 1.

P(x | C) is called the class likelihood, which is the probability that an event belonging to C has the associated observation value x.
In statistical inference, we make a decision using the information provided by a sample. In this case, we assume that the sample is drawn from some distribution that obeys a known model, for example, Gaussian.
Part of this task is to generate the Gaussian that describes our data, so we can use the probability density function to compute the probability for a given attribute 2.
As already mentioned, every attribute will be treated as independent from the others.

Finally, P(x), also called the evidence, is the probability that an observation x is seen, regardless of the class C of the example.

The above equation is the ‘posterior probability’, which is the probability of class C after have seen the observation x.

At this point, given the posterior probability of several classes, we are able to decide which one is the most likely. It is interesting to notice that the denominator would be the same for all the classes, so we can simplify the calculation by comparing only the numerator of the Bayes’ theorem.

Read data

First thing first, we want to read our dataset so we are able to perform analysis on it. It is a CSV file, so we could use the csv Python library, but I personally prefer to use something more powerful like pandas.

pandas.read_csv will read our CVS file into a DataFrame, which is a two-dimensional tabular data structure with labeled axes. In this way, our dataset will be damn easy to manipulate.
I also decided to label my columns so everything will be much clearer.

Now that we have our dataset in memory we want to split it into two parts: the training set and the test set. The former will be used to train our ML model, while the latter to check how accurate the model is.

The following code will split the data dividing the dataset in chunks (based on the number of blocks_num) and choose as a test set the chunk at position test_block which will also be removed from the training set.
If nothing is provided apart from the dataset, the function will just use the same data for both training and test sets.

Prior

Estimating the P(C) of a given training sample is pretty straightforward.
Prior probabilities are based on previous experience, in this case, the percentage of a class in the dataset.

We want to count the frequency of each class and get the ratio by dividing by the number of examples. The code to do so is extremely concise, also because pandas library makes the calculation of frequencies trivial.

Mean and variance

To calculate the ‘pdf’ (probability density function) we need to know how the distribution that describes our data looks like. To do that we need to compute the mean and the variance (or eventually the standard deviation) for each attribute for every single class.
Since we have 9 attributes and 2 classes in our dataset, we will end up having 18 mean-variance pairs.

Again for this task we can use the helper functions provided by pandas, where we select the column of interest and call their mean() and std() methods.

Gaussian Probability Density Function

The function to compute the ‘pdf’ is just a static method that takes as input the value of the attribute and the description of the Gaussian (mean and variance) and returns a probability according to the ‘pdf’ equation.

Predict

Now that we have everything in place, it is time to predict our classes.

Basically, what the following does, is it iterating through the test set and for each sample calculates the probability of every class using the Bayes’ theorem. The only difference here is that we use log probabilities since the probabilities for each class given an attribute value are small and they could underflow.

As a result, we need to take as a prediction the class with the highest probability. If two or more classes end up having the same probability we decided to take the class which comes earlier in reverse alphabetical order, but this was not really needed for the given dataset.

Accuracy

Once we obtain the predictions, we can compare them to the class value present in the test dataset, so we can calculate the ratio of correct ones over the total number of predictions. This measure is also called accuracy and allows to estimate the quality of the ML model used.

In our tests, we obtained a 90% accuracy using the same dataset for both training and test.

Cross validation

Now that we know how to perform a prediction, let’s look at the data again. Does it really make any sense to train an algorithm on something and then test it on the same data? Probably not.
We want to have two different sets then, but this is not always possible when you do not have enough data.

Our example dataset contains 200 records, ideally, we would like to squeeze it as much as we can and perform a test on the all 200 samples, but then we would not have anything to use to train the model.

The way ML people do this is called cross validation. The dataset is divided into chunks (as shown before), say 5 for example, and the model is trained against 4 of 5 chunks and the other chunk is used for the test. This operation is repeated as many times as the number of chunks so that the test is performed on every chunk.
Finally, the accuracy values collected for every repetition is averaged.

Again, even using 5-fold cross validation we obtained the same accuracy equal to 90%.

Zero-R classifier

Zero-R classifier simply predicts the majority class (the class that is most frequent in the training set).
Sometimes a not-very-intelligent learning algorithm can achieve high accuracy on a particular learning task simply because the task is easy. For example, it can achieve high accuracy in a 2-class problem if the dataset is very imbalanced.

Running a Zero-R classifier on our dataset just as a comparison with Naive Bayes, it achieved 74.5% accuracy.

Comparing the Zero-R classifier accuracy with the Naive Bayes one we realized that our model is pretty accurate when compared to simplistic ones. Indeed, Zero-R only achieves a 74.5% accuracy.

Popular implementation

One of the most popular library in Python which implements several ML algorithms such as classification, regression and clustering is scikit-learn.
The library also has a Gaussian Naive Bayes classifier implementation and its API is fairly easy to use. You can find the documentation and some examples here: http://scikit-learn.org/…/sklearn.naive_bayes.GaussianNB.html

This implementation is definitely not production ready, even though it obtains the same predictions of scikit-learn since what is actually happening under the hood is the same.
On the other hand, it has not been engineered too much as its scope was only to play with Naive Bayes. Anyway, most of the times looking at a simple implementation might be easier and more effective. You can find the whole source code and the dataset used here: https://github.com/amallia/GaussianNB

References

]]>Sorted integers compression with Elias-Fano encodinghttps://www.antoniomallia.it/sorted-integers-compression-with-elias-fano-encoding2018-01-06T00:00:00Z2018-01-06T00:00:00ZIn the previous post we discovered how to compress a set of integers by representing it as a bitmap and then compressing the latter using a succinct representation.

This post instead is about compression of monotone non-decreasing integers lists by using Elias-Fano encoding. It may sound like a niche algorithm, something that solves such an infrequent problem, but it is not like this.
Inverted indexes1, which is the most common data structure used by search engines to index their data, are made of lists of increasing integers corresponding to the documents of the collection. I might write again in the future about inverted indexes in a more comprehensive way if this is a topic of your interest, in that case please let me know with a comment.

Elias-Fano encoding has been proposed independently by Peter Elias and
Robert Mario Fano during the 70s, but their usefulness has been rediscovered recently. Elias-Fano representation is an elegant encoding scheme to
represent a monotone non-decreasing sequence of n integers from the universe [0 . . . m) occupying bits, while supporting constant time access to the i-th element.

If we compare Elias-Fano encoding space requirement with the theoretical lower bound we realize that this structure is close to the bound, so it has been epithet quasi-succint index2.

In the Elias-Fano representation each integer is first binary encoded using
bits. Each binary representation of the elements is split in two: the higher part consisting of the first (left to right) bits and the lower part with the remaining .
The concatenation of the lower part of each element of the list is the actual stored representation and takes trivially bits. The higher part, instead, is a unary representation, specifically a bit-vector of size bits.
It is constructed starting from and empty bit-vector, we add a 0 as a stop bit for each possible value representable with the bits of the higher part length, we add a 1 for each value actually present positioning it before the correct stop bit. This makes clearer why we use exactly 2n bits, one bit set to 1 for the n elements and one 0 bit for all the possible distinct values obtainable with bits. Finally, the Elias-Fano representation is the bitvector resulting from the concatenation of the higher and the lower part.

Figure 1: An example of Elias-Fano encoding of a sorted integer sequence.

As an example, lets take the sorted list of {2,3,5,7,11,13,24} as shown in Figure 1. In this case we know that m (the universe of the list) is equal to 24 and to represent all the elements in fixed-length binary we need 5 bits per element.
Then we want to split the binary representation of each element in two parts, the higher and the lower. Since we have 7 elements in total, we will use 3 bits for the higher part and 2 for the lower one as explained previously. If we consider 2 => 0b00010 we will have 000 and 10 respectively.
We repeat this process for every element of the list and we concatenate all the lower parts together.
Regarding the higher bits, since we use 3 bits per element we can imagine to have buckets and we associate a counter to each bucket corresponding to the cardinality of that bucket. For 2 we will increment the 000 bucket. To the same bucket goes 3, while 5 will increment 001 and so on and so forth. There might be cases where the counter of the bucket is equal to zero, as it is for 100 in Figure 1.
Finally, we use unary encoding to represent the buckets’ counters, specifically we append as many 1-bits as the counter value of each bucket followed by a 0-bit.
In the case of the 000 bucket we will add 2 set-bits and an unset one to separate the following bucket.
The final Elias-Fano encoding is obtained by concatenating higher and lower bits just obtained.

Query

Now, we show how to get an element given the information we have. Interestingly, with this type of encoding, we can have random access for both Access and NextGEQ operations in nearly constant time.

Access

Access(i) is the operation of retrieving the element at position i from the original list of elements.
To get the lower part we can simply jump to the corresponding bits since we know the length stored for each element. To compute the higher part we need to perform a select_1(i) - i, where select_1(i) is defined as the operation which returns the position of the i-th set-bit and there are techniques to perform it in nearly constant time 3.

NextGEQ

Another interesting operation is NextGEQ(x), which returns the next integer of the sequence that is greater or equal than x.
We retrieve the position p by performing select_0(hx) − hx where hx is the bucket in which x belongs to.
At this point, we start to scan the elements from position p and we stop at the first one greater than x. The scan can traverse at most the size of the bucket.

Conclusion

Elias-Fano is a very effective encoding algorithm as it allows to randomly access the sequence without decoding it and in constant time. As highlighted in academic literature 4 Elias-Fano demonstrates its power in particular in list intersection overcoming any other form of compression.

Implementations

I would like to point out my Golang implementation (https://github.com/amallia/go-ef) of Elias-Fano, which is still in early stage. Feel free to get involved in the development.

A very good implementation is the one from Facebook present in Folly (https://github.com/facebook/folly/blob/master/folly/experimental/EliasFanoCoding.h).

Leave me a comment if you have written your own implementation and I will be more than happy to add it to the list.

]]>On-the-fly encoding and decoding of bitmapshttps://www.antoniomallia.it/on-the-fly-encoding-and-decoding-of-bitmaps2017-12-12T00:00:00Z2017-12-12T00:00:00Z

A bitmap, also referred to as bit-vector or bit-array, is a sequence of 0s and 1s which typically encodes a more complex object.

A common example of this is a set of numbers where each of the elements are indicated as set bits in a bitmap of length equal to the greatest element plus one (as we count from zero), also commonly referred as the universe. As an example, the set {3,5,21,4,23,12} can be represented as 101000000001000000111000, where - counting right to left - we have the 1-bit at the corresponding positions of the elements in the initial set.

The importance of bitmaps is irrefutable, this is why I recently started investigating which are the most effective techniques used to compress them.
Being able to reduce their memory usage means being able to store more data or, possibly, fit it in a lower level of the cache hierarchy which immediately translates to faster access.

The technique I would like to discuss sets the base for more complex ones, which I will try to cover in a future blog post. The most important property of the following compression algorithm is the ability to query the bitmap without fully decompressing it. Considering the set we saw in the previous example, this would be extremely appealing, as we would be able to tell if an element i is present or not just by looking at the bit at position i.
The compression I am going to present falls into the category of data structures called succinct data structures, which allow efficient query operations while using an amount of space that is close to the information-theoretic lower bound.

Now we split the bitmap into fixed-length blocks. In the previous example, the bitmap was 24-bit long, if we split it into blocks of 3-bits each we obtain four distinct blocks.

101000000001000000111000

The idea is to code each block independently from the the others, using a pair of values <Ci,Oi> for the i-th block.
The first element of the pair is the cardinality of the block, also referred as population count or just popcount; while the latter is the offset in the table that contains all the distinct permutations (so combinations) of the bits in that block 1.

Let’s say we want to encode the first block 101. Calculating C is trivial as we need to count the number of bits set to 1. This can also be done in hardware by most of the modern CPUs (I will come back to this topic again in the future), but for now, we can rely on the following naive implementation.

Now lets imagine we have a table containing all the ordered permutations of the previous block. If we iterated over the rows of this table and stop when we reach the entry that matches our block, then we would have computed the offset for that block.
In our example, the offset of the block in the following table would be 2.

0

011

1

101

2

110

In this way we can encode our block with the two integers C = 2 and O = 2.

Whenever we would like to decode the block from the given C and O we need to select the appropriate table of combinations using C and then move to the index O of that table to retrieve the original representation.

In this environment if we were only interested in the i-th bit, we would have decoded the entire block and applied a proper mask to filter it out.

Cost analysis

So far we realized we need to store a pair for integers for each block, so for blocks where m is the original bitmap length and b is the fixed block size. We know that the population count of the block cannot be greater then the block size its-self, since there cannot be more than b set bits in a block. Then we can state that the C coefficients can be stored in bits.
Regarding the offsets we know that they are indexes in a table, but the table size depends on two factors: the blocks size and the number of set bits in it.
We know that the former is the same for each block, but the population count can vary.

There are two lucky cases where the cardinality gives us enough information to infer the offset:

if C is 0, as we know then that the only permutation is then one where all the bits are unset.

if C is equal to the size of the block, where the block as all the bits set.

In these two cases we can store the offset implicitly and so we would not sacrifice any extra space. For all the other possibilities we can always store the offset in bits.

For instance, the original bitmap we used in the previous example used 24 bits of actual data in its uncompressed form. To store the cardinality of each block we would need 2 bits per block, for a total of 16 bits. Then, the blocks containing all-zeros or all-ones is encoded implicitly, while the blocks with C = 1 and C = 2 need 2 additional bits each. This sums up to 20 bits used to represent our uncompressed 24-bit bitmap, with a saving of 2 bits or ~16% of the initial size.

On-the-fly generation of ordered binary permutations

At this point we know how to encode and decode a block, what is actually missing is the way to generate the lookup table of the permutations. The answer is that we don’t do it, but instead ordered binary permutations are generated on-the-fly 2.

For small blocks it would actually be doable and probably also convenient, but if the block gets bigger then it is just not feasible. What is needed is an algorithm to compute offsets for a given block and being able to go back from the offset to the original block representation in a reasonably effective way.

Compute offsets

The aim of the computation of the offset is to find the index of a block, given its size and population count, in a table listing all the possible permutations. Moreover, it needs to be deterministic and without the overhead of an actual table.
Basically what the following algorithm does is iterating over every bit in the block, if is unset it moves to the next bit (so the block size decreases but the cardinality does not vary), else if it is set then we increase the offset by a quantity equal to (now both block size and cardinality decrease by one) where n is the position of the bit and count the missing set bits to encounter.

Decode offsets on the fly

Now we need to reverse the encoding process.
If , then the first bit of the block was a 1 and we decrement both n and count and subtract from the offset; otherwise it was a 0 and we decrement only n.
Every iteration the block size decreases by one. We can stop when we have processed the whole block or when the count reaches 0 as it means that the remaining bits are all unset.

Since most of the times we are interested in a single bit of the block and blocks can be quite long and so slower to decode, we can perform two forms of optimization. The former is to stop the iteration when we reach the given position, the latter is to perform a binary search instead of the linear scan we already described. The combination of the two solutions is ideal, indeed a linear scan is still faster when the position we are interested in is within the first elements.

Conclusion

I feel this is a nice and elegant way to compress a bitmap while keeping the ability to decode a block in constant time as it on-the-fly decoding only depends on the the block size which is fixed. I am also sure that further improvements for faster decoding can be possible with the use of SIMD instructions.

Feel free to get in touch if you want to share any feedback or have any ideas about the topic and would like to dig more into it.

]]>New website and personal bloghttps://www.antoniomallia.it/new-blog2017-08-22T00:00:00Z2017-08-22T00:00:00ZThis is my first post in my new personal blog.
I will use this place to write about technology, programming and, maybe, to keep you updated about my personal life too.

I recently started using Twitter more often, but I realized that sometime is not enough when you want to express a longer or more complex concept. So, in those cases I will write here but please keep following on Twitter if you are interested in my brief opinions too.