How to Fail a Coding Interview

Some time ago I attended to a coding interview for the position of Data Scientist
at one start-up. I felt myself well-prepared and confident, practicing lots of
programming puzzles, coding various Machine Learning techniques from scratch,
and having several years of programming experience under the belt. What can go wrong?

This post tells a story of a very unusual coding interview and is my attempt to
analyze its results to derive some useful insights even from the fault.

Publication Date: 22 March 2019

The Best Format to Save Pandas Data Frame

When working on data analytical projects, I usually use Jupyter notebooks and
a great pandas library to process and move my data around. To store the data
between sessions, I usually use binary formats that allow to preserve data types
and efficiently encode their content.

However, there are plenty of binary formats to store data frames on disk.
How can we know which one is better for our purposes? Well, we can try a few of
them and compare! In this post, I do a little benchmark to understand which
format is the best in terms of short-term storing of pandas data.

Publication Date: 14 March 2019

How to Build a Flexible CLI with Standard Python Library

The Python programming language is quite often used to write various CLI-based utilities and automation tools. There are plenty of third-party libraries that make this process very easy and straightforward. However, recently I’ve realized that very often I use the good old argparse when writing my snippets, and also there are lots of legacy code that utilizes this package. That’s why I’ve decided to create a single reference point for myself showing how to use it. In this post, we are going to take a close look at the library and gradually build a simple CLI to generate plots with matplotlib library.

Publication Date: 30 January 2019

Deep Learning Model Training Loop

The PyTorch is a fantastic and easy to use Deep Learning framework. It provides you
with all the fundamental tools to build a machine learning model. It gives you CUDA-driven
tensor computations, optimizers, neural network layers, and so on. However, to train a model
one needs to assemble all these things into a data processing pipeline. Recently the developers
released the 1.0 version of the framework, and I’ve decided it is a good time to try myself
in writing generic training loop implementation. In this post, I’m describing this process
and giving some interesting observations about its development.

Publication Date: 09 December 2018

Building Simple Recommendation System with PyTorch

Recently I’ve started watching fast.ai lectures—a great
online course on Deep Learning. In one of his lectures, the author
discusses the building of a simple neural network based recommendation system
with application to the MovieLens dataset. The lecture relies on
the library developed by the author to run the training process. However,
I strongly wanted to learn more about the PyTorch framework which sits
under the hood of authors code. In this post, I am describing the process of implementing and training a simple embeddings-based collaborative filtering recommendation system using PyTorch, Pandas,
and Scikit-Learn.

Publication Date: 17 August 2018

Classifying Quantized Dataset with Random Forest Classifier (Part 2)

In this post, we’re going to finish the work started in the previous one and
eventually classify quantized version of
Wrist-worn Accelerometer Dataset.
There are many ways to classify datasets with numerical features, but
Decision Tree is one of the most intuitively understandable ones and simple it
its underlying implementation. We are going to build a Decision Tree
classifier using Numpy library and generalize it to
Random Forest — an ensemble of randomly generated trees, which is
less prone to data noise.

Publication Date: 27 June 2018

Using K-Means Clustering to Quantize Dataset Samples (Part 1)

Clustering algorithms are used to analyze data in an unsupervised fashion, in
cases when labels are not available or to get new insights about the dataset.
The K-Means algorithm is one of the oldest clustering algorithms developed
several decades ago but still applied in Machine Learning tasks. One of the ways
to use this algorithm is to apply it for vector quantization, a process which
allows reducing the dimensionality of analyzed data. In this post, I’m going to
implement a simple implementation of K-Means and apply it to Wrist-worn Accelerometer Dataset.

Publication Date: 19 May 2018

Dogs Breeds Classification with Keras

Modern deep learning architectures show quite good results in various fields of
artificial intelligence. One of them is images classification. In this post,
I am going to see if one could achieve an accurate classification of images by
applying out-of-the-box ImageNet pre-trained deep
models from Keras Python package.

The post was originally published on Medium. Then an updated and improved version
was ported
to Able.bio platform.

Publication Date: 09 April 2018

Generators-Based Data Processing Pipeline

Generators represent a quite powerful conception of gradually consumed (and probably indefinite)
streams of data. Almost every Python 3.x developer has encountered generators
(when used range or zip methods, for example). However, the generators could
be used not only as sources of data but also as coroutines organized into
data transformation chains. The post shows how to build generators pipeline
which preprocesses images stored on external storage.

Each Python object has a set magic of methods which could be overridden to
customize instance creation and behavior. One of widely used methods is
__init__, which is used to perform newly created instance initialization.
But there is one more magic method taking part in object creation, called __new__,
which is actually creates class’s instance. The post explains how to use
this method to dynamically switch implementation of class methods.

Publication Date: 27 February 2018

Deep Learning Machine Software: Ubuntu, CUDA, and TensorFlow

Recently I’ve decided to build a simple deep learning machine with single
GTX 1080Ti GPU and based on Ubuntu 16.04. The machine’s assembling process
was quite straightforward. But while deploying required software, a few
minor issues had arisen. That would be helpful to have an instruction with the list
of performed actions in case if the setup system would ever require re-deployment.