Menu

The network is not your friend. Machines will lose connectivity. Sometimes it will be a permanent loss, other times temporary. The long tail of packet delivery time is merciless. Latency will temporarily spike for minutes sometimes, tricking you into thinking that your machines have lost connectivity. It is very hard to tell the difference between a loss of connectivity and temporary spike in latency.

Everything will fail. In a distributed system, with an unreliable network, each and every component, and probably each and every line of code, will inevitably fail. What happens when your machines lose power? Lose Network connectivity? A disk dies? Another service fails? Run out memory? Byzantine failures? (They happen!) Cascading failures? Race conditions?

Dynamic typing will slowly become a nightmare. As your code base grows into the 10’s of thousands of LOCs, how will you verify that your functions always are called with appropriately types parameters? If you have a service oriented architecture, how do you make sure that only appropriately types parameters are passed into services with any service may have a byzantine failure? If a race condition that only presents itself monthly in a package you depend on modifies the type of one of your objects, how does your system respond?

Reasoning about time will become very difficult. Causality is very difficult to establish when you cannot establish an order of events. Did the nfs server crash or did worker pool first? What caused what? Check the logs and see the same timestamps everywhere.

Refactoring a large, service-oriented system is hard. Even with industry standard abstractions, changing the interface to a service is very hard. If you make a change to one interface, how long will it take to update the entire codebase?

The only real test is production. However great your test coverage is, no test is like the real test of production. In production, things happen that you can’t even conceive of beforehand. Mistakes will be made. Incorrect configs and broken code will inevitably get pushed. Users will find every race condition.

Do not aim for 100%. No distributed system will work 100% of the time. Proof: earth hit by asteroid. Now, the best you can do it pick a target. Mine is 99%; i want my system to fail 1 in 100 times in production. For a commercial system, you can select sometime higher: 1 failure in 1000 or 1 failure in 100,000.

TL;DR

tl;dr With mechanisms to track and free pointers, we can implement a simple garbage tracker / collector for functionalC. Garbage collection has already been added to lists, closures, and lifted types.

This is the first part of a series on implementing the garbage collector.

Background

After posting the first version of functionalC to github, there were quite a few complaints about a lack of memory management / garbage collection. So, I added a garbage tracker to functionalC.

I write tracker, rather than collector, because our simple system will just track allocated objects and free them when instructed to. It does not (yet) do reference counting and is blissfully unaware of the program’s execution. (More of that for part 2.)

Implementing the Tracker – A List of References

The first step in implementing our garbage tracker is keeping track of all allocated memory. To do this, we need a few things:

Our TYPE enum works for 1. and our ref struct for 2. We will store groups of references as lists, hence the ref *next field.

Now, putting these together, the structure of our garbage tracker is simple: simply store a list of references for tracking.

struct gc {
ref *marked; //these will not be freed unless unmarked
ref *last_marked; //append should be O(1)
ref *unmarked; //these will be freed when gc_collect is called
ref *last_unmarked; //we want append to be O(1)
void (*destructor_table[TYPE_COUNT])(void *);
//TYPE_COUNT is number of all tracked types
};

Our unmarked list stores the references that are ready for collection. When a user marks an object for safe keeping, it is moved over to a marked, safe list. More on that later.

I have also added an optimization here: we keep track of the last element in the unmarked list so that appending to the unmarked list is O(1). This substantially improves the performance of the collector over a list with O(n) appends.

Implementing the Collector – Destructors & Collection

Once we are tracking pointers and their associated types, we need to know how to destroy an object of a certain type. To do this, we maintain an array, indexed by type, of function pointers to destructors.

Easy Improvements – Better Destructors & Marking

The first major room for improvement in this collector is better destructors. Right now, lots of overhead is created because lists are tracked element by element. Improving this is not too hard and would significantly lower the overhead per tracked byte of memory.

Another feature that i am currently working is the ability to mark an object for safe keeping. When an object is marked, it is still tracked by the garbage tracker, but it will not be freed until it is unmarked. The interface for this should be simple:

The garbage tracker is working, but a full fledged collector is a long way off. The tracker is unaware of program execution and does not count references. Users most manually invoke the collector. Going forward, i will improve on these issues.

Lists, closures, and lifted types all have garbage collection built in.

TL;DR

tl;dr You can implement many functional programming techniques in C. Map, filter, closures, and lifted types are all very possible. Function pointers are worth understanding. Code.

Functional C

I was having a conversation with a lover of Haskell who lamented C’s lack of functional constructs. Eager to show him that C does in fact support functional programming, i decided to write a small, functional C library. I posted the code here and hacker news seems to have enjoyed it.

The following is a brief explanation about how to go about building a functional C library. It walks through the basics and leaves the implementation details to the code itself. Here is a more technical README.

Function Pointers

The basis of writing functional C is an understanding of the function pointer. A good rule of thumb in C is that everything is and/or has a memory location (think & operator). Functions are no different; they are locations of sequences of instructions that reside in memory. Since this is the case, they can passed into another function just like any other memory location. Anyone who has used qsort or signals in C has used a function pointer before, to specify a comparison or a handler function, respectively.

Closures

To close a function, one binds a function to environment that it is to be executed in. E.g. if i have a function that takes in two integers, a and b, and returns their sum, i can bind 2 to the value b to get a new function, called addtwo, that adds two to any number i put into it.

To implement closures, we need a few things:

A way to store environments and their variables

A way to bind environments to functions to create closures

A way to call a closure and execute its function within its environment

To store our closures, we use lists of environment variables that store each value along with its size in memory. To store a closure, we’ll store a function pointer and an environment.

Environments store variables with their size metadata in order to permit polymorphic C functions. I’ll write more about this later. To simplifying lifting a standard variable into an environment for binding to a function, i’ve written some helper functions:

Machine Learning: Building Image Based Search

tl;dr It’s not too bad to build a simple search engine where you put in a picture and get back both similar images and potential labels. The distance between the gist descriptors of two images provides a useful way to determine their similarity. Therefore, it is simple to build an image classifier from gists. Code.

super tl;dr

Computers can tell which ones of these are cats.

Image Based Search

For hack@UChicago’s winter (un)hackathon two friends (Falcon, Jeremy) and I decided to build an image based search engine. Essentially, you upload a picture and we return two things:

10 most similar images

Keywords labeling the image and their associated likelihoods

To build this search engine, you’ll need a couple of things:

Some labelled data. I have included the data we used here. It comes from the tiny image dataset. (These are very noisy labels and the robustness of the gist allows successful classification even with this noise.

A way to describe the image. We use gists. Python lib for their computation.

A way to find the most close gists to an input image. Here’s our code.

But what’s the gist of an image?

To determine the similarity of images, we used gist descriptors, an image description technique that comes from MIT. A gist descriptor is a a sequence of numbers that describe an image in its entirety. It summarizes the “naturalness, openness, roughness, expansion, ruggedness” of an image through a series of mathematical manipulations focusing heavily on the discrete fourier transform.

How do you find how ‘close’ two images are?

Each gist is a vector of 960 floating point numbers. Taking the simple euclidean distance between two gists yields their distance.

How is a search actually done?

We precompute the gists for the training data and load these into memory when the server starts up. When someone uploads an image, we calculate its gist. We find the distance from this gist to each of the others and sort the gists from closest to furthest. We return the 10 closest images.

And the labels?

As for the labels, we use a weighted average of the keywords associated with the 100 closest images. We return the labels in sorted order with their corresponding likelihood.

Comments and extra resources

I hope this demystifies machine learning and image recognition a little bit. I hope to continue my series on machine learning and machine learning infrastructure in the coming weeks.

If you want to build other datasets, here is my python library to extract data from the tiny image dataset by keyword: PyTinyImage

tl;dr Subsequent calls to fstat may return stale file info such as size, last accessed time, etc.. no matter what file system you are using. Always open and close a file in order to poll its size. e.g. code

In order to prep for a coding interview, I was re-implementing the *nix tail command. As a refresher, the tail command prints the last 10 lines of a file and, if passed the -f flag, will continue to watch the file, printing to stdout anything written to it.

To account for the -f flag, we poll on file on the size of the file, calling fstat periodically in order to determine the files current size. i.e.

While the idea behind this code is fine, it contains a subtle problem: fstat may cache the file attributes and return the same values on subsequent calls even though the underlying file has been modified.