How learning data structures and algorithms make you a better developer

By Najaf Ali

"I'm a huge proponent of designing your code around the data, rather than the
other way around, and I think it's one of the reasons git has been fairly
successful […] I will, in fact, claim that the difference between a bad
programmer and a good one is whether he considers his code or his data
structures more important."
- Linus Torvalds

Learning about data structures and algorithms makes you a stonking good programmer.

Data structures and algorithms are patterns for solving problems. The more
of them you have in your utility belt, the greater variety of problems you'll be
able to solve. You'll also be able to come up with more elegant solutions to new
problems than you would otherwise be able to.

You'll understand, in depth, how your computer gets things done. This informs
any technical decisions you make, regardless of whether or not you're using a
given algorithm directly. Everything from memory allocation in the depths of
your operating system, to the inner workings of your RDBMS to how your networking
stack manages to send data from one corner of Earth to another. All computers
rely on fundamental data structures and algorithms, so understanding them better
makes you understand the computer better.

Cultivate a broad and deep knowledge of algorithms and you'll have stock
solutions to large classes of problems. Problem spaces that you had difficulty
modelling before often slot neatly into well-worn data structures that elegantly
handle the known use-cases. Dive deep into the implementation of even
the most basic data structures and you'll start seeing applications for them in
your day-to-day programming tasks.

You'll also be able to come up with novel solutions to the somewhat fruitier
problems you're faced with. Data structures and algorithms have the habit of
proving themselves useful in situations that they weren't originally intended
for, and the only way you'll discover these on your own is by having a deep and
intuitive knowledge of at least the basics.

But enough with the theory, have a look at some examples

Figuring out the fastest way to get somewhere

Let's say we're creating software to figure out the shortest distance from one
international airport to another. Assume we're constrained to following routes:

Based on that graph of destinations and the distances between them, how can we
find the shortest distance say, from Helsinki to London? Dijkstra's
algorithm is the algorithm that will definitely get us the right
answer in the shortest time.

In all likelihood, if you ever came across this problem and knew that Dijkstra's
algorithm was the solution, you'd probably never have to implement it from
scratch. Just knowing about it would point you to a library implementation
that solves the problem for you.

If you did dive deep into the implementation, you'd be working through one of
the most important graph algorithms we know of. You'd know that in practice it's
a little resource intensive so an extension called A* is often used in
it's place. It gets used everywhere from robot guidance to routing TCP packets
to GPS pathfinding.

Figuring out the order to do things in

Let's say you're trying to model courses on a new Massive Open Online Courses
platform (like Udemy or Khan Academy). Some of the courses depend on each
other. For example, a user has to have taken Calculus before she's eligible for
the course on Newtonian Mechanics. Courses can have multiple dependencies.
Here's are some examples of what that might look like written out in YAML:

Given those dependencies, as a user, I want to be able to pick any course and
have the system give me an ordered list of courses that I would have to take to
be eligible. So if I picked calculus, I'd want the system to return the list:

arithmetic -> algebra -> trigonometry -> calculus

Two important constraints on this that may not be self-evident:

At every stage in the course list, the dependencies of the next course must be
met.

We don't want any duplicate courses in the list.

This is an example of resolving dependencies and the algorithm we're looking
for to solve this problem is called topological sort (tsort). Tsort works on a
dependency graph like we've outlined in the YAML above. Here's what that would
look like in a graph (where each arrow means requires):

What topological sort does is take a graph like the one above and find an
ordering in which all the dependencies are met at each stage. So if we took a
sub-graph that only contained radioactivity and it's dependencies, then ran
tsort on it, we might get the following ordering:

This meets the requirements set out by the use case we described above. A user
just has to pick radioactivity and they'll get an ordered list of all the
courses they have to work through before they're allowed to.

We don't even need to go into the details of how topological sort works before
we put it to good use. In all likelihood, your programming language of choice
probably has an implementation of it in the standard library. In the worst case
scenario, your Unix probably has the tsort utility installed by default, run
man tsort and have a play with it.

Other places tsort get's used

Tools like make allow you to declare task dependencies. Topological sort is
used under the hood to figure out what order the tasks should be executed in.

Any programming language that has a require directive, indicating that the
current file requires the code in a different file to be run first. Here
topological sort can be used to figure out what order the files should be
loaded in so that each is only loaded once and all dependencies are met.

Project management tools with Gantt charts. A Gantt chart is a graph that
outlines all the dependencies of a given task and gives you an estimate of
when it will be complete based on those dependencies. I'm not a fan of Gantt
charts, but it's highly likely that tsort will be used to draw them.

Squeezing data with Huffman coding

Huffman coding is an algorithm used for lossless data compression. It works by
analyzing the data you want to compress and creating a binary code for each
character. More frequently occurring characters get smaller codes, so e might
be encoded as 111 while x might be 10010. The codes are created so that
they can be concatenated without a delimeter and still be decoded accurately.

Huffman coding is used along with LZ77 in the DEFLATE algorithm which is used by
gzip to compress things. gzip is used all over the place, in particular for
compressing files (typically anything with a .gz extension) and for http
requests/responses in transit.

Knowing how to implement and use Huffman coding has a number of benefits:

You'll know why a larger compression context results in better compression overall (e.g.
the more you compress, the better the compression ratio). This is one of the
proposed benefits of SPDY: that you get better compression on multiple HTTP
requests/responses.

You'll know that if you're compressing your javascript/css in transit anyway,
it's completely pointless to run a minifier on them. Sames goes for PNG files,
which use DEFLATE internally for compression already.

If you ever find yourself trying to forcibly decipher encrypted information ,
you may realize that since repeating data compresses better, the compression
ratio of a given bit of ciphertext will help you determine it's block cipher
mode of operation.

Picking what to learn next is hard

Being a programmer involves learning constantly. To operate as a web developer
you need to know markup languages, high level languages like ruby/python,
regular expressions, SQL and JavaScript. You need to know the fine details of
HTTP, how to drive a unix terminal and the subtle art of object oriented
programming. It's difficult to navigate that landscape effectively and
choose what to learn next.

I'm not a fast learner so I have to choose what to spend time on very carefully.
As much as possible, I want to learn skills and techniques that are evergreen,
that is, won't be rendered obsolete in a few years time. That means I'm hesitant
to learn the javascript framework of the week or untested programming languages
and environments.

As long as our dominant model of computation stays the same, data structures and
algorithms that we use today will be used in some form or another in the future.
You can safely spend time on gaining a deep and thorough knowledge of them and
know that they will pay dividends for your entire career as a programmer.