Introduction

Nowadays is not uncommon that web applications include full text search features. There are already well known solutions working out-of-the-box that provide the needed functionalities, such as ElasticSearch or Apache Solr.

Having used ElasticSearch at work a couple of times I wondered how it achieved fast searches and what mechanism empowered that, so reading up a little on the topic, the Inverted Index appears as the cornerstone of full text search algorithms.

Thus, what better way to understand how something works than writing my own toy one?

What is an Inverted Index?

The Inverted Index is the data structure used to support full text search over a set of documents. It is constituted by a big table where there is one entry per word in all the documents processed, along with a list of the key pairs: document id, frequency of the term in the document.

How does it work?

Let's imagine we have two documents with different texts:

The big sharks of Belgium drink beer.

Belgium has great beer. They drink beer all the time.

In order to create the Inverted Index, each text is sliced into different units or terms. The rule is to use whitespace as the natural separator between words, although it can be changed.

Additionally, per each term there is a list of pairs (document id, occurrences), showing the document's ID where the term is found, and the number of times the term appears in the text.

Therefore, the Inverted Index after processing the previous two documents would be:

Term

Appearances (DocId, Frequency)

The

(1, 1)

big

(1, 1)

sharks

(1, 1)

of

(1, 1)

Belgium

(1, 1) (2, 1)

drink

(1, 1)

beer

(1, 1) (2, 2)

has

(1, 1)

great

(1, 1)

They

(1, 1)

all

(1, 1)

the

(1, 1)

time

(1, 1)

As seen, the term Belgium appears once in both documents, while the term beer appears once in the first and twice in the second one. Whenever a search is issued, the index will be looked up and the corresponding documents retrieved automatically.

This in turn makes processing the documents (indexing) and thus creating & updating the index a slow process, since each document needs to be parsed, sliced and analyzed. Conversely, once the index is created search becomes a really cheap operation since it only entails looking up an entry in a table.

As it happens with everything, this mechanism is not a silver bullet and it has it's quirks and drawbacks, being some of them:

Case: The words Belgium and belgium are indexed as two different terms in the table, while a user writing "belgium" in lowercase letters most likely would want to see all the occurrences regardless case.

Stopwords: keywords considered irrelevant and deprived on any additional sense, like prepositions, articles or conjunctions.

Stemming: Reducing derivations of a word to their common root (e.g: shark and sharks to shark).

Synonyms: Gathering terms that share a common meaning to a single one, in order to prevent an explosion in Inverted Indexe's size. Terms like car, automobile, plane and truck could be mapped into a single vehicle word in the index.

Example

The Inverted Index can be understood as a simple key/value dictionary where per each term we store a list of appearances of those terms in the documents and their frequency.

Thus, an Appearance class represents a single Appearance of a term in a document:

class Appearance:
"""
Represents the appearance of a term in a given document, along with the
frequency of appearances in the same one.
"""
def __init__(self, docId, frequency):
self.docId = docId
self.frequency = frequency
def __repr__(self):
"""
String representation of the Appearance object
"""
return str(self.__dict__)

The Database class is a fake in-memory DB used to persist the documents after they have been indexed.

class InvertedIndex:
"""
Inverted Index class.
"""
def __init__(self, db):
self.index = dict()
self.db = db
def __repr__(self):
"""
String representation of the Database object
"""
return str(self.index)
def index_document(self, document):
"""
Process a given document, save it to the DB and update the index.
"""
# Remove punctuation from the text.
clean_text = re.sub(r'[^\w\s]','', document['text'])
terms = clean_text.split(' ')
appearances_dict = dict()
# Dictionary with each term and the frequency it appears in the text.
for term in terms:
term_frequency = appearances_dict[term].frequency if term in appearances_dict else 0
appearances_dict[term] = Appearance(document['id'], term_frequency + 1)
# Update the inverted index
update_dict = { key: [appearance]
if key not in self.index
else self.index[key] + [appearance]
for (key, appearance) in appearances_dict.items() }
self.index.update(update_dict)
# Add the document into the database
self.db.add(document)
return document
def lookup_query(self, query):
"""
Returns the dictionary of terms with their correspondent Appearances.
This is a very naive search since it will just split the terms and show
the documents where they appear.
"""
return { term: self.index[term] for term in query.split(' ') if term in self.index }

In order to test the execution of the index, I just create a couple documents and perform some searches.

Wow time flies! Three months since the last review of my time spent during sabbatical have gone by. I thought it would be useful to take a whole afternoon to reflect on the steps I've taken and also think about the direction I want to take.

Things I've done & learnt so far

I got my A2 motorbike permit :D

I learnt to make a couple new recipes with toasts.

I finally threw in the beach a couple Chinese lanterns with a friend in our hometown holiday, we used to forget every year!

I completed the PADI Open Water Diver Course, it is the first begginer's certification in scuba diving. There are some others worldwide recognized as SSI. It was also the first time I did scuba diving and I loved it.

I tried Surf for the first time but it was not the most pleasant experience. I have been snowboarding for years and thought I could transfer some knowledge but they turn to be two quite different worlds.

I have done Yoga for the first time. I really enjoyed it!

I learnt how to say "Rock, Paper, Scissors" in Chinese and I am quite proud of it.

I learnt how Singapore went for a fisherman's village to the technological & economical hub it is today.

I learnt about the arrival of Islam to Malaysia.

I learnt about the history of Bali and it's colonization past with the Dutch.

I started a small programming project involving MindMaps, but don't really devote much time to it.

Things I would like to do

I think it is better to keep this to myself, there are so many things that I am afraid that if I share all them and don't do at least the majority, I come out as a failure. Better this way.

Conclusion

This months have given me quite a good perspective about life, myself and traveling. I've come to realize that success means different things to different people and that traveling as a means to achieve some kind of inner peace is often irrevelant, as that peace must come from whithin. I am definitely still working on that.

Introduction

Besides the day-by-day work, I like to always have an ongoing side project to keep my skills sharp. These don't have to be necessarily useful or directly linked to my current job, but serve as a distraction to throw some spare hours into.

Normally those projects have a bunch of common traits:

They are purposely small.

They focus on a single defined task.

I try to add at least one element of novelty on them (language, framework, programming paradigm).

Although this not always happens, I need to see the point of working on it, otherwise boredom easily arises.

I remember during college I started writing an ELF file parser (much like the readelf binary) and realized later that it was way better to use memory mapping instead of opening the file with file descriptors.

Another example was a simply e-shop webapp using Ember.js & Laravel 5. The reason for choosing Laravel was because I was borrowing a friend's hosting so I was limited to a LAMP environment, and despite having a shell available an thus being able to install node.js or whatever I needed (without root, extra difficulty), the backend was simple enough to stick with PHP.

The point of giving these examples is to state that over the course of the years I've had better or worse ideas (many completely useless), but I've greatly enjoyed my invested time.

Then, the moment of complete despair comes when I run out of them, when I dismiss all those waiting on my Possible Projects backlog and can't find anything worth starting. If it is not useful enough, it is way too complex to invest so many hours; if it is interesting, it might have no direct application, and thus creating anxiety (e.g: ReasonML is very cool, but I should really learn Typescript properly..); if it is real-life "useful", it may be utterly boring (e.g: writing better unit tests in python/selenium).

It is not uncommon to spend a considerable amount of time thinking on which project to embark next. Since this is a recurring problem I have eventually decided to sketch out a mind map describing the process I go through when looking for something new to do.

I have divided the potential Idea Triggers in six groups, colored in different shades of green. Once a choice is made, the Possible Actions are in different shades of orange, all but the "Choose Stack" choice, which I'll explain later.

Each Idea Trigger represents a specific area where you want to focus when it comes to find an idea for a project, and each Possible Action is the action you take accordingly. Notably, not all combinations make a lot of sense.

Idea Triggers

Useful for your current job

You want to invest your free time into skills that have a direct impact in the job you are currently pursuing, so the contribution pays off directly. The potential areas I tend to pay attention to are:

Identify stack & languages used at work: as a developer we use a cluster of technologies we are may not be very acquainted with (you can't know everything).

Your backend might be an array of microservices running in Docker containers written in Ruby, Python and Go, each of them running in an Nginx server, using RabbitMQ for message passing, one of them using Redis for some caching and the frontend written in React.js & Typescript. It is unlikely even as a "Fullstack Developer" that you have well versed expertise in all these technologies, hence making each one of the bolded words a possible challenge to tackle.

Identify repetitive or annoying tasks that can be automated: in a developer's (or any IT job) daily tasks there's often plenty of room for automating tasks, providing you a good ground for identifying weak spots in the processes (e.g. inefficient build process) that you'd otherwise neglect, given this examples from my own experience:

npm link'ing multiple repositories at once in a single project or having nested npm link inside a single repo. You can write a tool for that!

Cloning the same docker barebone repository and having to change over and over the code here and there to customize it. You can have a docker template and extend it with custom code!

Doing some Excel columns manipulation manually, over and over again. You can write a script for that!

And I am sure there are plenty of these examples you are already starting to realize..

Identify a process related to your job you don't quite yet understand: you might be a frontend developer who writes commits, pushes them and forgets what are the further steps until the feature is shipped to the final product. You know someone writes the Q&A automated testing, and that a thing called TravisCI perform automatic tests & building, but never scratched the time to understand it. This can be a good entry point to learn steps in the software shipping process you are never directly involved.

Useful for a prospective job

You might be looking for a new job that can be either related to your current one, or diverts a little, or tackles an entirely new domain you have never worked with. Despite computer science related jobs normally share a common trunk of knowledge, there are so many specializations that it would become impossible to become proficient in all of them.

However, it is not uncommon for a developer to slide from one to another. Let's take a few examples:

You are a frontend developer working in a web application, you realize there are many security holes in the code & architecture and you decide to fix them. You become increasingly interested in multiple ways attackers can take advantage of a web application and therefore you start to know it's correspondent security measures (XSS, SQL injection, RFI, LFI, etc.. vs. CSP, CORS, Character filtering, etc..). After some joyful playing you decide you find more enjoyment and purpose in web security and want to jump to a Security Consultant/Architect position.

You are a backend developer and your next task is to find out why the DB queries are so slow. You start to enjoy data modeling and database administration and start to research alternative databases: NoSQL, GraphQL, Triplestores, etc.. and decide you want to move into a DBA role.

You are a developer who is bored of waiting so much for the CD/CI cycle and decide to create your own set of scripts and try to tweak Travis, Jenkins or whatever your organization has running. You realize you love scripting and DevOps and want to jump into a role to do that.

The list can go on, those are just some examples that popped into mind to illustrate the leap that might come after curiosity brings us outside our regular role.

Hot Topic

There is a constant stream of new technologies, languages, frameworks, etc.. Most of them fade away, but good ideas tend to stick around and often they originate from old computer science papers that are rescued from their dusty bookshelves (functional programming?).

In your daily job you may just perform mundane development tasks, but some topics in hacker news or lobste.rs appear quite often and start to resonate with you. If they are around for that long, they might be worth giving a look. Topics like (my personal list):

Blockchain/Ethereum

IPFS

WebAssembly

Natural Language Processing

Machine Learning (applied to whatever)

Serverless

I personally find this Idea Trigger one of the hardest to follow, since I need to see an immediate practical application to them and also enjoy playing with. As an example I really like the idea of WebAssembly but I don't see real projects made with it yet and I am concerned with the potential security implications it may bring along (like Flash did).

Specific tech

Stricty related to the previous Hot Topic section, but instead of paying attention to technologies that appear often in the media and thus having a direct job finding correlation, the Specific Tech section encompasses the technologies you feel curious about and have been staying in the back of your head for a long time, despite not having necessarily a direct application, more like a general "usefulness" vibe.

Some items of my personal list are:

Lisp

ReasonML

Elm

Exploit development

Reactive programming

Malware analysis

Advanced Shell Scripting (sed, awk, etc)

Understand the OS booting process and how programs are loaded into memory

Again, although enjoyment is paramount, sticking to these projects is hard since I often don't see a direct return of investment in time & effort.

Solve a real problem

By far the easiest to stick to. If you are lucky enough to come accross a real problem you realize you could automate or build a software solution for it, you hit a gold mine. You have the motivation to craft something new with the added advantage of being useful for yourself or a group of people.

An example of this could be that you participate in a cultural center in your neighborhood, and you notice they are having some trouble managing the activities, letting people know about the offer and providing a way of proposing new ones. Hence, you roll up your sleeves and decide to create a web application to cover those use cases.

To get inspiration for people who solve real problems you can check Indie Hackers, they have plenty of nice projects to showcase.

Interesting Topic

Instead of looking for a given technology to learn, this method has the inverse approach. I make a list of my different areas of interest besides computers, for example:

Privacy

Freedom of Speech

Motorbikes

Learning new spoken languages

Different forms of education

And from that list I would look for a project that tackles a given topic or a combination of several.

Possible Actions

Per each of the Idea Triggers there are one or many possible ways to start building knowledge as you can see in the overview picture. Those are the Possible Actions to be taken.

Online Course

A simple and straightforward way of getting up and running with a topic is to follow an online course. This is good if you want filtered, curated and structured content, since other people have taken care of that for you. The downsides of this approach are that you still have to filter whether it is useful and worth following, and not all worthy courses are for free.

Normally you can either opt for the regular MOOCs offered by multiple universities and platforms online or access a more specific courses platform. Some platforms I've used in the past are:

MOOC

Other Platforms

Github Project

Github provides and excellent repository to hunt for ideas, especially if you don't want to start a project from scratch and would be better off by contributing to an already started one. By clicking in "explore" you can find the latest trending repositories in the Open Source community, as well as being able to filter them by the language, framework and topic you're interested in.

Let's say I am initially interested in security and I want to learn Python by applying it to that topic. By the time I am writing the article, I see there are some interesting projects that fall into this category I would be quite thrilled to contribute to:

wifiphisher: a security tool that allows to create a rogue access point and force clients to authenticate & phish credentials.

As seen, Github provides a great "open source supermarket" where you can search through a wide range of different options and hopefully find one you resonate with and are willing to contribute.

Related Jobs

If you are clear about the topic or technology you want to jump to, a great way of finding a pragmatical use for that knowledge would be to look for actual startups & companies that make use of them, or simply focus on the topic you are interested in (e.g: Farming, Privacy, E-Health, you name it).

There is a big offer of job portals out there, so I will just give a small example: imagine you are sure you want to learn the Haskell language, and you are very interested in making a career in health care. A great way of starting is to look in StackOverflow and in AngelList for jobs with haskell or functional programming keywords and if then start to fine tune the search.

Luckily there will be some interesting projects to look at, and might give you ideas for yours.

Code Challenges

Participating in code challenges, code jams and whatnot is a good exercise to get comfortable with the quirks of the language, but they normally don't represent real problems or situations you might encounter in a job. It can be cool to know how to implement a double linked list, or a breadth-first tree lookup, but most of those problems have been solved time ago and are now part of standar libraries.

This is not to undervalue their utility, they have proven to be incredibly good to keep my skills sharp in many times, but I'd rather do projects that try to solve a real problem, even though it is a well-known solution (say a host port-scanner).

There are plenty of them, but some of the code challenge websites I've used to practice are:

This principle can be applied to anything you want to understand: if you're able to explain it means you get it.

Misc

There are many other ways to start a side project, such as reading books, but those have simply proven not to be useful for me at the beginning. When I take a book about a topic it has to be one I am fairly familiar with and towards which I already profess genuine interest. Alas, I lose interest quite fast.

Conclusion

I am sure I am leaving a lot of good information there, any remarks and constructive criticism is more than welcome in the comment section.

Introduction

I never quite got what was the requestAnimationFrame function I kept seeing while reading web articles since I never really dove deep in graphics in the browser.

Still, I think it will be a good exercise and reminder for the future to write a small article to structure and understand the basic concept and usage of the function.

Overview

For the human brain to be fooled into thinking that a sequence of images flow into movement (animate), you need to provide at least 24 images (frames) per second, otherwise you'll start to notice cuts. Most computer screens have an "image refresh" (or frames/per/second) rate of 60fps, meaning in turn that you will see 60 images in the course of one second.

If we divide the slice of time for perception (a second) into the refresh rate of the screen it gives us the time slot available to perform an animation (both calculating and painting) for a single frame. We use miliseconds since setTimeout, setInterval use them to measure time.

1000ms / 60 fps = 16.67ms/frame

The smoothness of an animation depends on whether each frame of the animation is executed within that 16.67ms time frame.

An example animation taken from here but slightly modified includes a green square being moved horizontally from left to right using the setInterval function (Demo here).

As seen, I've rounded the 16.67ms to 17ms, and you can see the continuity and somewhat fluidity of the animation. What will happen if we change the interval timer to say, 100ms, is that animation will feel choppy and stumbling. Additionally, there is no real guarantee that the setInterval function will comply with the passed timeout.

What is requestAnimationFrame()

It is a javascript API function that allows the user to perform animation-based loops more efficiently than the usual alternatives: for-loop or setTimeout & setInterval functions.

Intensive graphics animations run with requestAnimationFrame API will get some preferent treatment and be optimized to create a smoother feel in the browser, compared to the previously mentioned counterparts:

Animations will only run when visible: the browser will drastically throttle animation if it isn't executed in the active tab, hence saving up a ton of cpu cycles.

Browser with setTimeout will update the screen arbitrarily, trying to match the animation's redraw with the whole screen repaint, losing cpu cycles in the process.

It does not require an update interval, hence adapting to the refresh rate of the computer. Animations are refreshed when possible, not when told.

It will group all the animations into a single repaint, saving a lot of cpu cycles too.

Battery friendly: as a consequence of the previous reasons.

Animations can be of any kind, either by manipulating the DOM, using WebGL, using CSS transitions, or using the canvas HTML element.

Cancelling requestAnimationFrame() execution

As stated previously, animations can be cancelled by making a call to the cancelAnimationFrame API and passing the frame id returned by requestAnimationFrame. Check this pen to see the animation cancelled after one second. Then, comment out the setTimeout line and see how the green square moves way forward.

Slowing down the animation

The frame rate of an animation using requestAnimationFrame can also be adjusted by throttling the call to the API inside a setTimeout call and dividing a second between the frame rate desired, slowering it down, like shown in this pen.

Overview

I've had the intention of writing an article about PGP for a long time, but every time I started reading about the topic I almost immediately brushed the idea off. I believe this is spurred by the fact that understanding it is not quite simple.

Just as monads, once you get the hang of it it becomes evident and you automatically lose the ability to communicate the concept to others.

Therefore, finally I decided to wrap my head around the matter and write a small article to try to simplify this concepts mostly for myself and hopefully for whomever reads it.

What is PGP?

The simplest way I think of it is as an information's encryption technology. By information I mean as stated by the wikipedia article: emails, files, text, directories and whole disk partitions.

It provides a framework to exchange data securely among peers, as well as guaranteeing both the authenticity and integrity of the message.

How does it work?

There are three main concepts I split PGP into:

Encryption/Decryption

Authenticity & Integrity

Web of Trust

1. Encryption/Decryption

Encrypting a message is the act of codifying it to make it unreadable to an external observer other than the intended recipient. The encoded message is called ciphertext. Decrypting is the inverse process where an unreadable message becomes accessible (converting from ciphertext back to plain text).

PGP uses several techniques to achieve that goal: hashing, data compression and symmetric/asymmetric cryptography. Explaining each one of them separately would go beyond the scope of the article. Moreover, they are no simple concepts to fathom, there is quite a bit of literature about them.

The process of data encryption & decryption has multiple steps to be taken into account. The best way to understand it is with a simple diagram where the flow of information being encrypted and sent from a Sender to a Receiver is depicted.

Compression: initially the message is compressed. This is done with a multiple purpose in mind: to reduce the amount of bytes to be transmitted, thus increasing performance, and additionally to enhance the cryptographic strength. Messages can have patterns that can be spotted in the text after encryption; compression helps on preventing those patterns to leak.

A new Session Key is generated per each one of the messages sent; this is a random string created by using multiple sources of entropy (randomness) in your computer, for example mouse movements. This key is then used to encrypt the compressed message data using symmetric cryptography, creating a compressed ciphertext.

Afterwards, the Session Key alone is encrypted with the Receiver's public key (asymmetric cryptography) and attached to the compressed ciphertext, generating the whole encrypted message. The Session Key is later discarded.

If the whole message is encrypted using the Receiver's public key (asymmetric), the size of the resulting message would spike, as well as dramatically increasing the computation time. As instead, by only encrypting the session key the process is faster and the final payload way smaller.

Inversely, what is left for the receiver is to use their private key to decrypt the session key, and use that session key to decrypt the entire message.

2. Authenticity & Integrity

PGP also provides capabilities to verify whether the received message has been sent by a specific sender. To give an example, how do you know that an email you received from personitrust@trust.com is actually from that person and not some impostor?

The solution PGP provides is by creating a digital signature of the message and sending it along so the receiver can verify it's origin. The process overview is like this:

Then, the Receiver will use the Sender's public key to verify the integrity of the message's hash that was signed with the Sender's private key.

Because there is a unique association between public and private key, if the sender uses a certain private key to sign a message and you verify the signature using the corresponding public, then the signature verification will succeed only if the message has not been altered.

There might be however some cases where such statement does not hold (See more).

3. Web of Trust

Public keys can be obtained in multiple manners: by downloading them from the site of the interested party, by email, using an untrusted key server, etc.. PGP stores all the public keys we need to communicate in a file called the keyring.

You can sign several public keys, as so can others, establishing a network of trust among public key holders where for example you may accept a document signed by some external party whose public key has been signed enough times from several other sources you've deposited your confidence in.

Example

Sending an encrypted message using GPG

GPG is the free implementation for PGP, while the original one is protected. Lets think of the simple scenario where a Sender wants to send a "Hello Amigo!" message to a Receiver.

Now we ideally we send this encrypted.msg gibberish text to the Receiver who is in posession of their private key, and they just need to type the key's password he established when they created it (receiver).