Data Structures for Beginners: Arrays, HashMaps, and Lists

When we are developing software, we have to store data in memory. Depending on how you want to manipulate the information, you might choose a data structure over another. There are many types of data, such as arrays, maps, sets, lists, trees, graphs, etc. Choosing the right data structure for a task can be tricky. So, this post will help you to know the trade-offs, so, you can use the right tool for the job!

On this section, we are going to focus on linear data structures like Arrays, Lists, Sets, Stacks, Queues and so on.

Note: Binary search trees and trees, in general, will be cover in the next post. Also, graph data structures.

Primitive Data Types

Primitive data types are the most basic elements where all the other data structures built upon. Some primitives are:

Integers. E.g., 1, 2, 3, …

Characters. E.g., a, b, "1", "*"

Booleans. E.g., true or false.

Float (floating points) or doubles. E.g., 3.14159, 1483e-2.

Array

Arrays are collections of zero or more elements. Arrays are one of the most used data structure because of its simplicity and fast way of retrieving information.

You can think of an array as a drawer where you can store things on the bins.

Array is like a drawer that stores things on bins

It’s when you want to search for something you can go directly to the bin number (O(1)). However, if you forgot what cabinet had, then you will have to open one by one (O(n)) to verify its content until you find what you are looking for. That same happens with an array.

Depending on the programming language, arrays have some differences. For some dynamic languages like JavaScript and Ruby, an array can contain different data types: numbers, strings, words, objects, and even functions. In typed languages like Java/C/C++, you have to predefine the size of the array and the data type. In JavaScript, it would automatically increase the size of the array when needed.

Arrays built-in operations

Depending on the programming language, the implementation would be slightly different.

For instance, in JavaScript, we can accomplish append to end with push and append to the beginning with unshift. But also, we have pop and shift to remove from an array. Let’s describe some common operations that we are going to use through this post.

What do you think is the runtime of the insertToHead function? Looks the same as the previous one except that we are using unshift instead of push. But, there’s a catch! unshift algorithm makes room for the new element by moving all existing ones to the next position in the array. So, it will iterate through all the elements.

The Array.unshift runtime is an O(n)

Access an element in an array

If you know the index for the element that you are looking for, then you can access the element directly like this:

So we are using our search function to find the elements’ index O(n). Then we use the JS built-in splice function which has a running time of O(n). What’s the total O(2n)? Remember we constants doesn’t matter as much.

We take the worst case scenario:

Deleting an item from an array is O(n).

Array operations time complexity

We can sum up the arrays time complexity as follows:

Array Time Complexities

Operation

Worst

Access (Array.[])

O(1)

Insert head (Array.unshift)

O(n)

Insert tail (Array.push)

O(1)

Search (for value)

O(n)

Delete (Array.splice)

O(n)

HashMaps

HashMaps has many names like HashTable, HashMap, Map, Dictionary, Associative Arrays and so on. The concept is the same while the implementation might change slightly.

Hashtable is a data structure that maps keys to values

Going back to the drawer analogy, bins have a label rather than a number.

HashMap is like a drawer that stores things on bins and labels them

In this example, if you are looking for a toy, you don’t have to open the bin 1, 2, and 3 to see what’s inside. You go directly to the bin labeled as “toys”. That’s a huge gain! Search time goes from O(n) to O(1).

Numbers were the array indexes, and labels are the key for the HashMap’s values. Internally, the keys get translated into indexes using a hash function.

There are at least two ways to implement hashmap:

Array: Using a hash function to map a key to the array index value. Worst: O(n), Average: O(1)

We are going to cover Trees & Binary Search Trees, so don’t worry about it for now. The most common implementation of Maps is using an array and hash function. So, that’s the one we are going to focus on.

HashMap implemented with an array

As you can see in the image, each key gets translated into a hash code. Since the array size is limited (e.g. 10), we have to loop through the available buckets using modulus function. In the buckets we store the key/value pair and if there’s more than one, we use a collection to hold them.

Now, What do you think about covering each of the HashMap components in details? Let’s start with the hash function.

Hash Function

The first step to implement a HashMap is to have a hash function. This function will map every key to its value.

The perfect hash function is the one that for every key it assigns a unique index.

Ideal hashing algorithms allow constant time access/lookup. However, it’s hard to achieve a perfect hashing function in practice. You might have the case where two different keys yields on the same index: collision.

Collision in hashmaps is unavoidable when using an array-like underlying data structure. So one way to deal with collisions is to store multiple values in the same bucket. When we try to access the key’s value and found various values, we iterate over the values O(n). However, in most implementations, the hash adjusts the size dynamically to avoid too many collisions. So, we can say that the amortized lookup time is O(1). We are going to explain what we mean by amortized runtime later on this post with an example.

This Map allow us to set a key and a value and then get the value using a key. The key part is the hash function let’s see multiple implementations to see how it affects the performance of the Map.

Can you tell what’s wrong with NaiveHashMap before expanding the answer below?

What is wrong with `NaiveHashMap` is that...

1)Hash function generates many duplicates. E.g.

12

hash('cat') // 3hash('dog') // 3

This will cause a lot of collisions.

2)Collisions are not handled at all. Both cat and dog will overwrite each other on the position 3 of the array (bucket#1).

3)Size of the array even if we get a better hash function we will get duplicates because the array has a size of 3 which less than the number of elements that we want to fit. We want to have an initial capacity that is well beyond what we need to fit.

Did you guess any? ☝️

Improving Hash Function

The primary purpose of a HashMap is to reduce the search/access time of an Array from O(n) to O(1).

For that we need:

A proper hash function that produces as few collisions as possible.

An array that is big enough to hold all the required values.

Let’s give it another shot to our hash function. Instead of using the length of the string, let’s sum each character ascii code.

This DecentHashMap gets the job done, but, there are still some issues. We are using a decent hash function that doesn’t produce duplicate values, and that’s great. However, we have two values in bucket#0 and two more in bucket#1. How is that possible?

Since we are using a limited bucket size of 2, we use modulus % to loop through the number of available buckets. So, even if the hash code is different, all values will fit on the size of the array: bucket#0 or bucket#1.

Take notice that after we add the 12th item, the load factor gets beyond 0.75, so a rehash is triggered and doubles the capacity (from 16 to 32). Also, you can see how the number of collisions improves from 2 to 0!

This implementation is good enough to help us to figure out the runtime of common operations like insert/search/delete/edit.

To sum up, the performance of a HashMap will be given by:

The hash function that every key produces for a different output.

Size of the bucket to hold data.

We nailed both 🔨. We have a decent hash function that produces different output for different data. Two distinct data will never return the same code. Also, we have a rehash function that automatically grows the capacity as needed. That’s great!

Insert element on a HashMap runtime

Inserting an element on a HashMap requires two things: a key and a value. We could use our DecentHashMap data structure that we develop or use the built-in as follows:

If there’s no collision, then values will only have one value and the access time would be O(1). But, we know there will be collisions. If the initial capacity is too small and the hash function is terrible like NaiveHashMap.hash then most of the elements will end up in a few buckets O(n).

HashMap access operation has a runtime of O(1) on average and worst-case of O(n).

Advanced Note: Another idea to reduce the time to get elements from O(n) to O(log n) is to use a binary search tree instead of an array. Actually, Java’s HashMap implementation switches from an array to a tree when a bucket has more than 8 elements.

Edit/Delete element on a HashMap runtime

Editing (HashMap.set) and deleting (HashMap.delete) key/value pairs have an amortized runtime of O(1). In the case of many collisions, we could face an O(n) as a worst case. However, with our rehash operation, we can mitigate that risk.

HashMap edits and delete operations has a runtime of O(1) on average and worst-case of O(n).

HashMap operations time complexity

We can sum up the arrays time complexity as follows:

HashMap Time Complexities

Operation

Worst

Amortized

Comments

Access/Search (HashMap.get)

O(n)

O(1)

O(n) is an extreme case when there are too many collisions

Insert/Edit (HashMap.set)

O(n)

O(1)

O(n) only happens with rehash when the Hash is 0.75 full

Delete (HashMap.delete)

O(n)

O(1)

O(n) is an extreme case when there are too many collisions

Sets

Sets are very similar to arrays. The difference is that they don’t allow duplicates.

How can we implement a Set (array without duplicates)? Well, we could use an array and check if an element is there before inserting a new one. But the running time of checking if an element is already there is O(n). Can we do better than that? We develop the Map that has an amortized run time of O(1)!

Set Implementation

We could use the JavaScript built-in Set. However, if we implement it by ourselves, it’s more logic to deduct the runtimes. We are going to use the optimized HashMap with rehash functionality.

We used HashMap.set to add the set elements without duplicates. We use the key as the value, and since hash maps keys are unique we are all set.

Checking if an element is already there can be done using the hashMap.has which has an amortized runtime of O(1). The most operations would be an amortized constant time except for getting the entries which is O(n).

Note: The JS built-in Set.has has a runtime of O(n), since it uses a regular list of elements and checks each one at a time. You can see the Set.has algorithm here

Here some examples how to use it:

123456789101112131415161718

const assert = require('assert');// const set = new Set(); // Using the built-inconst set = new MySet(); // Using our own implementationset.add('one');set.add('uno');set.add('one'); // should NOT add this one twiceassert.equal(set.has('one'), true);assert.equal(set.has('dos'), false);assert.equal(set.size, 2);// assert.deepEqual(Array.from(set), ['one', 'uno']);assert.equal(set.delete('one'), true);assert.equal(set.delete('one'), false);assert.equal(set.has('one'), false);assert.equal(set.size, 1);

You should be able to use MySet and the built-in Set interchangeably for this examples.

Set Operations runtime

From our Set implementation using a HashMap we can sum up the time complexity as follows (very similar to the HashMap):

Set Time Complexities

Operation

Worst

Amortized

Comments

Access/Search (Set.has)

O(n)

O(1)

O(n) is an extreme case when there are too many collisions

Insert/Edit (Set.add)

O(n)

O(1)

O(n) only happens with rehash when the Hash is 0.75 full

Delete (Set.delete)

O(n)

O(1)

O(n) is an extreme case when there are too many collisions

Linked Lists

Linked List is a data structure where every element is connected to the next one.

The linked list is the first data structure that we are going to implement without using an array. Instead, we are going to use a node which holds a value and points to the next element.

node.js

123456

classNode{constructor(value) {this.value = value;this.next = null; }}

When we have a chain of nodes where each one points to the next one we a Singly Linked list.

Singly Linked Lists

For a singly linked list, we only have to worry about every element having a reference to the next one.

We start by constructing the root or head element.

linked-list.js

1234567

classLinkedList{constructor() {this.root = null; }// ...}

There are 4 basic operations that we can do in every Linked List:

addLast: appends an element to the end of the list (tail)

removeLast: deletes element to the end of the list

addFirst: Adds an element to the beginning of the list (head)

removeFirst: Removes an element from the start of the list (head/root)

Adding/Removing an element at the end of a linked list

There are two primary cases. 1) If the list first (root/head) doesn’t have any element yet, we make this node the head of the list.
2) Contrary, if the list already has elements, then we have to iterate until finding the last one and appending our new node to the end.

As expected the runtime for removing/adding to the firt element from a linked List is always constant O(1)

Removing an element anywhere from a linked list

Removing an element anywhere in the list leverage the removeLast and removeFirst. However, if the removal is in the middle, then we assign the previous node to the next one. That removes any reference from the current node, this is removed from the list:

Notice that every time we are adding/removing from the last position the operation takes O(n)…

But we could reduce the addLast/removeLast from O(n) to a flat O(1) if we keep a reference of the last element!

We are going to add the last reference in the next section!

Doubly Linked Lists

When we have a chain of nodes where each one points to the next one we a Singly Linked list. When we have a linked list where each node leads to the next and the previous element we a Doubly Linked List

Doubly linked list nodes have double references (next and previous). We are also going to keep track of the list first and the last element.

Adding and removing elements from a (singly/doubly) LinkedList has a constant runtime O(1)

Adding and removing from the end of a list

Adding and removing from the end of the list is a little tricky. If you checked in the Singly Linked List, both operations took O(n) since we had to loop through the list to find the last element. Now, we have the last reference:

Using doubly linked list, we no longer have to iterate through the whole list to get the 2nd last elements. We can use directly this.last.previous and is O(1).

Did you remember that for the Queue we had to use two arrays? Now, we can change that implementation an use a doubly linked list instead that has an O(1) for insert at the start and deleting at the end.

Adding an element anywhere from a linked list

Adding an element on anywhere on the list leverages our addFirst and addLast functions as you can see below:

The first in (a) as the last to get out. We can also implement stack using a linked list instead of an array. The runtime will be the same.

That’s all!

Queues

Queues is a data structure where the first data to get in is also the first to go out. A.k.a First-in, First-out (FIFO).
It’s like a line of people at the movies, the first to come in is the first to come out.

We could implement a Queue using an array, very similar to how we implemented the Stack.

Queue implemented with Array(s)

A naive implementation would be this one using Array.push and Array.shift:

When we remove something for the first time, the output array is empty. So, we insert the content of input backward like ['b', 'a']. Then we pop elements from the output array. As you can see, using this trick we get the output in the same order of insertion (FIFO).

What’s the runtime?

If the output has already some elements, then the remove operation is constant O(1). When the output arrays need to get refilled, it takes O(n) to do so. After the refilled, every operation would be constant again. The amortized time is O(1).

We can achieve a Queue with a pure constant if we use a LinkedList. Let’s see what it is in the next section!

Queue implemented with a Doubly Linked List

We can achieve the best performance for a queue using a linked list rather than an array.

Adrian Mejia is a full-stack web developer located in Boston.
Currently working at Cisco as a Software Engineer.
Adrian enjoys writing posts about programming and technology.
Also, he likes to travel ✈️ and biking 🚴‍. Find out more here.