Implementing a dynamically sized array in C

By Najaf Ali

One of the biggest stumbling blocks on moving to C from a high-level scripting
language comes when you realize you need to know how big an array needs to be
before you use it. In this article I'm going to go through a quick
implementation of a dynamically sized array in C.

Not only will this give us a useful bit of library code in C, it will also help to illustrate the sort of
performance/memory overhead in higher level languages
have to pay down to give users access to built-in variable-length arrays.

Basic C arrays

Here's a quick refresher on what a basic C array looks like, allocated on the
stack:

Allocates memory on the stack1. Specifically it allocates 3000 * sizeof(int)
bytes on the stack frame of the main function call. Memory allocated in this
way is automatically freed when execution reaches the end of the current
block (in this case, the end of main).

Makes a pointer to the allocated memory available in the variable my_array.
We can index into my_array with subscripts e.g. my_array[271] would give us
the 270th 272nd element. This by the way, would do exactly the same
thing as *(my_array + 271).

Based on the above, it's clear that C arrays are for the most part syntactic sugar
working with its underlying memory handling operations. Some other notable
factoids:

There's no easy way to automatically resize an array as its contents expand.
You can call realloc on an array you've allocated on the heap to make it
larger (more on this later) but ideally the resizing of the array would be
automatic.

You can index out of array bounds. There is no
bounds checking in C arrays. In other words, there's nothing stopping you from
reading a value out of my_array[5000]. Since all this code does is read the
memory 5000 elements2 after the location of the pointer my_array, indexing
out of an arrays bounds will simply retrieve some as-yet unallocated location
in virtual memory. This is most likely not what you intended, and can have
potentially dangerous consequences.

When we can afford a few extra clock cycles and the requisite space, it would be
nice to abstract these problems away in some sort of underlying data structure.
A dynamic array a.k.a a vector does exactly that.

N.B. A vector doesn't solve all of the problems that we have when dealing with collections.
It's quite good for lists which are append-only, but if insertion/deletion is a
common operation then it's possible that a linked list is a better choice.
Linked lists come with their own set of problems that won't quite fit into this
article.

Defining a Vector interface

In this example we're just going to create a dynamically sized array of
integers. Here's what the definition of a vector interface might look like:

#define VECTOR_INITIAL_CAPACITY 100 is a constant we'll use later to set a
starting capacity of our vectors internal array. 100 is an arbitrary value,
and if we were writing this for real we might think a little harder about what
an optimal initial capacity might be.

typedef struct { ... } Vector is the definition of our Vector type.

size is the current size of the vector,

capacity is the total size of the underlying array

data is a pointer to the first element of the underlying array itself.

vector_init is a function that initializes a vector struct. It sets size
to 0, capacity to VECTOR_INITIAL_CAPACITY and allocates an appropriate
amount of memory (vector->capacity * sizeof(int)) for the underlying data array.

vector_append appends the given value to the vector. If the underlying
data array is full, then calling this function should cause vector->data to
expand to accept this value. Increments vector->size.

vector_get returns a value out of a vector at the given index. If the
index is below 0 or greater than vector->size - 1, this function should complain
about the index being out of bounds.

vector_set sets the value at the given index to the given value. If the
index is greater than the vector->size, this function should expand the
vector until it is big enough to contain the index and set the value at that
index. It should zero-fill all values in between. vector->size should be
incremented accordingly.

vector_double_capacity_if_full doubles the underlying data array capacity
if vector->size >= vector->capacity. We'll find out later that changing the
size of the array is expensive, so in order to minimize the number of times
we need to resize, we double the capacity each time.

vector_free frees the memory allocated for the data array. We leave
freeing of the Vector struct itself to client code (so they can use any sort
of pointer they like, be it stack or heap, and then clean up after
themselves).

This is a very simple plan for the implementation and glosses over a lot of
issues like error handling, but it should be enough to get us to a basic working
version of our vector.

Implementing the Vector

Here's what an implementation of the interface we defined above might look like:

Using our Vector

In this usage example, we keep things simple and just pass around the pointer to
a variable called vector allocated on the stack:

// vector-usage.c
#include <stdio.h>
#include "vector.h"
int main() {
// declare and initialize a new vector
Vector vector;
vector_init(&vector);
// fill it up with 150 arbitrary values
// this should expand capacity up to 200
int i;
for (i = 200; i > -50; i--) {
vector_append(&vector, i);
}
// set a value at an arbitrary index
// this will expand and zero-fill the vector to fit
vector_set(&vector, 4452, 21312984);
// print out an arbitrary value in the vector
printf("Heres the value at 27: %d\n", vector_get(&vector, 27));
// we're all done playing with our vector,
// so free its underlying data array
vector_free(&vector);
}

Tradeoffs

Working at this low a level helps to highlight a number of tradeoffs that we don't
necessarily see when coding at break-neck speed in high-level languages like Ruby.
For example:

Resizing the vector (specifically, calling realloc()) might be expensive. From the man page:

The realloc() function tries to change the size of the allocation pointed to by ptr to size, and returns ptr. If there is not enough room to enlarge the memory allocation pointed to by ptr, realloc() creates a new allocation, copies as much of the old data pointed to by ptr as will fit to the new allocation, frees the old allocation, and returns a pointer to the allocated memory.

If we ever find ourselves in the situation where there's no room for the data
array to expand into, we may be hit with the expensive copy operation. In order
to avoid excessive calls to realloc then, we double the capacity of the array each
time we expand it.

This strategy has the downside of potentially allocating space we
never use, so there's a trade-off to be made between wasted space and
performance overhead.

Also, Vector only ever holds integers. If we were to hold an array of
void pointers instead of integers, our vector could be used with values of
arbitrary types. This would of course incur the overhead of dereferencing
a pointer every time we wanted to read a value out of the array, so there's a
trade-off there between flexibility and an (albeit small) performance hit.

Related Posts

Footnotes

1 "Allocating memory on the stack" really just means taking the
current position of the stack pointer, assigning it to the variable name and
then incrementing the stack pointer by the size of the type of the variable.
It's an extremely quick operation compared with allocating memory on the heap
with malloc and friends. Unlike with malloc however you lose any stack allocated
memory once you pop your current stack frame (i.e. execution reaches the end of the
current function).

2 By elements we mean the size of the type of the array. If an array
holds elements of type int which each occupy four bytes, then
integer_array[5000] means the location in memory of integer_array
offset by (5000 * 4) bytes.