Introduction

Whether you are learning Rust or C, it is important that you have
an understanding of how various data types are represented in memory.
For example, when learning C, if you are not able to
sketch on a piece of paper how the array ‘a’ is represented in memory,
you will have a hard time understanding the meaning of expressions like
*a, **a etc (array decaying
rules make this tricky in C):

main(){char*a[]={"abc","def","ijk"};}

In C, it is possible to learn more about the representation of the
array simply by printing out the value of various expressions involving
‘a’. One can also think of using ‘gdb’ to peek into memory. Finally, it
is not too difficult to read and understand the assembly code generated
by the C compiler.

Things are not that easy in Rust. References in Rust are not the free
spirits you encounter in the land of C - they have a lot of restrictions
imposed on them (for an excellent reason, the safe execution of your code).
But these restrictions will have to be lifted if you wish to write very
low level Rust code (say operating system kernels, embedded systems) - Rust
does indeed provide an ‘unsafe’ mechanism using which you are free to do
all the wild and crazy things you can do with pointers in C. Needless to say:

this should be used with extreme caution if you do not wish your Rust code
to blow up the way your C code sometimes does! Unless you are writing very
low level code, you will seldom have the need to use ‘unsafe’.

In this article, we will use the ‘unsafe’ mechanism in Rust purely for
educational purpose - to understand how Rust lays out objects of different
types in memory.

Getting the address of a variable

If you expect the following program to print the address of ‘a’,
you are in for surprise:

fnmain(){leta=10;letb=&a;println!("{:?}, {:?}",b,b+1);}

You will see the value of ‘a’ and ‘a+1’ (that is,
10 and 11) in the output.

Even though ‘b’ contains address of ‘a’, when you try
to access ‘b’, Rust will automatically derefer the pointer
and give you the pointed-to value (ie, Rust will automatically
do *b when you make use of ‘b’ in an expression like ‘b+1’ - we need not
write *b explicitly).

This behaviour is very convenient because that is what you
really want to do with a pointer: access the pointed-to object.

It is also safe; if you are able to perform arithmetic on the
memory address and then derefer the modified address (say do
something like *(b+1)), there is no guarantee that the modified
address (b+1) is pointing to a valid memory location.

But what if you really want to see the address of ‘a’?

fnmain(){leta=10;letb=&a;println!("{:?}",bas*consti32);}

This is the output I got on my system (x86_64 Linux):

0x7ffeebf83a24

[Edit(14/09/2016)] You can use

println!("{:p}",&a)

to see the
address of ‘a’. Thanks Manish for pointing this out!

Rust has a “raw pointer” type, written as “* const T” (where T is some
type like say i32). None of the safety checks associated with ordinary
Rust references are applicable to raw pointers. In the above example,
we are asking Rust to interpret ‘b’ as a raw pointer, which will let us
see the actual address stored in ‘b’.

Here is another way to write the same program:

fnmain(){leta=10;letb:*consti32=&aas*consti32;println!("{:?}",b);}

The explicit declaration of the type of ‘b’:

letb:*consti32=&aas*consti32;

is not really required as Rust can easily infer the type; we
need to only write:

letb=&aas*consti32;

Dereferencing raw pointers

Here is a fun program in C:

intmain(){int*p=0;intk=*p;}

This will of course give you a segfault as you are trying to derefer the
null pointer.

Let’s try to write an equivalent one in Rust:

fnmain(){letp=0as*consti32;}

This will compile and run without any problem; you can store 0 in a
raw pointer.

Now let’s try this one:

fnmain(){letp=0as*consti32;letk=*p;}

We are now trying to access the memory location whose address is 0; you will
find that the code will not compile. We get an error which says:

Rust does not allow us to derefer raw pointers because it is a dangerous
operation with the potential for undefined behaviours; a raw pointer is
not guaranteed to contain valid memory addresses.

If you are really stubborn, Rust will let you do this:

fnmain(){letp=0as*consti32;unsafe{letk=*p;}}

The above program will compile properly; and you get a wonderful
segfault when you run it!

Any code which derefers a raw pointer should be explicitly marked
as “unsafe”; the big promise of Rust is memory safety without using
a garbage collector - you don’t get that safety in unsafe blocks.

Why does Rust need unsafe?

Say you are writing code which runs on a microcontroller. Your code
needs to write some data to an I/O port. Most often, I/O ports are
memory mapped, that is, you can access the I/O port by reading from
or writing to specific memory locations. If you have an I/O port
which is mapped to address 0x7134af23 and you want to write 1 byte
of data:

intmain(){unsignedchar*c=(unsignedchar*)0x7134af23;*c='A';}

Unless you have raw pointers and “unsafe” blocks, Rust code will not
be able to do low level
manipulations like this which are essential for writing operating systems
and embedded systems code. Rust is targeted to be a replacement for C, so
it should be capable of doing everything that is possible in C. As the dangerous
parts of the code are explicitly marked as “unsafe”, it becomes easy to
identify and more thoroughly audit such blocks of code. Redox OS,
an operating system written in Rust, has demonstrated that it is
possible to write a large percentage of the kernel code in safe rust.

Looking at the memory layout of 32 bit integers

Here is a C program which declares a 32 bit integer and tries to read
it back as four independent bytes:

Using the “transmute” function

The “transmute” function takes in an object of type u32 and
returns a view of that object as an array of four 8 bit
values, which gets copied to b. You can think of it as a
kind of copy operation which copies all the bits of an object
of type T1 to an object of type T2. Both T1 and T2 must have
the same size and alignment.

Warning: Use of the “transmute” function
is discouraged. As Rust beginners, most
of us will have absolutely no need for this operation in the code that we write.
I am using it here purely for educational purpose.

Understanding Vector Layout

The figure below shows how a vector is mapped to memory; we have
a triplet of values: (pointer,capacity,length) on the stack. The
space for the contents of the vector is allocated on the heap. The
“length” attribute specifies the actual number of elements in the
vector, “capacity” represents the actual amount of space allocated
(which may
be larger than the number of elements in the vector) on the heap and “pointer”
points to the heap location where the contents of the vector are
stored.

Here is a program which reads the stack locations representing
a vector and converts the data into three u64 values:

The program changes the stack representation of the vector;
the pointer value is now 0 and capacity and length are both 1.
The program will segfault when you try to print v[0] because
it will try to derefer 0.

Once again: these are horrible tricks which shouldn’t be employed
in production code!

Exercise: You can write a program to understand how Strings
are represented in memory.

The transmute operation requires that you know the exact
size of the object to be transmuted. Here is how you can find
out the size of a Rust data type:

A variable of type Color can assume only one of 3 possible values
at any point in time. Each value needs 4 bytes of storage, but because
you are storing only one value at any point in time, the compiler allocates
only 4 bytes instead of 12.

But now there is a problem. How does Rust represent the fact that what
is stored in this common 4 byte storage area is a “Red”. Or, a “Green”, or
a “Blue”? It will be impossible to execute “match” operations without
this knowledge.

The solution is simple. Allocate another 4 byte location as a “tag”. If the
tag value is 0, the next 4 byte location represents a Red, if the tag
is 1, it is a Green and so on … (only 1 byte of the tag space will be used,
the other 3 bytes are most probably used as a padding to meet the alignment
restriction for the 32 bit integer Red/Green/Blue data field which comes next).

The tag field value is 1 (ie, Green) and the data field value
is hex 12.

Exercise: Write similar programs to find out the memory layout
of product types: tuples and structures.

Conclusion

All the basic Rust data types have a simple representation in
memory; the common theme is to allocate everything on the stack
without any layers of indirection. The only exception is
when you need data structures which grow dynamically; in this
case, the dynamic part is allocated on the heap and the stack
allocated part will have a pointer to this heap allocated area plus
some other information (like length, capacity etc).

When a data structure is passed to a function, you have two options:
either copy the stack allocated part to the function or to pass
a simple pointer to the beginning of the stack allocated part.