There's a question that always comes up when people pick up the
Rust programming language: why are there two
string types? Why is there String, and &str?

My Declarative Memory Management
article answers the question partially, but there is a lot more to say about
it, so let's run a few experiments and see if we can conjure up a thorough
defense of Rust's approach over, say, C's.

We're using the standard C11 main function signature, which takes the
number of argument (argc, for argument count) as an int, and an “array”
of “strings” (argv, for argument vector) as a char**, or char *[].

Then we use the printf format specifier %s to print each argument
as a string - followed by \n, a newline. And sure enough, it prints each
argument on its own line.

Before proceeding, let's make sure we have a proper understanding of what's
going on.

Although our Node.js program behaves as expected, we can see that É is
also different from the other letters, and that the upper-case counterpart
*of “c3 a9” is “c3 89”.

Our C program didn't work - it couldn't work, because it was only seeing “c3”
and “a9” individually, when it should have considered it as a single, uh, “Unicode
scalar value”.

Why is “é” encoded as “c3 a9”? It's time for a very quick UTF-8 encoding course.

A very quick UTF-8 primer

So, characters like “abcdefghijklmnopqrstuvwxyz”,
“ABCDEFGHIJKLMNOPQRSTUVWXYZ” and “123456789”, and “!@#$%^&*()", etc., all
have numbers.

For example, the number for “A” is 65. Why is that so? It's a convention! All
a computer knows about is numbers, and we often use bytes as the smallest
unit, so, a long time ago, someone just decided that if a byte has the value
65, then it refers to the letter “A”.

Since ASCII is a 7-bit encoding, it has 128 possible values: from 0 to 127
(inclusive). But, on modern machines at least, a byte is 8 bits, so there's
another 128 possible values.

Great! Everyone thought. We can just stuff “special characters” in there:

It's not… just ASCII, it's ASCII plus 128 characters of our choice. Of
course, there's a lot of languages out there, so not every language's
non-ASCII character can fit in those additional 128 values, so there were
several alternative interpretations of those any value that was greater than
127.

Those interpretations were named “codepages”. The picture above is Codepage
437, also known as CP437, OEM-US,
OEM 437, PC-8, or DOS Latin US.

It's sorta adequate for languages like French, if you don't care about
capital letters. It's not adequate at all for Eastern european languages,
and doesn't even begin to cover Asian languages.

So, Japan came up with its own
thing, where they replaced ASCII's
backslash with a yen sign, the tilde with an overline (sure, why not), and
introduced double-byte characters, because 128 extra characters sure wasn't
enough for them.

Cool bear's hot tip

Wait, replacing backslash? Does that mean… in file paths… ?

..yep.

And for the languages with smaller alphabets, people used other code pages
like Windows-1252 for years,
and most text in the Western world was still sorta kinda ASCII, also known as
“extended ASCII”.

But eventually, the world collectively started to put their affairs in order
and settled on UTF-8, which:

Looks like ASCII (not extended) for ASCII characters, and uses the same space.

Allows for a lot more characters - like, billions of them with multi-byte sequences.

Of course, before that happened, people asked, isn't two bytes
enough? (Or sequences of two two-byte
characters?), and surely four bytes is
okay, but eventually, for important
reasons like compactness, and keeping most C programs half-broken instead of
completely broken, everyone adopted UTF-8.

So, yeah, ASCII plus multi-byte character sequences, how does it even work? Well, it's
the same basic principle, each character has a value, so in Unicode, the number for “é”
is “e9” - we usually write codepoints like so: “U+00E9”.

And 0xE9 is 233 in decimal, which is greater than 127, so, it's not ASCII, and we need
to do multi-byte encoding.

How does UTF-8 do multi-byte encoding? With bit sequences!

If a byte starts with 110 it means we'll need two bytes

If a byte starts with 1110 it means we'll need three bytes

If a byte starts with 11110 it means we'll need four bytes

If a byte starts with 10, it means it's a continuation of a multi-byte character sequence.

So, for “é”, which has codepoint U+00E9, its binary representation is “11101001”, and
we know we're going to need two bytes, so we should have something like this:

We can see in the lower part that two-byte UTF-8 sequences give us 11 bits of storage:
5 bits in the first byte, and 6 bits in the second byte. We only need to fit 8 bits,
so we fill them from right to left, first the last 6 bits:

Then the remaining 2 bits:

The rest is padding, filled with zeroes:

We're done! 0b11000011 is 0xC3, and 0b10101001 is 0xA9.

Which corresponds to what we've seen earlier - “é” is “c3 a9”.

Back to C

So, uh, our C program. If we want to really separate characters,
we have to do some UTF-8 decoding.

There! Simple! None of that String and &str business. In fact, there's a
remarkable lack of Rust code for an article about Rust string handling, and
we're about ten minutes in already!

Cool bear's hot tip

Binary literals, e.g. 0b100101010 are not standard C, they're a GNU
extension. Normally you'd see hexadecimal literals, e.g. 0xDEADBEEF, but
it's much harder to see what's going on since UTF-8 deals with individual
bits.

Does our program work?

$ gcc print.c -o print
$ ./print "eat the rich"
e a t t h e r i c h

So far so good!

$ ./print "platée de rösti"
p l a t é e d e r ö s t i

Nice!

$ ./print "23€ ≈ ¥2731"
2 3 € ≈ ¥ 2 7 3 1

Cool!

$ ./print "text 🤷 encoding"
t e x t 🤷 e n c o d i n g

Alright!

Well I don't know what everyone is complaining about, UTF-8 is super easy to
implement, it only took us a few minutes and it is 100% correct and accurate
and standards-compliant and it will work forever on all inputs and always do
the right thing

…or will it?

Here comes the counter-example… I can feel it in my bones.

Consider the following string:

$ echo "noe\\u0308l"
noël

It's just Christmas in French! Surely our program can handle that, no sweat:

$ ./print $(echo "noe\\u0308l")
n o e ̈ l

Uh oh.

Turns out U+0308 is a “combining diaeresis”, which is fancy talk for “just slap two dots
on the previous character”.

In fact, we can slap more of them if we want (for extra Christmas cheer):

Cool bear's hot tip

The combination of multiple scalar values that end up showing a single
“shape” are called “grapheme clusters”, and you should read Henri Sivonen's
It’s Not Wrong that “🤦🏼‍♂️”.length ==
7 if you want to learn more about them.

Turbo explanation of the above: std::env::args() returns an Iterator of
strings. skip(1) ignores the program name (which is usually the first
argument), next() gets the next element in the iterator (the first “real”)
argument.

By that point we have an Option<String> - there might be a next argument, or
there might not be. If there isn't, .expect(msg) stops the program by printing
msg. If there is, we now have a String!

Our naive UTF-8 decoder first read C3 and was all like “neat, a 2-byte sequence!",
and then it read the next byte (which happened to be the null terminator), and decided
the result should be “à”.

So, instead of stopping, it read past the end of the argument, right into
the environment block, finding the first environment variable, and now you can see
the places I cd to frequently (in upper-case).

Now, this seems pretty tame in this context… but what if it wasn't?

What if our C program was used as part of a web server, and its output was shown
directly to the user? And what if the first environment variable wasn't CDPATH, but
SECRET_API_TOKEN?

Then it would be a disaster. And it's not a hypothetical, it happens all
the time.

Cool bear's hot tip

By the way, our program is also vulnerable to buffer overflow attacks: if the input
decodes to more than 1024 scalar values, it could overwrite other variables,
potentially variables that are involved in verifying someone's credentials…

So, our C program will happily do dangerous things (which is very on-brand), but
our Rust program panics early if the command-line arguments are not valid utf-8.

What if we want to handle that case gracefully?

Then we can use OsStr::to_str, which returns an Option - a value that is
either something or nothing.

In Rust, provided you don't explicitly work around it with unsafe, values
of type String are always valid UTF-8.

If you try to build a String with invalid UTF-8, you won't get a String,
you'll get an error instead. Some helpers, like std::env::args(), hide the
error handling because the error case is very rare - but it still checks
for it, and panics if it happens, because that's the safe thing to do.

By comparison, C has no string type. It doesn't even have a real character type.
char is.. an ASCII character plus an additional bit - effectively, it's just
a signed 8-bit integer: int8_t.

There is absolutely no guarantee that anything in a char* is valid UTF-8,
or valid anything for that matter. There is no encoding associated to a char*,
which is just an address in memory. There is no length associated to it either,
so computing its length involves finding the null terminator.

Null-terminated strings are also a serious security concern. Not to mention
that NUL is a valid Unicode character,
so null-terminated strings cannot represent all valid UTF-8 strings.

See? Easy! None of that String / &str nonsense. No lifetimes, no nothing.

Ah. deep breath. Simpler times.

Okay, back to reality. First of all, that's not really the length of
a string. It's.. the number of bytes it takes to encode it using UTF-8.

So, for example:

$ ./woops "née"
length of "née" = 4

And also:

$ ./woops "🐈"
length of "🐈" = 4

But in all fairness, that was to be expected. We didn't spend half
the article implementing a half-baked UTF-8 decoder and encoder just
to be surprised that, without one, we can't count characters properly.

Also, that's not what's bothering me right now.

What's bothering me right now is that the compiler does nothing to prevent
us from doing this:

And, you know, len() is right. By the time it's done… the length of the
string is zero. (It even “works” on non-ASCII inputs!).

This would pass unit tests. And if no one bothered to look at the len
function itself - say, if it was in a third-party library, or worse, a
proprietary third-party library, then it would be… interesting… to
debug.

Now it compiles again. And it runs. And it doesn't fail at runtime - it
silently overwrites our input string, just the same.

Even -Wall, -Wextra and -Wpedantic don't warn us about this. They warn
us about argc being unused. Which, fair enough, not passing an argument
to ./woops definitely ends up reading from unmapped memory addresses and
crashes right now.

And if this is in a proprietary library, you're lulled into a false sense
of security, because you look at the header file and you see this:

int len(const char *s);

But, granted - that's a contrived example. You'd have to be a pretty evil
vendor to ship a len function that mutates its input. Unless you do it
accidentally. Which you'd never do, right? Unless you do. In which case,
well, you shouldn't have. Obviously.

Right?

Okay so let's go with a more realistic example: a function that turns a
string uppercase:

See, arg points to somewhere in memory that is set up at process startup.
Again, the details are out of scope, but what I can tell you with confidence
is that it hasn't been allocated with malloc, and it shouldn't be freed
with free.

The result of strdup, however, definitely needs to be freed by calling
free.

In fact, valgrind can easily confirm our program is
leaking memory right now:

If we built our program with -g, it'd even show us line numbers in our “.c”
sources. It does show line numbers in glibc's “.c” sources, because I
installed glibc debug symbols recently, for reasons, but yeah, whoa, look
at that output.

So anyway, silly me, I freed upp right before printing it, my fingers
slipped, luckily this never happens in real life right haha.

It would be cool if the compiler could tell me about it at compile-time but
I guess it's impossible because free() is just a regular function so who's
to say what it actually does? What if we defined our ownfree function that
doesn't invalidate its argument? Checkmate, static analysts.

How nice. We even use const in all the right places! I think! Except maybe argv!
Who knows? The compiler sure doesn't seem to care much. I guess casting non-const to const
is pretty harmless. Fair enough, GCC, fair enough.

Now our program 100% does what it should: for each character of src, we
convert it to uppercase, and then store into dst, which was allocated by
the caller, so it's, well, clear-er that it's the caller's job to free it.

But um. Speaking of null terminators… what happened to it exactly? I don't
remember setting a null terminator in dst.

Oh, haha. It turns out that, the way we wrote our loop, we read the null
terminator and pass it to toupper(), eventually writing it to dst. So we
made two mistakes (iterating too far, and not writing a null terminator) but
they, kinda, canceled each other out.

Lucky toupper has no way to return an error and just returns 0 for 0,
right? Or maybe 0 is what it returns on error? Who knows! It's a C API!
Anything is possible.

Again, I don't usually make those mistakes - I'm a programmer. Humans are
not known for being fallible. This is clearly just a fluke. It would never
happen if I was actually writing production code, much less if I worked for
a large company.

And if I did, it probably wouldn't result in remote code execution or
something nasty like that. It'd probably display a playful panda instead,
that asks our users to wait until some of our engineers tend to the problem.

Right?

And even if I was working for a large company, and was writing production
code, some tool probably would have caught that. Obviously Valgrind didn't,
but you know, some other tool. I'm sure there's a tool. Somewhere.

Unless it's a perfectly legitimate use of the C language and standard
library, just.. an incorrect one. And in that case no amount of tooling will
help us. But that's why we have teams! Well, if we have a team. Surely a
colleague would have caught it during code review. I thought it looked a bit
suspicious. Didn't I say I felt funny about the code? I could've sworn.

Anyway, let's fix it. It won't look as C-ish as before, but it's
for the sake of robust code, so just go with it:

std::char::to_uppercase() returning an Iterator is great for performance - our C UTF-8
implementation was eager - it always decoded (or re-encoded) an entire string, but Rust's
standard library uses Iterators everywhere to make it lazy.

If we were to iterate through the characters of a string, convert them to
uppercase, and check that the result contains “S”, even if the input was a
very large string, we'd only pay have a handful of characters at a time in
memory, never the full thing. Also, we'd stop decoding once we've found the
“S”. It's a very flexible design!

But… that's not quite what the C code did, is it? When we had the C version of uppercase
take an src and dst, it returned… void!

Of course, we're still not 100% at parity with the C version. We didn't have
to call malloc, because String::new and String::push worry about
allocation for us. We didn't have to call free, because a String going
out of scope frees it automatically.

We didn't have to make an “n” variant of our function, because dst is a
mutable reference to a growable string, so it's impossible for us to write
past the end of it.

And finally, we couldn't mishandle the null terminator, because there is
none.

So, wait, growable string and all.. does that mean we could pre-allocate
a String of reasonable size, and just re-use it for multiple uppercase
calls?

$ cargo run --quiet -- "🙈🙉🙊💥"
thread 'main' panicked at 'byte index 2 is not a char boundary; it is inside '🙈' (bytes 0..4) of `🙈🙉🙊💥`', src/libcore/str/mod.rs:2069:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.

That is… that is just excellent. Not just the fact that it panics - which
is the safe thing to do - but look at that message. I'm just completely in
awe. No, I hadn't tried it before writing this article but DARN look at that
thing.

Closing words

Anyway, this article is getting pretty long.

Hopefully it's a nice enough introduction to the world of string handling in
Rust, and it should motivate why Rust has both String and &str.

The answer, of course, is safety, correctness, and performance. As usual.

We did encounter roadblocks when writing our last Rust string manipulation
program, but they were mostly compile-time errors, or panics. Not once have we:

Read from an invalid address

Written to an invalid address

Forgotten to free something

Overwritten some other data

Needed an additional sanitizer tool to tell us where the problem was

And that, plus the amazing compiler messages, and the community, is the
beauty of Rust.