Thursday, April 14, 2016

Names Considered Useful

Software is written for people to understand; variable names should
be chosen accordingly. People need to comb through your code and
understand its intent in order to extend or fix it. Too often,
variable names waste space and hinder comprehension. Even
well-intentioned engineers often choose names that are, at best, only
superficially useful. This document is meant to help engineers choose
good variable names. It artificially focuses on code reviews because
they expose most of the issues with bad variable names. There are, of
course, other reasons to choose good variable names (such as improving
code maintenance). This document is also a work-in-progress, please
send me any constructive feedback you might have on how to improve it.

Why Name Variables?

The primary reason to give variables meaningful names is so that a
human can understand them. Code written strictly for a computer could
just as well have meaningless, auto-generated names1:

All engineers would recognize the above code is needlessly difficult
to understand, as it violates two common guidelines: 1) don't
abbreviate, and 2) give meaningful names. Perhaps surprisingly, these
guidelines can be counter-productive. Abbreviation isn't always bad,
as will be discussed later. And meaningful is vague and subject to
interpretation. Some engineers think it means that names should always
be verbose (such as MultiDictionaryLanguageProcessorOutput). Others
find the prospect of coming up with truly meaningful names daunting,
and give up before putting in much effort. Thus, even when trying to
follow the above two rules, a coder might write:

Reviewers could, with effort, understand the above code more easily
than the first example. The variable names are accurate and readable.
But they're unhelpful and waste space, because:

processElements

most code "processes" things (after all, code
runs on a "processor"), so process is seven wasted characters
that mean nothing more that "compute". Elements isn't much
better. While suggestive that the function is going to operate on
the collection, that much was already obvious. There's even a
bug in the code that this name doesn't help the reader spot.

numResults

most code produces "results" (eventually); so, as
with process, Results is seven wasted
characters. The full variable name, numResults is
suggestive that it might be intended to limit the
amount of output, but is vague enough to impose a
mental tax on the reader.

collection

wastes space; it's obvious that it's a collection
because the previous tokens were
Collection<Integer>.

num

simply recapitulates the type of the object (int)

result, count

are coding cliches; as with numResults they
waste space and are so generic they don't help the reader
understand the code.

However, keep in mind the true purpose of variable names: the reader
is trying to understand the code, which requires both of the
following:

What was the coder's intent?

What does the code actually do?

To see how the longer variable names that this example used are
actually a mental tax on the reader, here's a re-write of the function
showing what meaning a reader would actually glean from those names:

The naming changes have almost made the code worse than the
auto-generated names, which, at least, were short. This rewrite shows
that coder's intent is still mysterious, and there are now more
characters for the reader to scan. Code reviewers review a lot of
code; poor names make a hard job even harder. How do we make code
reviewing less taxing?

On Code Reviews

There are two taxes on code reviewers' mental endurance: distance
and boilerplate. Distance, in the case of variables, refers to how
far away a reviewer has to scan, visually, in order to remind
themselves what a variable does. Reviewers lack the context that
coders had in mind when they wrote the code; reviewers must
reconstruct that context on the fly. Reviewers need to do this
quickly; it isn't worth spending as much time reviewing code as it
took to write it2. Good variable names eliminate the problem of distance
because they remind the reviewer of their purpose. That way they don't
have to scan back to an earlier part of the code.

The other tax is boilerplate. Code is often doing something
complicated; it was written by someone else; reviewers are often
context-switching from their own code; they review a lot of code,
every day, and may have been reviewing code for many years. Given all
this, reviewers struggle to maintain focus during code reviews.
Thus, every useless character drains the effectiveness of code
reviewing. In any one small example, it's not a big deal for code to
be unclear. Code reviewers can figure out what almost any code does,
given enough time and energy (perhaps with some follow-up questions to
the coder). But they can't afford to do that over and over again, year
in and year out. It's death by 1,000 cuts.

A Good Example

So, to communicate intent to the code reviewer, with a minimum of
characters, the coder could rewrite the code as follows:

Let's analyze each variable name change to see why they make the code
easier to read and understand:

printFirstNPositive

unlike processElements, it's now clear
what the coder intended this function to do (and there's a
fighting chance of noticing a bug)

n

obvious given the name of the function, no need for a more
complicated name

c

collection wasn't worth the mental tax it imposed, so at
least trim it by 9 characters to save the reader the mental
tax of scanning boilerplate characters; since the function is
short, and there's only one collection involved, it's easy to
remember that c is a collection of integers

skipped

unlike results, now self-documents (without a
comment) what the return value is supposed to be. Since
this is a short function, and the declaration of
skipped as an int is plain to see, calling it
numSkipped would have just wasted 3 characters

i

iterating through a for loop using i is a
well-established idiom that everyone instantly understands.
Give that count was useless anyway, i is preferable since
it saves 4 characters

maybePositive

num just meant the same thing int did,
whereas maybePositive is hard to misunderstand and may help one
spot a bug

It's also easier, now, to see there are two bugs in the code. In the
original version of the code, it wasn't clear if that the coder
intended to only print positive integers. Now the reader can notice
that there's a bug, because zero isn't positive (so n should be
greater than 0, not greater-than-or-equals). (There should also be
unit tests). Furthermore, because the first argument is now called
maxToPrint (as opposed to, say, maxToConsider), it's clear the
function won't always print enough elements if there are any
non-positive integers in the collection. Rewriting the function
correctly is left as an exercise for the reader.

Naming Tenets (Unless You Know Better Ones)

As coders our job is to communicate to human readers, not
computers.

Don't make me think. Names should communicate the coder's
intent so the reader doesn't have to try to figure it out.

Code reviews are essential but mentally taxing. Boilerplate must
be minimized, because it drains reviewers' ability to concentrate on
the code.

We prefer good names over comments but can't replace all comments.

Cookbook

To live up to these tenets, here are some practical guidelines to use
when writing code.

Don't Put the Type in the Name

Putting the type of the variable in the name of the variable imposes a
mental tax on the reader (more boilerplate to scan over) and is often
a poor substitute for thinking of a better name. Modern editors like
Eclipse are also good at surfacing the type of a variable easily,
making it redundant to add the type into the name itself. This
practice also invites being wrong; I have seen code like this:

Set<Host>hostList = hostSvc.getHosts(zone);

The most common mistakes are to append Str or String to the name,
or to include the type of collection in the name. Here are some
suggestions:

Bad Name(s)

Good Name(s)

hostList, hostSet

hosts, validHosts

hostInputStream

rawHostData

hostStr, hostString

hostText, hostJson, hostKey

valueString

firstName, lowercasedSKU

intPort

portNumber

More generally:

Pluralize the variable name instead of including the name of a
collection type

If you're tempted to add a scalar type (int, String, Char) into your
variable name, you should either:

Explain better what the variable is

Explain what transformation you did to derive the new variable
(lowercased?)

Use Teutonic Names Most of The Time

Most names should be Teutonic, following the spirit of languages like
Norwegian, rather than the elliptical vagueness of Romance languages
like English. Norwegian has more words like tannlege (literally
"tooth doctor") and sykehus (literally "sick house"), and fewer
words like dentist and hospital (which don't break down into other
English words, and are thus confusing unless you already know their
meaning). You should strive to name your variables in the Teutonic
spirit: straightforward to understand with minimal prior knowledge.

Another way to think about Teutonic naming is to be as specific as
possible without being incorrect. For example, if a function is
hard-coded to only check for CPU overload, then name it
overloadedCPUFinder, not unhealthyHostFinder. While it may be used
to find unhealthy hosts, unhealthyHostFinder makes it sound more
generic that it actually is.

The exceptions to the Teutonic naming convention are covered later on
in this section: idioms and short variable names.

It's also worth noting that this section isn't suggesting to never
use generic names. Code that is doing something truly generic should
have a generic name. For example, transform in the below example is
valid because it's part of a generic string manipulation library:

Avoid Over-used Cliches

In addition to not being Teutonic, the following variable names have
been so horribly abused over the years that they should never be used,
ever.

val, value

result, res, retval

tmp, temp

count

str

The moratorium on these names extends to variations that just add in a
type name (not a good idea anyway), such as tempString, or intStr,
etc.

Use Idioms Where Meaning is Obvious

Unlike the cliches, there are some idioms that are so
widely-understood and unabused that they're safe to use, even if by
the letter of they law they're too cryptic. Some examples are (these
are Java/C specific examples, but the same principles apply to all
languages):

Warning: idioms should only be used in cases where it's obvious
what they mean.

May Use Short Names Over Short Distances When Obvious

Short names, even one-letter names, are preferable in certain
circumstances. When reviewers see a long name, they tend to think they
should pay attention to them, and then if the name turns out to be
completely useless, they just wasted time. A short name can convey the
idea that the only useful thing to know about the variable is its
type. So it is okay to use short variable names (one- or two-letter),
when both of the following are true:

The distance between declaration and use isn't great (say within
5 lines, so the declaration is within the reader's field of
vision)

There isn't a better name for the variable than its type

The reader doesn't have too many other things to remember at that
point in the code (remember, studies show people can remember about
7 things).

Here's an example:

voidupdateJVMs(HostServices, FilterobsoleteVersions){// 'hs' is only used once, and is "obviously" a HostServiceList<Host>hs = s.fetchHostsWithPackage(obsoleteVersions);
// 'hs' is used only within field of vision of readerWorkfloww = startUpdateWorkflow(hs);
try{
w.wait();
}finally{
w.close();
}}

But this takes up more space with no real gain; the variables are all
used so close to their source that the reader has no trouble keeping
track of what they mean. Furthermore, the long name updateWorkflow
signals to the reviewer that there's something significant about the
name. The reviewer then loses mental energy determining that the name
is just boilerplate. Not a big deal in this one case, but remember,
code reviewing is all about death by 1,000 cuts.

Remove Thoughtless One-Time Variables

One-time variables (OTVs), also known as garbage variables, are
those variables used passively for intermediate results passed between
functions. They sometimes serve a valid purpose (see the next
cookbook items), but are often left in thoughtlessly, and just clutter
up the codebase. In the following code fragment, the coder has just
made the reader's life more difficult:

Again, because the distances are all short and the meanings are all
obvious, short names can be used.

Use Longer OTVs to Explain Confusing Code

Sometimes you need OTVs simply to explain what code is doing. For
example, sometimes you have to use poorly-named code that someone else
(ahem) wrote, so you can't fix the name. But you can use a meaningful
OTV to explain what's going on. For example,

23 comments:

Nice post, thanks.I find many similarities with advices in Clean Code by Robert C. Martin. It's a very nice read, I recommend it.There are some slides with key points of his book: http://fr.slideshare.net/hebel/clean-code-vortrag032009pdf

Actually I read the first code and even found an error (but also used the second one to be sure): in line 5 it should be"int a6 = a2.get(a3);"not"int a6 = a2.get(i1);"(a3 in place of i1). So yes this kind of programs is not only difficult to debug, but also to write properly!

In about 1975, the BASIC language only allowed variables to have a single letter and a number (such as a1). DEC basic (digital equipment Corp) provided for a 3rd character signifying the data type. a1% was an integer, a1$ was a string, etc. The company I founded in 1972 developied some pretty sophisticated and commercialy successful software using these naming restrictions. Of course, they were hard to debug and we were all thankful when DEC enabled longer names some years later.

The method name uses "N", but the param for N is named "maxToPrint". Confusing and also violates his/her wasted space rule.

Instead:

int printFirstNPositive(int n, Collection c)

I think the short variable names "n" and "c" are fine since they are part of the method signature and have unambiguous meaning, and also because the implementation is so short (so no problem remembering that n and c are the inputs.

Some useful tips here. A few minor objections regarding your "Teutonic" suggestion (although I agree with your basic intent): English is not a Romance language, it is Germanic. Also, "Teutonic" is antiquated and carries certain questionable overtones. A better way to express your intention here would be to point out how Germanic languages such as modern German create compound nouns that precisely describe a given thing: "Zahnarzt" is literally "tooth doctor," to use one of your examples. This is not a trait unique to Germanic languages, however. The thing about English is not that it is "elliptical," but rather that it has been strongly influenced by French, Latin, and other languages. If you do not speak French, then an English word of French origin may seem vague, but it's really just a word you don't understand as well as you might understand others.

I disagree with the *never use these variable names again* assertion. More fitting is the rule regarding short naming in obvious context; tmp is a favored name, just like i, j, k, ii, nn, t, tt, etc. This behavior derives from days gone by writing Fortran IV code. Hence, the use of such names has long been recognised to be as an index. I hold that no index deserves even a full name; three characters at best. tmp, val and str are such names.

Great post! You covered pretty much everything there is to say about naming variables!

Two remarks.

Instead of the variable name 'c' in your rewritten example I would have written 'cs'. The idea is that you pluralize even one-letter names when they refer to a collection of elements. It becomes especially useful if your code mixes both variables that refer to a single element of a collection and variables that refer to the collections themselves (e.g. 'x' and 'xs'). I picked up this convention from the Haskell community and I'm fond of it since then.

As other commenters have pointed out I find the "overused" variable names 'result' and 'tmp' (or sometimes just 't') useful sometimes. The canonical example for a valid use case for 'tmp' is the swap function where a variable 'tmp' is literally introduced to be a *temporary* reference. I tend to use the name 'result' in a function that returns a collection and imperatively builds up the resulting collection by feeding elements into 'result' (e.g. imperatively written 'filter' function). It's especially useful if this function has many variables or operates on several collections since you see at a glance which variable refers to the resulting collection even though you could have named it 'r' if only considering your "distance" argument.

Just a small fix. You said and I agree that types shouldn't be on the variable name because it is already explained when declaring the variable. But this is exclusive to static typed languages. If you are working with a dynamic typed language, it would make more sense to put the type of the variable in the name, as you don't know what it is until you make a method call on that object.

Also here is a small error:

// BADint port; // TCP port number// GOODint tcpPortNumber;

With your explanation it would be better just to name it as tcpPort. You already know it is a number because it is declared as int.

Thanks for all the comments and bug fixes - I made a few minor updates attempting to address the more mechanical problems people pointed out. Regarding the use of the term Teutonic, I think it might be more accurate to say Agglutinative. I noticed this in a talk by Kevlin Henney (see slide 40 of http://www.slideshare.net/Kevlin/seven-ineffective-coding-habits-of-many-programmers-42301681 for example). Because he used it to refer to the common anti-pattern of gluing together low-information-content words (his example was "validateCustomerValue"), I'd be hesitant to use the term here, for fear of causing confusion.

I only have one difference of opinion, which is when CamelCasing acronyms and initialisms, only the first letter should be capitalized. It's not too bad if there's only one in the name, but when I see names such as AWSEC2HTTPClient, my head explodes.