Cleaning Up Your Data With Go: Part 1

Overview

One of the most important aspects of any application is validating its input. The most basic approach is just failing if the input doesn’t satisfy the requirements. However, in many cases this is not enough. In many systems the data collection is separate from data analysis. It could be a survey or an old dataset.

In these cases, it is necessary to go over the entire dataset before analysis, detect invalid or missing data, fix what can be fixed, and flag or remove data that can’t be salvaged. It is also useful to provide statistics about the quality of the data and what kinds of errors were encountered.

In this two-part series you’ll learn how to use Go’s text facilities, slice and dice CSV files, and ensure your data is spotlessly clean. In part one, we’ll focus on the foundation of text processing in Go—bytes, runes, and strings—as well as working with CSV files.

Text in Go

Before we dive into data cleaning, let’s start with the foundation of text in Go. The building blocks are bytes, runes, and strings. Let’s see what each one represents and what the relationships are between them.

Bytes

Bytes are 8-bit numbers. Each byte can represent one of a possible 256 values (2 to the power of 8). Each character in the ASCII character set can be represented by a single byte. But bytes are not characters. The reason is that Go as a modern language supports Unicode, where there are way more than 256 separate characters. Enter runes.

Runes

A rune in Go is another name for the int32 type. This means that each rune can represent more than four billion separate values (2 to the power of 32), which is good enough to cover the entire Unicode character set.

In the following code you can see that the rune ‘∆’ (alt-J on the Mac) is just an int32. To print the character it represents to the screen, I have to convert it to a string.

Strings

String literals are a sequence of UTF-8 characters enclosed in double quotes. They may contain escape sequences, which are a backslash followed by an ASCII character such as n (newline) or t (tab). They have special meanings. Here is the full list:

Sometimes you may want to store literal bytes directly in a string, regardless of escape sequences. You could escape each backslash, but that’s tedious. A much better approach is to use raw strings that are enclosed in backticks.

Here is an example of a string with a t (tab) escape sequence, which is represented once as is, then with the backslash escape, and then as a raw string:

While strings are slices of bytes, when you iterate over a string with a for-range statement, you get a rune in each iteration. This means you may get one or more bytes. This is easy to see with the for-range index. Here is a crazy example. The Hebrew word “שלום” means “Hello” (and peace). Hebrew is also written right to left. I’ll construct a string that mixes the Hebrew word with its English translation.

Then, I’ll print it rune by rune, including the byte index of each rune within the string. As you’ll see, each Hebrew rune takes two bytes, while the English characters take one byte, so the total length of this string is 16 bytes, even though it has four Hebrew characters, three symbols, and five English characters (12 characters). Also, the Hebrew characters will be displayed from right to left:

All these nuances can be extremely important when you have a dataset to clean up with weird quotes and a mix of Unicode characters and symbols.

When printing strings and byte slices, there are several format specifiers that work the same on both. The %s format prints the bytes as is, %x prints two lowercase hexadecimal characters per byte, %X prints two uppercase hexadecimal characters per byte, and %q prints a double quoted string escaped with go syntax.

To escape the % sign inside a format string specifier, just double it. To separate the bytes when using %x or %X, you can add a space, as in “% x” and “% X”. Here is the demo:

Reading and Writing CSV Files

Data can arrive in many ways and formats. One of the most common formats is CSV (comma-separated values). CSV data is very efficient. The files typically have a header line with the name of the fields or columns and rows of data where each row contains a value per field, separated by commas.

Here is a little snippet from a UFO sightings dataset (really). The first row (header) contains the column names, and the other lines contain the data. You can see that often the “Colors Reported” column is empty:

Writing this chunk of CSV data to a file involves some string operations as well as working with files. Before we dive into the main logic, here are the mandatory parts: the package definition, the imports, and the data string (note the use of const).

The main() function creates a file called “ufo-sightings.csv”, checks that there is no error, and then creates a buffered writer w. The defer call in the next line, which flushes the contents of the buffer to the file, is executed at the end of the function. That is the meaning of defer. Then, it uses the Split() function of the strings package to break the data strings into individual lines.

Then, inside the for-loop, the leading and trailing whitespace is trimmed from each line. Empty lines are skipped, and non-empty lines are written to the buffer, followed by a newline character. That’s it. The buffer will be flushed to the file in the end.

Conclusion

Go has strong facilities to deal with text of all shapes and encodings. In this part of the series, we looked into the basics of text representation in Go, text processing using the strings package, and dealing with CSV files.

In part two, we will put what we’ve learned into practice to clean up messy data in preparation for analysis.