Creating a basic Julia project for loading and saving data [Tutorial]

In this article, we take a look at the common Iris dataset using simple statistical methods. Then we create a simple Julia project to load and save data from the Iris dataset.

This article is an excerpt from a book written by Adrian Salceanu titled Julia Programming Projects. In this book, you will develop and run a web app using Julia and the HTTP package among other things.

To start, we’ll load, the Iris flowers dataset, from the RDatasets package and we’ll manipulate it using standard data analysis functions. Then we’ll look more closely at the data by employing common visualization techniques. And finally, we’ll see how to persist and (re)load our data.

But, in order to do that, first, we need to take a look at some of the language’s most important building blocks.

Here are the external packages used in this tutorial and their specific versions:

Using simple statistics to better understand our data

Now that it’s clear how the data is structured and what is contained in the collection, we can get a better understanding by looking at some basic stats.

To get us started, let’s invoke the describe function:

julia> describe(iris)

The output is as follows:

This function summarizes the columns of the irisDataFrame. If the columns contain numerical data (such as SepalLength), it will compute the minimum, median, mean, and maximum. The number of missing and unique values is also included. The last column reports the type of data stored in the row.

A few other stats are available, including the 25th and the 75th percentile, and the first and the last values. We can ask for them by passing an extra stats argument, in the form of an array of symbols:

julia> describe(iris, stats=[:q25, :q75, :first, :last])

The output is as follows:

Any combination of stats labels is accepted. These are all the options—:mean, :std, :min, :q25, :median, :q75, :max, :eltype, :nunique, :first, :last, and :nmissing.

In order to get all the stats, the special :all value is accepted:

julia> describe(iris, stats=:all)

The output is as follows:

We can also compute these individually by using Julia’s Statistics package. For example, to calculate the mean of the SepalLength column, we’ll execute the following:

The script iterates over each column of the dataset with the exception of Species (the last column, which is not numeric), and generates a basic correlation table. The table shows strong positive correlations between SepalLength and PetalLength (87.17%), SepalLength and PetalWidth (81.79%), and PetalLength and PetalWidth (96.28%). There is no strong correlation between SepalLength and SepalWidth.

We can use the same script, but this time employ the cov() function to compute the covariance of the values in the dataset:

julia> for x in names(iris)[1:end-1]
for y in names(iris)[1:end-1]
println("$x \t $y \t $(cov(iris[x], iris[y]))")
end
println("--------------------------------------------")
end

Loading and saving our data

Julia comes with excellent facilities for reading and storing data out of the box. Given its focus on data science and scientific computing, support for tabular-file formats (CSV, TSV) is first class.

Let’s extract some data from our initial dataset and use it to practice persistence and retrieval from various backends.

We can reference a section of a DataFrame by defining its bounds through the corresponding columns and rows. For example, we can define a new DataFrame composed only of the PetalLength and PetalWidth columns and the first three rows:

The generic indexing notation is dataframe[rows, cols], where rows can be a number, a range, or an Array of boolean values where true indicates that the row should be included:

julia> iris[trues(150), [:PetalLength, :PetalWidth]]

This snippet will select all the 150 rows since trues(150) constructs an array of 150 elements that are all initialized as true. The same logic applies to cols, with the added benefit that they can also be accessed by name.

Armed with this knowledge, let’s take a sample from our original dataset. It will include some 10% of the initial data and only the PetalLength, PetalWidth, and Species columns:

julia> test_data = iris[rand(150) .

What just happened here? The secret in this piece of code is rand(150) .. It does a lot—first, it generates an array of random Float values between 0 and 1; then, it compares the array, element-wise, against 0.1 (which represents 10% of 1); and finally, the resultant Boolean array is used to filter out the corresponding rows from the dataset. It’s really impressive how powerful and succinct Julia can be!

In my case, the result is a DataFrame with the preceding 10 rows, but your data will be different since we’re picking random rows (and it’s quite possible you won’t have exactly 10 rows either).

Saving and loading using tabular file formats

We can easily save this data to a file in a tabular file format (one of CSV, TSV, and others) using the CSV package. We’ll have to add it first and then call the write method:

Just specifying the file extension is enough for Julia to understand how to handle the document (CSV, TSV), both when writing and reading.

Working with Feather files

Feather is a binary file format that was specially designed for storing data frames. It is fast, lightweight, and language-agnostic. The project was initially started in order to make it possible to exchange data frames between R and Python. Soon, other languages added support for it, including Julia.

Support for Feather files does not come out of the box, but is made available through the homonymous package. Let’s go ahead and add it and then bring it into scope:

Next, we need to make sure to add the Mongo package. The package also has a dependency on LibBSON, which is automatically added. LibBSON is used for handling BSON, which stands for Binary JSON, a binary-encoded serialization of JSON-like documents. While we’re at it, let’s add the JSON package as well; we will need it. I’m sure you know how to do that by now—if not, here is a reminder:

pkg> add Mongo, JSON

At the time of writing, Mongo.jl support for Julia v1 was still a work in progress. This code was tested using Julia v0.6.

Easy! Let’s let Julia know that we’ll be using all these packages:

julia> using Mongo, LibBSON, JSON

We’re now ready to connect to MongoDB:

julia> client = MongoClient()

Once successfully connected, we can reference a dataframes collection in the db database:

julia> storage = MongoCollection(client, "db", "dataframes")

Julia’s MongoDB interface uses dictionaries (a data structure called Dict in Julia) to communicate with the server. For now, all we need to do is to convert our DataFrame to such a Dict. The simplest way to do it is to sequentially serialize and then deserialize the DataFrame by using the JSON package. It generates a nice structure that we can later use to rebuild our DataFrame:

julia> datadict = JSON.parse(JSON.json(test_data))

Thinking ahead, to make any future data retrieval simpler, let’s add an identifier to our dictionary:

julia> datadict["id"] = "iris_test_data"

Now we can insert it into Mongo:

julia> insert(storage, datadict)

In order to retrieve it, all we have to do is query the Mongo database using the “id” field we’ve previously configured:

In this tutorial, we looked at the Iris dataset and worked on loading and saving the data in a simple Julia project. To learn more about machine learning recommendation in Julia and testing the model check out this book Julia Programming Projects.