Installing tools for writing Haskell code

After you’ve finished the install instructions, stack should all be in your path. ghci is the REPL (read-eval-print loop) for Haskell, though as often as not, you’ll use stack ghci to invoke a REPL that is aware of your project and its dependencies.

What we’re going to make

We’re going to write a little csv parser for some baseball data. I don’t care a whit about baseball, but it was the best example of free data I could find.

Project layout

There’s not a prescribed project layout, but there are a few guidelines I would advise following.

One is that Edward Kmett’s lens library is not only a fantastic library in its own right, but is also a great resource for people wanting to see how to structure a Haskell project, write and generate Haddock documentation, and organize your namespaces. Kmett’s library follows Hackage guidelines on what namespaces and categories to use for his libraries.

There is an alternative namespacing pattern demonstrated by Pipes, a streaming library. It uses a top-level eponymous namespace. For an example of another popular project you could also look at Pandoc for examples of how to organize non-trivial Haskell projects.

Once we’ve finished laying out our project, it’s going to look like this:

I’m also going to add the gitignore from Github’s gitignore repository plus some additions for Haskell so we don’t accidentally check in unnecessary build artifacts or other things inessential to the project.

This should go into a file named .gitignore at the top level of your bassbull project.

Building and interacting with your program

One thing to note is that for a module to work as a main-is target for GHC, it must have a function named main and itself be named Main. Most people make little wrapper Main modules to satisfy this, sometimes with argument parsing and handling done via libraries like optparse-applicative.

For now, we’ve left Main very simple, making it just a putStrLn of the string "Hello". To validate that everything is working, let’s build and run this program.

Then we install our dependencies by building our project. This can take some time on the first run, but Stack will cache and share dependencies across your projects automatically.

We did the stack setup just in case you didn’t already have GHC installed. Note that you’ll only have to do this once for a particular version of GHC. If this succeeds, we should get a binary named bassbull. To run this, do the following.

If everything is in place, let’s move onto writing a little csv processor.

Writing a program to process csv data

One thing to note before we begin is that you can fire up a project-aware Haskell REPL using Stack’s GHCi command. The benefit of doing so is that you can write and type-check code interactively as you explore new and unfamiliar libraries or just to refresh your memory about existing code.

First, we’re importing our dependencies. Qualified imports let us give names to the namespaces we’re importing and use those names as a prefix, such as BL.ByteString. This is used to refer to values and type constructors alike. In the case of import Data.Csv where we didn’t qualify the import (with qualified), we’re bringing everything from that module into scope. This should be done only with modules that have names of things that won’t conflict with anything else. Other modules like Data.ByteString and Data.Vector have a bunch of functions that are named identically to functions in the Prelude and should be qualified.

Here we’re creating a type alias for BaseballStats. I made it a type alias for a few reasons. One is so I could put off talking about algebraic data types! I made it a type alias of the 4-tuple specifically because the Cassava library already understands how to translate CSV rows into tuples and our type here will “just work” as long as the columns that we say are Int actually are parseable as integral numbers. Haskell tuples are allowed to have heterogenous types and are defined primarily by their length. The parentheses and commas are used to signify them. For example, (a, b) would be both a valid value and type constructor for referring to 2-tuples, (a, b, c) for 3-tuples, and so forth.

We need to read in a file so we can parse our CSV data. We called the lazy ByteString namespace BL using the qualified keyword in the import. From that namespace we used BL.readFile which has type FilePath -> IO ByteString. You can read this in English as I take a FilePath as an argument and I return a ByteString after performing some side effects.

We’re binding over the IO ByteString that BL.readFile "batting.csv" returns. csvData has type ByteString due to binding over IO. Remember our tuples that we signified with parentheses earlier? Well, () is a sort of tuple too, but it’s the 0-tuple! In Haskell we usually call it unit. It can’t contain anything; it’s a type that has a single value - (), that’s it. It’s often used to signify we don’t return anything. Since there’s usually no point in executing functions that don’t return anything, () is often wrapped in IO. Printing strings are a good example of the result type IO () as they do their work and return nothing. In Haskell you can’t actually “return nothing;” the concept doesn’t even make sense. Thus we use () as the idiomatic “I got nothin’ for ya” type and value. Usually if something returns () you won’t even bother to bind to a name, you’ll just ignore it.

Here we’re using a let expression to bind the expression fmap (V.foldr summer 0) v to the name summed so that the expressions that follow it can refer to summed without repeating all the same code.

First we fmap over the Either String (V.Vector BaseballStats). This lets us apply (V.foldr summer 0) to V.Vector BaseballStats. We partially applied the Vector folding function foldr to the summing function and the number 0. The number 0 here is our “start” value for the fold. Generally in Haskell we don’t use recursion directly. Instead in Haskell we use higher order functions and abstractions, giving names to common things programmers do in a way that lets us be more productive. One of those very common things is folding data. You’re going to see examples of folding and the use fmap from Functor in a bit.

We say V.foldr is partially applied because we haven’t applied all of the arguments yet. Haskell has something called currying built into all functions by default which lets us avoid some tedious work that would require a “Builder” pattern in languages like Java. Unlike previous code samples, these examples are using my interactive ghci REPL.

This lets us apply some, but not all, of the arguments to a function and pass around the result as a function expecting the rest of the arguments.

Fully explaining the fmap in let summed = fmap (V.foldr summer 0) v would require explaining Functor. I don’t want to belabor specific concepts too much, but I think a quick demonstration of fmap and foldr would help here. This is also a transcript from my interactive ghci REPL. I’ll explain Either, Right, and Left after the REPL sample. The :type or :t command is a command to my ghci REPL, not part of the Haskell language. It’s a way to request the type of an expression.

Either in Haskell is used to signify cases where we might get values of one of two possible types. Either String Int is a way of saying, “you’ll get either a String or an Int”. This is an example of sum types. You can think of them as a way to say or in your type, where a struct or class would let you say and. Either has two constructors, Right and Left. Culturally in Haskell Left signifies an “error” case. This is partly why the Functor instance for Either maps over the Right constructor but not the Left. If you have an error value, you can’t keep applying your happy path functions. In the case of Either String Int, String would be our error value in a Left constructor and Int would be the happy-path “yep, we’re good” value in the Right constructor. Also, Haskell has type inference. You don’t have to declare types explicitly like I did in the example from my REPL transcript - I did so for the sake of explicitness.

Here we have the list type, signified using the [] brackets and whatever type is inside in our list, in this case Int. With Either we have two possible types and Functor only lets us map over one of them, so the Functor instance for Either only applies our function over the happy path values. With the type [a] there’s only one type inside of it, so it’ll get applied regardless…or will it? What if I have an empty list?

Conveniently not only does fmap let us avoid manually pattern matching the Left and Right cases of Either, but it lets us not bother to manually recurse our list or pattern-match the empty list case. This helps us prevent mistakes as well as clean up and abstract our code. In a less happy alternate universe, we would’ve had to write the following code, written in typical code file style rather than for the REPL this time:

We use parens on the left-hand side here to pattern match at the function declaration level on whether our Either e Int is Right or Left. Parentheses wrap (addOne numberWeWanted) so we don’t try to erroneously pass two arguments to Right when we mean to pass the result of applying addOne to numberWeWanted, to Right. If our value is Right 1 this is returning Right (addOne 1) which reduces to Right 2.

As we process the CSV data we’re going to be doing so by folding the data. This is a general model for understanding how you process data that extends beyond specific programming languages. You might have seen fold called reduce. Here are some examples of folds and list/string concatenation in Haskell. We’re switching back to REPL demonstration again.

Last, we stringify the summed up count using show, then concatenate that with a string to describe what we’re printing, then print the whole shebang using putStrLn. The $ is just so everything to the right of the $ gets evaluated before whatever is to the left. To see why I did that remove the $ and build the code. Alternatively, I could’ve used parentheses in the usual fashion. That would look like the following.

What instance Show Integer is telling us is that Integer has implemented Show. This means we should be able to use show on something with that type. We can specialize the type of show to Integer in a few passes.

Next we’ll look at summer. summer is the function we are folding our Vector with. You can hang where clauses off of functions which are a bit like let but they come last. where clauses are more common in Haskell than let clauses, but there’s nothing wrong with using both.

Our folding function here takes two arguments: the tuple record (we’ll have many of those in the vector of records), and the sum of our data so far.

Here n is the sum we’re carrying along as fold the Vector of BaseballStats.

Next we’ll make our extraction of the ‘at bats’ from the tuple more compositional. If you’d like to play with this further, consider rewriting our example program at the end of this article into using a Haskell record instead of a tuple. I used a tuple here because Cassava already understands how to parse them, sparing me having to write that code.

Here we can use something called eta reduction to remove the explicit record and sum values to make it point-free. Since our function is really just about composing the extraction of the fourth value from the tuple and summing that value with the summed up atBat values so far, this makes the code quite concise.

So, for example, if we multiplyByTwo . addOne we’re adding one, then passing that result to the multiplyByTwo function. In the csv parser code, first fourth gets applied to the r argument, then (+) is composed so that it is applied to the result of fourth r and the value n.

Streaming

We can improve upon what we have here. Currently we’re going to use as much memory as it takes to store the entirety of the csv file in memory, but we don’t really have to do that to sum up the records!

Since we’re just adding the current records’ “at bats” with the sum we’ve accumulated so far, we only really need to read one record into memory at a time. By default Cassava will load the csv into a Vector for convenience, but fortunately it has a streaming module so we can stream the data incrementally and fold our result without loading the entire dataset at once.

The core here is the Records datatype Cassava gives us via the Streaming module. You can read more about the Records datatype on hackage. Records is a sum type, you could read out in English like so:

data Records a -> Records is a datatype that takes a type variable a

Cons (...) | Nil (...) -> It is a sum type of two possible constructors, Cons or Nil (note the list-like nomenclature). This is way of saying a Record a is always either Cons or Nil.

Cons (Either String a) (Record a) -> the Cons data constructor is a product of Either String a and Record a. We’re saying Cons is always Either String aandRecord a. Also, this Cons resembles the cons-cells in Lisp, Haskell, ML, etc. The library has the following comment about it: “A record or an error message, followed by more records.”

Nil (Maybe String) BL.ByteString -> the Nil data constructor is a product of Maybe String and BL.ByteString. The library has the following comment: “End of stream, potentially due to a parse error. If a parse error occured, the first field contains the error message. The second field contains any unconsumed input.”

What the Records type is doing for us is letting us process the records like a lazy list, but with a little extra context in the Nil case.

Because Haskell has abstractions like the Foldable typeclass, we can talk about folding a dataset without caring about the underlying implementation! We could’ve used the foldr from Foldable on our Vector, a List, a Tree, a Map - not just Cassava’s streaming API. foldr from Foldable has the type: Foldable t => (a -> b -> b) -> b -> t a -> b. Note the similarity with the foldr for the list type, (a -> b -> b) -> b -> [a] -> b. What we’ve done is abstracted the specific type out and made it into a generic interface.

In case you’re wondering what the Foldable instance is doing under the hood:

We’re also going to add a library and shift over some code so that our package is exposed as a proper library rather than only working as an executable. We’re exposing a single module named Bassbull. With an hs-source-dirs of src and an exposed module named Bassbull, Cabal will expect a file to exist at src/Bassbull.hs.

There’s not too much here. We’re importing Bassbull, which is the library module we’ve exposed. This is also a Main module with its own main file because we execute our test suite as a binary just like we do with executables.

stack test is just a shortcut for building tests specifically, then running the executable produced to see test output.

You aren’t limited to building the tests binary and running your tests in that manner. You can also pass stack ghci an argument to make it load your tests. This can be faster as the REPL uses an interpreter and can reload your code very quickly - much more quickly than doing a full build & execution run.

The above will then give you a REPL which can see anything the build in your Cabal named tests can see. You can then run the main function or individual test suites - if you bother to split them out.

Tests are useful and important in Haskell, although I often find I need much fewer of them. Often my process for working on an existing Haskell project will involve working on the code I’m changing with Emacs and a REPL instantiated via stack ghci. As my code starts passing the type-checker, I start running the tests as another layer of assurance that I’m doing the right thing.

I like having a lot of feedback and help from my computer when writing code!

Making your Haskell packages available to the Haskell community

Hackage is the main community repository of Haskell packages and will usually be where you look to find libraries you need.

Mostly you’ll find libraries and the occasional executable utility, but utilities should also be exposing library APIs that make their functionality accessible via Haskell code. This is not only more useful to other people but enforces good practices and more modular projects.

To learn more and for more information on building a package for uploading to Hackage see this tutorial.

How I work

When I’m working with Haskell code, I interact with my code in a few ways. One is that I’m writing the code itself in Emacs. I’ll also have a terminal with a REPL open, usually via stack ghci as I am almost always working on a specific project.

:reload in the REPL. flycheck will give me type errors, but I sometimes like to see them in the REPL too.

Sometimes I’ll use eta-reduction to refactor code. You can see an example of this in this code review on StackExchange. Making code point-free makes the most sense when it’s primarily about composing functions rather than about applying them.

If code still type-checks after some cleaning, I’ll run the tests. If tests pass, I move on unless I’m suspicious about test coverage. If tests break or I want more coverage, I write more tests until I’m satisfied. When that’s done, I return to step #1 in this loop for the next unit of work I want to perform.

My diagnosis process when something isn’t working:

If I can’t get something to type-check, I’ll break down sub-expression, query the types of those sub-expressions and make certain they were what I expected.

If have expressions I am trying to combine and I trying to make the types thereof make sense, but I haven’t implemented them yet I will use undefined and work with only application, composition, and monadic variations thereof to figure out how I need to get to where I’m going before I’ve implemented anything. You can see a good example of this in this Github gist. I wrote the solution @ifesdjeen displays in his final comment.

If I have a function expecting arguments I can’t figure out how to satisfy, I will sometimes use typed holes or a similar trick with implicit parameters to see what type I need to provide.

Since Haskell functions are pure and lazy, I can replace references to functions with their contents with a high degree of confidence that it will not change the semantics of my program. To that end, sometimes it’s easier to understand what’s going on by inlining the code by hand and seeing what your code turns into.

If something type-checks but doesn’t work, I’ll run the tests. If the coverage isn’t catching it, I add it. This is less common for me in Haskell than you’d think. If I can frame the test as an assertion about some property the code should satisfy like with QuickCheck I will do so. You can learn more about using QuickCheck in Real World Haskell.

Emacs

vim

Sublime Text 2/3

My personal dotfiles

Wrapping up

This is the end of our little journey in playing around with Haskell to process CSV data. Learning how to use abstractions like Foldable, Functor or use techniques like eta reduction takes practice! I have a guide for learning Haskell which has been compiled based on my experiences learning and teaching Haskell with many people over the last year or so.

If you are curious and want to learn more, I strongly recommend you do a course of basic exercises first and then explore the way Haskell enables you think about your programs in terms of abstractions. Once you have the basics down, this can be done in a variety of ways. Some people like to attack practical problems, some like to follow along with white papers, some like to hammer out abstractions from scratch in focused exercises & examples.

More than anything else, my greatest wish would be that you develop a richer and more rewarding relationship with learning. Haskell has been a big part of this in my life.

Special thanks to Daniel Compton and Julie Moronuki for helping me test & edit this article. I couldn’t have gotten it together without their help.

Chris Allen

Haskell

Coder, Teacher, Author

Chris is a long time FP and Lisp user who discovered a love of learning and types when he found Haskell. Aside from releasing multiple Haskell project, such as Bloodhound and Blacktip he took to teaching Haskell to spread the love and creating the Learn Haskell guide which he is turning into a book.