A taste of OCaml Batteries Included

I don’t know about you, but I have the feeling that many people are interested by OCaml Batteries Included but don’t dare try it yet, due to the Alpha status and the fact that it’s not available for their favorite Linux distribution yet. Well, it’s probably a healthy level of caution.

Of course, just reading the manual is probably not the best way of getting a feeling of OCaml Batteries Included.

So I’ve decided to take measures. From time to time, I’ll add here a few samples of what you can do with OCaml Batteries Included and how you can do it.

For today, let’s start with displaying the contents of a file. You know, Unix’s cat or MS-DOS’s type.

Now, for details: open System, IO, File opens three modules System (which contains all system-related functions, including input/output, file management, etc.), IO (the submodule of System containing all the operations on inputs and outputs), and File (the submodule of System containing all the necessary to open files for reading, writing, etc.).

Let’s move on to the last line. Function iter is defined in module Standard, which means that you don’t need to open any module to be able to use it. This function is a general imperative loop on enumerations. This is the equivalent of loop for-each in some languages. Perhaps I should detail what enumerations are: they are a read-and-forget data structure used pervasively in OCaml Batteries Included, and which replaces streams. By opposition to lists, arrays, etc., enumerations are built lazily and discarded as they are read, which makes them quite convenient for loops, or for accessing possibly huge sets of data — depending on your background, you may think of these either as streams (in OCaml and most languages) or as iterators (in Python, JavaScript and other dynamic languages). Oh, and for functional-minded people, don’t worry, your usual functional loops are available on enumerations, too. You can fold, map or unfold at will.

So, what does iter do? Well, if asked, OCaml will tell you that it has type ('a -> unit) -> 'a Enum.t -> unit. In other words, for any type 'a, this function takes as first argument a function (let’s call it f), as second argument an enumeration of elements of type 'a and returns nothing. Function f itself should take as argument an element of type 'a and return nothing. In other words, iter takes a function which works on one element of type 'a and turns it into a function which works on a whole enumeration of elements of type 'a. Yep, it’s called a loop.

Before looking at the definition of the function, let’s take a look at the enumeration passed to iter: args (). Well, if we look at the documentation (for instance by using our on-line help), we may read

args(): An enumeration of the arguments passed to this program through the command line.

So, args () is your usual pair argc;, argv (if you come from C) or args [] (if you come from Java). By opposition to Java, you don’t have to always put them in your program, if you don’t use them, and by opposition to both, it’s an enumeration, which makes more sense than an array, since you don’t need to modify them and since you always move forward among the arguments. Still, if you need it as an array, it’s available in a package. No more on this for the moment.

What’s left? Oh, yes, the function. As the code indicates, fun x -> copy (open_in x) stdout) is an anonymous function (also known as a “lambda” in a few languages). This function takes an argument x, the name of a file. Function copy, defined in module IO, takes two arguments, an input (a source of data) and an output (a sink of data), and copies the whole contents of the input into the output. The first argument here is open_in x, that is the result of applying function open_in to argument x. Function open_in, defined in module File, opens a file for reading. The result open_in x is therefore an input which lets us read the contents of file x. The second argument of copy is stdout, that is the standard output, that is the screen. In other words, fun x -> copy (open_in x) stdout) is a function which takes as argument the name of a file, opens that file and prints its contents on the screen. Note that everything is done lazily, so the contents is never completely present in memory. In other words, this works on files of theoretically unlimited length.

Bottom line: this utility reads all the files whose names are given on the command line and prints their content on the screen. In three lines of code.

Indeed, your plain OCaml version is better than my plain OCaml version. Note that the Batteries version also semi-leaks file descriptors, by design, as this is a 3-liner program. It’s not a full leak, as file descriptors are recollected at garbage-collection and/or at program close. To ensure timely close of channels, we may rewrite

For information, according to my measures, your stream-functional Stream version is about 2 to 3 times slower than the stream-functional Batteries version, while your imperative version is about 3 times faster, due to faster input/output. I’m measuring native versions, with Batteries compiled in debug mode.

The Batteries version is faster than the Stream version due to faster higher-level libraries. The Batteries version is slower than the imperative version as this one uses lower-level libraries, which are much less generic. One difference, among many, is that our I/O infrastructure permits things such as transparent transcoding of text, transparent compression to gzip/decompression from gzip, etc.

The speed is interesting but I am still concerned with the correctness.

You say that the file handles are closed when they are garbage collected: so you have implemented this yourself by wrapping them in an object and specifying a finalizer?

You say that the file handles are closed when the program exits but I do not believe that. Specifically, OCaml is buggy wrt calling finalizers when programs complete (they are often never called) but I do not believe you can rely upon the OS to close them either (IIRC, Windows does not).

Finally, you may like to study the design of IDisposable and IEnumerable from .NET because they do something similar. The IDisposable interface presents a Dispose function that can be used to clean up a resource deterministically but which is also used as a finalizer when the object is collected. The IEnumerable interface akin to impure streams and can call Dispose() on a handle when traversal of the stream is complete. F# provides a “use” equivalent of “let” that calls Dispose() automatically when the value goes out of scope.

The transcoding and decompression of streams on-the-fly is a fantastic idea.

You say that the file handles are closed when they are garbage collected: so you have implemented this yourself by wrapping them in an object and specifying a finalizer?

File handles (and more generally inputs and outputs, which do not need to map to file handles), are closed:

when they are garbage collected
when the program ends
when leaving the scope of a higher-order function such as with_file_in
for input from files, when the end of file has been reached (unless the corresponding option has been chosen)
for an output wrapping an underlying output (e.g. transcoding), when the underlying output has been closed
for an input wrapping an underlying input (e.g. transcoding), when the underlying input has been closed
when the user closes the handle manually

Hopefully, this should cover all situations.

Specifically, OCaml is buggy wrt calling finalizers when programs complete (they are often never called).

That’s actually not a bug, it’s part of the definition of finalization — I don’t know any single programming language in which finalizers are guaranteed to be called.

F# provides a “use” equivalent of “let” that calls Dispose() automatically when the value goes out of scope.

I’ve been working on something similar. For the moment, we’re using higher-order functions for this purpose, but something more complex and more robust may be added to Batteries at some point, under the guise of a Camlp4 syntax extension.

The transcoding and decompression of streams on-the-fly is a fantastic idea.

which does lose some clarity. However, even then, I grant you that it remains clear. Something which is less clear to me is whether/when file descriptors are actually closed. Could you answer that question? The manual of php couldn’t help me there.

Note that, when it works, your php version seems faster than the Batteries version and slower than the low-level OCaml version. However, when tested against a 500Mb file, the result is
Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 742842369 bytes) in /tmp/test.php on line 5

So I’m afraid that your php version, no matter how clear and how fast, is not really a good competition for either of the OCaml versions.

For information, if you prefer foreach, with OCaml Batteries Included, we could have written similarly

However, none of this is the point. The programming language community is quite aware that a few dynamic languages (I’m thinking of Python, Ruby and Php) have extensive libraries which permit the development of simple tools in a matter of a few lines of code. This batteries included approach is probably the biggest reason behind the success of these languages.

On the other side of the spectrum, we have functional languages such as OCaml or Haskell, both of which are sometimes erroneously described as dynamic, because they can do most of the interesting things which are possible in Python, Ruby and Php, as well as plenty which aren’t (I’m thinking type-safety and pattern-matching, but also functors, type-classes, local modules, static analysis of exception-safety, provably correct code, compile-time code generation, embedded domain-specific languages, syntax customization, etc.). Despite their many qualities, these languages have never been taken seriously in large part because of the lack of a library which would actually make these languages useful.

So the point is the following: if we can write small programs in OCaml which are nearly as concise as their counterparts in Python, Ruby or Php, and if the library scales up to large programs, then we have achieved our objectives. Because, when we reach this point, we will have most of what makes Python, Ruby or Php so attractive, and plenty of things that these languages are missing.

We certainly haven’t reached that point, mind you, but we’re getting there. The many versions of cat which appear on this page are examples of what we can do now. And I believe they already show large progress.

[…] Additionally, the OCaml Batteries Included project was created as an attempt to bundle a standard set of commonly-used library together with the language core. Even if this project is still in alpha stage, it definitely looks promising. […]