Sarah and Alaric Snell-Pym living in interesting times

I want a Scheme that lets me apply advanced programming language techniques - lightweight Higher-order functions and Hygienic macros rather than Boilerplate code, Continuations rather than a fixed set of predefined Control flow mechanisms, symbols rather than Enumerated types, Functional programming rather than getting tangled with too much state, dynamically-scoped parameters rather than God objects - to my day-to-day tasks. I'm a professional programmer; for a living, I've written code in Java, C, C++, PHP, Perl, Python, Ruby, SQL, AWK, shell and JavaScript, and I'd love to have been able to use Scheme for all of the above. I'm limited more by the usual commercial pressure than by any technical issues with Scheme or the qualities of my favourite implementation, Chicken, so my wishes for R7RS are relatively minor in terms of changing the semantics of the language. What I really want is a Scheme report that will unit the Scheme community, so we can continue to have a wide array of innovative implementations that all have their own strengths and weaknesses - but with much better portability of libraries between them, so they really do start to feel like one language with multiple implementations rather than separate languages.

So I feel that things like module systems and access to networking needs to be standardised, so each implementation doesn't gratuitously have their own syntax for doing the same thing. But these things need to be optional, so implementations are not constrained to be large in order to earn the name "R7RS Scheme".

So I thought I'd step up and propose a solution.

The Problem

Most of the arguing is about definitions, at heart; different people have different definitions of "simple" or "necessary", so they argue about whether things are simple or necessary for various applications, rather than getting down to it and just stating their requirements for their use of Scheme.

So let me take a stab at a few major stakeholders:

Education

All that SICP requires is a very basic Scheme, in terms of language features; but for more advanced courses, it'd be nice to have macros (so they can be taught in their own right as a programming tool, and as a lightweight introduction to some of the kinds of things that compilers have to deal with). Hygienic low-level macros would be good as a simple basis emphasising the code-is-data-is-code side of things, while high-level macros can be useful as a simple way of getting started on template-based code rewriting rules. I don't think unhygienic macros have much place, as they break the lexical scoping that works so well elsewhere in Scheme; they're easily broken in confusing ways.

But key for education is to be able to have a Read-eval-print loop so students can experiment. Which places certain constraints on how your module system can work. And "also nice" for education is a language design that doesn't make it too hard for implementations to give very meaningful introspection into the state of execution - so that errors can be explained with a reference to the dynamic and lexical scope at which the error occurred. Macros can sometimes make this hard, so making sure your macro system works in such a way as to not preclude carrying source position information through code transformed by macros, and that smart IDEs might be able to switch between original and expanded views of the source code as it's traced through and so on might be good. This seems to suggest a "syntax object"-style hygienic macro system that can easily carry metadata, and a macro/module system that lets you try out macro expansions "on the fly", rather than needing a whole-program analysis.

Research

Scheme's good for research, since it has such a minimal core language, it's easy to implement it. Which means that if you're working on programming language implementation techniques, implementing the kernel of Scheme then being able to drop a load of portable libraries on top to generate a full-featured language is an attractive prospect compared to implementing C++.

Having a maximum-bang-per-buck runtime language core (eg, lambda+cond+call/cc+dynamic-wind+basic data types) - which can then host an off-the-shelf implementation of define-syntax defined in terms of the base language, and then host an off-the-shelf implementation of an arbitrary set of helper libraries in order to get all the high-level goodies we want - means you have the minimal work required in implementing, to get a maximum result in final power. Which is great for programming-language implementation research. And it's also good for Scheme as a whole, since it means there'll be lots of innovation in implementations.

Mainstream programmers

These are the guys who want to write real applications. These days, they fall into four main camps: GUI programming, systems programming, Web programming, and embedded-scripting-language.

What all four have in common is that they want access to the resources of their platform, and they want a large library of good existing code to use.

This is two separate problems. The resources of the platform vary between implementations, as the implementations run on different platforms. They're also a lot of work for an implementer to cover. If those research folks can't just get by with lambda+cond+call/cc+dynamic-wind+basic data types and have to go and implement file I/O, directory manipulation, sockets, and so on in order to meet the language spec, that'll be a problem for them; but the mainstream programmers may need all of those features.

So we clearly need some separation of optional features, which is why the R7RS proposal to have little and large Schemes appeals to me; but more on that later.

Mainstream programmers benefit strongly from good compilers that can generate efficient code, so they needn't be paying an onerous penalty for the benefits of Scheme.

Also, these guys write big software - so they need a decent module system, that lets them combine code from different authors without any problems with clashing names.

And they want portability. So they can pick and choose different Scheme implementations, but most importantly, so they can pick and choose from Scheme libraries; library portability is more important than app portability, otherwise a Scheme implementation is judged on the set of libraries within its walled garden, rather than the essentials of the implementation itself, thereby reducing the scope for innovation in implementations.

Embedded programmers

These people are a small niche, but a good one. Mainstream programming is largely bound by backwards compatibility concerns, which can be quite limiting; but embedded developers can Do It Right since they're usually working from scratch. So they're a good market for highly pragmatic languages like Scheme.

They want a very minimal system, like the researchers, so they can be sure of fitting their base into their restricted memory footprint; they may want a read-evel-print loop in development, like the educators, so they can experiment with hardware devices (the data sheets never quite tell you everything you need to know...); but they also want to be able to bring in libraries of portable code to get the job done quickly, and they want to access platform-specific features (often at a very low level).

Conclusion

All these groups benefit each other. Research based around Scheme provides innovative implementations. Education based around Scheme produces programmers who know Scheme. Mainstream and embedded programming using Scheme attracts other developers, and raises the profile of the language, and makes an education in Scheme more useful in the job market, and provides more useful libraries.

My proposal

I'm confident all these needs (and more) can be reconciled. Here's how.

The small language / big language issue

Basic data types: symbols, integers, vectors, a primitive way to make disjoint types (eg, a tagging system) and procedures to operate on them. It's possible to implement the full numeric tower, structs, and object-oriented classes of various kinds on top of this lot, so we leave it out at this level.

Strings should carefully punt on the issue of representation. Much has been written about this, but the sad fact is, the obvious ways of handling strings just don't work when Unicode is a possibility. Unicode codepoints are not characters. UTF-8 bytes, or even UTF-16 double-bytes, are not Unicode codepoints. If you want to be able to deal with Unicode without horrible surprises, then you need to ditch "characters" and instead just have strings, some of which cannot be split any smaller (yet may still be a series of Unicode codepoints, such as a letter plus a few combining characters). But our string library should still be implementable in terms of plain ASCII; we can't mandate the mappings between strings and bytes.

Low-level hygienic macros, as we can build high-level macros on top of them. Of course, hygienic macro systems let us consciously break hygiene when we need to.

read, write, and display, but no more I/O than that, and no way of altering the standard input or output ports. Note that this is I/O purely in terms of strings, not bytes.

error, to signal fatal errors portably, but not necessarily a way to catch them. * A basic module system. And I mean very basic. But one that a more advanced module system can be backwards-compatible with. This is essential at this level, as it provides the means for extension into a big language later.

The kind of module system I have in mind can really just be boiled down to an include-file system; if you say (require <name>), the system uses some implementation-dependent mechanism to find the definition for the named module (a file of the same name with .scm appended found in a system search path including the current directory being highly recommended on platforms where that makes sense). This <name>.scm file must contain a module declaration of the form:

(module <name>
...list of definitions interspersed with expressions that are evaluated
for their side effects alone...)

And "definitions" are either require, define, define-syntax, or some expression that macro-expands to the above, or groups of the above contained within begin (which is an essential feature to let a single defining-macro expand to a group of definitions).

The presence of a require in the list of definitions ensures that the bindings exposed in the required module are available at least after the require itself (and maybe before), and the side-effects of the expressions in the module occur at the require at the latest (and maybe before, particularly if the same module is required twice via different modules in the same program!)

Meanwhile, a "top-level program" is defined as a similar list of definitions interspersed with expressions that are evaluated for their side effects alone; it doesn't need wrapping in module. And a read-eval-print loop session is defined similary to a top-level program, except that the result of each expression is outputted to the standard output, and there's a guaranteed dynamic-wind to restore the REPL continuation if it's bypassed. error is just defined as a system-provided continuation that is guaranteed to skip the rest of the top-level program the user has supplied, or the current expression in a REPL.

A very minimal system might just provide a set of inbuilt libraries that are there in the initial namespace, so require-ing them is a no-op, a require mechanism that is just file inclusion (perhaps checking the included file contains a single module with the same name as we're expecting, if we can be bothered), and a module macro that just expands to its contents.

The distinguishing thing about this base system is that it isn't incompatible with a large mainstream-programming Scheme, yet is easy to implement, and provides enough tools for a lot of portable high-level libraries to be built on top of it (such as advanced data structures, control-flow abstractions, and the like).

The key thing it lacks is access to platform-specific functions, such as file systems, networks, processes, and the like. But the require mechanism sets the stage.

Libraries

I don't think there really needs to be a "big scheme" language; there just needs to be "small scheme", and then optional libraries. But I hear cries that this might hamper interoperability; what if each implementation supports a different set of libraries?

Well, there's two prongs to solving this problem. One is that most of the libraries defined in the R7RS standard will be simple to implement, and reference implementations of them in terms of the R7RS core scheme should be provided in an appendix. Implementations would be free to use other implementations (for example, implementing map with super duper multithreading on massively multicore systems), but there'd really be no excuse for not providing them.

Even a tiny embedded system could provide them all, as optional libraries; if the developer requires more libraries than will fit alongside their application in the target device, then that's their problem - if we just provided him core Scheme, they'd be able to include those libraries by hand through simple textual inclusion anyway. We might as well make the process easy and elegant, and then let them decide how to spend their memory budget!

On the other hand, platform-dependent libraries such as filesystems, the POSIX process model, and BSD sockets, depend on the platform actually supporting them and the implementation having implemented them. I think these interfaces should be standardised, as then lots of very useful libraries implementing protocols such as HTTP, or GUI toolkits, can be shared between implementations. But these things HAVE to be optional. We can't hamper researchers with having to implement them, and we can't hamper implementers with not being able to provide a decent Scheme system on limited or unusual platforms.

So I think we'll classify implementations as providing "R7RS Core Scheme" (just the essentials described above, and at least bundles copies of the reference implementations of the portable libraries), or "full R7RS Scheme", with all the platform-dependent libraries as well; and embedded or work-in-progress implementations might be "R7RS Core Scheme plus the network and filesystem libraries".

But where do we draw the line between R7RS and the SRFIs? Personally, I think a useful set of basic SRFIs should be mandated by "full R7RS Scheme", purely for convenience; things that aren't heavyweight to implement, and can fundamentally affect the clarity and elegance of most code, ought to be made part of the "full Scheme" standard, purely so that most code can rely on using them.

Platform-dependent libraries

I think a good set of libraries for mainstream developers would be:

Networking

Basic POSIX process model (fork&exec, signals)

The concept of an I/O stream, with low-level operations to read and write strings from it (with a choice of encodings, or "platform default"), or binary data (as integers)

A sturdy module system, made by adding syntax to module and require forms, to provide support for handling name clashes through renaming (such as prefixing). A module or program that uses these features needs to require it, but it'll most likely be built into the implementation's basic module system, rather than actually loaded as a library.

"Helpful" libraries

These can all be implemented in terms of "Core Scheme". It'd be good to have them in the standard, to provide a "standard profile" upon which other libraries and applications can be developed, meaning that any "Full R7RS Scheme" implementation will support advanced high-level features out of the box, and avoiding external libraries having to be written in a crippled subset language.

External libraries

In Chicken, as there's a very strong Foreign function interface, we tend to wrap a lot of third-party C libraries. In my opinion, the design of an FFI shouldn't be in R7RS, as different implementations may have widely varying ways of doing it. There should be an SRFI for a minimal lowest-common-denominator FFI that even pure interpreters can implement (via libffi, for example), but implementations should be free to come up with others that aren't based on the C calling model (implementations on top of the JVM or CLR, for example, need a rather different flavour of FFI), or ones that take advantage of particular aspects of their implementation (Chicken lets you embed arbitrary C code in your Scheme source, as it's compiled to C anyway).

So, yes, things like Chicken eggs that wrap a C library won't be particularly portable between implementations. I can't see a resonable way of making it so, without constraining implementations unnecessarily.

Third-party libraries

Third-party contributed libraries written in terms of R7RS libraries and other third-party libraries should be portable between implementations - and able to do useful things, with the ability to portably access the network and other such resources.

So it would be nice to have a shared repository of them, or at least a standard way of accessing such a repository so that competing repositories can coexist. Each implementation could then provide its own mechanism for installing libraries from the repositories.

Perhaps a most egalitarian approach would be to globally specify a package by a URL from which it can be downloaded; which would produce a .tar.gz file containing the Scheme source files for a set of modules, plus a metadata file listing the URLs of other packages it requires (along with descriptions and links to online documentation and platform/implementation requirements and the like).

@Andy LeClair: Oh, PLT may well be a very nice implementation with modules and everything.

But it's just one implementation.

The problem with Scheme is that it's a family of languages; there's PLT, there's Chicken, and so on.

If we can bring the family closer together, so that libraries and even some applications can be shared between them where it's reasonable to do so, then we'll still have a family of languages - but without them all duplicating their effort in writing libraries.

If we're to gang up against Java, C++, PHP, Ruby, Perl, Python, and friends, we'll need to be united!

One of the things that came up during the R6RS discussions IIRC was the ability to mutate imported symbols. People who wanted compilability were dead set against it, but people who want Scheme to remain compatible with SICP were in favor of it.

I think the difference between the products of Working Group 1 and Working Group 2 may not be that one contains fewer defined symbols than the other, but that one is more concerned with expressiveness and one with safety and the ability to reason about code.