Comment Syntax to End of Line

Almost all popular languages fail in a strict sense of “whitespace is insignificant”. Because, first of all, if the language's comment syntax runs to end of line (For example, Bash, Perl, Ruby, C, C++, Java, Lisp, Haskell, OCaml, …), then obviously it fails, because the EOL char has significant meaning.

Whitespace Inside Strings

If the language contain string datatype where newline inside is significant, then they fail. Basically, you can't simply replace all newline by space in source code and expect the program to behave the same.

All popular language have string where whitespace inside is meaningful. However, we could ignore this criterion as design decision, because, otherwise you couldn't have literal text in your source code, and it'd be a major inconvenience. (whether one could have a language where literal text is not allowed yet still convenient to read/type, perhaps by a automatic preprocessing in editor that reformat/display on the fly (For example, Mathematica), is a open question to me. To be researched.)

Which Language is Actually “Whitespace Insignificant”?

Now, if we ignore whitespace inside strings, and ignore the comment to end-of-line syntax, then, many popular language might qualify as “whitespace Insignificant”, but not well-defined.

Can we come up with a mathematically precise definition of “whitespace insignificance” to popular langs such as C, Java? What would the definition be like? Would it just be a few sentences, or tens of special cases?

Ruby Example

Here's a Ruby example.

Following is a syntax error:

# -*- coding: utf-8 -*-
# ruby 1.9
aa = [1,2,3]
aa.each
{ |xx|
p xx
}

But if you remove a newline, it's ok:

# -*- coding: utf-8 -*-
# ruby 1.9
aa = [1,2,3]
aa.each { |xx|
p xx
}

This means that a newline has significant meaning in Ruby. (not even considering inside string or as to-end-of-line comment syntax.)

Why is This Important?

The meaning of whitespace significance issue is important in simplicity of language syntax grammar. It's related to the concept that each character or class of character or character sequence, has one and one only meaning, regardless of its neighboring characters (i.e. not dependent on context. (this is different from the concept of “context-free language”)). Lisp and Mathematica comes close.

If a language where whitespace insignificant is precisely defined as one or more of the criterions above, then, it means, the lexical grammar is simple. With such simplicity, you can then have syntactic layers on top of it, or in editor, that display or reformat the code in a number of ways (For example, HTML, Mathematica) on the fly.

This issue shouldn't be confused with readability or convenience of typing the code. They should be in a different layer.

Almost all languages, ignore this. Readability and programer convenience is mixed in into the design of the syntax. Worst examples are unix shell tools, C. On the other extreme, language such as Python, where whitespace is significant, is a design flaw because that means the readability is hardcoded into the semantics. There can never be a auto-formatter or displayed in a different way.

Research TODO

Survey popular languages and give a precise definition of their “whitespace significance”. (ignoring line-comment and string.)