HTML6: Your JSON ＆ SXML Simplified

Tired of the standard bodies telling us what to do and change their altitude? Tired of the SGML/HTML/XML/XHTML/HTML5 changes? Tire no more, here's a new proposal that will make it all better.

Introducing HTML6

HTML6 is based on HTML5, XML, and a rectified LISP syntax.
It is inspired from JSON and
SXML.
HTML6 is 100% regular at syntax level, and is not a valid JavaScript expression nor lisp expression. The syntax can be specified by about 10 short lines of parsing expression grammar.

The aim is a very simple syntax, 100% regularity, leaner, trivial to parse using any language.

Simple Brackets for Tag Delimiters

The standard XML markup bracket is simplified using simple brackets in lisp style. For example, this code:

<h1>HTML6</h1>

Is written as:

〔h1 HTML6〕

The delimiters used are:

Character

Unicode Code Point

Unicode Name

〔

U+3014

LEFT TORTOISE SHELL BRACKET

〕

U+3015

RIGHT TORTOISE SHELL BRACKET

Syntax for XML Attributes

In XML:

<h1id="xyz"class="abc">HTML6</h1>

In HTML6:

〔h1 「id “xyz”, class “abc”」 HTML6〕

The attributes are specified by corner brackets. Items inside are a sequence of pairs, separated by a comma. The value must be quoted by curly double quotes.

Character

Unicode Code Point

Unicode Name

「

U+300c

LEFT CORNER BRACKET

」

U+300d

RIGHT CORNER BRACKET

“

U+201c

LEFT DOUBLE QUOTATION MARK

”

U+201d

RIGHT DOUBLE QUOTATION MARK

XML Entities and Escape Mechanisms

To include a literal tortoise shell bracket characters in data, use &#x3014; and &#x3015;, similarly for other Unicode chars.

The only chars you need to escape are 〔tortoise shell brackets〕, 「corner brackets」, “double curly quotes”.

Unicode; No Named Entities

There's no Named Entities. For example, &amp; is literal, it should not be rendered as “&”.

Character “entities” is allowed in hexadecimal format ⁖ &#x3b1; for “α”.

Treatment of Whitespace

Identical to XML.

Char Encoding; UTF8 Only

Source code must be UTF8 only. Nothing else.

File Name Extension

File name extension is “.html6”.

Semantics

The semantics should follow XHTML5.

Questions ＆ Answers

What's wrong with XHTML/HTML5 exactly?

The politics of standard body changes, and their attitude about
what is correct also changes unpredictably. In around 2000, we are
told that XML and XHTML will change society, or, at least, make the
web correct and valid and far more easier to develop and flexible. Now
it's a decade later. Sure the web has improved, but as far as
HTML/XHTML and browser rendering goes, it's still syntax soup with
extreme complexities. 99.99% of web pages are still not valid, and
nobody cares. Google doesn't care. Apple doesn't care. In Google's
hundreds of tips to webmasters, none of it ever advocates HTML
validation. Google Earth itself generates invalid KML. Some 99.9% of
the HTML files produced by Google or Apple are not valid HTML. Major
browsers still don't agree on their rendering behavior. Web dev is
actually far more complex, involving tens or hundreds of tech that
hardly a person even knows about (ajax, JSON, lots XML
variations). It's hard to say if it is better at all than the HTML3
days with “font” and “table” tags and gazillion tricks. The best
practical approach is still trial ＆ error with browsers.

And, now HTML5 comes alone, from a newfangled hip group primarily from current big corporations Google and Apple, with a attitude that validation is overrated — a insult to the face about the XML mantra from w3c, just when there starts to be more and more sites with correct XHTML and Microsoft's Internet Explorer getting on track about correctness.

For some personal story about how the change of standard body attitude effect practical programing, see:

Lisp's SXML is not a stand-alone syntax for the need of the web. SXML's syntax is designed to be compatible with lisp lang's existing syntax. Lisp syntax (aka sexp) has several syntactical irregularities. It is not 100% of nested paren of the form (a b c …). SXML is easy for lispers to adopt, but harder for other languages and communities. (For detail of lisp's syntax irregularities, see: Fundamental Problems of Lisp.)

The following are explanation on how several of lisp's syntax for XML breaks the tree-and-syntax structural correspondence that is inherent in XML.

XML as textual representation of a tree has a
quirk, in that each node has this special thing called “attributes”
(aka “properties”). The “attribute” is not a node of the tree, but
rather, is a special info attached to a node. Here's a example HTML:

<h1id="xyz"class="abc">A B C</h1>

The standard lisp syntax to represent attributes, adopted from lisp's similar concept of “properties” of lisp's “symbols”, is this:

(h1 :id "xyz" :class "abc" A B C)

The way this works is by creating a extra rule on the first char of a name. If the name starts with :, then that name is considered the name of a property, and the next element is considered its value. This special rule breaks a fundamental principle of XML syntax. That is, the lexical structure of the source code no longer corresponds to the semantic structure. The semantics of the source code changes depending on the first char of a atom.

Another way to represent XML's attribute, adopted in some lisp code based on lisp's “alist” (aka associative array) syntax, is this:

(h1 ((id . "xyz") (class . "abc")) A B C)

This too, has syntactical ambiguity.

From purely lexical analysis, the whole ((id . "xyz") (class . "abc")) can be interpreted as a node by itself, where the first element is again a node.

But also here, it uses lisp's special “cons” syntax (id . "xyz") which is itself ambiguous at the syntax level. It can be considered as a node named “id” with 2 branches . and "xyz", like this:

id
.
"xyz"

or it can be considered as a node named “cons” with 2 branches id and "xyz", like this:

cons
id
"xyz"

Another common lisp syntax for attributes, from SXML, is this:

(h1 (@ (id . "xyz") (class . "abc")) A B C)

Here, again a special rule is created. When the first element's name is just “@”, then that parenthesized expression is considered to be a property list, not a node.

So, in conceiving HTML6, a solution for getting rid of syntax ambiguity for node vs attributes is to use a special bracket for properties/attributes of a node. ⁖ 〔h1 「id “xyz”, class “abc”」 A B C〕. This is a pure syntactical solution.

Why use weird Unicode characters for brackets?

Unicode is widely adopted today and is very practical.
〔➤ Unicode Popularity: How Popular is UTF-8?〕
It is the default char set for many langs (⁖ Java, XML, Haskell, GoLang).
Unicode also has a lot proper matching pairs.
〔➤ Matching Brackets in Unicode〕
Today is a good time to adopt the wide range of proper symbols provided in Unicode, instead of relying on the very limited number of ASCII characters of the 1960s.

The straight double quote character " (ASCII 34) is not a matching pair; it has several practical problems when used in a computer language. For example, it needs context to know which quote chars are paired. Also, it is difficult to recover from a missing quote. (this problem is especially pronounced in text editors for syntax highlighting.) A proper matching pair allow programs and editors to more easily correctly determine the quoted string, and thus easier to know its position in a tree, and makes it easier to implement features such as navigating the tree in a editor. (For more detail, see: Problems of Symbol Congestion in Computer Languages; ASCII Jam vs Unicode.)

If we use ASCII brackets () and [] for HTML6, then it means a lot ugly escape will need to happen in the content text.

The core idea of HTML6 is that the syntax is designed specifically as a 2-dimensional textual representation of a tree, and with a added special syntax for XML's concept of attributes.

The advantage of this is that it should be extremely easy to parse. The syntax can be specified in perhaps just 3 lines of
parsing expression grammar (PEG), and PEG libraries exists for Perl, Python, Ruby, Lua, C, C#, Java, OCaml/F#, Clojure, … A parser for HTML6 can be trivially written without relying on PEG.

Any thoughts about flaws?

It is probably hopeless for browsers to adopt this. But if you are involved in standard bodies of XML or HTML5, please consider this, and consider more about correctness and validation. XML is a move in the right direction, with huge consequences in various XML languages and formats (JSON, XSLT, XSL, XQUERY, o:XML…, Microsoft Office Open XML, etc.) Whatever new features of HTML5 can be expressed as XML with a new DTD (⁖ XHTML 5). HTML5 was created in part to address w3c's slowness in responding to industrial changes, and in part to address verbosity of XML syntax. HTML5 by itself does not introduce any new technical concepts. The force behind HTML5 is almost purely corporate adoption, and mostly existing practices from corporations. But the attitude it brought about seems to be a step backward, towards corporate sponsored tags (much from Google) and technologies (⁖ much of canvas is from Apple, a low-level pixel-drawing garbage in comparison to SVG), odd-end special tags, more special syntaxes, less focus about correctness, another new syntax/format in the HTML/XML/XHTML/DTD-sniffing soup.