I see this as an implementation issue, not a spec issue. It does not affect parsing, and the spec is about how the input is parsed.

An implementation may choose to provide a “safe mode” that renders raw HTML escaped, or as a comment, or as nothing, and suppresses links with URLs that don’t fit some condition. It may even choose to make this mode the default. As I’ve described, it’s easy to do this at the rendering stage, or by altering the AST before rendering.

In my project at work XSS (Javascript injection) exploitable code has been a huge problem. I’ve thought the best way to solve this would have been to not save html but use a language like markdown to prevent XSS from being saved. Looking at the spec for CommonMark I realized that allowing HTML does not appear to be optional, for many site’s this could allow exploits. I’ve also noticed that it can’t be used on this site.

I personally think that html input should be an extension and not part of the default behavior. If that isn’t acceptable maybe it can be spec-ed that a valid parser/renderer can disable this feature, or provide an option to enable only safe renderings (clean unsafe tags/attributes).

@vitaly I don’t really agree, if a spec doesn’t specify something as optional, disable-able, or an extension, then if an implementation doesn’t support it, it isn’t really conforming to that spec. I do agree that it’s managed at the implementation level, but I think that the spec level can define that it is something that is optional and that implementations should make it disable-able. If you were to allow only partial html support that’s also something that the spec would need to define what that means. Does the spec need to define exactly how to disable it in the implementation? no that is an implementation detail in this case.

“Optional” in context of spec is something, that can be omitted. HTML can NOT be omitted if you implement markdown parser. IMHO it’s not good idea to add programming issues into spec. That make things complex without added value. If spec does not prevent to implement configurable parsers - it’s good enouth as is.

Parsing HTML into an AST is much harder than parsing Markdown into an AST

You have made this claim before and have yet to provide any evidence to support your argument. HTML is a subset of markdown, unless you’re naively thinking that you can remove the ability for markdown to support raw HTML, at which point you’ve invalidated the majority of use cases that would reap any benefit from HTML sanitization.

Jack_Douglas:

even the best HTML sanitizers contain bugs

I’ve refuted this point already:

zzzzBov:

If you find that there are bugs in your HTML sanitizer, fix the bugs in the HTML sanitizer. Building a new sanitizer is a sure-fire way to introduce more buggy implementations.

unless you’re naively thinking that you can remove the ability for markdown to support raw HTML

Didn’t I make that 100% clear in the original post? I’ve no idea why you would call that “naive” — I’m not invalidating anything, the topic is about whether support is mandatory or optional in the spec. There are plenty of use cases for markdown where HTML support is not necessary or is only used to work around what is not supported in a particular flavour of markdown.

From the original post with extra emphasis:

Jack_Douglas:

I suggest that all potentially unsafe features of CommonMark are designated Optional. Among others this would include inline HTML and javascript hrefs in links.

The HTML5 syntax spec specifies everything in excrutiating detail, while CommonMark’s bulk is mostly examples. CommonMark is more readable, if you want to get a general idea of what Markdown’s syntax looks like, but HTML5 probably makes it easier for two people to build a compatible implementation without talking to each other.

Apparently not, as I obviously missed it or at least forgot about it in the last 18 months.

Jack_Douglas:

I’ve no idea why you would call that “naive”

It is naive to think that you can make markdown “safe by default” especially where the most impacted use cases are made invalid. What you’d be doing is providing bad defaults because everyone who wanted to use HTML in markdown then has to enable it, at which point the safety is moot.

Jack_Douglas:

There are plenty of use cases for markdown where HTML support is not necessary or is only used to work around what is not supported in a particular flavour of markdown

I don’t disagree that there are use cases where HTML support is not strictly necessary, but I consider those to be a minority and will continue to do so until you can provide evidence that proves otherwise.

I strongly disagree with the idea of making CommonMark safe by default, because I see it as misguided and a waste of time that’s better served by other existing tools.

If you would like to change my mind, you will need to provide evidence that:

disabling the features that you suggest could be optional will not affect 95% of existing markdown usage (I’m willing to be flexible on what constitutes this set, but it obviously can’t be cherry picked)

compelling evidence that markdown sanitization is easier and more effective than existing HTML sanitization libraries

other commonmark implementors are interested in and willing to support the burden of keeping up with changes to the commonmark spec, specifically with regard to the “safe by default” feature, which could require fast action to respond to known security threats.

disabling the features that you suggest could be optional will not affect 95% of existing markdown usage (I’m willing to be flexible on what constitutes this set, but it obviously can’t be cherry picked)

compelling evidence that markdown sanitization is easier and more effective than existing HTML sanitization libraries

I don’t think this is a generally disputed point. jgm described HTML sanitization as “surprisingly hard”, and my suggestion is to parse inline HTML as normal text by default, which is simple. The only difficulty is with other injection points, eg links — where we could white-list schemes to prevent javascript injection (for example). A bit harder but not even close to the difficulties of parsing HTML, never mind sanitizing.

other commonmark implementors are interested in and willing to support the burden of keeping up with changes to the commonmark spec, specifically with regard to the “safe by default” feature, which could require fast action to respond to known security threats

I can’t imagine there would be security threats that ever needed changes to the spec if inline HTML was treated as text, and we white-listed safe link schemes.

I endorse this suggestion. CommonMark should be safe by default. The underlying assumption should be that it runs in a browser and that it has no ability to modify execution context or implement an XSS attack. If you want features that require more lax-execution then those checks should be turned down. But, the default should be security in the context of a Javascript enabled browser.

Html should be an extension, because I really only see 2 use cases for CommonMark

Static sites where only trusted developers can edit things, thus full html is fine

Sites where markdown is used for formatting for many untrusted users, html is not fine

theoretically there may be a 3rd case where CommonMark is currently not sufficient for 2 but I think those are probably just features CommonMark needs to add additional formatting and features to the core. If there are cases where html is needed but not full html, that could be handled by the extension, in its spec.

URI’s of course will need to have safe as part of the core spec.

I would argue that any site that is currently “cleaning” html output and claims any kind of CommonMark compliance is currently in violation of the spec. The spec as written says that html is allowed, and includes no restrictions. So if this site doesn’t allow the following code to render if pasted into this form, then it’s not actually CommonMark compliant as the spec is written.

theoretically there may be a 3rd case where CommonMark is currently not sufficient for 2 but I think those are probably just features CommonMark needs to add additional formatting and features to the core. If there are cases where html is needed but not full html, that could be handled by the extension, in its spec.

Do you want to add a Markdown syntax for the <details> tag? For tables with colspan attributes? For <dl>? For defining anchor points (<a name="point">)? For comments? For defining abbreviations inline? For exponentiation? For nested tables? If the answer to any of these questions is “no”, then we’re going to need embedded HTML, because I’ve seen all of those things used in GitHub.

Markdown’s advantage is that, by supporting a small number of features, it can pretty much assign each feature it’s own special syntax, giving very good support for the 80% part of the 80/20 rule. But the more features Markdown grows, the harder it gets to parse, because all those different syntaxes interact in very ugly ways. HTML, because (almost) all of the tags are syntactically identical, can add new features with very little effect on the parser.

It’s the same reason Discourse’s quotebox syntax is done with BBCode. It’s easier to add a lot of features to a list of English words than to come up with a parseable, typeable ASCII art representation.

Why do you need those things on github? I would question if markdown is really the right tool for the arbitrary user input. But if it’s useful enough… I have seen useful extensions for tables that work well, do they support custom attributes? no but custom attributes also get people in trouble, did you remember to clean your style, onclick, and onmouseover attributes (i’m sure there’s other ones too)? if not I can write an exploit.

Having it as an extension means that people can allow it if they want, and add it, but that commonmark implementation without extension would be safe to have out of the box. Note: hugo, for example was ok with my embedded <script></script> but because it’s static html generation that’s fine… I wouldn’t want this on a site where arbitrary people can do it.

you can either have html and not have it be secure, define only a safe subset of html (hard), or not have html at all. Currently the spec defines the first option.