What’s up with the strange treatment of quotation marks and backslashes by CommandLineToArgvW

The way the CommandLineToArgvW function treats quotation marks and backslashes has raised eyebrows at times. Let's look at the problem space, and then see what algorithm would work.

Here are some sample command lines and what you presumably want them to be parsed as:

Command line

Result

program.exe "hello there.txt"

program.exe hello there.txt

program.exe "C:\Hello there.txt"

program.exe C:\Hello there.txt

In the first example, we want quotation marks to protect spaces.

In the second example, we want to be able to enclose a path in quotation marks to protect the spaces. Backslashes inside the path have no special meaning; they are copied as any other normal character.

So far, the rule is simple: Inside quotation marks, just copy until you see the matching quotation marks. Now here's another wrinkle:

Command line

Result

program.exe "hello\"there"

program.exe hello"there

In the third example, we want to embed a quotation mark inside a quotated string by protecting it with a backslash.

Okay, to handle this case, we say that a backslash which precedes a quotation mark protects the quotation mark. The backslash itself should disappear; its job is to protect the quotation mark and not to be part of the string itself. (If we kept the backslash, then it would not be possible to put a quotation mark into the command line parameter without a preceding backslash.)

But what if you wanted a backslash at the end of the string? Then you protect the backslash with a backslash, leaving the quotation mark unprotected.

Command line

Result

program.exe "hello\\"

program.exe hello\

Okay, so what did we come up with?

We want a backslash before a quotation mark to protect the quotation mark, and we want a backslash before a backslash to protect the backslash (so you can end a string with a backslash). Otherwise, we want the backslash to be given no special treatment.

The CommandLineToArgvW function therefore works like this:

A string of backslashes not followed by a quotation mark has no special meaning.

An even number of backslashes followed by a quotation mark is treated as pairs of protected backslashes, followed by a word terminator.

An odd number of backslashes followed by a quotation mark is treated as pairs of protected backslashes, followed by a protected quotation mark.

The backslash rule is confusing, but it's necessary to permit the very important second example, where you can just put quotation marks around a path without having to go in and double all the internal path separators.

Personally, I would have chosen a different backslash rule:

Warning - these are not the actual backslash rules. These are Raymond's hypothetical "If I ran the world" backslash rules.

A backslash followed by another backslash produces a backslash.

A backslash followed by a quotation mark produces a quotation mark.

A backslash followed by anything else is just a backslash followed by that other character.

I prefer these rules because they can be implemented by a state machine. On the other hand, it makes quoting regular expressions a total nightmare. It also breaks "\\server\share\path with spaces", which is pretty much a deal-breaker. Hm, perhaps a better set of rules would be

Warning - these are not the actual backslash rules. These are Raymond's second attempt at hypothetical "If I ran the world" backslash rules.

Backslashes have no special meaning at all.

If you are outside quotation marks, then a " takes you inside quotation marks but generates no output.

If you are inside quotation marks, then a sequence of 2N quotation marks represents N quotation marks in the output.

If you are inside quotation marks, then a sequence of 2N+1 quotation marks represents N quotation marks in the output, and then you exit quotation marks.

This can also be implemented by a state machine, and quoting an existing string is very simple: Stick a quotation mark in front, a quotation mark at the end, and double all the internal quotation marks.

But what's done is done, and the first set of backslash rules is what CommandLineToArgvW implements. And since the behavior has been shipped and documented, it can't change.

If you don't like these parsing rules, then feel free to write your own parser that follows whatever rules you like.

It keeps track of opening and closing quotes and removes all quotes except for any quote immediately following a closing quote (and these retained quotes aren't treated as opening quotes). Surely this isn't deliberate.

[No idea if it was deliberate, but at least in the examples you gave, it seems to match my second set of rules… -Raymond]

"If you don't like these parsing rules, then feel free to write your own parser that follows whatever rules you like."

But then you'd create an app that behaves inconsistently with others. (Not that there is total consistency, I guess; maybe that horse has already bolted.)

The DOS parsing rules, like most aspects of DOS, are awful and demonstrate the designers' lack of thought, but they're also unavoidable if you use cmdline tools on Windows (since you cannot change the parsing rules used in 3rd party apps) so I think we just have to live with them, including in our own programs.

Maybe it'd work if someone with influence (i.e. control over the OS or developer tools/libraries) provided and encourage use of a single new, good parser. But if the only alternative to the DOS one is everyone knocking together their own, I think I'd rather the devil I know.

Using the command-line seems to be a dying art (the number of people I talk to who are clueless about putting paths with spaces inside quotes…) and, after PowerShell solved all the wrong problems, I've given up hope of seeing any real focus on renovations to that aspect of the OS. It is what it is, changing it is a compatibility nightmare and not enough people use it to justify the effort, I guess. Time machine needed, stat!

[Looks like the .NET Framework folks took up your challenge. All .NET programs use a common command line parser. (No idea if it counts as "good" though.) -Raymond]

Read closer. The rules that Raymond originally said he would want match the typical scripting language. However, in the actual rules, only escapes if the sequence of backslashes ends with a ". In other words, in Unix, 'foo\bar' encodes 'foobar', but here, 'foo\bar' just encodes 'foo\bar'.

It is unfortuanate that the way the CRT parses a command line is different from the way this API parses the command line. If you need to quote a complicated string with quotes and backslashes you have to know if the application used [w]main or [w]WinMain.

I always thought the manner of Unix doing it, this is, handing the OS an argv[] array of arguments to the process to start, and the process retrieving that same array, much cleaner. The shell is the one actually doing the splitting.

A. Added a function that converts an argv array back into a properly quoted and escaped command-line. (i.e., something that undoes CommandLineToArgvW) Maybe this isn't even possible if what asdf said is true.

B. Added a version of CreateProcess that does away with the confusing module-name/command-line business and that instead takes an argv array.

Right now if I want to write a program that forwards command-line arguments to some other program (e.g., sudo- or DropMyRights-like programs would be prime candidates), it's really hard to invoke the other process exactly right. I'd need to tokenize the command-line to process my own options, but then I'd need to rebuild the rest of the command-line to feed to CreateProcess.

The timing on this posting is actually quite serendipitous. In my operating systems class, the project I'm currently working on requires us to write a routine to parse the command line into argc and argv[]. I was debating on how to best handing quoting (we weren't given specifics) and then I come across your blog post while taking a break working on the project.

The simplicity of your ideal quoting rules broke the tie in my mind between using backslash and doing VB-style double quotes. Thanks! :)

That said, it seems like there are still some corner cases which you might want to define if you were really going to do something like this. For example, given your ideal rules, what do you do with dangling/mismatched quotes? It seems like you have three choices; have the null terminator be an implicit ending quote, treat the quote as a literal character, or return some kind of error (which is lame):

Oh the joys of Windows' command line parsing. I once had to implement a command-line parser in Delphi (because it's built-in one sucks and pretty much just eats all quotes), and since I didn't know about CommandLineToArgvW then (the MSDN has a description on how it works regarding to quotes and backslashes; the function itself wouldn't have helped me, since I had to have the program work on Win98), I just wrote a small C program and fed it various quoted strings until I figured out the behaviour – well, at lest most of it (I still don't know how exactly it parses multiple quotes, but at least as I later found out, this isn't documented in MSDN either).

And the primary issue? Using the for both escaping mechanism AND path delimiter. If we were using that new-fangled web-notation, there wouldn't be that many issues. Whoever had that idea a few decades ago certainly messed up. Not that there wouldn't be enough issues left, such as regExp inside any other programming language…

@Joshua: Hm, maybe. Possibly a hypothetical argv version of CreateProcess could set internal compatibility flags that could affect the behavior of a subsequent CommandLineToArgvW call? (I haven't thought this through.) IMO it also might be acceptable to give up and ignore quoting differences between the different parsing APIs. Just pick a horse and let it become the standard method going forward. If what asdf said is true, then it's already impossible to forward command-line options reliably to an arbitrary program anyway.

@James: Yes, that's true, but there were other alternatives that would have been better in the long run.

After all, there's no ambiguity using / as both switch and path-delimiter except in the single case of "path relative to the root folder of the current drive" — which could have been expressed differently, eg. ".:/foo/bar" instead of "foobar".

@James: AFAIK, it was IBM that forced as path delimiter. CP/M and DOS 1.0 didn't have directories, but their programs used / as of command-line parameter separator. From what I've read, Microsoft actually wanted to use / as path separator (and switch to something else for command-line parameters), but IBM was against it (so instead the people who worked on DOS made it possible to use both and / as path separator, and added an undocumented function to change the parameter separator to any other character).

Looks like your example show that you can create a quote by quoting it """ -> ", four of these (12 ") would then become 4 " which your example shows. I like the way you said the quote *behind* the closing quote was displayed while I took it that the quote *before* the closing quote was shown (sort of like []a-z] in regexps).