Script arguments containing whitespace or a double quote are garbled when passed to R3 scripts. Blanks within arguments are irrecoverably intermingled with the blank used to separate distinct arguments. Other whitespace and double quotes are misinterpreted by the REBOL code to split the joined argument string into separate arguments again.

Consider the %args.r script given in the example code. When calling it with R3 A111 and the 3 arguments "a", "b c", "d" we get the following result:

In the host kit, we see in src/os/host-args.c (lines 252-261) that the real script argument delimitation is already irrecoverably dropped at the host kit level.

A proper fix therefore seems to at least require changing the REBARGS structure (src/include/reb-args.h) to use a REBCHR **args (instead of the current REBCHR *args) to correctly preserve argument delimitation for the arguments being passed from the host kit to the core library.

The args after the script are being combined into a single string by concatenating them with spaces separating them. The problem is that the OS args processing code gets rid of the quotes surrounding args in quotes, but we aren't adding them back when we reconstruct the args string. The changed code above will put quotes around args with spaces in them, which will make the [parse args ""] in the startup code treat them as separate values. We probably don't need to check for other whitespace chars, like tabs, because the operating system converts them to spaces (or at least Windows does - how about Linux?).

(0003197)BrianH1-Oct-2011 20:44

Note: I changed the category to Host-Kit because the bug is in the host-kit source, and fixable there. However, this bug also affects the RT one-exe builds, and it affects all platforms because it's in a portion of the code that is not qualified by any platform-specific defines.

(0003198)abolka1-Oct-2011 21:40

Unfortunately, that "fix" only fixes a part of the problem, that is blanks within arguments. It doesn't address double quotes or other whitespace within arguments.

Mangling the argv into a string only to later unmangle it into separate arguments again and trying not to lose information in the process is probably not the wisest approach. Simply preserving the argv as an array of strings while passing it from the hostkit to libr3 seems like a better idea; it certainly avoids this class of issues completely.

(Updated the summary & description to reflect the whitespace/dquote issue in more detail.)

(0003201)abolka2-Oct-2011 02:37

Alternatively, use PARSE/all with a delimiter and escape the delimiter as well as the escape char in each arg. For example:

In hostkit code:
1. replace all "\" in each arg with "\b"
2. replace all ";" in each arg with "\s"
3. join args with ";"

In startup code:
1. parse/all args ";"
2. replace all "\s" in each arg with ";"
3. replace all "\b" in each arg with "\"

(I'd say this is the more complex option, but YMMV.)

(0003203)BrianH4-Oct-2011 17:40

Replacing a single character with two characters in the host kit code is awkward, because the strings in question are allocated with fixed lengths. Better to join with a rarely used non-printable character in the ASCII range like DEL (to-char 127), then split by that character in the boot code when found in the args. When it's not found that means the --args option was used, so we can fall back on [parse args ""].

We wouldn't be able to use simple parsing for the 7F joined case because of the quotes issue, so we'd have to do a REBOL-language equivalent. This is a bit of a problem because we've been avoiding rule-based parsing in the startup code for speed reasons. We can't just do a REPLACE of " with something else because REPLACE is even slower, and not available in sys/start since it's a non-base mezzanine, and modifying of a string that might not be safe to modify. So to do this we would need to only use a small number of fast natives.

That would mean that in host-args.c you would replace (the original code):
JOIN_STR(args, TXT(" "), 1);
with this:
JOIN_STR(args, TXT("\x7F"), 1);

This is starting to get a bit complex, but it's faster than using REPLACE, and it's non-modifying on the initial args string, something which might be required. Note that the join character is a statement terminator, not a separator, so the final empty string is not an argument.

The main problem with changing the args field from REBCHR* to REBCHR** is that the initial args option would still need to be a string if the --args parameter was specified and not overriden by the leftover command line args after the script name. This is necessary because we need to support that parameter being parsed by user-supplied startup code. In order to really support this different behavior, the native args would need to be some kind of tagged union which could either be a string or an array (block?) of strings. I don't yet know whether such a thing is possible in the native code, even with Carl changing r3lib.

(0003206)abolka4-Oct-2011 23:13

> The main problem with changing the args field from REBCHR* to REBCHR** is that the initial args
> option would still need to be a string if the --args parameter was specified and not overriden by
> the leftover command line args after the script name. This is necessary because we need to support
> that parameter being parsed by user-supplied startup code.

If we pass through arguments unmangled, then --args is not really needed anymore.

Note: Currently, (REBOL-written) startup code has no way to discern between a system/options/args string passed via --args or system/options/args created by the hostkit joining command-line args. This does not change if we switch args to be a block! instead of a string!. (And I do not see a reason why it should change.)

So I am still strongly in favour of switching to a block!-based system/options/args for startup.

The only real reason against this that I can think of is that there may be platforms/hosts which don't really have a notion of multiple distinct arguments (or arguments at all). Even for those cases, passing an empty args block (no arguments) or an args block with a single element (one argument) seems to be a good startup interface.