yOn Thu, May 31, 2001 at 01:46:01PM +0200, Oren Ben-Kiki wrote:
| > I don't see why we need to forbid it's usage in lists...
| > although you may consider it ugly...
| >
| > list: @
| > This is a multi line
| > scalar value.
| > This is the second entry.
|
| First, it is specifically forbidden by the current spec.
It is? It was in a previous spec, but not the recent
one (the one dated the 26th) which was updated in a major
way on the 26th after it was released (my bad).
Where explicitly, as I actually remember working this
through in my head so that it wasn't an exception.
| Second, yes, it is ugly, a prefix ':' would do wonders here.
Hmm. Considering.
| Let's see. YAR needs to satisfy the following requirements:
|
| 1. Always round-trip the file correctly, byte-to-byte identical.
| 2. Subject to (1), use a human-readable representation of a file.
| 3. Carry over any meta-data about the file which is relevant.
Good.
| - Represent each file as a map:
| file-name: %
| permissions: ...
| owner: ...
| group: ...
| character-set: ...
| content: ...
|
| - The "character set" attribute explains how to convert the value of the
| 'content' entry to bytes to write. It should be one of
| ascii/utf-8/utf-16be/utf-16le. If the content is a binary blob, either there
| is no "character set" (the content isn't textual), or it could just say
| "binary" (I know that's not a IANA recognized character set...)
Ok. So we've made an explicit node attribute
called "character set". Interesting. Should
this be added using a special indicator? ^utf-8
| This works, is safe, is extensible, and doesn't require us to bend YAML out
| of shape. The only things which bothers me is handling newlines when
| transferring between DOS/Windows and the rest of the world. Question: in a
| block, does YAML preserve whether a line ended with a \n or a \r\n? If not
| we are OK.
Yep. I was worring about this one too. I'm not
sure what the solution is. Normalizing new lines
is a perfectly reasonable thing to do... except
in this case.
| And I'm still worried about the class marker... Can you
| say something about what you mean by "class map"? I
| didn't get it.
Assume you only have a object which is a map, list, or scalar.
The YAML parser/emitter pair could keep a global "classmapping"
which associated all objects created via reading from the
serialized format with it's optional class map. Then, when
writing the YAML file, the global variable could be used to
reconstruct the class map attribute. Yes, this opens up
a small can of worms... garbage collection among them.
However, it does make round-tripping possible even when
class isn't supported by the system.
Best,
Clark

> I don't see why we need to forbid it's usage in lists...
> although you may consider it ugly...
>
> list: @
> This is a multi line
> scalar value.
> This is the second entry.
First, it is specifically forbidden by the current spec.
Second, yes, it is ugly, a prefix ':' would do wonders here. I've promised
not to ask again for allowing it, so I won't :-)
> | I don't like it because:
> |
> | - There's no way to write Unicode blocks.
> | - You are mixing up ASCII text and binary data in the same
> | data type,
> | - And separating ASCII and Unicode text for no good reason.
> | Languages are
> | evolving towards "text = Unicode", as they should, and
> | UTF-8 makes it very
> | easy to deal with Unicode text even in languages such as C.
>
> Yep. I can see these problems. However, then our YAR
> use case must have a ".yar" file which includes the list
> of "text" extensions. Furthermore, it will have to also
> note if the file was stored as UTF-8 or UTF-16. Hmmm.
Let's see. YAR needs to satisfy the following requirements:
1. Always round-trip the file correctly, byte-to-byte identical.
2. Subject to (1), use a human-readable representation of a file.
3. Carry over any meta-data about the file which is relevant.
Solution:
- Represent each file as a map:
file-name: %
permissions: ...
owner: ...
group: ...
character-set: ...
content: ...
- The "character set" attribute explains how to convert the value of the
'content' entry to bytes to write. It should be one of
ascii/utf-8/utf-16be/utf-16le. If the content is a binary blob, either there
is no "character set" (the content isn't textual), or it could just say
"binary" (I know that's not a IANA recognized character set...)
- The yar file creator chooses which syntax format to give the content
according to the following heuristic. In each case, it ensures the encoding
will be such that the file will be re-created byte-to-byte identical to the
original:
If the file contains...
1. ... only 7-bit printable ASCII characters, plus newlines => syntax:
block, charset: ascii.
2. ... only utf-8 printable characters, plus newlines => syntax: block,
charset: utf-8.
3. ... like 1, with the odd 8-bit character (but not 2) => syntax: quoted
string, charset: ascii.
4. ... like 2, with the odd unprintable character => syntax: quoted string,
charset: utf-8.
5. ... what looks to be as a utf-16 file with only printable characters plus
newlines => syntax: block charset: utf-16be/utf-16le.
6. ... like 5 with the odd non-printable character => syntax: quoted string,
charset: utf-16be/utf-16le.
7. ... anything else => syntax: blob, charset: binary (or missing).
This works, is safe, is extensible, and doesn't require us to bend YAML out
of shape. The only things which bothers me is handling newlines when
transferring between DOS/Windows and the rest of the world. Question: in a
block, does YAML preserve whether a line ended with a \n or a \r\n? If not
we are OK.
And I'm still worried about the class marker... Can you say something about
what you mean by "class map"? I didn't get it.
Have fun,
Oren Ben-Kiki

On Thu, May 31, 2001 at 09:03:37AM +0200, Oren Ben-Kiki wrote:
| > > Unquoted: Good for single or multi-line folded
| > > content lacking significant whitespace
| > > and having all printables. Good for map
| > > scalars. Good for non-escaped content
| > > (as long as the first character does not
| > > begin with an indicator)
|
| I see the use of "the simplest possible syntax" without any escaping but
| with folding. But I find it rather arbitrary that you can't use it (the
| multi-line version of it) in lists.
I don't see why we need to forbid it's usage in lists...
although you may consider it ugly...
list: @
This is a multi line
scalar value.
This is the second entry.
| At any rate, I think we've trashed this to death. I'd rather use one of the
| above two forms (optional ':' or unifying quoted and unquoted), but I'll go
| with whatever you decide. I promise not to bug you about it again :-)
| > > Blocked: | Good where leading and intermediate
| > > | w h i t e s p a c e
| > > | is important to preserve and also good
| > > | where " $ and other special characters
| > > | need not be escaped.
| > > \
|
| OK. We're keeping it simple - so there's no way at all to break long lines
| in a block.
Right -- 4 simple "styles" (mod unicode) rather than one
big complicated mechanism.
Best,
Clark

On Thu, May 31, 2001 at 09:03:37AM +0200, Oren Ben-Kiki wrote:
| I'm confused. I suggested we make the distinction between
| "text" (Unicode characters) and "binary" (byte array).
| Base64 indicates "binary", everything else is text.
Right. Let's call this option "binary switch"
| Are you suggesting that instead we make the distinction
| between "Unicode" and "byte array which may be either
| binary blob or ASCII text"?
Yes. This was Brian's suggestion, let's call this option
the "unicode switch". It is clear that we need a switch.
Let's go over the use cases...
Python - Has a separate ASCII string and Unicode
string. Unicode strings are specially
marked and conversion from ASCII to Unicode
is relatively easy.
Java - Has byte and String, where string is Unicode.
Perl - (no clue)
C++ - Separate ascii (char) and unicode (wchar_t)
strings.
Python and C/C++ fit more strongly with the "Unicode Switch"
where Java is better with "Binary Switch". Perl?
...
On a related note, consider the YAR use case. For the
Unicode switch things works pretty well... ASCII files
are shown as ASCII and UTF-16 nodes are shown as unicode.
Every once and a while a binary file will show up as
a readable ASCII file. But this will be rare.
However, for the Binary switch... YAR cannot operate
without a list of "known text extensions" where all
other files (regardless about how ASCII they look)
must be treated as binary to avoid possible mangling!
Thus, from a "yar" perspective, I like the Unicode
Switch much better...
| Where the only syntax for Unicode is 'single quoted text'?
|
| I don't like it because:
|
| - There's no way to write Unicode blocks.
| - You are mixing up ASCII text and binary data in the same data type,
| - And separating ASCII and Unicode text for no good reason. Languages are
| evolving towards "text = Unicode", as they should, and UTF-8 makes it very
| easy to deal with Unicode text even in languages such as C.
Yep. I can see these problems. However, then our YAR
use case must have a ".yar" file which includes the list
of "text" extensions. Furthermore, it will have to also
note if the file was stored as UTF-8 or UTF-16. Hmmm.
| As someone working in Israel, and having worked a lot
| with European and Japanese clients, I'm rather sensitive
| to this issue... It is a pain to use second-quality language
| features just because you aren't an English speaker.
I agree here... but let's work through the YAR use
case a bit more; as it seems to be our first really
good application that is easily expressable and
self-contained.
Best,
Clark

Oren Ben-Kiki wrote:
> > Unicode: 'Good for unicode data. It''s just cool
> > to use the "single quote" for this!'
>
> I'm confused. I suggested we make the distinction between "text" (Unicode
> characters) and "binary" (byte array). Base64 indicates "binary", everything
> else is text.
>
> Are you suggesting that instead we make the distinction between "Unicode"
> and "byte array which may be either binary blob or ASCII text"? Where the
> only syntax for Unicode is 'single quoted text'?
To be honest, I lack any real savvy in unicode issues. I merely
suggested the
single quote as a preferred syntax, if we were going to separate ascii
from
unicode. I probably misunderstood the intent. If it doesn't make sense,
then
by all means let's consider all text to be utf8.
>
> I don't like it because:
>
> - There's no way to write Unicode blocks.
> - You are mixing up ASCII text and binary data in the same data type,
> - And separating ASCII and Unicode text for no good reason. Languages are
> evolving towards "text = Unicode", as they should, and UTF-8 makes it very
> easy to deal with Unicode text even in languages such as C.
Although Perl's unicode support isn't the best, I do believe that UTF8
is
the way strings are now assumed to be by default.
> As someone working in Israel, and having worked a lot with European and
> Japanese clients, I'm rather sensitive to this issue... It is a pain to use
> second-quality language features just because you aren't an English speaker.
That makes sense.
I willfully step aside on this issue.
Cheers, Brian
--
perl -le 'use Inline C=>q{SV*JAxH(char*x){return newSVpvf
("Just Another %s Hacker",x);}};print JAxH+Perl'

Brian Ingerson [mailto:briani@...] wrote:
> > Given that we have different types of scalar's with
> > different types of constraints the current four
>
> five :)
Sigh :-)
> > have very nice coverage:
> >
> > Unquoted: Good for single or multi-line folded
> > content lacking significant whitespace
> > and having all printables. Good for map
> > scalars. Good for non-escaped content
> > (as long as the first character does not
> > begin with an indicator)
I see the use of "the simplest possible syntax" without any escaping but
with folding. But I find it rather arbitrary that you can't use it (the
multi-line version of it) in lists. Using " forces you to escape any \
characters in it... Is there a chance to allow ':' as an optional prefix for
scalars in lists? That would solve this issue.
Alternatively, it seems we had a notion to allow escaping in Unquoted text
as well. That would unify it with Quoted text - make the " optional, to be
used when there's potential ambiguity, such as multi line text in lists. I'd
rather do that than have a format which is only usable in maps and not in
lists.
Under this proposal, I suggested we switch the block terminating character
to ` instead of \ to remove one case of ambiguity:
is this: \na single line block or an escaped newline?
Of course one could just use surrounding quotes to disambiguate it, I just
thought it would be nice to eliminate one more case of requiring them.
At any rate, I think we've trashed this to death. I'd rather use one of the
above two forms (optional ':' or unifying quoted and unquoted), but I'll go
with whatever you decide. I promise not to bug you about it again :-)
> > Quoted: "Good for single or multi-line folded
> > content where some of the whitespace is
> > significant and non-printables may be
> > escaped. Good for list scalars. Not good
> > when alot of whitespace must be escaped
> > or with content with frequent quote usage."
OK.
> > Blocked: | Good where leading and intermediate
> > | w h i t e s p a c e
> > | is important to preserve and also good
> > | where " $ and other special characters
> > | need not be escaped.
> > \
OK. We're keeping it simple - so there's no way at all to break long lines
in a block.
> > Binary: [BASE-64-IS-GOOD-FOR-BINARY-DATA-THAT-CAN-
> OBVIOUSLY-SPAN-MORE-THAN-ONE-LINE]
OK.
> Unicode: 'Good for unicode data. It''s just cool
> to use the "single quote" for this!'
I'm confused. I suggested we make the distinction between "text" (Unicode
characters) and "binary" (byte array). Base64 indicates "binary", everything
else is text.
Are you suggesting that instead we make the distinction between "Unicode"
and "byte array which may be either binary blob or ASCII text"? Where the
only syntax for Unicode is 'single quoted text'?
I don't like it because:
- There's no way to write Unicode blocks.
- You are mixing up ASCII text and binary data in the same data type,
- And separating ASCII and Unicode text for no good reason. Languages are
evolving towards "text = Unicode", as they should, and UTF-8 makes it very
easy to deal with Unicode text even in languages such as C.
As someone working in Israel, and having worked a lot with European and
Japanese clients, I'm rather sensitive to this issue... It is a pain to use
second-quality language features just because you aren't an English speaker.
Have fun,
Oren Ben-Kiki

"Clark C . Evans" wrote:
> At first, I was very weary of the "many ways to do it".
YAML will give rest to your weary soul.
> Given that we have different types of scalar's with
> different types of constraints the current four
five :)
> have very nice coverage:
>
> Unquoted: Good for single or multi-line folded
> content lacking significant whitespace
> and having all printables. Good for map
> scalars. Good for non-escaped content
> (as long as the first character does not
> begin with an indicator)
>
> Quoted: "Good for single or multi-line folded
> content where some of the whitespace is
> significant and non-printables may be
> escaped. Good for list scalars. Not good
> when alot of whitespace must be escaped
> or with content with frequent quote usage."
>
> Blocked: | Good where leading and intermediate
> | w h i t e s p a c e
> | is important to preserve and also good
> | where " $ and other special characters
> | need not be escaped.
> \
>
> Binary: [BASE-64-IS-GOOD-FOR-BINARY-DATA-THAT-CAN-
OBVIOUSLY-SPAN-MORE-THAN-ONE-LINE]
Unicode: 'Good for unicode data. It''s just cool
to use the "single quote" for this!'
If this is YAML; I like it :)
Cheers, Brian
--
perl -le 'use Inline C=>q{SV*JAxH(char*x){return newSVpvf
("Just Another %s Hacker",x);}};print JAxH+Perl'

"Clark C . Evans" wrote:
> On Wed, May 30, 2001 at 01:59:24PM -0700, Brian Ingerson wrote:
> | I'm completely on board with the current proposal (except that
> | I would not allow the absence of double quotes in a multi-line
> | folded stream.)
>
> Ok. This prevents easy way to enter folded text
> that is not escaped (like HTML), but perhaps can
> make the whole proposal cleaner. Oren, what do
> you think? You seem to want to eliminate one of
> the forms as well -- would reducing the un-quoted
> form to a single-line variant work for you?
> Why not just eliminate all unquoted forms?
Ya know... If we just want to make the rule "Double quotes are optional
unless ambiguity results", that's fine with me.
> After some additional background thought on the unicode
> string proposal, I think I'm in favor.
Good.
> Would it have the
> similar escape sequences as the double quoted string?
Sure. Why not?
(read, "no strong opinion")
> I think the canonical form is where we need to
> spend a bit more time... but unforatunately, I'm
> too busy to push this further... perhaps later
> next week.
I disagree. *Forget* the canonical form for now, we don't really need
it. In fact, it's not even the appropriate time for us to make such
decisions. Let's get a couple reasonable implementations out and decide
canonical forms during a beta period. It's just not right for us to be
the ones determining what the proper ways to use YAML are. Let the users
decide. Power to the people!
Just relax :)
Cheers, Brian
--
perl -le 'use Inline C=>q{SV*JAxH(char*x){return newSVpvf
("Just Another %s Hacker",x);}};print JAxH+Perl'

On Wed, May 30, 2001 at 02:13:52PM -0700, Brian Ingerson wrote:
| "Clark C . Evans" wrote:
| > Also, I'm not sure that I like limiting unquoted streams to
| > a single line. Being able to cut/paste in a HTML text without
| > having to escape the quotes and such is very valueable.
|
| I see your point. But I think disallowing them in lists is confusing.
| I'll go either way here.
At first, I was very weary of the "many ways to do it".
Given that we have different types of scalar's with
different types of constraints the current four
have very nice coverage:
Unquoted: Good for single or multi-line folded
content lacking significant whitespace
and having all printables. Good for map
scalars. Good for non-escaped content
(as long as the first character does not
begin with an indicator)
Quoted: "Good for single or multi-line folded
content where some of the whitespace is
significant and non-printables may be
escaped. Good for list scalars. Not good
when alot of whitespace must be escaped
or with content with frequent quote usage."
Blocked: | Good where leading and intermediate
| w h i t e s p a c e
| is important to preserve and also good
| where " $ and other special characters
| need not be escaped.
\
Binary: [BASE-64-IS-GOOD-FOR-BINARY-DATA]
In short, I see each one as having a particular class
of data it is good at representing. And I think the
normalization rules could nicely choose among them!
The quoted format is the most flexible, but it is
also, probably the least readable.
Is our focus here on "readability"? If so, then I
think all four of the above forms could be important.
Best,
Clark

On Wed, May 30, 2001 at 01:59:24PM -0700, Brian Ingerson wrote:
| > | - Scalar value: as today, allow escaping using \, always allow
| > | it to be multi-line, but the continuation lines must be more
| > | indented then the first line.
| >
| > I understand, however, I think the current division
| > of scalar types is rather nice balance of concerns.
| > Most programmers will expect \ style escaping
| > within quotes.
|
| I'm completely on board with the current proposal (except that
| I would not allow the absence of double quotes in a multi-line
| folded stream.)
Ok. This prevents easy way to enter folded text
that is not escaped (like HTML), but perhaps can
make the whole proposal cleaner. Oren, what do
you think? You seem to want to eliminate one of
the forms as well -- would reducing the un-quoted
form to a single-line variant work for you?
Why not just eliminate all unquoted forms?
...
After some additional background thought on the unicode
string proposal, I think I'm in favor. Would it have the
similar escape sequences as the double quoted string?
| > What I'd like to focus on (and what should be added
| > to the spec) is the canonoical form. Here
| > are some base line suggestions:
| >
| > 1. Indenting always occurs, i.e. quoted or binary
| > scalars are indented.
|
| Fine.
|
| >
| > 2. The tab setting for indents is 4 characters
|
| Yup.
|
| >
| > 3. When possible the text is word-wrapped to 76
| > characters; leaving for a minimum of 20 characters
| > for scalar's content. Thus, after 14 levels of
| > indentation, text may go beyond 20 characters.
|
| But never on blocks, right?
Right.
| >
| > 4. If leading whitespace occurs on any line within
| > a scalar, then the block format is used.
|
| Good.
|
| >
| > 5. If a character string is longer than 20 characters
| > without having intermediate whitespace, then
| > the quoted format is used.
|
| And also without having any newlines or YAML meta-chars.
| 20 - 30 seems fine.
This one lost me a bit, I'm talking about:
key: this-is-a-long-value-that-can't-be-word-wrapped
canonical form...
key: "this-is-a-long-value-that-can't\
be-word-wrapped"
...
I think the canonical form is where we need to
spend a bit more time... but unforatunately, I'm
too busy to push this further... perhaps later
next week.
Best,
Clark

"Clark C . Evans" wrote:
> Also, I'm not sure that I like limiting unquoted streams to
> a single line. Being able to cut/paste in a HTML text without
> having to escape the quotes and such is very valueable. Further,
> I thought we had agreed on [base64] instead of the back tick.
I see your point. But I think disallowing them in lists is confusing.
I'll go either way here.
Cheers, Brian
--
perl -le 'use Inline C=>q{SV*JAxH(char*x){return newSVpvf
("Just Another %s Hacker",x);}};print JAxH+Perl'

"Clark C . Evans" wrote:
>
> On Wed, May 30, 2001 at 02:56:00PM +0200, Oren Ben-Kiki wrote:
> | Hmmm. If one is serializing a built-in map/list/scalar, you do
> | it normally; when serializing an "Object" it probably gets
> | serialized into a map, with a class attribute. It is only
> | "typed scalars" which are a problem (e.g., a number, a date, etc.).
> | One problem about the proposed class-as-color syntax is that it
> | is rather cumbersome for something like a simple number...
>
> Right... this is half the problem. The other half of the
> problem is that the class should be known before the map's
> content should be loaded. I think I like the class solution
> as it is... I think the round-tripping problems are
> problematic; but perhaps they are not that bad.
Since this is up in the air, I'd like a crack at implementing things as
they stand. I think I can deal with round tripping issues; from Perl's
side anyway.
>
> | Cut&paste implies a tool - an editor or a program. Most editors
> | allow you to trivially indent a group of lines - certainly any
> | editor used for writing YAML had better support it. And a program
> | doing cut&paste can add indentation easily enough.
>
> Ok. I'd rather have it always indented as well.
>
> | I'd rather we stick with strict indentation for everything,
> | including blocks, blobs and double-quoted streams.
>
> Did you consider that an emitter can indent the blob
> and double quoted streams? Thus, indentation isn't
> prevented. And with machine generated YAML (99%),
> the non-intendented case is minor. Further, by running
> YAML through a program which puts the file in canonical
> form indenting will definately be the norm. So... I didn't
> see the harm in allowing this. It doesn't really impact
> the parsing complexity.
Thanks for stating this clearly Clark. I assumed it was understood.
>
> | Separate issue: having both double-quoted and simple scalar
> | values. Here's a proposal to use just one form:
> |
> | - Block: as today, but use ` instead of \ to mark the end line.
>
> I like the \, is there a reason why you wanted to change it?
Agree
>
> | - Scalar value: as today, allow escaping using \, always allow
> | it to be multi-line, but the continuation lines must be more
> | indented then the first line.
>
> I understand, however, I think the current division
> of scalar types is rather nice balance of concerns.
> Most programmers will expect \ style escaping
> within quotes.
I'm completely on board with the current proposal (except that I would
not allow the absence of double quotes in a multi-line folded stream.)
>
> | - List: as today (no special marker), but allow : as an
> | optional prefix for scalar values.
>
> Well, if you leave the quote type in, then
> the : optional marker won't be needed:
>
> Example:
> @
> Single line scalar
> "Pretty multi line
> text without multiline."
> |
> |Block with leading and
> |trailing new line.
> \
> [Base 64 blob]
Love this!
Um is '[' or ']' used in mime-base64?
>
> | Hmmm. As I see it, we have at least one major issue to resolve (the class
> | issue) plus another where a "dictator hat" may be required (the text
> | syntax). Then we can all start doing development at our own pace...
>
> As for the scalar types... I'm inclined to leave them
> as they are now. It does seem like a nice balance of
> concerns. What I'd like to focus on (and what should
> be added to the spec) is the canonoical form. Here
> are some base line suggestions:
>
> 1. Indenting always occurs, i.e. quoted or binary
> scalars are indented.
Fine.
>
> 2. The tab setting for indents is 4 characters
Yup.
>
> 3. When possible the text is word-wrapped to 76
> characters; leaving for a minimum of 20 characters
> for scalar's content. Thus, after 14 levels of
> indentation, text may go beyond 20 characters.
But never on blocks, right?
>
> 4. If leading whitespace occurs on any line within
> a scalar, then the block format is used.
Good.
>
> 5. If a character string is longer than 20 characters
> without having intermediate whitespace, then
> the quoted format is used.
And also without having any newlines or YAML meta-chars.
20 - 30 seems fine.
Cheers, Brian
--
perl -le 'use Inline C=>q{SV*JAxH(char*x){return newSVpvf
("Just Another %s Hacker",x);}};print JAxH+Perl'

On Wed, May 30, 2001 at 01:23:41PM -0700, Brian Ingerson wrote:
| Perl is a little up in the air on Unicode. I believe that recent
| versions use UTF8 by default for *most* string operations. A byte mode
| can be set to default back to the good ole days.
|
| FWIW, I suggest using the single quote for Unicode data. (Get it? *Uni*
| code :-). Using Clark's backtick for binary base-64 data. Using '|,\'
| for blocks. Using double quotes for everything else (folded stream).
| Unquoted streams are a convenience option for single line values only.
Hmm. This is interesting, instead of marking which nodes are
"binary" we mark which nodes are "unicode". Not a bad compromise,
as it is far easier with the language to know if something is
unicode or not.... I like this, it closely mirrors what Python
and C does. It is the unicode which is treated differently,
allowing regular "char" strings and binary to still be used
interchangeably. This has one *big* impact, though. A YAML
document cannot be encoded using UTF-16 (via the BOM), although
leaves can be encoded with UTF-8.
Also, I'm not sure that I like limiting unquoted streams to
a single line. Being able to cut/paste in a HTML text without
having to escape the quotes and such is very valueable. Further,
I thought we had agreed on [base64] instead of the back tick.
I *really* like the unicode idea though... it solves alot
of problems rather cleanly.
| Classes are a can of worms.
|
| Although I'm not against the color idea, Perl will never embrace the
| idea of having everything be an object. Using getValue() as the normal
| method of retrieving a simple value is just too much work for Perl
| people. A map should just become a hash, plain and simple. I can come up
| with some creative ways of preserving YAML classes in Perl
| transparently, but my plate's too full right now.
A global "class map" could work. It's not pretty...
Alternatively, we can drop the class idea for now.
Along this same line of thought, I was asking myself
if class is equivalent to encoding for scalars...
just wondering. Are they similar constructs?
| > readability. I'd rather we stick with strict indentation for everything,
| > including blocks, blobs and double-quoted streams.
|
| I think it's nice to allow the cut-and-paste and let emitters always
| reformat. But I'll support strict for now.
...
| If you are once again suggesting dropping double quotes, you're
| needlessly going down a road of pain. Double quotes are simple,
| intuitive and will always work for folded text (indented or not).
I think the double quote mechanism stays in... just about
everyone knows how to use them.
| I want to start doing something on this after June 15th or so. I hope we
| can resolve things by then. We were very close last week.
I think we are still very close.
Open Issues
~~~~~~~~~~~
A. Classes
Problem: Round-tripping
1. Drop classes for now
2. Let the implemenation worry about them (class map)
B. Unicode vs Binary
1. Introduce a isBinary scalar flag and keep all
scalars unicode. This causes problems with
YAR where we cannot know if a given node is
binary or not, but would like it readable if
it is ASCII.
2. Introduce an isUnicode scalar flag. Limit
encoding to ASCII for regular strings and to
UTF-8 within single quoted strings.
...
I think I'd pick A2 and B2 at this time.
Clark
P.S. One of the problems often sited with UTF-8
is that it is verbose. We could clearly offer
the option that YAML texts are "gziped" and
require that a parser know how to gunzip.

On Wed, May 30, 2001 at 02:56:00PM +0200, Oren Ben-Kiki wrote:
| Hmmm. If one is serializing a built-in map/list/scalar, you do
| it normally; when serializing an "Object" it probably gets
| serialized into a map, with a class attribute. It is only
| "typed scalars" which are a problem (e.g., a number, a date, etc.).
| One problem about the proposed class-as-color syntax is that it
| is rather cumbersome for something like a simple number...
Right... this is half the problem. The other half of the
problem is that the class should be known before the map's
content should be loaded. I think I like the class solution
as it is... I think the round-tripping problems are
problematic; but perhaps they are not that bad.
| Cut&paste implies a tool - an editor or a program. Most editors
| allow you to trivially indent a group of lines - certainly any
| editor used for writing YAML had better support it. And a program
| doing cut&paste can add indentation easily enough.
Ok. I'd rather have it always indented as well.
| I'd rather we stick with strict indentation for everything,
| including blocks, blobs and double-quoted streams.
Did you consider that an emitter can indent the blob
and double quoted streams? Thus, indentation isn't
prevented. And with machine generated YAML (99%),
the non-intendented case is minor. Further, by running
YAML through a program which puts the file in canonical
form indenting will definately be the norm. So... I didn't
see the harm in allowing this. It doesn't really impact
the parsing complexity.
| Separate issue: having both double-quoted and simple scalar
| values. Here's a proposal to use just one form:
|
| - Block: as today, but use ` instead of \ to mark the end line.
I like the \, is there a reason why you wanted to change it?
| - Scalar value: as today, allow escaping using \, always allow
| it to be multi-line, but the continuation lines must be more
| indented then the first line.
I understand, however, I think the current division
of scalar types is rather nice balance of concerns.
Most programmers will expect \ style escaping
within quotes.
| - List: as today (no special marker), but allow : as an
| optional prefix for scalar values.
Well, if you leave the quote type in, then
the : optional marker won't be needed:
Example:
@
Single line scalar
"Pretty multi line
text without multiline."
|
|Block with leading and
|trailing new line.
\
[Base 64 blob]
| Hmmm. As I see it, we have at least one major issue to resolve (the class
| issue) plus another where a "dictator hat" may be required (the text
| syntax). Then we can all start doing development at our own pace...
I still don't know what to do about the class issue, but I
think that there are solutions for this. Let us take a
language with only maps, lists, and scalars. An external
"class map" could be put in place by the YAML load/save
mechanism. Thus, as maps/lists/scalars are loaded,
entries in this class map could be made. Then when
the objects are serialized, the class could be written
back out. Therefore, I think that a round-trip ability
is possible, it just may not be the most obvious solution.
As for the scalar types... I'm inclined to leave them
as they are now. It does seem like a nice balance of
concerns. What I'd like to focus on (and what should
be added to the spec) is the canonoical form. Here
are some base line suggestions:
1. Indenting always occurs, i.e. quoted or binary
scalars are indented.
2. The tab setting for indents is 4 characters
3. When possible the text is word-wrapped to 76
characters; leaving for a minimum of 20 characters
for scalar's content. Thus, after 14 levels of
indentation, text may go beyond 20 characters.
4. If leading whitespace occurs on any line within
a scalar, then the block format is used.
5. If a character string is longer than 20 characters
without having intermediate whitespace, then
the quoted format is used.
etc.
Best,
Clark

Oren Ben-Kiki wrote:
>
Sorry guys. I'm a little busy this week. I started to reply to Oren's
message yesterday, but abandoned it because I couldn't give it the
required brain power. I'll try to comment quickly.
> Clark C . Evans [mailto:cce@...] wrote:
> > On Tue, May 29, 2001 at 07:06:42PM +0200, Oren Ben-Kiki wrote:
> > | Supposing the information model does make this distinction
> >
> > Ok. Let's make the binary/unicode distinction in
> > the information model.
>
> Brian? Can you say a word on how this would work in Perl?
Perl is a little up in the air on Unicode. I believe that recent
versions use UTF8 by default for *most* string operations. A byte mode
can be set to default back to the good ole days.
FWIW, I suggest using the single quote for Unicode data. (Get it? *Uni*
code :-). Using Clark's backtick for binary base-64 data. Using '|,\'
for blocks. Using double quotes for everything else (folded streams).
Unquoted streams are a convenience option for single line values only.
>
> > | === Classes and Color ===
> > ...
> > I'd like to hear brian's perspective on this one. I actually
> > kinda like it, although I'm partial since we had talked about it
> > for a long time on the sml-dev list. We would need to push forward
> > on the API notion of "getValue()" as returning the node with
> > a blank key. Building this mechanism in is very cool since it
> > allows for "schema substitutability". However, I'm not sure
> > how well it works for data serilization....
>
> Hmmm. If one is serializing a built-in map/list/scalar, you do it normally;
> when serializing an "Object" it probably gets serialized into a map, with a
> class attribute. It is only "typed scalars" which are a problem (e.g., a
> number, a date, etc.). One problem about the proposed class-as-color syntax
> is that it is rather cumbersome for something like a simple number...
Classes are a can of worms.
Although I'm not against the color idea, Perl will never embrace the
idea of having everything be an object. Using getValue() as the normal
method of retrieving a simple value is just too much work for Perl
people. A map should just become a hash, plain and simple. I can come up
with some creative ways of preserving YAML classes in Perl
transparently, but my plate's too full right now.
(I'm trying to be a famous Perl guy in other realms, ya know ;)
>
> > (on a technical note, I think "", a blank key, should be the
> > default value for a node, and perhaps "__class__" could be
> > the class name).
>
> I rather like '=' for 'value' - it has the right intuitive semantics.
> '__class__' for type is too verbose for my taste.
Agree.
> > The primary rationale for the [ and " formats breaking
> > the indentaiton rules is to allow for easy cut and paste.
>
> I don't buy that. Cut&paste implies a tool - an editor or a program. Most
> editors allow you to trivially indent a group of lines - certainly any
> editor used for writing YAML had better support it. And a program doing
> cut&paste can add indentation easily enough.
>
> Besides, if "raw" cut&paste is that important, doesn't it apply to blocks as
> well? we can simply use:
>
> block: |===arbitrary marker line===
> block text,
> raw, not indented at all,
> no prefix for the lines,
> cut & paste into it whichever way you want.
> |===arbitrary marker line===
> \===arbitrary marker line (if no trailing newline)===
>
> Perl people know this as the '<<EOF' approach :-) This would leave simple
> scalars as the only type of value which must be indented - and given that
> most of these are single-line, that hardly matters. You'd end up with files
> where every multi-line value is not indented, which I think will really ruin
> readability. I'd rather we stick with strict indentation for everything,
> including blocks, blobs and double-quoted streams.
I think it's nice to allow the cut-and-paste and let emitters always
reformat. But I'll support strict for now.
>
> Separate issue: having both double-quoted and simple scalar values. Here's a
> proposal to use just one form:
>
> - Block: as today, but use ` instead of \ to mark the end line.
> - Scalar value: as today, allow escaping using \, always allow it to be
> multi-line, but the continuation lines must be more indented then the first
> line.
> - List: as today (no special marker), but allow : as an optional prefix for
> scalar values.
>
> Example:
>
> @
> Ugly multi
> line text
> Pretty single line text with NL\n
> : Ugly single line text w/o NL
> : Pretty multi
> line text
> |multi line
> |block with NL
> `
> |multi line
> `block w/o NL\n
> [base64
> blob]
>
> Everything is always strictly python-indented.
Whatever :(
If you are once again suggesting dropping double quotes, you're
needlessly going down a road of pain. Double quotes are simple,
intuitive and will always work for folded text (indented or not).
>
> > I'm not going to be able to put much more time into
> > this in the next few months, as I'm a key player
> > in a start-up, xgenda.com, which hopes to publicly
> > launch our product by September.
>
> Hmmm. As I see it, we have at least one major issue to resolve (the class
> issue) plus another where a "dictator hat" may be required (the text
> syntax). Then we can all start doing development at our own pace...
I'll volunteer as dictator :)
I want to start doing something on this after June 15th or so. I hope we
can resolve things by then. We were very close last week.
Cheers, Brian
--
perl -le 'use Inline C=>q{SV*JAxH(char*x){return newSVpvf
("Just Another %s Hacker",x);}};print JAxH+Perl'

Clark C . Evans [mailto:cce@...] wrote:
> On Tue, May 29, 2001 at 07:06:42PM +0200, Oren Ben-Kiki wrote:
> | Supposing the information model does make this distinction
>
> Ok. Let's make the binary/unicode distinction in
> the information model.
Brian? Can you say a word on how this would work in Perl?
> | === Classes and Color ===
> ...
> I'd like to hear brian's perspective on this one. I actually
> kinda like it, although I'm partial since we had talked about it
> for a long time on the sml-dev list. We would need to push forward
> on the API notion of "getValue()" as returning the node with
> a blank key. Building this mechanism in is very cool since it
> allows for "schema substitutability". However, I'm not sure
> how well it works for data serilization....
Hmmm. If one is serializing a built-in map/list/scalar, you do it normally;
when serializing an "Object" it probably gets serialized into a map, with a
class attribute. It is only "typed scalars" which are a problem (e.g., a
number, a date, etc.). One problem about the proposed class-as-color syntax
is that it is rather cumbersome for something like a simple number...
> (on a technical note, I think "", a blank key, should be the
> default value for a node, and perhaps "__class__" could be
> the class name).
I rather like '=' for 'value' - it has the right intuitive semantics.
'__class__' for type is too verbose for my taste.
> The primary rationale for the [ and " formats breaking
> the indentaiton rules is to allow for easy cut and paste.
I don't buy that. Cut&paste implies a tool - an editor or a program. Most
editors allow you to trivially indent a group of lines - certainly any
editor used for writing YAML had better support it. And a program doing
cut&paste can add indentation easily enough.
Besides, if "raw" cut&paste is that important, doesn't it apply to blocks as
well? we can simply use:
block: |===arbitrary marker line===
block text,
raw, not indented at all,
no prefix for the lines,
cut & paste into it whichever way you want.
|===arbitrary marker line===
\===arbitrary marker line (if no trailing newline)===
Perl people know this as the '<<EOF' approach :-) This would leave simple
scalars as the only type of value which must be indented - and given that
most of these are single-line, that hardly matters. You'd end up with files
where every multi-line value is not indented, which I think will really ruin
readability. I'd rather we stick with strict indentation for everything,
including blocks, blobs and double-quoted streams.
Separate issue: having both double-quoted and simple scalar values. Here's a
proposal to use just one form:
- Block: as today, but use ` instead of \ to mark the end line.
- Scalar value: as today, allow escaping using \, always allow it to be
multi-line, but the continuation lines must be more indented then the first
line.
- List: as today (no special marker), but allow : as an optional prefix for
scalar values.
Example:
@
Ugly multi
line text
Pretty single line text with NL\n
: Ugly single line text w/o NL
: Pretty multi
line text
|multi line
|block with NL
`
|multi line
`block w/o NL\n
[base64
blob]
Everything is always strictly python-indented.
> I'm not going to be able to put much more time into
> this in the next few months, as I'm a key player
> in a start-up, xgenda.com, which hopes to publicly
> launch our product by September.
Hmmm. As I see it, we have at least one major issue to resolve (the class
issue) plus another where a "dictator hat" may be required (the text
syntax). Then we can all start doing development at our own pace...
Have fun,
Oren Ben-Kiki

On Tue, May 29, 2001 at 07:06:42PM +0200, Oren Ben-Kiki wrote:
| Supposing the information model does make this distinction
Ok. Let's make the binary/unicode distinction in
the information model.
| === Top Level Production ===
|
| I see we are back to "list of maps, separated by blank lines". Great. The
| API basics given in the YAML spec aren't explicit on how this translates to
| actuall calls. I'd expect that the "next()" calls at the top-level of the
| parser will return a map node, one per each "map block" in the file. As for
| the emitter, I'd expect that repeatedly calling "begin()" and "end()" on the
| top-level cursor will emit multiple map blocks, but the text doesn't seem to
| support this. I think that this should be made explicit.
Ok. When I get time I'll make it more clear.
| === Classes and Color ===
|
| Clark made a good case against comments - we can't use them because either
| they aren't part of the information model, and won't round-trip; or they
| round-trip and hence must be a part of the data model. He's right.
|
| The same argument carries over to classes. The current spec states that if a
| class is not recognized, a warning is emitted and the class is ignored. This
| is unacceptable - consider a YAML pretty-printer, it will recognize a very
| small set of classes, if any. Yet it must preserve the class names.
|
| The problem seems a classical place to use the color idiom. We have a piece
| of information - the class name - which we want to attach to any nodes in
| the YAML document, including scalar nodes. Some applications (the pretty
| printer) aren't interested in this information, but must preserve it. Other
| applications (e.g., a YAML-based application server) rely on this
| information.
|
| There's no problem for attaching such information to map nodes. Use some
| special key for the class info - say, '#' - and you are done. For scalar and
| list nodes, the color idiom suggests you wrap them in a map node. Use the
| same key for specifying your "color", and another special key (say, '=') to
| specify the *value* of the node:
|
| delivery: %
| = : 2000-JAN-10
| # : date
|
| The trick is that the random-access APIs should allow you to call
| "getValue()" on the 'delivery' node and obtain the date value. This is
| acceptable to an interface such as XML's DOM. It probably could work in
| languages dynamic enough to hack the type system into doing this implicitly:
| de-serialize 'delivery' into something which *behaves like* the string
| '2000-JAN-10' but is *at the same time* a map with two keys.
|
| Complex? Sure, but Brain said, "don't worry about the implementation" :-)
|
| Seriously, I think there's cause to worry here. The color idiom has turned
| out to be the solution for many practical problems - not the least of which
| is schema evolution. Maybe you guys can come up with a better way to do it
| then wrapping scalar/list nodes in a map; I couldn't find any. But I
| strongly feel that YAML should address this somehow - and that classes are
| the perfect way to "eat our own dog food" in this regard.
I'd like to hear brian's perspective on this one. I actually
kinda like it, although I'm partial since we had talked about it
for a long time on the sml-dev list. We would need to push forward
on the API notion of "getValue()" as returning the node with
a blank key. Building this mechanism in is very cool since it
allows for "schema substitutability". However, I'm not sure
how well it works for data serilization....
(on a technical note, I think "", a blank key, should be the
default value for a node, and perhaps "__class__" could be
the class name).
| Do we really need so many different formats? *4*? I see why we'd want base64
| - that's not really a text format, anyway. I can see why we'd want one
| format for cut-and-paste and one format for allowing escaping. That would be
| the block format and either the string or the simple scalar format. There's
| also the issue of multi-line values allowed in a map but not in a list. So
| we actually have 4.5 formats (3.5 for text, one for blobs).
|
| Do we really need all of these? Can't we make due with just one block format
| and one stream format?
|
| I also strongly dislike the fact that indentation rules are suspended in a
| string value. Why is that? I see the pain - one won't be able to, say,
| quickly skip a sub-tree by simply looking for a properly indented line;
| readability suffers, etc.
|
| I don't see the gain: Cut&paste by itself isn't a good enough answer - if
| you are doing it by hand, any editor will easily indent the new block of
| text, and if you are doing it by a tool, well just add the indentation while
| you are at it.
|
| In E-mail messages Clark said that base64 blocks are indented, but the spec
| says otherwise. Which is right? What's the harm in indenting blobs? We
| already allow long lines in blocks... in fact, there's no longer a way to
| break a long line in a block into two shorter ones.
|
| I realize you had just about enough of this syntax issue, and that I'm fresh
| out of a 10 day vacation from it, but I think that either some strong
| rationale or some more work on this are due. I'll try to see if I can come
| up with any new ideas...
The primary rationale for the [ and " formats breaking
the indentaiton rules is to allow for easy cut and paste.
| I'd rather you list me as oren@...
Fixed.
...
I'm not going to be able to put much more time into
this in the next few months, as I'm a key player
in a start-up, xgenda.com, which hopes to publicly
launch our product by September.
Best,
Clark

Hi Guys,
I'm back at last from my convention (which was lots of fun, and educational
too). I finally had the time to catch up on the 50 messages in this list
since I left... You two have sure been busy, and doing a great job.
I also went through the YAML spec draft - great going, Clark - first "on its
own" and second after reviewing the messages for context. So, at the risk of
this becoming a long message :-), here's my view of the current state.
=== Rationale ===
A good rationale is harder to write than the original spec. On the other
hand, a spec with a good rationale is much more useful. I think we should
give this some thought. We have the advantage we don't meet face-to-face, so
this list provides a good record of how each point was decided...
=== Encoding/Binary ===
It seems as though the problem here is that we are trying to handle two
separate "data types" in the same way. A string of (Unicode) characters is a
different beast than a binary blob of bytes.
We have decided that YAML will be "printable" - that is, a YAML file is a
string, using only the "printable" subset of the Unicode characters.
Implicitly this means it may contain characters beyond 7-bit ASCII - but
these are still Unicode characters and not arbitrary binary bytes.
Therefore, when a binary blob is written into YAML format it must be
encoded. Clark suggested using base64 as the universal way to achieve that.
This seems reasonable, with the objection that it is a horribly wasteful way
to encode binary blobs under UTF-16. I wonder if there a base4096 or
something for that case? I never heard of one...
At any rate, note that you can't achieve the same effect by using something
"\xXX" or "\uUUUU". These escape sequences still denote text *characters*,
not blob *bytes*. This distinction isn't automatic to old-time C programmers
(like myself) used to "char == byte". But it is essential to make it,
otherwise things get very messy.
Clark suggested that all scalar syntaxes are to be equivalent - it doesn't
matter if one is using base64, quoted string, block or "simple value"; the
in-memory result is the same. As shown above, this can't be true for
*writing* values; a binary blob may only be written in base64. Why,
therefore, not make the same distinction when *reading* values?
Supposing the information model does make this distinction, then Brian's
(wonderful) YAR format becomes possible, regardless of encoding. If a file
contains only or "mostly" printable characters, then emit it as a text
block. Otherwise, emit it as a base64 blob. This requires a two-pass
algorithm through the file, but I think there's no helping that. In the Mac
and Be, if the MIME type is text/*, emit it as a block, otherwise as a
base64 blob.
How to make this distinction in the data model is an open issue, of course.
In Java, which was born Unicode-aware, there's no problem distinguishing
between a String and a byte[]. I assume something similar may be done in
Python and Perl. C is trickier - I don't quite see how to work it into the
API - but C programmers already know that "char = byte" doesn't mean "string
= blob"; a string ends with a \0, a blob has a length.
=== Top Level Production ===
I see we are back to "list of maps, separated by blank lines". Great. The
API basics given in the YAML spec aren't explicit on how this translates to
actuall calls. I'd expect that the "next()" calls at the top-level of the
parser will return a map node, one per each "map block" in the file. As for
the emitter, I'd expect that repeatedly calling "begin()" and "end()" on the
top-level cursor will emit multiple map blocks, but the text doesn't seem to
support this. I think that this should be made explicit.
=== Classes and Color ===
Clark made a good case against comments - we can't use them because either
they aren't part of the information model, and won't round-trip; or they
round-trip and hence must be a part of the data model. He's right.
The same argument carries over to classes. The current spec states that if a
class is not recognized, a warning is emitted and the class is ignored. This
is unacceptable - consider a YAML pretty-printer, it will recognize a very
small set of classes, if any. Yet it must preserve the class names.
The problem seems a classical place to use the color idiom. We have a piece
of information - the class name - which we want to attach to any nodes in
the YAML document, including scalar nodes. Some applications (the pretty
printer) aren't interested in this information, but must preserve it. Other
applications (e.g., a YAML-based application server) rely on this
information.
There's no problem for attaching such information to map nodes. Use some
special key for the class info - say, '#' - and you are done. For scalar and
list nodes, the color idiom suggests you wrap them in a map node. Use the
same key for specifying your "color", and another special key (say, '=') to
specify the *value* of the node:
delivery: %
= : 2000-JAN-10
# : date
The trick is that the random-access APIs should allow you to call
"getValue()" on the 'delivery' node and obtain the date value. This is
acceptable to an interface such as XML's DOM. It probably could work in
languages dynamic enough to hack the type system into doing this implicitly:
de-serialize 'delivery' into something which *behaves like* the string
'2000-JAN-10' but is *at the same time* a map with two keys.
Complex? Sure, but Brain said, "don't worry about the implementation" :-)
Seriously, I think there's cause to worry here. The color idiom has turned
out to be the solution for many practical problems - not the least of which
is schema evolution. Maybe you guys can come up with a better way to do it
then wrapping scalar/list nodes in a map; I couldn't find any. But I
strongly feel that YAML should address this somehow - and that classes are
the perfect way to "eat our own dog food" in this regard.
=== Text syntax ===
Do we really need so many different formats? *4*? I see why we'd want base64
- that's not really a text format, anyway. I can see why we'd want one
format for cut-and-paste and one format for allowing escaping. That would be
the block format and either the string or the simple scalar format. There's
also the issue of multi-line values allowed in a map but not in a list. So
we actually have 4.5 formats (3.5 for text, one for blobs).
Do we really need all of these? Can't we make due with just one block format
and one stream format?
I also strongly dislike the fact that indentation rules are suspended in a
string value. Why is that? I see the pain - one won't be able to, say,
quickly skip a sub-tree by simply looking for a properly indented line;
readability suffers, etc.
I don't see the gain: Cut&paste by itself isn't a good enough answer - if
you are doing it by hand, any editor will easily indent the new block of
text, and if you are doing it by a tool, well just add the indentation while
you are at it.
In E-mail messages Clark said that base64 blocks are indented, but the spec
says otherwise. Which is right? What's the harm in indenting blobs? We
already allow long lines in blocks... in fact, there's no longer a way to
break a long line in a block into two shorter ones.
I realize you had just about enough of this syntax issue, and that I'm fresh
out of a 10 day vacation from it, but I think that either some strong
rationale or some more work on this are due. I'll try to see if I can come
up with any new ideas...
=== E-mail address ===
I'd rather you list me as oren@... and not orenbk@... since
YAML isn't very related to my day job and also because the other address is
more stable - I hope to keep it for a long time to come, while companies
come and go.
Sorry for the long message - I had a lot of catching up to do.
Have fun,
Oren Ben-Kiki

On Sun, May 27, 2001 at 12:51:55PM -0700, Brian Ingerson wrote:
| usr : #(drwxr-xr-x;root;root) %
| bin : #(drwxr-xr-x;root;root) %
| yar : #(-rwxr-xr-x;root;root)
| `base64encoding of yar`
| local : #(drwxr-xr-x;root;root) %
| file : #(-rw-r--r--;ingy;users)
| |Contents of
| |the file.
| \
This will work as long as each value is written
out without any transcoding which could corrupt
the data. Thus, a simple detection could be used
to convert all strings which use control and
out-of-range characters into base64 and then
back into the same character encoding it
was using before. So... this is possible
with a simple auto-detection for base64
encoding of out-of-range file.
However, the disadvantage, is that a specific
encoding would have to be picked. Thus,
UTF-8, ISO8859-1, or UTF-16 most likely.
Then, only those text files in the particular
encoding would read correctly. All of the
other text files may read *incorrectly*,
and, perhaps worse, those "binary" files
which happen to fall in range may show up
as gibberish that isn't base64 encoded.
Your thoughts? There are clear good and
bad things about this...
...
This makes me thing of two items:
a) Perhaps we need a source encoding
indicator. This could include the
mime type and/or character set.
b) Perhaps keys should be full-fledged
node. We might just limit a key
to scalar and/or references?
But as a scalar, it could have
a class (or an encoding indicator,
as given in option a above).
Just rambling... In my use cases, I *know*
which nodes are binary and which are not.
Therefore, I only require that YAML not
courupt them...
Best,
Clark

On Sun, May 27, 2001 at 08:08:21PM -0700, Brian Ingerson wrote:
| > 3. I think I'm going to loosen the restrictions on
| > key and class to allow any old binary.
|
| > 4. Allow both "quoted" and [binary] values to be
| > used for classes and keys....
Oops ... a complicator.
Consider a two byte binary value, 0x00E2. Say that
this is asked to be written by the YAML emitter
without knowing if it is binary or unicode. The
emitter should probably assume unicode, and thus
an "small a ring", will be emitted. If the stream
is UTF16, no problem what was in memory is the
same as what gets written. However, if the stream
is UTF8... and the internal character set was UTF16,
this character will be converted into a two byte
UTF-8 sequence which is definately *not* 0x00E2.
This is a counter to the idea that the emitter
can determine if something should be binary
encoded or not. The answer is... no.
Therefore, due to transcoding between UTF8 and
UTF16... which is completely unavoidable, we are
left with only two options:
a) Our "values" must have a flag to indicate
if they are binary or not.
b) We must only allow 7-BIT ASCII, and
consider everything else binary
requiring base64 encoding! Let
those people using "funny characters"
fend for themselves.
c) Some combination of indicating plus a
default based on 7-BIT ASCII?
Anyway, (b) is out. Option (c) is interesting,
but I have no clue how it'd work. So I think,
unless some stroke of majic comes along, that
option (a) is the only alternative. In other
words, the serilization interface has a binary
flag and this flag is put into the information
model... since it is impossible to "autodetect".
This being said, your "yar" idea is very neat,
but there is no way (barring the possibility of c)
to detect if a give file is truely binary ('cept
on the Macintosh platform, I've heared).
*sigh*
Clark

"Clark C . Evans" wrote:
> Yea, that is very slick. A yar file wouldn't be all that
> much bigger than a tar. Extremely Cool. We should
> ship this as the "example program".
Definitely.
>
> On that same line of thought, I think I'd like to make
> a few changes to our telephone agreement:
>
> 1. Use [ ] pair to designate base 64 encoded.
> This is just cosmetic. It looks better, IMHO.
> (see the specification) Do you agree?
Like it.
> 2. I didn't like the idea of having two categories
> of classes. Thus, I think the "reserved" classes
> will start with a leading period. This seems to
> be very readable: #.int The problem with having
> two classes is that it interferes with the API...
I like the '.' thing.
I just came back to this after reading the whole thing. Perhaps #".int"
needs to be semantically different than #.int. To be deterministic.
>
> Further having ( ) playing a similar role as
> quotes kinda irques me. See below.
>
> 3. I think I'm going to loosen the restrictions on
> key and class to allow any old binary.
This is actually good for Perl. Denter can't even handle that. :)
> 4. Allow both "quoted" and [binary] values to be
> used for classes and keys....
>
> "some:key" : this key is quoted.
> [DKEF543=] : this key is base64 (binary).
Excellent.
> "key" : this and the following
> key : have the same key, and thus are
> illegal when put together in
> a single map.
OK. Illegal is not the Perl way. But I'm not against starting strict.
> first : #class These three
> second : #"class" have the same
> third : #[base-64-of-class] class
Wonderful.
>
> The last two changes don't make things "worse", they
> actually make them a bit easier...
>
> Thoughts?
I think you and I think very much alike :) I really like your changes.
Make it so, Brian
--
perl -le 'use Inline C=>q{SV*JAxH(char*x){return newSVpvf
("Just Another %s Hacker",x);}};print JAxH+Perl'

On Sun, May 27, 2001 at 12:51:55PM -0700, Brian Ingerson wrote:
| I just realized that YAML could be used to "tar" up an entire Unix
| directory structure and transport it. And the content of course would be
| human readable.
|
| usr : #(drwxr-xr-x;root;root) %
| bin : #(drwxr-xr-x;root;root) %
| yar : #(-rwxr-xr-x;root;root)
| `base64encoding of yar`
| local : #(drwxr-xr-x;root;root) %
| file : #(-rw-r--r--;ingy;users)
| |Contents of
| |the file.
| \
|
| Pretty neat eh. I wonder how large it would be after gzip?
Yea, that is very slick. A yar file wouldn't be all that
much bigger than a tar. Extremely Cool. We should
ship this as the "example program". Very Neat.
On that same line of thought, I think I'd like to make
a few changes to our telephone agreement:
1. Use [ ] pair to designate base 64 encoded.
This is just cosmetic. It looks better, IMHO.
(see the specification) Do you agree?
2. I didn't like the idea of having two categories
of classes. Thus, I think the "reserved" classes
will start with a leading period. This seems to
be very readable: #.int The problem with having
two classes is that it interferes with the API...
Further having ( ) playing a similar role as
quotes kinda irques me. See below.
3. I think I'm going to loosen the restrictions on
key and class to allow any old binary.
4. Allow both "quoted" and [binary] values to be
used for classes and keys....
"some:key" : this key is quoted.
[DKEF543=] : this key is base64 (binary).
"key" : this and the following
key : have the same key, and thus are
illegal when put together in
a single map.
first : #class These three
second : #"class" have the same
third : #[base-64-of-class] class
The last two changes don't make things "worse", they
actually make them a bit easier...
Thoughts?
Best,
Clark

I just realized that YAML could be used to "tar" up an entire Unix
directory structure and transport it. And the content of course would be
human readable.
usr : #(drwxr-xr-x;root;root) %
bin : #(drwxr-xr-x;root;root) %
yar : #(-rwxr-xr-x;root;root)
`base64encoding of yar`
local : #(drwxr-xr-x;root;root) %
file : #(-rw-r--r--;ingy;users)
|Contents of
|the file.
\
Pretty neat eh. I wonder how large it would be after gzip?
Cheers, Brian
PS We should keep a list of potential uses for YAML.
--
perl -le 'use Inline C=>q{SV*JAxH(char*x){return newSVpvf
("Just Another %s Hacker",x);}};print JAxH+Perl'