Activity

I agree with the principle of 'generate strictly, accept liberally' with respect to schema names.
Implementations should restrict names to simple ASCII characters that match the pattern in the spec in almost all cases. Implementations should accept schemas as liberally as possible, up to any name that does not include a '.'. Being able to pass through a schema from another system is a powerful use case.

Later we can add to the specification how to encode '.' in a name for pass through encoding use cases if it ever comes up.

Scott Carey
added a comment - 07/Mar/12 21:20 I agree with the principle of 'generate strictly, accept liberally' with respect to schema names.
Implementations should restrict names to simple ASCII characters that match the pattern in the spec in almost all cases. Implementations should accept schemas as liberally as possible, up to any name that does not include a '.'. Being able to pass through a schema from another system is a powerful use case.
Later we can add to the specification how to encode '.' in a name for pass through encoding use cases if it ever comes up.

Raymie, this is a good approach. The spec language that requires ASCII should be changed from MUST to SHOULD.

One use case that Scott mentioned that your prose does not is transmitting schemas from other systems, e.g., Avro Schemas might often be generated automatically from Pig or SQL schemas. In these cases accepting liberally permits schemas to pass through Avro losslessly. Strict validation is really only useful when a developer is the schema author. In many (most?) cases Avro might be an underlying tool, used indirectly through an application, and in these cases strict validation is probably not useful.

Doug Cutting
added a comment - 06/Mar/12 16:56 Raymie, this is a good approach. The spec language that requires ASCII should be changed from MUST to SHOULD.
One use case that Scott mentioned that your prose does not is transmitting schemas from other systems, e.g., Avro Schemas might often be generated automatically from Pig or SQL schemas. In these cases accepting liberally permits schemas to pass through Avro losslessly. Strict validation is really only useful when a developer is the schema author. In many (most?) cases Avro might be an underlying tool, used indirectly through an application, and in these cases strict validation is probably not useful.

Based on the dialog above, I propose that we turn this into a set of recommendations for Avro implementers and app developers. I've attached a draft of these recommendations, if folks like them, I can turn them into a patch for the spec.

The basic idea is as follows. We cast this as an "interoperability" problem. On the one hand, the current spec for names facilitates language interoperability, and we shouldn't change it. On the other hand, we should apply the Robustness principle: accept liberally, generate strictly. (Looking closely at Scott's and Doug's comments, I think this is what they're really getting at.)

In our context, Avro implementations accept schemas, Avro app developers generate them. Thus, we recommend that Avro implementations, by default, accept schemas liberally. This means accepting arbitrary Unicode strings for names (but with no normalization or other Unicode processing). At the same time, Avro developers for who want language interoperability are exhorted to "generate strictly," and in particular, use only names that follow the strict definition in the Avro spec, because those names are easily handled by almost any language out there.

To help developers "generate strictly," we recommend that Avro implementations provide optional mechanisms to perform schema validation, preferably as early in the dev process as possible (eg, unit-test or even compile time). (For developers who don't care about language interop, they can ignore these mechanisms, and Avro implementations, being liberal by default, should happily consume their input.)

Raymie Stata
added a comment - 03/Mar/12 16:50 Based on the dialog above, I propose that we turn this into a set of recommendations for Avro implementers and app developers. I've attached a draft of these recommendations, if folks like them, I can turn them into a patch for the spec.
The basic idea is as follows. We cast this as an "interoperability" problem. On the one hand, the current spec for names facilitates language interoperability, and we shouldn't change it. On the other hand, we should apply the Robustness principle: accept liberally, generate strictly. (Looking closely at Scott's and Doug's comments, I think this is what they're really getting at.)
In our context, Avro implementations accept schemas, Avro app developers generate them. Thus, we recommend that Avro implementations, by default, accept schemas liberally. This means accepting arbitrary Unicode strings for names (but with no normalization or other Unicode processing). At the same time, Avro developers for who want language interoperability are exhorted to "generate strictly," and in particular, use only names that follow the strict definition in the Avro spec, because those names are easily handled by almost any language out there.
To help developers "generate strictly," we recommend that Avro implementations provide optional mechanisms to perform schema validation, preferably as early in the dev process as possible (eg, unit-test or even compile time). (For developers who don't care about language interop, they can ignore these mechanisms, and Avro implementations, being liberal by default, should happily consume their input.)
The file I just posted says the above in a little more detail.

As late as possible also makes transitive equivalence easier. If Hive and Pig both escaped the same string differently when writing into an Avro file, then that Avro file will not be shareable between them as easily. This is why I feel that the Avro spec should define how names are escaped. I think the only code point that truly needs escaping is the namespace separator and an escape character.

A single source string recognized as equivalent in two different external systems, should become equivalent escaped strings if each of those systems uses different Avro language bindings to escape and store a schema.

Scott Carey
added a comment - 13/Feb/12 22:29 As late as possible also makes transitive equivalence easier. If Hive and Pig both escaped the same string differently when writing into an Avro file, then that Avro file will not be shareable between them as easily. This is why I feel that the Avro spec should define how names are escaped. I think the only code point that truly needs escaping is the namespace separator and an escape character.
A single source string recognized as equivalent in two different external systems, should become equivalent escaped strings if each of those systems uses different Avro language bindings to escape and store a schema.

> If names are restricted, then consuming schemas from other systems will be difficult.

Good point. The question is where the escaping burden lies: either with adapter layers (e.g., in Pig or Hive) or in the code generation layer. I'd argue that code generation layer already has to handle reserved words so that adding character escaping is not a significant burden there. It's also safer to not assume that other implementations have correctly escaped all names; to be tolerant. Finally, escaping as late as possible maximizes legibility through the system.

Doug Cutting
added a comment - 13/Feb/12 18:13 > If names are restricted, then consuming schemas from other systems will be difficult.
Good point. The question is where the escaping burden lies: either with adapter layers (e.g., in Pig or Hive) or in the code generation layer. I'd argue that code generation layer already has to handle reserved words so that adding character escaping is not a significant burden there. It's also safer to not assume that other implementations have correctly escaped all names; to be tolerant. Finally, escaping as late as possible maximizes legibility through the system.

I see the wisdom in restricting names to be a simple set of ASCII characters. Until just a few minutes ago the arguments above were convincing me that the [A-Za-z_][A-Za-z0-9_]+
name format was a very useful simplification.

But now I think names should be almost entirely open. Defining "isLetter() or isDigit()" is problematic as pointed out above. So don't even bother with that. How about defining it only with respect to ASCII. The naming rule in the spec would apply to ASCII only, all other code points are allowed. Unlike some notion of isLetter(), this does not imply c or c++ needs a big library like ICU. All implementations must already support UTF-8 in order to support JSON. Languages can define internally how they map messy names to variables, types, or enum symbols.

If AVRO restricts valid names, then it won't be able to convert schemas from other systems into avro schemas.

If names are restricted, then consuming schemas from other systems will be difficult. Fewer restrictions in Avro make it more compatible and capable.

If there are stringent naming rules in the spec, it would be wise to standardize name mangling from external sources into Avro in the spec.

So I see two options that make sense:

Enforce the restriction in the current spec, add flexibility for reading schemas that do not comply (that may have already been persisted into permanent storage), and add to the spec standardized name mangling for translating schemas from other systems to Avro and back.

Open up the spec for naming to be significantly more flexible. At minimum also allow all code points above 127. Consider opening up even more characters in ASCII as valid names.

There are two kinds of mangling to consider.

"External system" to and from Avro. For example, a valid name in an external system might start with a number. If translated into Avro and Avro does not allow this, it would be very useful if all languages could look at the resulting name and convert it back if required. This should be standardized across Avro. The fewer restrictions in Avro, the easier this translation process is.

Avro to and from language identifiers in an implementation. This is a different issue that is language local. Because it is language local and up to the Avro implementation, this is less of a concern to me than translation from external schema sources. Most languages don't allow a newline in an identifier, but should Avro disallow that? Language implementations need to be prepared to mangle disallowed characters and strings regardless of what Avro specifies.

Scott Carey
added a comment - 12/Feb/12 22:48 I see the wisdom in restricting names to be a simple set of ASCII characters. Until just a few minutes ago the arguments above were convincing me that the
[A-Za-z_] [A-Za-z0-9_] +
name format was a very useful simplification.
But now I think names should be almost entirely open. Defining "isLetter() or isDigit()" is problematic as pointed out above. So don't even bother with that. How about defining it only with respect to ASCII. The naming rule in the spec would apply to ASCII only, all other code points are allowed. Unlike some notion of isLetter(), this does not imply c or c++ needs a big library like ICU. All implementations must already support UTF-8 in order to support JSON. Languages can define internally how they map messy names to variables, types, or enum symbols.
If AVRO restricts valid names, then it won't be able to convert schemas from other systems into avro schemas.
For example, how does this relate to
https://issues.apache.org/jira/browse/PIG-1339
?
If names are restricted, then consuming schemas from other systems will be difficult. Fewer restrictions in Avro make it more compatible and capable.
If there are stringent naming rules in the spec, it would be wise to standardize name mangling from external sources into Avro in the spec.
So I see two options that make sense:
Enforce the restriction in the current spec, add flexibility for reading schemas that do not comply (that may have already been persisted into permanent storage), and add to the spec standardized name mangling for translating schemas from other systems to Avro and back.
Open up the spec for naming to be significantly more flexible. At minimum also allow all code points above 127. Consider opening up even more characters in ASCII as valid names.
There are two kinds of mangling to consider.
"External system" to and from Avro. For example, a valid name in an external system might start with a number. If translated into Avro and Avro does not allow this, it would be very useful if all languages could look at the resulting name and convert it back if required. This should be standardized across Avro. The fewer restrictions in Avro, the easier this translation process is.
Avro to and from language identifiers in an implementation. This is a different issue that is language local. Because it is language local and up to the Avro implementation, this is less of a concern to me than translation from external schema sources. Most languages don't allow a newline in an identifier, but should Avro disallow that? Language implementations need to be prepared to mangle disallowed characters and strings regardless of what Avro specifies.

Maybe warnings aren't right. Maybe we should warn when validation is disabled and throw exceptions when validation is enabled then enable folks to disable validation in tools.

Yes, I think we should, by default, conform to the spec. That is, it throw an exception stating the names are not valid. Since we so far implemented it loosely, in order to maintain compatibility, should allow non-ASCII names provided the user explicitly wants name validation relaxed. In such situations we should also generate warnings.

This, while allowing the old schemas to be accepted, would encourage people to create conformant schemas. Now, how does the user express his desire for lenient name validation? The user may not have access to the source code that invokes the parser and so accepting a flag in the parse function may not work. I suggest we use Java system property to pass this information on.

Having said that, do we know if anyone uses non-ASCII names. Do we know if they are unwilling or unable to fix their schemas? Are we solving a problem that doesn't really exist? Perhaps, it is simpler if we just fix the code to conform to the spec and release it. We can do the workaround described above if someone complains.

Thiruvalluvan M. G.
added a comment - 11/Feb/12 02:53 Maybe warnings aren't right. Maybe we should warn when validation is disabled and throw exceptions when validation is enabled then enable folks to disable validation in tools.
Yes, I think we should, by default, conform to the spec. That is, it throw an exception stating the names are not valid. Since we so far implemented it loosely, in order to maintain compatibility, should allow non-ASCII names provided the user explicitly wants name validation relaxed. In such situations we should also generate warnings.
This, while allowing the old schemas to be accepted, would encourage people to create conformant schemas. Now, how does the user express his desire for lenient name validation? The user may not have access to the source code that invokes the parser and so accepting a flag in the parse function may not work. I suggest we use Java system property to pass this information on.
Having said that, do we know if anyone uses non-ASCII names. Do we know if they are unwilling or unable to fix their schemas? Are we solving a problem that doesn't really exist? Perhaps, it is simpler if we just fix the code to conform to the spec and release it. We can do the workaround described above if someone complains.

I guess I'm okay leaving the spec alone. But I question the wisdom of making implementations less tolerant of non-ASCII or other errors in names. The Java feature that disables validation was added because there are schemas and data out there that do not conform to the spec but which Java should still be able to process. Your patch here fails unit tests that depend on non-ASCII in names and is an incompatible change for users who do.

For interoperability we might make Java more tolerant of schemas with invalid names rather than less. For example, here's a patch that permits any name, but prints warnings for names that don't conform to the spec. It also modifies the compiler to escape characters that are not valid in Java identifiers. (This isn't quite ready to commit, since it comments out some tests of things that used to be parser errors but now we should test that they produce warnings.)

Maybe warnings aren't right. Maybe we should warn when validation is disabled and throw exceptions when validation is enabled then enable folks to disable validation in tools.

Such changes would permit us to generate code and read data regardless of whether the implementation that created the schema and data validated names correctly.

Doug Cutting
added a comment - 10/Feb/12 23:01 I guess I'm okay leaving the spec alone. But I question the wisdom of making implementations less tolerant of non-ASCII or other errors in names. The Java feature that disables validation was added because there are schemas and data out there that do not conform to the spec but which Java should still be able to process. Your patch here fails unit tests that depend on non-ASCII in names and is an incompatible change for users who do.
For interoperability we might make Java more tolerant of schemas with invalid names rather than less. For example, here's a patch that permits any name, but prints warnings for names that don't conform to the spec. It also modifies the compiler to escape characters that are not valid in Java identifiers. (This isn't quite ready to commit, since it comments out some tests of things that used to be parser errors but now we should test that they produce warnings.)
Maybe warnings aren't right. Maybe we should warn when validation is disabled and throw exceptions when validation is enabled then enable folks to disable validation in tools.
Such changes would permit us to generate code and read data regardless of whether the implementation that created the schema and data validated names correctly.

Right now, a number of Avro implementations, including the Java one, is out of conformance with the spec – ie, there are bugs. Also, the implementations disagree significantly with each other, so there's no de facto spec either.

I've submitted a bug fix to bring the Java implementation into conformance with the spec. I still think the spec is fine and we should just fix the implementations.

If you think otherwise, then you should submit a patch to the spec that generalizes names to allow Unicode characters. Such a patch should specify the characters to be allowed in identifiers, discuss normalization, perhaps address forward compatibility with Unicode changes (an issue discussed in TR31). In my view, it should also give guidance to people writing implementations with code generators; while that advice wouldn't be normative, it would help ensure that the spec is implementable. Indeed, it's generally good practice to develop reference implementations along with spec's, so ideally this patch would also include changes to the Java implementation to bring it in conformance with the proposed spec changes.

Raymie Stata
added a comment - 10/Feb/12 19:52 Right now, a number of Avro implementations, including the Java one, is out of conformance with the spec – ie, there are bugs. Also, the implementations disagree significantly with each other, so there's no de facto spec either.
I've submitted a bug fix to bring the Java implementation into conformance with the spec. I still think the spec is fine and we should just fix the implementations.
If you think otherwise, then you should submit a patch to the spec that generalizes names to allow Unicode characters. Such a patch should specify the characters to be allowed in identifiers, discuss normalization, perhaps address forward compatibility with Unicode changes (an issue discussed in TR31). In my view, it should also give guidance to people writing implementations with code generators; while that advice wouldn't be normative, it would help ensure that the spec is implementable. Indeed, it's generally good practice to develop reference implementations along with spec's, so ideally this patch would also include changes to the Java implementation to bring it in conformance with the proposed spec changes.

An implementation would be naive to trust that other implementations have validated all names in schemas it receives. Java currently disables validation when reading a schema from a data file, since it's more important to be able to read the data. With Generic APIs name validation isn't required and many applications use only generic APIs.

This would not require support for unicode identifiers in programming languages. A code generator should escape any character in a name that's not easy for it to represent in an identifier. We'd just be permitting code generators to take advantage of when a programming language does support Unicode in identifiers.

> If we went the other way (chance the spec), we'd have to answer a bunch of design questions
> (decide what is a "letter," decide on normalization, figure out how to mangle names in various
> languages, etc.), and then implement validation in each language [ ... ]

I disagree. Even if we removed all restrictions on naming I don't think we'd add much burden to implementations. Most implementations don't do code generation. Code generators already need to mangle names. A code generator should already escape rather than die when it sees an unexpected character in a name. (The alternative is an inability to generate code for schemas that someone else controls, a poor choice.)

So I don't see a new interoperability problem this would create. We already have schemas in the wild whose names are invalid.

Perhaps we should change the spec to recommend that names be restricted to ASCII for ease of programming with generated APIs in all languages. And we might check that in compiler, forcing folks to specify --escape-non-ASCII-names if they really want to generate code for a schema whose names contain non-ASCII characters, to discourage the use of non-ASCII in schemas that you do control. In general we could encourage implementations to both not trust that identifiers are all-ASCII and to try to encourage all-ASCII identifiers.

Doug Cutting
added a comment - 10/Feb/12 18:38 An implementation would be naive to trust that other implementations have validated all names in schemas it receives. Java currently disables validation when reading a schema from a data file, since it's more important to be able to read the data. With Generic APIs name validation isn't required and many applications use only generic APIs.
This would not require support for unicode identifiers in programming languages. A code generator should escape any character in a name that's not easy for it to represent in an identifier. We'd just be permitting code generators to take advantage of when a programming language does support Unicode in identifiers.
> If we went the other way (chance the spec), we'd have to answer a bunch of design questions
> (decide what is a "letter," decide on normalization, figure out how to mangle names in various
> languages, etc.), and then implement validation in each language [ ... ]
I disagree. Even if we removed all restrictions on naming I don't think we'd add much burden to implementations. Most implementations don't do code generation. Code generators already need to mangle names. A code generator should already escape rather than die when it sees an unexpected character in a name. (The alternative is an inability to generate code for schemas that someone else controls, a poor choice.)
So I don't see a new interoperability problem this would create. We already have schemas in the wild whose names are invalid.
Perhaps we should change the spec to recommend that names be restricted to ASCII for ease of programming with generated APIs in all languages. And we might check that in compiler, forcing folks to specify --escape-non-ASCII-names if they really want to generate code for a schema whose names contain non-ASCII characters, to discourage the use of non-ASCII in schemas that you do control. In general we could encourage implementations to both not trust that identifiers are all-ASCII and to try to encourage all-ASCII identifiers.

I took a look at the non-Java implementations. PHP validates names against the current spec (i.e., ASCII letters only). The rest don't validate names at all. If we declared the current spec correct, then "fixing" all of the implementations would consist of adding (or, for Java, changing) 2-3 lines of simple, name-validation code, and some more code to turn validation on and off.

If we went the other way (chance the spec), we'd have to answer a bunch of design questions (decide what is a "letter," decide on normalization, figure out how to mangle names in various languages, etc.), and then implement validation in each language (which, as Thiru points out, would include adding an ICU dependency for C/C++, and maybe others (Ruby? PHP?)), and then implementing mangling where needed (a lot more than a few-line change).

As a practical matter, this wouldn't get done, and as the universe of Avro users becomes bigger and bigger, fixing this broken corner of the Avro universe will become harder and harder.

Raymie Stata
added a comment - 10/Feb/12 14:31 I took a look at the non-Java implementations. PHP validates names against the current spec (i.e., ASCII letters only). The rest don't validate names at all. If we declared the current spec correct, then "fixing" all of the implementations would consist of adding (or, for Java, changing) 2-3 lines of simple, name-validation code, and some more code to turn validation on and off.
If we went the other way (chance the spec), we'd have to answer a bunch of design questions (decide what is a "letter," decide on normalization, figure out how to mangle names in various languages, etc.), and then implement validation in each language (which, as Thiru points out, would include adding an ICU dependency for C/C++, and maybe others (Ruby? PHP?)), and then implementing mangling where needed (a lot more than a few-line change).
As a practical matter, this wouldn't get done, and as the universe of Avro users becomes bigger and bigger, fixing this broken corner of the Avro universe will become harder and harder.

I've pulled together some documentation on how different languages handle non-ASCII characters in identifiers. You'll see that language vary greatly in both what non-ASCII characters are allowed in identifiers, whether or not they are normalized, and how they are normalized when they are normalized.

One of the goals of Avro is to support specifications that interoperate well across languages. Given all the variability in how different languages handle non-ASCII characters, I stand by what I said earlier: handling Unicode well In Avro is a lot of work, and doing it poorly (as we do now) just creates nasty interop problems.

—

The Unicode consortium has published a recommendation for defining Unicode identifiers:

C# follows it almost exactly (but not exactly); Python follows it mostly; Java kind of follows it, but not really; C/C++ ignore it; and, as far as I can tell, neither Ruby nor PHP have given Unicode identifiers much thought at all.

Regarding Python, Python 2.x only allowed ASCII characters in identifiers. It wasn't until Python 3.x that Unicode characters were allowed. Phython 3.x follows the Unicode TR31. However, while Python calls for NRKC normalization, it does not use the "modified" NFKC normalization recommended in TR31.

C# follows Unicode TR31 exactly (except that it allows identifiers to start with an underscore). Thus, C#'s handling of non-ASCII identifiers is similar to Python's, except that C# calls for NFC rather than NFKC. Also, C# requires that its input arrives in normal form, and states that "The behavior when encountering an identifier not in Normalization Form C is implementation-defined; however, a diagnostic is not required" (presumably a diagnostic would be allowed). Python, on the other hand, says that "identifiers are converted into the normal form NFKC while parsing."

Java makes no reference to TR31, but it does seem to have been inspired by it. However, it's more restrictive than TR31 (and thus C# and Python). For example, while Python (and TR31) allow non-spacing marks, Java does not. Also, unlike TR31/C#/Python, the Java language does not call for normalization, and is rather explicit about this: "Unicode composite characters are different from the decomposed characters. For example, a LATIN CAPITAL LETTER A ACUTE (Á, \u00c1) could be considered to be the same as a LATIN CAPITAL LETTER A (A, \u0041) immediately followed by a NON-SPACING ACUTE (´, \u0301) when sorting, but these are different in identifiers."

C/C++ does not come close to TR31 and is very restrictive still. The specification lists just a few sets of non-ASCII letters that can be in an identifier (http://www.kuzbass.ru:8086/docs/isocpp/extendid.html#extendid). These exclude many other Unicode letters that are allowed by C#, Python and Java, and excludes other non-letter characters (such as connecting punction) allowed in those languages. Also, while TR31/C#/Java/Python allow non-Arabic digits in identifiers (e.g., Ethiopic digits), C/C++ does not.

PHP defines a letter as follows: "a letter is a-z, A-Z, and the bytes from 127 through 255 (0x7f-0xff)." It says nothing about Unicode, including anything about normalization. Since much of the time input is presumably in UTF-8, the 0x7f-0xff range implicitly captures everything in Unicode that isn't in the Basic Latin block – this goes way beyond what's allowed by the languages discussed above. In short, they just haven't thought about the problem.

Raymie Stata
added a comment - 10/Feb/12 05:34 I've pulled together some documentation on how different languages handle non-ASCII characters in identifiers. You'll see that language vary greatly in both what non-ASCII characters are allowed in identifiers, whether or not they are normalized, and how they are normalized when they are normalized.
One of the goals of Avro is to support specifications that interoperate well across languages. Given all the variability in how different languages handle non-ASCII characters, I stand by what I said earlier: handling Unicode well In Avro is a lot of work, and doing it poorly (as we do now) just creates nasty interop problems.
—
The Unicode consortium has published a recommendation for defining Unicode identifiers:
http://www.unicode.org/reports/tr31/
C# follows it almost exactly (but not exactly); Python follows it mostly; Java kind of follows it, but not really; C/C++ ignore it; and, as far as I can tell, neither Ruby nor PHP have given Unicode identifiers much thought at all.
Regarding Python, Python 2.x only allowed ASCII characters in identifiers. It wasn't until Python 3.x that Unicode characters were allowed. Phython 3.x follows the Unicode TR31. However, while Python calls for NRKC normalization, it does not use the "modified" NFKC normalization recommended in TR31.
C# follows Unicode TR31 exactly (except that it allows identifiers to start with an underscore). Thus, C#'s handling of non-ASCII identifiers is similar to Python's, except that C# calls for NFC rather than NFKC. Also, C# requires that its input arrives in normal form, and states that "The behavior when encountering an identifier not in Normalization Form C is implementation-defined; however, a diagnostic is not required" (presumably a diagnostic would be allowed). Python, on the other hand, says that "identifiers are converted into the normal form NFKC while parsing."
Java makes no reference to TR31, but it does seem to have been inspired by it. However, it's more restrictive than TR31 (and thus C# and Python). For example, while Python (and TR31) allow non-spacing marks, Java does not. Also, unlike TR31/C#/Python, the Java language does not call for normalization, and is rather explicit about this: "Unicode composite characters are different from the decomposed characters. For example, a LATIN CAPITAL LETTER A ACUTE (Á, \u00c1) could be considered to be the same as a LATIN CAPITAL LETTER A (A, \u0041) immediately followed by a NON-SPACING ACUTE (´, \u0301) when sorting, but these are different in identifiers."
C/C++ does not come close to TR31 and is very restrictive still. The specification lists just a few sets of non-ASCII letters that can be in an identifier ( http://www.kuzbass.ru:8086/docs/isocpp/extendid.html#extendid ). These exclude many other Unicode letters that are allowed by C#, Python and Java, and excludes other non-letter characters (such as connecting punction) allowed in those languages. Also, while TR31/C#/Java/Python allow non-Arabic digits in identifiers (e.g., Ethiopic digits), C/C++ does not.
PHP defines a letter as follows: "a letter is a-z, A-Z, and the bytes from 127 through 255 (0x7f-0xff)." It says nothing about Unicode, including anything about normalization. Since much of the time input is presumably in UTF-8, the 0x7f-0xff range implicitly captures everything in Unicode that isn't in the Basic Latin block – this goes way beyond what's allowed by the languages discussed above. In short, they just haven't thought about the problem.
I can't find a language spec for Ruby or much discussion on Unicode variables in that language. More generally, it looks like Ruby's support for Unicode was bad prior to 1.9 (Jan 2009). Here's a discussion of how 1.9 makes it better: http://yokolet.blogspot.com/2009/07/design-and-implementation-of-ruby-m17n.html But there isn't any discussion of variable names.
Here's some summary info on support for Unicode variable-names in many different languages:
http://rosettacode.org/wiki/Unicode_variable_names

Java has invested a lot of effort in supporting international characters. In spite of that we have trouble. In many other languages it is worse.

Restricting to Unicode to letters and digits almost wipes out the use of non-ASCII characters completely. In the example shown in the above article, accents and accented characters are not recognized as letters. As another example, in my native language Tamil, there are 247 "letters". Unicode tries to model them not as individual alphabets but as symbols (about 60 of them). By combining the symbols it makes up the alphabets. When you represent in Unicode only about 30 or Tamil "letters" pass Java's isLetter() test. Almost no meaningful Tamil word will pass the isLetter() test. It is true with (at least some) other Indian Languages as well.

Moreover it is better to keep the spec more restrictive to start with and open up later. I'm not sure of the current level of support for non-ascii Avro names in the current implementations. It is not clear if the effort to make our implementations conformant brings commensurate benefits, at least for now. For example, in order to properly support this single feature, we may have to make C++ implementation use a large library like ICU. A vast majority of Avro C++ users just don't need it.

Thiruvalluvan M. G.
added a comment - 10/Feb/12 03:23 The basic trouble is that Unicode has multiple representations for the same text. For example, see
http://weblogs.java.net/blog/joconner/archive/2006/06/strings_equals.html
Java has invested a lot of effort in supporting international characters. In spite of that we have trouble. In many other languages it is worse.
Restricting to Unicode to letters and digits almost wipes out the use of non-ASCII characters completely. In the example shown in the above article, accents and accented characters are not recognized as letters. As another example, in my native language Tamil, there are 247 "letters". Unicode tries to model them not as individual alphabets but as symbols (about 60 of them). By combining the symbols it makes up the alphabets. When you represent in Unicode only about 30 or Tamil "letters" pass Java's isLetter() test. Almost no meaningful Tamil word will pass the isLetter() test. It is true with (at least some) other Indian Languages as well.
Moreover it is better to keep the spec more restrictive to start with and open up later. I'm not sure of the current level of support for non-ascii Avro names in the current implementations. It is not clear if the effort to make our implementations conformant brings commensurate benefits, at least for now. For example, in order to properly support this single feature, we may have to make C++ implementation use a large library like ICU. A vast majority of Avro C++ users just don't need it.

> And doing Unicode right is a lot of work; doing it poorly will just create a nasty source of interop problems.

I don't see this. Avro already requires that JSON parsers "do Unicode right". Permitting non-ASCII in identifiers only creates problems when generating code. The potential interoperability problem could be that some implementations, when given a schema, would be unable to generate valid code in their programming language for that schema, rendering that schema unreadable by generated code (although it would still be readable by "generic" code). That would be a bug in that implementation.

Code generators already have to mangle names that are reserved words in the generated programming language. If we permit non-ASCII characters in identifiers then implementations might also need to escape non-ASCII characters when generating code. This doesn't seem a huge burden.

It's important that the specification is clear about what characters implementations might expect to see in identifiers so that they know what characters need to be escaped. A conservative implementation might simply escape anything that's not permitted in their programming language.

If the spec is changed we should specify precisely what characters are permitted. Unicode characters have properties. We can use these properties to make the specification precise. One property is 'letter', another is 'number'. Java's isLetterOrDigit() includes these two sets.

Stepping back, it would be good if folks could use their own languages when writing Avro schemas. It should be possible to use, e.g., column names that are in Japanese, Chinese, Hindi, etc.

Doug Cutting
added a comment - 09/Feb/12 20:50 > And doing Unicode right is a lot of work; doing it poorly will just create a nasty source of interop problems.
I don't see this. Avro already requires that JSON parsers "do Unicode right". Permitting non-ASCII in identifiers only creates problems when generating code. The potential interoperability problem could be that some implementations, when given a schema, would be unable to generate valid code in their programming language for that schema, rendering that schema unreadable by generated code (although it would still be readable by "generic" code). That would be a bug in that implementation.
Code generators already have to mangle names that are reserved words in the generated programming language. If we permit non-ASCII characters in identifiers then implementations might also need to escape non-ASCII characters when generating code. This doesn't seem a huge burden.
It's important that the specification is clear about what characters implementations might expect to see in identifiers so that they know what characters need to be escaped. A conservative implementation might simply escape anything that's not permitted in their programming language.
If the spec is changed we should specify precisely what characters are permitted. Unicode characters have properties. We can use these properties to make the specification precise. One property is 'letter', another is 'number'. Java's isLetterOrDigit() includes these two sets.
Stepping back, it would be good if folks could use their own languages when writing Avro schemas. It should be possible to use, e.g., column names that are in Japanese, Chinese, Hindi, etc.

The current implementation uses Character.isLetter/OrDigit, rather than Character.isJavaIdentifierStart/Part. Thus, even if you change the spec to agree with what Java allows in identifiers, you'll also have to change the implementation. Also, it's not clear to me that the restrictions for non-ASCII Unicode letters is the same for all languages (e.g., will Character.isJavaIdentifierStart work for all languages? If not, what's the plan?).

And what if, some day in the future, we want to support a language that doesn't support Unicode?

A fundamental problem that most programming languages don't address with Unicode is what to do about Unicode normalization. Most are silent on the topic, and most implementations default to straight code-point comparison, which isn't really all that usable.

So, again, changing Avro's current spec to allow (some) Unicode characters will not obviate the need to revisit the various implementations. And doing Unicode right is a lot of work; doing it poorly will just create a nasty source of interop problems.

I think the original intuition here was a good, pragmatic decision: we should restrict letters in identifiers to ASCII letters. We should keep the spec as-is, and change the implementations to agree.

Raymie Stata
added a comment - 09/Feb/12 00:59 The current implementation uses Character.isLetter/OrDigit, rather than Character.isJavaIdentifierStart/Part. Thus, even if you change the spec to agree with what Java allows in identifiers, you'll also have to change the implementation. Also, it's not clear to me that the restrictions for non-ASCII Unicode letters is the same for all languages (e.g., will Character.isJavaIdentifierStart work for all languages? If not, what's the plan?).
And what if, some day in the future, we want to support a language that doesn't support Unicode?
A fundamental problem that most programming languages don't address with Unicode is what to do about Unicode normalization. Most are silent on the topic, and most implementations default to straight code-point comparison, which isn't really all that usable.
So, again, changing Avro's current spec to allow (some) Unicode characters will not obviate the need to revisit the various implementations. And doing Unicode right is a lot of work; doing it poorly will just create a nasty source of interop problems.
I think the original intuition here was a good, pragmatic decision: we should restrict letters in identifiers to ASCII letters. We should keep the spec as-is, and change the implementations to agree.

I am fine with making the spec match what we are currently doing if this is the case. Do all languages have good tests around use of non-ASCII characters/numbers in names and Enum symbols? In particular, I have seen some odd behavior that I did not track down when records with unicode were written in JSON encoding, so my impression is that this may not be well tested in all languages.

Scott Carey
added a comment - 08/Feb/12 18:41 I am fine with making the spec match what we are currently doing if this is the case. Do all languages have good tests around use of non-ASCII characters/numbers in names and Enum symbols? In particular, I have seen some odd behavior that I did not track down when records with unicode were written in JSON encoding, so my impression is that this may not be well tested in all languages.

Doug Cutting
added a comment - 08/Feb/12 18:30 Every language that currently implements Avro supports unicode identifiers. So I wonder if we should instead amend the specification to permit non-ASCII characters?

We should add a unit test that validates this and breaks without the change. It should hit each branch of the conditions. It may also be useful to encapsulate these conditions in private helper methods to share some of the logic and make it more readable (the condition for subsequent letters is a superset of the first letter).

Scott Carey
added a comment - 08/Feb/12 07:09 This looks correct.
We should add a unit test that validates this and breaks without the change. It should hit each branch of the conditions. It may also be useful to encapsulate these conditions in private helper methods to share some of the logic and make it more readable (the condition for subsequent letters is a superset of the first letter).