"Why is this allowed" seems to be too opinion-based to me. The language designers made a decision, what else is there need to know? Unless you find a statement of the person making that decision, we can only speculate.
– Ingo BürkJun 9 '15 at 9:07

184

One interesting thing is at least that OP's IDE obviously gets it wrong and displays incorrect highlighting,
– dhkeJun 9 '15 at 9:09

8 Answers
8

Unicode decoding takes place before any other lexical translation. The key benefit of this is that it makes it trivial to go back and forth between ASCII and any other encoding. You don't even need to figure out where comments begin and end!

As stated in JLS Section 3.3 this allows any ASCII based tool to process the source files:

[...] The Java programming language specifies a standard way of transforming a program written in Unicode into ASCII that changes a program into a form that can be processed by ASCII-based tools. [...]

This gives a fundamental guarantee for platform independence (independence of supported character sets) which has always been a key goal for the Java platform.

Being able to write any Unicode character anywhere in the file is a neat feature, and especially important in comments, when documenting code in non-latin languages. The fact that it can interfere with the semantics in such subtle ways is just an (unfortunate) side-effect.

There are many gotchas on this theme and Java Puzzlers by Joshua Bloch and Neal Gafter included the following variant:

More seriously, this puzzle serves to reinforce the lessons of the previous three: Unicode escapes are essential when you need to insert characters that can’t be represented in any other way into your program. Avoid them in all other cases.

In short then, Java intentionally allows it: the "bug" is in the OP's IDE?
– BathshebaJun 9 '15 at 9:15

59

@Bathsheba: It's more in the heads of people. People don't try to understand how Java parsing works, so IDEs sometimes display the code in a wrong way. In the example above, the comment should end with \u000d and the part after it should have code highlights.
– Aaron DigullaJun 9 '15 at 9:17

61

Another common mistake is to paste Windows paths in the code like // C:\user\... which leads to a compile error since \user isn't a valid Unicode escape sequence.
– Aaron DigullaJun 9 '15 at 9:18

50

In eclipse the Code after \u000d is highlighted partially. After pressing Ctrl+Shift+F the character is replaced with new line and rest of line is wrapped
– bluelDeJun 9 '15 at 9:21

20

@TheLostMind If I understand the answer correctly you should be able to reproduce this with block comments as well. \u002A/ should end the comment.
– TaemyrJun 9 '15 at 11:27

Since this hasn’t addressed yet, here an explanation, why the translation of Unicode escapes happens before any other source code processing:

The idea behind it was that it allows lossless translations of Java source code between different character encodings. Today, there is widespread Unicode support, and this doesn’t look like a problem, but back then it wasn’t easy for a developer from a western country to receive some source code from his Asian colleague containing Asian characters, make some changes (including compiling and testing it) and sending the result back, all without damaging something.

So, Java source code can be written in any encoding and allows a wide range of characters within identifiers, character and String literals and comments. Then, in order to transfer it losslessly, all characters not supported by the target encoding are replaced by their Unicode escapes.

This is a reversible process and the interesting point is that the translation can be done by a tool which doesn’t need to know anything about the Java source code syntax as the translation rule is not dependent on it. This works as the translation to their actual Unicode characters inside the compiler happens independently to the Java source code syntax as well. It implies that you can perform an arbitrary number of translation steps in both directions without ever changing the meaning of the source code.

This is the reason for another weird feature which hasn’t even mentioned: the \uuuuuuxxxx syntax:

When a translation tool is escaping characters and encounters a sequence that is already an escaped sequence, it should insert an additional u into the sequence, converting \ucafe to \uucafe. The meaning doesn’t change, but when converting into the other direction, the tool should just remove one u and replace only sequences containing a single u by their Unicode characters. That way, even Unicode escapes are retained in their original form when converting back and forth. I guess, no-one ever used that feature…

Yeah, native2ascii was intended to help preparing resource bundles by converting them to iso-latin-1 as Properties.load was fixed to read latin-1 only. And there, the rules are different, no \uuu… syntax and no early processing stage. In property files, property=multi\u000aline is indeed the same as property=multi\nline. (Contradicting to the phrase “using Unicode escapes as defined in section 3.3 of The Java™ Language Specification” of the documentation)
– HolgerJun 9 '15 at 18:52

9

Note that this design goal could have been achieved without any of the warts; the easiest way would have been to forbid \u escapes to generate characters in the U+0000–007F range. (All such characters can be represented natively by all the national encodings that were relevant in the 1990s—well, maybe except some of the control characters, but you don't need those to write Java anyway.)
– zwolJun 9 '15 at 19:28

3

@zwol: well, if you exclude control characters which aren’t allowed within Java source code anyway, you are right. Nevertheless, it would imply making rules more complicated. And today, it’s too late to discuss the decision…
– HolgerJun 9 '15 at 19:34

ah the problem of saving a document in utf8 and not latin or something else. All my databases were broken as well because of this western nonsense
– David 天宇 WongJun 17 '15 at 21:21

I'm going to completely ineffectually add the point, just because I can't help myself and I haven't seen it made yet, that the question is invalid since it contains a hidden premise which is wrong, namely that the code is in a comment!

In Java source code \u000d is equivalent in every way to an ASCII CR character. It is a line ending, plain and simple, wherever it occurs. The formatting in the question is misleading, what that sequence of characters actually syntactically corresponds to is:

IMHO the most correct answer is therefore: the code executes because it isn't in a comment; it's on the next line. "Executing code in comments" is not allowed in Java, just like you would expect.

Much of the confusion stems from the fact that syntax highlighters and IDEs aren't sophisticated enough to take this situation into account. They either don't process the unicode escapes at all, or they do it after parsing the code instead of before, like javac does.

I agree, this isn't a java "design error" , but it's an IDE bug.
– bvdbJun 22 '17 at 12:59

1

The question is rather about why code that looks like a comment to someone not familiar with this particular aspect of the language and perhaps without reference to syntax highlighting, is in fact not a comment. Objecting on the basis of the premise of the question being invalid is disingenuous.
– PhilJun 15 at 5:37

The \u000d escape terminates a comment because \u escapes are uniformly converted to the corresponding Unicode characters before the program is tokenized. You could equally use \u0057\u0057 instead of // to begin a comment.

This is a bug in your IDE, which should syntax-highlight the line to make it clear that the \u000d ends the comment.

This is also a design error in the language. It can't be corrected now, because that would break programs that depend on it. \u escapes should either be converted to the corresponding Unicode character by the compiler only in contexts where that "makes sense" (string literals and identifiers, and probably nowhere else) or they should have been forbidden to generate characters in the U+0000–007F range, or both. Either of those semantics would have prevented the comment from being terminated by the \u000d escape, without interfering with the cases where \u escapes are useful—note that that includes use of \u escapes inside comments as a way to encode comments in a non-Latin script, because the text editor could take a broader view of where \u escapes are significant than the compiler does. (I am not aware of any editor or IDE that will display \u escapes as the corresponding characters in any context, though.)

There is a similar design error in the C family,1 where backslash-newline is processed before comment boundaries are determined, so e.g.

// this is a comment \
this is still in the comment!

I bring this up to illustrate that it happens to be easy to make this particular design error, and not realize that it's an error until it is too late to correct it, if you are used to thinking about tokenization and parsing the way compiler programmers think about tokenization and parsing. Basically, if you have already defined your formal grammar and then someone comes up with a syntactic special case — trigraphs, backslash-newline, encoding arbitrary Unicode characters in source files limited to ASCII, whatever — that needs to be wedged in, it's easier to add a transformation pass before the tokenizer than it is to redefine the tokenizer to pay attention to where it makes sense to use that special case.

1 For pedants: I am aware that this aspect of C was 100% intentional, with the rationale — I am not making this up — that it would allow you to mechanically force-fit code with arbitrarily long lines onto punched cards. It was still an incorrect design decision.

I wouldn't go as far as saying that it's a design error. I could agree with you that it was a poor design choice, or a choice with unfortunate consequences, but I still think that it works as the language designers intended: It enables you to use any unicode character anywhere in the file, while maintaining ASCII encoding of the file.
– aioobeJun 9 '15 at 15:29

12

That having been said, I think the choice of processing stage for \u was less absurd than the decision to follow C's lead in using leading zeroes for octal notation. While octal notation is sometimes useful, I've yet to hear anyone articulate an argument why a leading zero is a good way of indicating it.
– supercatJun 9 '15 at 16:09

3

@supercat The people who threw that feature into C89 were generalizing the behavior of the original K&R preprocessor rather than designing a feature from scratch. I doubt they were familiar with punched card best practices, and I also doubt that the feature has ever been used for its stated purpose, except maybe for one or two retrocomputing exercises.
– zwolJun 9 '15 at 18:33

7

@supercat I wouldn't have a problem with Java \u as pre-tokenization transformation if it were forbidden to produce characters in the U+0000..U+007F range. It's the combination of "this works everywhere" and "this aliases ASCII characters with syntactic significance" that demotes it from awkward to flat-out wrong.
– zwolJun 9 '15 at 18:34

4

On your "for pedants": Of course at that time the // single-line comment didn't exist. And since C has a statement terminator that is not a new line, it would mostly be used for long strings, except that as far as I can determine "string literal concatenation" was there from K&R.
– Mark HurdJun 16 '15 at 17:39

This was an intentional design choice that goes all the way back to the original design of Java.

To those folks who ask "who wants Unicode escapes in comments?", I presume they are folks whose native language uses the Latin character set. In other words, it is inherent in the original design of Java that folks could use arbitrary Unicode characters wherever legal in a Java program, most typically in comments and strings.

It is arguably a shortcoming in programs (like IDEs) used to view the source text that such programs cannot interpret the Unicode escapes and display the corresponding glyph.

I agree with @zwol that this is a design mistake; but I'm even more critical of it.

\u escape is useful in string and char literals; and that's the only place that it should exist. It should be handled the same way as other escapes like \n; and "\u000A"should mean exactly "\n".

There is absolutely no point of having \uxxxx in comments - nobody can read that.

Similarly, there's no point of using \uxxxx in other part of the program. The only exception is probably in public APIs that are coerced to contain some non-ascii chars - what's the last time we've seen that?

The designers had their reasons in 1995, but 20 years later, this appears to be a wrong choice.

(question to readers - why does this question keep getting new votes? is this question linked from somewhere popular?)

I guess, you are not hanging around, where non-ASCII characters are used in APIs. There are people using it (not me), e.g. in Asian countries. And when you are using non-ASCII characters in identifiers, forbidding them in documentation comments makes little sense. Nevertheless, allowing them inside a token and allowing them to change the meaning or boundary of a token are different things.
– HolgerJun 9 '15 at 17:25

15

they can use proper file encoding. why write int \u5431 when you can do int 整
– ZhongYuJun 9 '15 at 17:29

3

What will you do when you have to compile code against their API and cannot use the proper encoding (assume that there wasn’t widespread UTF-8 support in 1995). You just have to call one method and don’t want to install the Asian language support pack of your operating system (remember, the nineties) for that single method…
– HolgerJun 9 '15 at 17:34

5

What is much clearer now than 1995 is that you better know English if you want to program. Programming is an international interaction, and almost all resources are in English.
– ZhongYuJun 9 '15 at 18:16

7

I don’t think that this has changed. Java’s documentation was all-English most of the time as well. There was a Japanese translation maintained for a while but maintaining two languages doesn’t really back up the idea of maintaining it for all the locales of the world (it rather disproved it). And before that, there was no mainstream language with Unicode support in identifiers anyway. So I would guess, somebody thought that localized source code was the next big thing. I would say thankfully, it didn’t take off.
– HolgerJun 9 '15 at 18:24

The compiler not only translates Unicode escapes into the characters they represent before it parses a program into tokens, but it does so before discarding comments and white space.

This program contains a single Unicode escape (\u000d), located in its sole comment. As the comment tells you, this escape represents the linefeed character, and the compiler duly translates it before discarding the comment.

This is platform-dependent. On certain platforms, such as UNIX, it will work; on others, such as Windows, it won’t. Although the output may look the same to the naked eye, it could easily cause problems if it were saved in a file or piped to another program for subsequent processing.

As eloquent as your "answer" might be, it actually is not an answer at all. OP's question was "Why is this allowed" but this here is an explanation of how it works...which OP already provided.
– mmgrossJan 2 at 11:10

2

Do you have any sources to confirm that this is platform dependent? If this is true, I would consider Java to be entirely broken (I do anyway, this is just another nail in the coffin).
– ClearerFeb 7 at 12:48