Comments from Jeff Allen:
Hi Rose,
I can reproduce this, but only by setting my system locale to Chinese.
Any console line with a quote in it seems to be a problem. It is not, as
I assumed, to do with the JLineConsole, since it also occurs with the
plain console. sys.stdin.readline() and raw_input() both read ok: I
believe it's somewhere in the interpreter, but it's odd it would be
sensitive to encoding when dealing with the ascii subset.
I'm on Windows 7 and get the encoding name 'x-mswin-936', but it's
practically the same as you see. And this is on the latest 2.7b2
precursor so progress since beta 1 hasn't fixed it for you.
Yes please, file a bug at http://bugs.jython.org/ then you'll get
updates. Thanks for bringing this to our attention.
Jeff

I'll take this as I've started analysing it. I'm a little puzzled so far. I'm tentatively marking it as dependent on #1066 (CJK Codecs), as at least some of the error massages refer to a missing codec and the interpreter must surely be using it. I notice Jim Baker is following up #1066 and all that.

I poked around at this yesterday. A couple of problems combine here.
The absence of the codec is the larger obstacle:
Jython 2.7b3+ (default:6cee6fef06f0+, Jun 8 2014, 19:49:20)
[Java HotSpot(TM) 64-Bit Server VM (Oracle Corporation)] on java1.7.0_51
Type "help", "copyright", "credits" or "license" for more information.
>>> import codecs
>>> codecs.lookup("cp936")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
LookupError: unknown encoding 'cp936'
And the same happens for "ms936" and "x-mswin-936". A look at aliases.py shows "ms936" and "cp936" should map to the gbk codes, which we don't have yet (#1066).
Secondly, in a couple of places, when no encoding is explicit, I've chosen to get a name from the console Charset like this:
encoding = Py.getConsole().getEncodingCharset().name();
thinking the "canonical name" would be the one we want. In this case, that returns "x-mswin-936", which wouldn't be recognised in aliases.py even if we had the gbk codec.
Stepping through the part of InteractiveConsole that parses Python, I find that I'm using the canonical name to set the cflags.encoding consulted by the compiler. Oddly, when parsing the quoted string, it reacts to the missing "x-mswin-936" codec in the same way as to an incomplete line, hence the continuation prompt. It is also odd to me that it doesn't notice sooner that the codec is missing: almost as if it only used the Python codec for string parsing. Elsewhere, it is definitely using a Java codec, which of course it has no trouble obtaining by the canonical name.
At present, I would like to see if we could use one codec consistently when parsing. I can see that using the Python codec is preferable in some ways, but this code uses the Java one predominantly (I think). It would be cool if we could make Java codecs into Python ones. Or the other way around.
I'll think about the name confusion too. Amongst the alias names (in Java) for "x-mswin-936" is "ms936" which Python would accept. It's a bit ugly to sort through them until we find a Python-acceptable one, but it may come to that. Perhaps Python (all Pythons) should accept the (Java) canonical codec name for a codec.

Jeff, just wanted to point out that I plan to use the underlying Java codecs and wrap them as Python codecs for #1066 support. Such support will be general. I like the idea of being able to go the reverse direction as well.
Ensuring that names can be used interchangeably sounds like a good goal as well. Jython should accept Java names for codecs, but use the Python name as the canonical name.

Got it. The parser uses a Java codec, so a literal string has already been decoded from the console by the Java x-mswin-936 codec. But a literal string should contain the bytes equivalent to it in the input encoding. So the parser has to be reverse itself, and is trying to do that with the (non-existent) Python codec. But using the Java codec is more respectable, and it fixes the hang on input.
>dist\bin\jython -Dpython.console=
Jython 2.7b3+ (default:6cee6fef06f0+, Jun 9 2014, 23:22:52)
[Java HotSpot(TM) 64-Bit Server VM (Oracle Corporation)] on java1.7.0_51
Type "help", "copyright", "credits" or "license" for more information.
>>> "xx"
'xx'
>>> "畫蛇添足"
'\xae\x8b\xc9\xdf\xcc\xed\xd7\xe3'
>>> u"畫蛇添足"
u'\u756b\u86c7\u6dfb\u8db3'
>>> exit()
This doesn't work with the default JLineConsole as that seems to have no idea about multibyte characters.
Output is still failing, as that really does need the codecs from #1066.
I'll push this small change after tests, and then think how to avoid the non-Python name "x-mswin-936".
On the wrapping issue, Jim: if someone defined a codec in Python, then used it a the source encoding, it would be necessary to be able to create a Java codec from it, since the parser has to use it as the decoding in a Reader. In the present design, that is.

Hi, Jeff,
So glad you found the cause and the fix. What do you mean the output is
still failing?
Thanks,
Rose
On Mon, Jun 9, 2014 at 4:03 PM, Jeff Allen <report@bugs.jython.org> wrote:
>
> Jeff Allen added the comment:
>
> Got it. The parser uses a Java codec, so a literal string has already been
> decoded from the console by the Java x-mswin-936 codec. But a literal
> string should contain the bytes equivalent to it in the input encoding. So
> the parser has to be reverse itself, and is trying to do that with the
> (non-existent) Python codec. But using the Java codec is more respectable,
> and it fixes the hang on input.
>
> >dist\bin\jython -Dpython.console=
> Jython 2.7b3+ (default:6cee6fef06f0+, Jun 9 2014, 23:22:52)
> [Java HotSpot(TM) 64-Bit Server VM (Oracle Corporation)] on java1.7.0_51
> Type "help", "copyright", "credits" or "license" for more information.
> >>> "xx"
> 'xx'
> >>> "畫蛇添足"
> '\xae\x8b\xc9\xdf\xcc\xed\xd7\xe3'
> >>> u"畫蛇添足"
> u'\u756b\u86c7\u6dfb\u8db3'
> >>> exit()
>
> This doesn't work with the default JLineConsole as that seems to have no
> idea about multibyte characters.
>
> Output is still failing, as that really does need the codecs from #1066.
>
> I'll push this small change after tests, and then think how to avoid the
> non-Python name "x-mswin-936".
>
> On the wrapping issue, Jim: if someone defined a codec in Python, then
> used it a the source encoding, it would be necessary to be able to create a
> Java codec from it, since the parser has to use it as the decoding in a
> Reader. In the present design, that is.
>
> _______________________________________
> Jython tracker <report@bugs.jython.org>
> <http://bugs.jython.org/issue2123>
> _______________________________________
>

On output, this happens:
>>> p = u"Java 蛇"
>>> p
u'Java \u86c7'
>>> print p
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
LookupError: unknown encoding 'x-mswin-936'
If I had an acceptable encoding name here, it would still fail because there is no suitable codec (awaiting #1066). The following works (after the fix) because the bytes in the string are the bytes that the console expects:
>>> print "Java 蛇"
Java 蛇
There is still a difficulty when the code comes from a file. The bytes will be in the encoding used for the file, but this may not match the console. I think this is a Python 2k behaviour, best addressed by using Unicode, which brings us back to #1066.

The console hang is now corrected in http://hg.python.org/jython/rev/eef3fcffc58a .
As noted, further disappointments are in store for the user of codepage 936 until 1) #1066 is resolved, and 2) Jython is able to recognise the equivalence of the Java and Python codecs. This is because, eventually, someone will do the equivalent of codecs.lookup(sys.stdout.encoding),
which will contain the name Java chose.
I'll hold this open for now as a reminder about 2). I think we could solve that either by trying the alias names of the Java codec, to find one Python likes, probably here: http://hg.python.org/jython/file/eef3fcffc58a/src/org/python/core/PySystemState.java#l274, or by adding the Java-canonical name to aliases.py.
Even without that, once #1066 is in place, I expect the environment variable PYTHONIOENCODING, set to a name both Python and Java recognise, would work around problem 2).

Hi, Jeff,
Thanks a lot for your great job. We really appreciate it.
Where can I find the latest fix?
Since you have the multi-byte characters environment with the fix, could
you please try out the following use case which failed for me before and
see if they are fixed too?
1. Create a py script, test.py. The test.py file defines a function called
create() which simply returns the value of the parameter:
def create(name):
return name
2. Then start Jython 2.7 and run:
>>> execfile("test.py")
>>> create("
*\u4f7f\u7528') <-- input multi-byte characters**u'\u4F7F\u7528' <--
Expecting the same unicode representing the multi-byte characters, but
instead got something like *
*'\xBB\xC8\xCD\xD1' which is not correct to me.*
If it's not fixed, do I need to file a new issue or just keep using this
issue #?
Thanks a lot,
Rose
On Thu, Jun 12, 2014 at 12:36 AM, Jeff Allen <report@bugs.jython.org> wrote:
>
> Jeff Allen added the comment:
>
> The console hang is now corrected in
> http://hg.python.org/jython/rev/eef3fcffc58a .
>
> As noted, further disappointments are in store for the user of codepage
> 936 until 1) #1066 is resolved, and 2) Jython is able to recognise the
> equivalence of the Java and Python codecs. This is because, eventually,
> someone will do the equivalent of codecs.lookup(sys.stdout.encoding),
> which will contain the name Java chose.
>
> I'll hold this open for now as a reminder about 2). I think we could solve
> that either by trying the alias names of the Java codec, to find one Python
> likes, probably here:
> http://hg.python.org/jython/file/eef3fcffc58a/src/org/python/core/PySystemState.java#l274,
> or by adding the Java-canonical name to aliases.py.
>
> Even without that, once #1066 is in place, I expect the environment
> variable PYTHONIOENCODING, set to a name both Python and Java recognise,
> would work around problem 2).
>
> ----------
> title: jython 2.7 beta 1 standalone hangs on multi-byte characters on
> Windows 8 -> Hang on multi-byte characters on Windows
>
> _______________________________________
> Jython tracker <report@bugs.jython.org>
> <http://bugs.jython.org/issue2123>
> _______________________________________
>

Yes, that will work. But you can't print that string out until #1066 is fixed.
Somewhat to my surprise, I can type Chinese characters at the prompt and that works too. Or rather, I can paste them, since I don't know how to type Chinese.
Jython 2.7b3+ (default:1f517f1e5a08, Jun 11 2014, 22:42:31)
[Java HotSpot(TM) 64-Bit Server VM (Oracle Corporation)] on java1.7.0_51
Type "help", "copyright", "credits" or "license" for more information.
>>> execfile("test.py")
>>> create(u'\u4f7f\u7528')
u'\u4f7f\u7528'
>>> print _
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
LookupError: unknown encoding 'x-mswin-936'
>>> u'\u4f7f\u7528'
u'\u4f7f\u7528'
>>> u'使用'
u'\u4f7f\u7528'
>>>
To get hold of this yourself, you would have to build Jython from source (just needs hg, ant and the jdk I think), or wait for beta 4, which Jim expects will also have the codec you need.

The last minor change at http://hg.python.org/jython/rev/44191dd20f5a completes this (I think). I've changed the way we record the console encoding so that we preserve original name specified, or deduced, rather than the Java-canonical name. I can now get the intended behaviour:
Active code page: 936
>dist\bin\jython -Dpython.console=org.python.core.PlainConsole
Jython 2.7b3+ (default:44191dd20f5a, Jun 24 2014, 22:42:40)
[Java HotSpot(TM) 64-Bit Server VM (Oracle Corporation)] on java1.7.0_51
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdin.encoding
'ms936'
>>> sys._jy_console.getEncoding()
u'ms936'
>>> sys._jy_console.getEncodingCharset()
x-mswin-936
>>> u = u'\u756b\u86c7\u6dfb\u8db3'
>>> print u
畫蛇添足
>>> s = "使用"
>>> s
'\xca\xb9\xd3\xc3'
>>> print s
使用
>>> raw_input('畫蛇')
畫蛇
''
>>> raw_input('畫蛇: ')
畫蛇: 添足
'\xcc\xed\xd7\xe3'
>>> raw_input('畫蛇: ').decode("gbk")
畫蛇: 添足
u'\u6dfb\u8db3'
It is an odd quirk that there are two Chinese codecs in Java: specifying ms936 will get you the codec with canonical name x-mswin-936, while specifying cp936 will get you one called GBK. ms936 is what we retrieve from java.io.Console, see http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/windows/native/java/io/Console_md.c#l54. Either of these gets the Python GBK codec, and that's the one necessary for print to work.
In test_ntpath and test_macpath, test_nonascii_abspath() complains of an invalid directory name, which I suspect is due to using the default (therefore multibyte) encoding. This seems so far from the original complaint, and may be a fault in the test anyway, that I feel justified not holding the bug open for that.
It is worth observing that the default JLineConsole does not work with multi-byte encoding. One can fix that on the command line by setting -Dpython.console=org.python.core.PlainConsole, or in the Jython registry file.
Rose:
Were you able to build from source? If so, I think your use of Jython with this code page will be a better test than anything I have done. You may find other faults in our MBCS support, but I'm hopeful that it won't be in the console encoding.

Hi, Jeff,
Thanks a lot for handling this. I really appreciate it.
One thing I noticed from the console output:
>>> s = "使用"
>>> s
'\xca\xb9\xd3\xc3'
This is different from Jython 2.2.1. We should expect the same characters
returned when printing s, like we have in Jython 2.2.1:
>>> aa="操作" <-- Multi-byte characters are displayed correctly
>>> print aa
操作
Could you please see if this is still an issue? Appreciate your comments.
Thanks,
Rose

On Windows we do not have the UTF-8 option. This is what CPython does with code page 936:
>python
Python 2.7.6 (default, Nov 10 2013, 19:24:24) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdin.encoding
'cp936'
>>> s = "使用"
>>> s
'\xca\xb9\xd3\xc3'
>>> print s
使用
>>>
Jython is now exactly the same, except that Java likes to call the encoding ms936. (Java actually tests the range of the code page number so it can call some of them cp* and some ms*; I assume there's a good reason.)
A str is a sequence of bytes, not characters. When you just type s at the prompt, Python actually prints repr(s), which gives you a "safe" representation, such as you might have written in ascii source code. When you execute print s, it pushes the bytes out through sys.stdout and what you see is the result of the (Windows) console interpreting those bytes, in this case as code page 936. The same bytes would normally come out on my console like this (code page 1252):
>>> s
'\xca\xb9\xd3\xc3'
>>> print s
Ê¹ÓÃ
At a time when CPython only dealt with bytes, Jython chose to allow UTF-16 characters in strings, interchangeably with Java. Since then, Python has evolved to support unicode as a distinct type, and later Jython versions conform to that design.
Bottom line: this aspect of Jython is correct now (probably). Thanks for making us think about it.