So os.listdir has interesting semantics with respect to working with Unicode paths that I was not aware of until now (being US/ASCII centric I suppose ;)
I created the following directory layout:
unicode
└── 首页
(Note that at least on OSX, the tree and ls commands don't work from the command line. cd does. So I manually created the above tree diagram!)
Python 2.7
>>> os.listdir(u"unicode")
[u'\u9996\u9875']
>>> os.listdir("unicode")
['\xe9\xa6\x96\xe9\xa1\xb5']
Trying again with Jython 2.7 trunk:
>>> os.listdir(u"unicode")
[u'\u9996\u9875']
>>> os.listdir("unicode")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
at org.python.core.PyString.<init>(PyString.java:60)
at org.python.core.PyString.<init>(PyString.java:66)
at org.python.core.PyString.createInstance(PyString.java:776)
at org.python.modules.posix.PosixModule.listdir(PosixModule.java:499)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: Cannot create PyString with non-byte value
Similar behavior is seen in say glob.glob("unicode/*") vs glob.glob(u"unicode/*"). os.stat can work with paths that are Unicode strings, but not as bytestrings:
os.stat('unicode/\xe9\xa6\x96\xe9\xa1\xb5')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
OSError: [Errno 2] No such file or directory: '/Users/jbaker/test/unicode/\xe9\xa6\x96\xe9\xa1\xb5'
There is the outstanding and possibly related bug #2110 re updating our JNR Posix jar

So this bug has been in Jython since at least 2.5:
$ jython25
Jython 2.5.4 (2.5:5ce837b1a1d8+, Dec 30 2014, 09:01:23)
[Java HotSpot(TM) 64-Bit Server VM (Oracle Corporation)] on java1.7.0_21
Type "help", "copyright", "credits" or "license" for more information.
>>> glob.glob("unicode/*")
['unicode/\u9996\u9875']
>>> os.stat('unicode/\u9996\u9875')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
OSError: [Errno 2] No such file or directory: '/Users/jbaker/test/unicode/\\u9996\\u9875'
Jeff's fixes to ensure that no 16-bit characters sneak into PyString simply made it fail earlier, which is a very good thing indeed.
Probably the right thing to do is for os.listdir and similar os functions to emit PyUnicode if any characters > 127; we can see the obvious bug here from PosixModule.java:
public static PyList listdir(PyObject path) {
String absolutePath = absolutePath(path);
File file = new File(absolutePath);
String[] names = file.list();
if (names == null) {
// Can't read the path for some reason. stat will throw an error if it can't
// read it either
FileStat stat = posix.stat(absolutePath);
// It exists, maybe not a dir, or we don't have permission?
if (!stat.isDirectory()) {
throw Py.OSError(Errno.ENOTDIR, path);
}
if (!file.canRead()) {
throw Py.OSError(Errno.EACCES, path);
}
throw Py.OSError("listdir(): an unknown error occurred: " + path);
}
PyList list = new PyList();
PyString string = (PyString) path;
for (String name : names) {
list.append(string.createInstance(name));
}
return list;
The point of string.createInstance(name) is that it will construct a PyString or PyUnicode based on the starting path; but this is also what caused the quiet bug earlier, and now the immediate failure we are seeing with Jeff's fixes.
The alternative is to try to simulate CPython here and return paths encoded using the underlying filesystem encoding (maybe UTF-8, maybe something else). But this is going too far: Java file paths are inherently already in Unicode.
But the most important reason is that we should be able to take os.listdir output and use with java.io.File, etc, etc. Java interoperability remains the most important reason for using Jython, after all.

A related bug is that one cannot start Jython with the current working directory being Unicode. This one fails in a similar fashion:
$ cd 首页
jimbaker:首页 jbaker$ jython27
Exception in thread "main" Traceback (most recent call last):
File "/Users/jbaker/jythondev/jython27/dist/Lib/site.py", line 62, in <module>
import os
File "/Users/jbaker/jythondev/jython27/dist/Lib/os.py", line 45, in <module>
from posix import *
java.lang.IllegalArgumentException: Cannot create PyString with non-byte value
at org.python.core.PyString.<init>(PyString.java:60)
at org.python.core.PyString.<init>(PyString.java:66)
at org.python.core.Py.newString(Py.java:626)
at org.python.modules.posix.PosixModule.getEnviron(PosixModule.java:896)
at org.python.modules.posix.PosixModule.classDictInit(PosixModule.java:113)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
...

I made it fail ugly as I expected there would be breakage, but didn't know how to look for it. (I found a few things.) The test suite does not stretch far in this direction, so it passes.
Interesting mention of java.io.File. I wonder how that deals with the conversion from what the filesystem might return and Unicode file names?

@Jeff,
It's pretty straightforward - all paths are Unicode in Java. Apparently so are environment variables and their values. So if we look at PosixModule, it's trying to replicate what's specified (more or less) in https://docs.python.org/2/library/os.html#os.listdir by intercepting the returned String and making a PyString/PyUnicode as appropriate.
I believe the right choice for us is if the path is PyString, to only return a PyString if ascii, otherwise PyUnicode, because we don't actually support encoded strings anyway.
Likewise, we have a similar problem in os.environ, as supported by PosixModule.getEnviron:
private static PyObject getEnviron() {
PyObject environ = new PyDictionary();
Map<String, String> env;
try {
env = System.getenv();
} catch (SecurityException se) {
return environ;
}
for (Map.Entry<String, String> entry : env.entrySet()) {
environ.__setitem__(Py.newString(entry.getKey()), Py.newString(entry.getValue()));
}
return environ;
}
https://github.com/jythontools/jython/blob/master/src/org/python/modules/posix/PosixModule.java#L896
Note that Python 3 separates out os.environ and os.environb:
https://docs.python.org/3/library/os.html#os.environ
So when I run on Python 3, os.environ has this entry:
'PWD': '/Users/jbaker/test/unicode/首页'
whereas on Python 2.7:
'PWD': '/Users/jbaker/test/unicode/\xe9\xa6\x96\xe9\xa1\xb5'
Compare Java:
http://docs.oracle.com/javase/7/docs/api/java/lang/System.html#getenv()
And using Jython 2.5 in this same directory:
>>> System.getenv().get("PWD")
u'/Users/jbaker/test/unicode/\u9996\u9875'
And it's all related to that curious entity of surrogateescape

The problem with returning encoded strings, vs unicode, is that we cannot simply use the results of Python os operations, including ones that build on it, like glob, with Java IO, since they do not know about encoded strings.
For once, this may be one advantage of the conflation of strings and Unicode strings in Python 2.x. With Jython 3.x, there will also be a seamless transition, since it will be unicode anyway unless bytes are asked for specifically.
The other advantage is that this is then an easy fix, per the diff in my earlier message. Getting similar Java IO seamlessness with returning encoded strings from os.listdir is involved (of unknown scope), and doesn't correspond to anything else we have done. But Jython has always tried to make for a seamless experience between Python and Java; is it why people use Jython.
Boundaries always present challenges like these. Providing unicode simplifies what we need to do.

Arfrever, great question. But the underlying Java platform doesn't support this scenario anyway:
>>> from java.io import File
>>> File("/tmp/somedir").list()
returns None
I wouldn't rule out supporting bytestring filenames, with C support in a future release. But it's a lot of work, and it's not going to be in 2.7.0, per what we can reasonably triage :)

Fixed as of https://hg.python.org/jython/rev/ea036792f304
There are two remaining issues:
1. Supporting non Unicode paths, if available, through C integration. Moved to #2245
2. Supporting bytes through all filesystem integration points including Java (eg java.io.File). Moved to #1839

Arfrever, that's a very interesting difference. When I tested, I was using OS X 10.10.1 (Mavericks) with Java 7 release 21.
However, paths returned with replacement characters are not going to be very useful. This seems like a fundamental limitation of Java because it has decided to use Unicode paths uniformly even if the underlying OS can support a more general path model.