Slashdot videos: Now with more Slashdot!

View

Discuss

Share

We've improved Slashdot's video section; now you can view our video interviews, product close-ups and site visits with all the usual Slashdot options to comment, share, etc. No more walled garden! It's a work in progress -- we hope you'll check it out (Learn more about the recent updates).

darthcamaro writes "Some programming languages just move on to major version numbers, leaving older legacy versions (and users) behind, but that's not the plan for Python. Python 2.6 has the key goal of trying to ensure compatibility between Python 2.x and Python 3.0, which is due out in a month's time. From the article: 'Once you have your code running on 2.6, you can start getting ready for 3.0 in a number of ways,' Guido Van Rossum said. 'In particular, you can turn on "Py3k warnings," which will warn you about obsolete usage patterns for which alternatives already exist in 2.6. You can then change your code to use the modern alternative, and this will make you more ready for 3.0.'"

If you move to 3.0, unless you have those changes already, it just won't work.

...which is why some heavy python users, myself included, aren't going to use 2.6 or 3.0. I have huge amounts of python in operation, and the very last thing I'm going to do is break any of it with an incompatible language that happens to slightly resemble python (no matter who wrote it, and no matter what they call it, it isn't python if it can't run mundane python code.)

...which is why some heavy python users, myself included, aren't going to use 2.6 or 3.0. I have huge amounts of python in operation, and the very last thing I'm going to do is break any of it with an incompatible language that happens to slightly resemble python (no matter who wrote it, and no matter what they call it, it isn't python if it can't run mundane python code.)

"slightly resemble python"? Python 3.0 code looks just like the Python that's been around for years. Maybe there's some handy new syntax (with), but it's still Python.

This is not about fundamentally changing Python. This is about cleaning up warts, some of which have been around since Python 1.x.

If you're going to modify a language, you *must* do it in a compatible manner, otherwise what you're doing is making a new language that will require an entirely new community. Names notwithstanding, and resemblance beyond incompatibilities notwithstanding.

From what I've seen, the Python devs have put together about the best possible migration path while still actually making the changes that need to be made.

Here's the picture, in case it's not clear: Python 2.6 is just as backwards compatible as the other 2.x releases. Which is to say that porting from 2.5 to 2.6 is pretty trivial. I'd expect any actively used and maintained library to be 2.6 compatible within weeks (and a great many probably didn't break at all).

2.6 lets you use many of 3.0's features that don't break compatibility (and there are many). It also has a warnings mode to help you spot 3.0 incompatible code. And it lets you selectively turn on 3.0 features within a module.

Want to start using the new print function?

from __future__ import print_fiunction

Voila! The print keyword goes away and you have the new print function. Certainly bits of new Python 3.0 syntax work now as well:

try:
1/0except ZeroDivisionError as e:
pass

The "as e" bit is new.

Finally, there's actually a "2to3" tool that makes many of the changes in an automated fashion.

The single biggest change from a compatibility standpoint is that "foo" is a unicode object in 3.0 and a string (set of bytes) in 2.x. You can even prepare for that switch:

from __future__ import unicode_literals

foo = "foo" # this will be unicodebar = b"bar" # this is a set of bytesunibar = bar.decode("utf-8") # get a unicode from the bytes

They have put *a lot* of thought into how to make this transition. People will gradually shift to 2.6, just as they did with 2.5. And, over time, they will change to using the new features. They'll probably upgrade to 2.7 (yes, there will be one), and use the new features even more. And eventually their code will just be 3.0 code and the switch will be a no brainer.

No. You can go on all you want about "needed to change" and "autofix" and etc, but the bottom line is that this code presently isn't broken, and I am not about to fix code that isn't broken. It makes no sense on any level; financially, time-wise, or strategically. I have better things to do than refactor my code for entirely arbitrary reasons. Perhaps I just place a different value on my time than you do; that's fine. You should, of course, feel free to do whatever you like.

If not, why wouldn't I just wait for 3.0 and then just fix everything ONCE?

Well, first of all, 2.6 and 3.0 come out at the same time and share many of the same new features... so there's no "just wait for 3.0" possible, it's either/or right now.

The advantage is that if you have a big pile of 2.5 code right now, you can slowly turn on the "use 3.0 style" switches in 2.6 and migrate your code one little switch at a time over a long period of time.

That way, a few years from now when they decide to stop supporting new features in the 2.x path and you really "must have" some new featur

Why not just wait for 3.0 to make the changes? That way you'll only have to test everything once.

Because 2.6 and 3.0 have different objectives.

2.6 is simply the next in the 2.x line and one of the new features is the ability to import 3.0 features from __future__. Otherwise, it'll be no bigger a transition than 2.4 to 2.5 was. Existing programs will likely run without any issues.

3.0 is a bigger transition. It will drop a few things now considered mis-features (if we had known then...). Most current programs will break in 3.0 (but often in ways that are trivially fixable).

These kind of compatibility switches are make-or-break. I'm glad there's Python 2.6 to try to ease the problem, but Py3k means that everybody who publishes python software will all of a sudden have to maintain 2 branches, for Python 2.X line and Python 3.X line.

This isn't the same as one software package having "legacy" and "bleeding edge" branches, because that's their own choice. In this case the underlying language is forcing them to choose.

Honestly, I'm not confident in the economics of such transitions, and believe Py3k will die out.

These kind of compatibility switches are make-or-break. I'm glad there's Python 2.6 to try to ease the problem, but Py3k means that everybody who publishes python software will all of a sudden have to maintain 2 branches, for Python 2.X line and Python 3.X line.

No, they don't "have to" maintain two branches. They can choose to, or they can maintain one (which depends on their particular circumstance); if necessary (if it is an app and not a library) they can just distribute the right interpreter with the app.

This isn't the same as one software package having "legacy" and "bleeding edge" branches, because that's their own choice.

Yeah, actually, it is exactly the same as that, at least as long as bug-fixes and maintenance continues on Python 2.x: the "one software package" being the Python interpreter.

And, yeah, if those maintaining python-based projects choose to maintain Python-2.x and Python-3.x based versions, that will also be an instance of exactly what you say it wouldn't be, as it will still be their own choice.

For whatever reason, people fail to understand python natively supports parallel installs. Furthermore, since python's preferred script magic is "#!/bin/env python", rather than, "#!/bin/python", the executing script will use the python that it finds in your path. Additionally, you can also tie python to a specific version as "python2.5". Want a different python? Change your path. A script requires a specific version of python? Change the script to require it. It's one line and trivial. It's at the top of the file, so there's no hunting even.

New python releases only pose problems for the uninitiated, the ignorant, or the dumb.

For whatever reason, people fail to understand python natively supports parallel installs. Furthermore, since python's preferred script magic is "#!/bin/env python", rather than, "#!/bin/python", the executing script will use the python that it finds in your path. Additionally, you can also tie python to a specific version as "python2.5". Want a different python? Change your path. A script requires a specific version of python? Change the script to require it. It's one line and trivial. It's at the top of the file, so there's no hunting even.

Changing my path is not practical. It's too broad. I'd have to write a shell script wrapper for the
application which did 'env PATH=new_python:$PATH the_real_application "$*"' or something.
And it's not just me; I'd have to communicate this to all other users of the system somehow.
And changing one line of a script is not trivial, if I'm not root.

All this may seem like minor things, but it adds up. And no other good language puts me in situations like
that.

New python releases only pose problems for the uninitiated, the ignorant, or the dumb.

Or those of us who have been around for a while, and seen innocent backwards-incompatible changes
become maintenance nightmares...
Ok, maybe not a nightmare in this case, but an inconvenience and annoyance which
will keep being inconvenient and annoying for years, until the last Python 2.x dependency
goes away.

The best way to judge this would probably be to look at what Linux distributions like Debian want to do about Python 3.0.
They ship one Python as the default (2.4 currently, for Debian) but provide others too. I bet even a change from
2.4 to 2.5 is a major migration for them.

I find it really easy to use virtualenv (sometimes together with zc.buildout) to encapsulate applications and modules. In fact, I tend to cuss when a module that I want to try doesn't offer a way to be easily integrated with virtualenv (such as an egg or at least a subversion checkout with a working setup.py package file).

Changing my path is not practical. It's too broad. I'd have to write a shell script wrapper for the application which did 'env PATH=new_python:$PATH the_real_application "$*"' or something. And it's not just me; I'd have to communicate this to all other users of the system somehow. And changing one line of a script is not trivial, if I'm not root.

You have a system admin problem not a python problem. If you can't run system installed software and your admin refuses to help, you have an admin problem. Making

But some popular environments (Windows, Mac, shared web hosting) identify scripts not by their script magic but instead by their file extension. When I used Google to search for python parallel install windows, I got a whole bunch of results about parallel ports and parallel processing. Does a parallel install work in Linux, Solaris, *BSD, and the like, or is there a recommended way to use it with more popular desktop operating systems such as Windows and Mac OS X? And how

It is the way python simply installs. Each python install places its library into a numbered directory (e.g. python2.4, python2.5). The only thing you may have to change is the "python" proper binary, which is copied from or linked to the numbered python binary.

In other words, each python install should have its own directory structure which insures one installation doesn't effect the other. The only other issues is which binary you get when you run "python". Typically "python" proper points to the newest i

The only thing you may have to change is the "python" proper binary, which is copied from or linked to the numbered python binary.

So, under Windows, how do I force a specific.py file to use C:\python24 or C:\python25 or C:\python26 or C:\python30 upon double-click, without changing behavior of other.py files installed on the same machine? And how can I make mod_python read the #! line before loading a module?

I can't speak for OSX but the above is true for the other platforms.

Mac OS X should act like FreeBSD. I'm more concerned about 1. Windows, and 2. shared web hosting using mod_python and the like.

Create multiple users, each with its own path. Use runas features. Some people use wrapper scripts to set their path. Most people seem to prefer the first option as they typically don't use the command line in the first place. If you are a command line guy, you'll likely prefer the second option.

A third option is to use cygwin, which does honor the environment's path and magic. Some people hate cygwin. If you're are command line person on windows, you should s

It's possible that some of the python maintainers prefer that, but the distributions sure as hell don't. "Grab a random python binary that you hit first in my path" does not make for a reliable system. It destroys any idea of security (SELinux, setuid, consolehelper, etc. etc.), and I've seen more than a couple of bugs where applications stupidly used it and then someone wanted to try a newer python in

Another common pattern to use for this, as well as for libraries, is the following:

try:
import one_way_to_do_it
except:
import more_common_way_to_do_it

But how well does a try block work with things that depend on from __future__ statements that Python 2.5.x doesn't recognize, such as the different print syntax and the different string literal syntax ("8bitchars", u"32bitchars" vs. b"8bitchars", "32bitchars")? From Python 2.5.x's definition of a future statement [python.org]:

A future statement must appear near the top of the module. The only lines that can appear before a future statement are:

Uh, it's almost exactly the opposite of what you're saying. You don't have to have a Python 3.x line; you can just deploy your code on Python 2.6, keep your working application working, and do all your new development and testing with Python 3.x warnings turned on. Then your next release is Python 3.0 compatible; or if you somehow fail to do finish the Python 3.x upgrades in time for your next release, you don't have to release on Python 3.x, you can just keep using Python 2.6 even though your code is par

Funny thing is that none of my production code base even runs under 2.6. I'm moving stuff from a very old server to new hardware and so far I've had to move 2.1,2.2,.2.3 and 2.4 over and some stuff broke when using the newest version of some of the old version. The result is now I have to spend lots of time maintaining programs that should not have to be maintained. I have never seen a project written in Python that meets its time or financial budget and stuff like this makes me want to ban the language

I'm glad there's Python 2.6 to try to ease the problem, but Py3k means that everybody who publishes python software will all of a sudden have to maintain 2 branches, for Python 2.X line and Python 3.X line.

If it was even slightly hard to install 2 versions of Python at the same time, that might be true. However, that's not the case. I see nothing there that will FORCE a developer to maintain two versions of their Python software.

Most will probably stick with 2.x for now, perhaps trying out 3.x or just importing from future and playing with updating their code. By the time 2.8 is out, insisting on at least 2.6 to run your code will be perfectly reasonable. At that point, start importing from future and actuall

I can assure you that the one Python application I use regularly (trac) cannot be upgraded between minor versions without large-scale upgrades to dependent modules. It was an absolute nightmare upgrading from a machine with 2.2. to 2.3...many hours spent tracking down modules that simply didn't work with 2.3.

Coming from the perl world, having to deal with just one dependency nightmare with Python was enough to entice me to stay in the perl world...

Please explain. If you used DBAPI standard interfaces, it's unlikely anything you were using broke or changed. Most DBAPI packages do a pretty good job (all I've seen) of explaining which interfaces comply with the DBAPI spec and which interfaces don't. My guess is you didn't pay attention. That's a coder problem, not a language problem.

Don't mod something down just because you disagree. When I have mod points, I never downmod things out of disagreement. This is a legitimate concern over the python strategy. They have benefited from their flexibility (the language at a given instant I will give is relatively low on quirks as they are rethought and replaced, whereas perl is chock-full of quirks that you must learn to live with), but there is a price.

Nothing is perfect. Nothing is without flaws. To achieve one end, something almost alwa

These changes are NOT earth-shattering. 2.6 is mostly just going to add a few new features, most important being the with statement. Most code written using Python idioms will be fine under 2.6 and 3.0. Now, if you tried to write Java-esque or C-esque code under Python, you might run into issues. Even then, I doubt it. They've been deprecating features for awhile, and 3.0 is probably the point at which they'll be yanked...you've only had a year or two of DeprecationWarnings.

I'm not sure why people whine about a language evolving. Retain backwards compatibility to a fault and you end up with C++, which is crippled by C-isms. You either know your code well enough that you could make the small incremental changes along the way, or you simply don't upgrade.

Python most needs sane standard libraries. It is far too much of a "let's throw this in there" with three different naming conventions and no package organization. It is a shame, because the language itself is pretty powerful in the right hands.

That woul dbring the same problems as the transition from PHP 4 to PHP 5. How would I deploy my product to end users who have installed Python 3.x as the system-wide handler for.py files? Will Python Software Foundation recommend the use of an extension such as.py2? Conversely, if I do take advantage of Python 3.x, how would I deploy to end users who still use 2.x?

For Windows, it's theoretically possible to write a dispatch program to associate with.py files that looks for a version-specific shebang, and tries to find the appropriate version of Python on the system. In the case of Apache, I think you're stuck with whichever version your copy of mod_python was compiled against.

In the case of Apache, I think you're stuck with whichever version your copy of mod_python was compiled against.

So how should publishers of software that runs on leased shared web hosting work around Python version incompatibilities? In PHP, it's normally done by associating.php4 to PHP 4,.php5 to PHP 5, and.php to one or the other based on the hosting provider's preference (or, if you're really lucky, the user's preference in the hosting control panel, like on Go Daddy). But I haven't seen any mention of extensions like.py2 or.py3.

Reading the release, they have decided to really push 16-bit strings (they call this "Unicode" but it really is what is called UTF-16). I think this is a serious mistake.

The proper solution is to use 8-bit strings, but any functions that care (such as I/O) should treat them as being UTF-8. Most functions do not care and thus the treatment of "Unicode" and "bytes" are the same.

The problem with UTF-16 is you cannot losslessly convert a string that *might* be UTF-8 to UTF-16 and then back again. This is because any illegal UTF-8 byte sequences will be lost or altered. This is a MAJOR problem for code that wants to process data that is likely to be text but must not be altered under any circumstances, in effect such programs are forced to be ASCII-only, even though UTF-8 is purposly designed so that such programs could display all the Unicode characters. Note that bad UTF-16 (ie with mismatched surrogate pairs) can be losslessly converted to UTF-8 and back.

This has been a real pain so far in our use of Python, and I am quite alarmed to see that they are changing the meaning of plain quotes in 3.0 to "Unicode". This is really a serious step backwards, as we will be forced to tell anybody using our system to put 'b' before all their string constants and I suspect there will be a lot less automatic conversion of these strings to unicode when we want to display them. Note that Qt is also causing a lot of trouble here too.

The problem is that there are three kinds of string-like objects in Python: UTF-16 strings, ASCII strings, and uninterpreted arrays of 8-bit bytes. Python 2.5 sort of supports all 3, with "array of bytes" the least well supported. Since this is a language without declarations, the semantics of this gets messy.

The most common problem was that functions like ".read()" yielded strings, not arrays of bytes. This follows C standard library semantics, but is a bad fit to Python. In 3.0, ".read()" yields an array of bytes, not a string. If the data read is to be converted to a string, "decode" is required. That's the right answer.

This is consistent with modern thinking about data representation. Consider SQL, which makes a similar distinction between "TEXT" and "BLOB".

Interesting. I was afraid they were making all these functions return strings. If they are returning bytes as well it would certainly make things a lot better. However I would expect them to have the same trouble I am having.

Let's assume read returns a string of bytes. What I am worried is that the following example text will not work as expected:

if file.read()=="utf8 string"...

I expect this will automatically convert the result of file.read() to UTF-16 and then do the comparison. This will

From What's new in Python 3.0 [python.org]:
The str and bytes types cannot be mixed; you must always explicitly convert between them, using the str.encode() (str -> bytes) or bytes.decode() (bytes -> str) methods.

That's the right way to do it, but I agree that as a retrofit to existing code, it's a headache.

Worse, it's a problem that's detected at run time, not compile time, at least with the CPython implementation.

Well in a lot of ways that (not doing any automatic conversion) is the only correct solution if they really want plain quotes to be Unicode and not bytes/utf-8. It will be such a pain to fix existing code, though, that I would not have thought they would do that.

It might be helpful to run your programs through one of the more advanced Python compilers, like Shed Skin or PyPy, if and when they get converted to Python 3.0. They have implicit type analysis, and if you get data from "read" and apply a string operation without conversion, they will usually report that as a compile-time error. So you may get to find most or all of the errors up front. CPython, being a naive interpreter, will happily compile code that will always raise an exception at run time.

I'm using python in an environment with lots of external strings (from the web, from files), and the current mechanism is horrible. I end up with non-ASCII data in strings a lot if I'm not extremely careful with thinking about which string is ASCII and which is uninterpreted bytes, and have spend endless hours debugging silly decoding problems.

If nothing else, having the read() methods return bytes and dealing with strings as unicode objects (regardless of internal encoding, I doubt that the python spec for

Spoken like somebody that's never had to deal with encoding issues. Using UTF-8 internally is fine, but exposing it to the programmer is insane and error-prone. And if the programmer then proceeds to manipulate that raw byte buffer as a string, he's an idiot.

The proper solution is to use 8-bit strings, but any functions that care (such as I/O) should treat them as being UTF-8. Most functions do not care and thus the treatment of "Unicode" and "bytes" are the same.

You might not be aware of this, but computers are used for more than just transmitting text. I don't want my binary streams being rewritten to gibberish because some I/O routine was written to be too clever.
Furthermore, not every system uses UTF-8. Some may even need to send data over a *gasp* network! Good luck getting every other computer in the world to start using UTF-8 immediately.

The problem with UTF-16 is you cannot losslessly convert a string that *might* be UTF-8 to UTF-16 and then back again. This is because any illegal UTF-8 byte sequences will be lost or altered.

If you try to convert bytes that aren't in UTF-8 using a UTF-8 codec, an error will be raised. This behavior is proper -- if you don't know what format your input is in, there's no way to perform text-based operations on it.

This has been a real pain so far in our use of Python, and I am quite alarmed to see that they are changing the meaning of plain quotes in 3.0 to "Unicode".

Every developer I know uses Unicode strings already. The new behavior is just one less character to type in front of literals.

This is really a serious step backwards, as we will be forced to tell anybody using our system to put 'b' before all their string constants

Otherwise said as: "We're too stupid to fix the glaring encoding errors in our product, so we'll just use bytes everywhere and pretend it's all working".
Also, Unicode strings in Python are implemented with either UTF-16 or UCS-4 depending on platform.

You might not be aware of this, but computers are used for more than just transmitting text. I don't want my binary streams being rewritten to gibberish because some I/O routine was written to be too clever

Thank you for explaining exactly why I want UTF-8 to be used, while thinking you were arguing against it.

Data is NOT just text. Therefore we should not be mangling it because we think it is text. We have enough trouble with MSDOS inserting \r characters. This crap is a million times worse.

Spoken like somebody that's never had to deal with encoding issues. Using UTF-8 internally is fine, but exposing it to the programmer is insane and error-prone. And if the programmer then proceeds to manipulate that raw byte buffer as a string, he's an idiot.

The compiler will turn "unicode" into the utf-8 encoding. The programmer does not see \xnn sequences of the utf-8 bytes. Try some modern compilers with utf-8 support some day before you say anything stupid again.

So, what if I'm from the UK using an editor that uses ASCII and I insert a £ into my python code or pull one from a data file? That's at code point 163 in ISO/IEC 8859-1... but if it's assumed to be utf-8, it'd be part of a multi-byte character because the first bit is set.

If you actually have the byte 163 in the file, it almost certainly will be an invalid UTF-8 encoding (it would have to be directly proceeded with an accented letter in ISO-8859-1 for it to look like legal UTF-8).

One of the big reasons why I want the strings to remain bytes is because of exactly this. Yes the compiler can convert, but, believe it or not, we really do read text produced by other programs, often with incorrect UTF-8 encoding. Only by leaving it as bytes can we properly analyize this. It is rel

If your editor inserted the UTF-8 encoding of two bytes (0xc2,0xa3 I think) the result should be those same two bytes. However I/O routines when told to print the string should then decode the UTF-8 and produce the pound sign. If the compiler is producing something other than UTF-8 (such as current Python does if you put a 'u' before the quote) then the compiler does the conversion, not the I/O routine. My main argument is that I think this is a job for I/O, not the c

How, when, and by whom is this decision to turn on --with-wide-unicode (UCS-4) made for each platform? What Google keywords should I have used?

Well that obviously varies by the platform. Under Debian GNU/Linux the decision would be made by the maintainer of the python package. But does it really matter? On what platform are you forced to use the python provided by the system vendor, rather than your own package?

On what platform are you forced to use the python provided by the system vendor, rather than your own package?

On platforms that verify digital signatures on executables and where certificates aren't handed out like candy. But still, for applications deployed in Europe and the Americas (not east Asia), UCS-2/UTF-16 is still significantly larger than UTF-8.

Python does not use UTF-16 strings; it uses UCS-2 strings. The difference is that in UCS-2, every character is represented by exactly two bytes, while in UTF-16, some characters, those outside Plane 0, are represented by two "surrogate" pairs, totaling four bytes. UCS-2 does not provide any representation for characters outside the BMP. In other words, UCS-2 is a straightforward fixed length encoding, while UTF-16 is a more complex variable-length encoding.

Python can in fact use either of two internal representations for text: UCS-2 or UTF-32 = UCS-4. If you give the option --enable-unicode=ucs4 to configure when building Python, you will get a Python that supports all of Unicode rather than just the BMP.

In fact I am better informed than you are. When not compiled to use UCS-4, Python uses what is properly called UCS-2, with half-baked extensions for treating it as UTF-16. Certain functions know about surrogate pairs, such as those that convert between UTF-8 and the internal representation. However, such basic functions as len do not know about surrogate pairs. Try giving a character outside the BMP as the argument to len. It will return 2, not 1.

The fact that len returns 2 for a non-BMP character indicates that UTF-16 *is* being used. len is returning the number of words that the string occupies. This is a useful number (it indicates how much memory is needed to copy the string). The number of "characters" is completely useless, it causes crashes if you think it has something to do with memory usage, and it is useless for analyzing text unless you believe all the letters in Unicode are like fixed-pitch Latin letters.

Why is it so important that "number of characters" (actually number of Unicode code points) is O(1), but "number of words", "number of sentences", "number of lines", "number of glyphs", and a zillion other possible questions are O(n)?

This is the basic question that everybody here refuses to answer. They just blindly state that "it is really important for it to be fast to figure out the 'number of characters'"

Please give an actual real example of source code where you *use* the "number of characters". You ar

I think the real lesson here is that byte sequences and character sequences are not the same. Every character sequence can be encoded to a byte sequence (by using an appropriate encoding), and every byte sequence can be converted to a character sequence (by means of some decoding), but they are fundamentally different things. I wonder if we wouldn't be better off making this explicit, and providing distinct string (character sequence) and blob (byte sequence) types.

The fact that some code can interpret that byte sequence and draw something on the screen that the user thinks of as "text" is completely irrelevant and should not be a fundemental datatype of a programming language. This should be part of the code that draws the text. Imagine if every other type of data, such as image pixels, or sound samples, had a different IO routine and you could never read a file with the wrong routine because the conversion was lossy.

The real problem is that everybody's mind has been polluted by decades of ASCII where there was no difference between characters and bytes. All I can suggest is to try to think of text as words or sentences. Nobody would suggest that it would be good to make all words use the same amount of storage, or that it is important that you be unable to split a string except at word boundaries. But there has been so much use of ASCII that people think this is important for "characters".

I also believe there is a serious politically-correctness problem. Otherwise logical programmers are consumed with guilt because Americans get the "better" short encodings, and therefore feel they have to punish themselves by making the conversion to i18n as painful as possible so that Americans have just as much trouble as anybody else. The fact that they have actually made I18N far harder for everybody and thus actually discouraged it is the ironic result of this guilt.

If "characters" are important, then the combining characters and invisible formatting ones in Unicode mean that UTF-32 and every other way of encoding Unicode is useless as well, they are *all* variable length. It is in fact far preferrable to use UTF-8 as this forces programmers to understand variable length right away.

I would also like a really clear explanation as to why "characters" are important, but "words", "sentences", "paragraphs", "lines", and all kinds of other structures that most readers of tex

Reading the release, they have decided to really push 16-bit strings (they call this "Unicode" but it really is what is called UTF-16). I think this is a serious mistake.

The proper solution is to use 8-bit strings, but any functions that care (such as I/O) should treat them as being UTF-8. Most functions do not care and thus the treatment of "Unicode" and "bytes" are the same.

I'm going to try once more, slightly differently. Two other people apparently have tried and failed.

Python 3.0's handling of strings is basically the same as Java's, because it has proven to work quite well there.

For webapps, and the rules may be a little different on the desktop, "best practices" in Python for some time have been that you use unicode objects everywhere internally when you are representing text. When you hit a boundary (a file on disk, the net), you encode that unicode string into whatever encoding makes sense (often UTF-8). So far, so good, I hope?

Python's internal representation of unicode objects is only relevant in that you need it to support whatever code points you care about. I don't think there are any code points that you can represent in UTF-8 that Python will screw up after decoding/encoding. I'm sure there are many people who would be interested to see such a test case.

If you have a bunch of bytes that *might* be UTF-8, you're screwed. "process data that is likely to be text but must not be altered"? What do you mean by text? 7-bit ASCII? UTF-8? And where is the text coming from? Unless you tell Python the encoding of the file, you're going to get bytes out, not unicode objects.

The whole point is that Python unicode objects know how to represent code points. If you have get a set of bytes from somewhere you *have* to know what encoding it is in order to be able to treat it as a bunch of text characters. Python unicode objects will not be "bad UTF-16". How they're stored is not generally important. What's important is that Python internally keeps track of the code points and will either successfully convert to whatever encoded sequence of bytes you want or it will raise an exception because the encoding you've chosen doesn't have one of the characters in your string.

Python 3.0 makes this all clearer. When you talk about a "string", you're talking about a bunch of unicode characters. Anything else is a collection of bytes.

By the way, you can specify what encoding a Python source file is in so that your string literals are all properly decoded.

The proper solution is to do what they did: hide from the programmer what internal format is used for strings. The only time programmers should know about the encoding is when they themselves explicitly select an encoding so that they can turn a bunch of bytes into a string or when they're sending the string out into the world as a bunch of bytes. Encode and decode explicitly at the edges. Internally, hide the implementation details. It's just basic OO.

Hiding is only good if it actually works. Once you leak information about the internal encoding to the program, you have lost. Such as the length of a one-character-string sometimes being 2 -- have one program depend on that, and you can never change the supposedly hidden encoding. Of course noone would be stupid enough to return 2 when asked for the length of certain one-character-strings...

Here's the thing: that only happens in Python if you go outside the BMP, but even in the best character encoding scheme, unless you normalize, you can't tell if é is U+00E9 (Latin small letter e with acute) or e plus U+0301 (Combining acute accent). So, you can never really trust the length of a Unicode string.

Would it be better if Python reported the length of non-BMP characters correctly? Yes. But, given how funky Unicode can be, it's an understandable trade off to make.

The problem with UTF-16 is you cannot losslessly convert a string that *might* be UTF-8 to UTF-16 and then back again. This is because any illegal UTF-8 byte sequences will be lost or altered.

Then set strict conversion, which will raise UnicodeError for any nonconforming byte sequences. My problem with UTF-16 is how it bloats in-memory databases of mostly-ASCII text by a factor of nearly 2 (or 4 if Python is compiled with UTF-32 to handle hieroglyphics and ancient Chinese).

Throwing exceptions on bad UTF-8 strings is great if they are strings you control. It is not useful for strings provided by the outside environment. I can assure you that users want that data copied even if it contains errors, and they only want to see an error message when the data is interpreted.

The best that could be done with exceptions is make some kind of union of the UTF-16 and the bytes (or perhaps convert the bytes by just padding each out to 16 bits), along with a flag indicating if the data conve

People expect a string to be a sequence of characters. Please notice the first word in that sentence.

"People" are not computers. "people" LOOK at the display. People are not trying to copy the data literally from one place to another or do comparisons of strings or read files that might (horrors) not contain correct UTF-8 data. There is no reason to mangle the data until the very last moment before it is put on the display.

I can quite confirm that if you have more than one way to represent the same sequence

Many essential third party libraries need to be converted for Python 3.0. I need M2Crypto (SSL support) and MySQLdb (MySQL support), neither of which is ready for Python 3.0, and neither of which has been updated in the last year or so.

My guess is that it will be three years before stock mainstream Linux distros come with Python 3.0 and a set of libraries that work with it.

This is quite true: but sort of irrelevant. Even the core developers on Python-dev have been seen to state on more then one occasion that they don't expect Python 3.0 to be the "standard" for a period of time that will stretch to years: one? three? The specifics don't exactly matter.

That's why they've done the releasing of Python 2.6 and Python 3.0 in parallel (although 3.0 was recently delayed a little, the development of each have been hand in hand); they fully expect to maintain the 2.x line for awhile,

3.0rc1 (beta) [python.org] is already available and has been for some time now.
The advantage of 2.6 is not as much its backward-compatibility but its ability to tell you exactly what needs to change (via runtime warnings) for 3.0 without actually breaking your code. I've been using both for months now, so this article isn't exactly hot news.

Technically the correct term would be readier, but that sounds a little awkward to some people. Generally the rule is:
One Syllable=[adjective]er
More than one Syllable=more [adjective]
Unfortunately very few people tend to adhere to this. They usually randomly pick one method or the other, or worse, they use both. (more readier).

I like Python for a whole lot of other reasons too. I am a programming language snob. I used to write device drivers in C, I respected the power of the language and how unforgiving it was. My first reaction to Python was "layout is part of the language? Ha!". But credit to me, I tried it out properly, and fuck me, it's fun! I needed to carry out some very repetitive operations on a web-interace and naturally I didn't want to spend hours clicking buttons on a website. I thought to myself, I wonder how hard

(Mind you, there online documentation could be better - PHP's site for example, is so much friendlier).

They're actually hard at work on that problem too. In addition to Python 2.6 being released, the Python documentation is now generated using Sphinx [pocoo.org]. See for example the new tutorial [python.org] output. Big WTF the first time I saw it, but it's a decent improvement with more in the pipeline.