Am 28.03.2012 19:43, schrieb Peter Daum:
> As it seems, this would be far easier with python 2.x. With python 3
> and its strict distinction between "str" and "bytes", things gets
> syntactically pretty awkward and error-prone (something as innocently
> looking like "s=s+'/'" hidden in a rarely reached branch and a
> seemingly correct program will crash with a TypeError 2 years
> later ...)
It seems that you're mixing things up wrt. the string/bytes
distinction; it's not as "complicated" as it might seem.
1) Strings
s = "This is a test string"
s = 'This is another test string with single quotes'
s = """
And this is a multiline test string.
"""
s = 'c' # This is also a string...
all create/refer to string objects. How Python internally stores them
is none of your concern (actually, that's rather complicated anyway, at
least with the upcoming Python 3.3), and processing a string basically
means that you'll work on the natural language characters present in the
string. Python strings can store (pretty much) all characters and
surrogates that unicode allows, and when the python interpreter/compiler
reads strings from input (I'm talking about source files), a default
encoding defines how the bytes in your input file get interpreted as
unicode codepoint encodings (generally, it depends on your system locale
or file header indications) to construct the internal string object
you're using to access the data in the string.
There is no such thing as a type for a single character; single
characters are simply strings of length 1 (and so indexing also returns
a [new] string object).
Single/double quotes work no different.
The internal encoding used by the Python interpreter is of no concern
to you.
2) Bytes
s = b'this is a byte-string'
s = b'\x22\x33\x44'
The above define bytes. Think of the bytes type as arrays of 8-bit
integers, only representing a buffer which you can process as an array
of fixed-width integers. Reading from stdin/a file gets you bytes, and
not a string, because Python cannot automagically guess what format the
input is in.
Indexing the bytes type returns an integer (which is the clearest
distinction between string and bytes).
Being able to input "string-looking" data in source files as bytes is a
debatable "feature" (IMHO; see the first example), simply because it
breaks the semantic difference between the two types in the eye of the
programmer looking at source.
3) Conversions
To get from bytes to string, you have to decode the bytes buffer,
telling Python what kind of character data is contained in the array of
integers. After decoding, you'll get a string object which you can
process using the standard string methods. For decoding to succeed, you
have to tell Python how the natural language characters are encoded in
your array of bytes:
b'hello'.decode('iso-8859-15')
To get from string back to bytes (you want to write the natural
language character data you've processed to a file), you have to encode
the data in your string buffer, which gets you an array of 8-bit
integers to write to the output:
'hello'.encode('iso-8859-15')
Most output methods will happily do the encoding for you, using a
standard encoding, and if that happens to be ASCII, you're getting
UnicodeEncodeErrors which tell you that a character in your string
source is unsuited to be transmitted using the encoding you've
specified.
If the above doesn't make the string/bytes-distinction and usage
clearer, and you have a C#-background, check out the distinction between
byte[] (which the System.IO-streams get you), and how you have to use a
System.Encoding-derived class to get at actual System.String objects to
manipulate character data. Pythons type system wrt. character data is
pretty much similar, except for missing the "single character" type
(char).
Anyway, back to what you wrote: how are you getting the input data? Why
are "high bytes" in there which you do not know the encoding for?
Generally, from what I gather, you'll decode data from some source,
process it, and write it back using the same encoding which you used for
decoding, which should do exactly what you want and not get you into any
trouble with encodings.
--
--- Heiko.