On Tue, Sep 28, 2010 at 9:29 PM, Michael Foord
<fuzzyman at voidspace.org.uk> wrote:
> On 28/09/2010 12:19, Antoine Pitrou wrote:
>> On Mon, 27 Sep 2010 23:45:45 -0400
>> Steve Holden<steve at holdenweb.com> wrote:
>>> On 9/27/2010 11:27 PM, Benjamin Peterson wrote:
>>>> Tokenize only works on bytes. You can open a feature request if you
>>>> desire.
>>>>>>> Working only on bytes does seem rather perverse.
>>>> I agree, the morality of bytes objects could have been better :)
>>> The reason for working with bytes is that source data can only be correctly
> decoded to text once the encoding is known. The encoding is determined by
> reading the encoding cookie.
>> I certainly wouldn't be opposed to an API that accepts a string as well
> though.
A very quick scan of _tokenize suggests it is designed to support
detect_encoding returning None to indicate the line iterator will
return already decoded lines. This is confirmed by the fact the
standard library uses it that way (via generate_tokens).
An API that accepts a string, wraps a StringIO around it, then calls
_tokenise with an encoding of None would appear to be the answer here.
A feature request on the tracker is the best way to make that happen.
Cheers,
Nick.
--
Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia