After looking at the preceding code, the provider of the web service
gave me this advice...
"Instead of trying to create a file, take the unzipped bytes and get a
Unicode string of text from it."

If so, I'm not sure how to do what he's suggesting, or if it's really
different from what I've done.

I find that I am able to unzip the resulting data.zip using the unix
unzip command, but the file inside contains some FFFD characters, as
described in this thread...http://groups.google.com/group/comp.lang.python/browse_thread/thread/4f57abea978cc0bf?hl=en#
I don't know if the unwanted characters might be the result of my
trying to write and unzip a file, rather than unzipping the bytes.
The file does contain a semblance of what I ultimately want -- it's
not all garbage.

Apologies if it's not appropriate to start a new thread for this. It
just seems like a different topic than how to deal with the resulting
FFFD characters.

Advertisements

webcomm wrote:
> Hi,
> In python, is there a distinction between unzipping bytes and
> unzipping a binary file to which those bytes have been written?
>
> The following code is, I think, an example of writing bytes to a file
> and then unzipping...
>
> decoded = base64.b64decode(datum)
> #datum is a base64 encoded string of data downloaded from a web
> service
> f = open('data.zip', 'wb')
> f.write(decoded)
> f.close()
> x = zipfile.ZipFile('data.zip', 'r')
>
> After looking at the preceding code, the provider of the web service
> gave me this advice...
> "Instead of trying to create a file, take the unzipped bytes and get a
> Unicode string of text from it."
>
Not terribly useful advice, but one presumes he she or it was trying to
be helpful.
> If so, I'm not sure how to do what he's suggesting, or if it's really
> different from what I've done.
>
Well, what you have done appears pretty wrong to me, but let's take a
look. What's datum? You appear to be treating it as base64-encoded data;
is that correct? Have you examined it?

f = open('data.zip', 'wb')

opens the file data.zip for writing in binary. Not as a zip file, you
understand, just as a regular file. I suspect here you really needed

f = zipfile.ZipFile('data.zip', 'w')

Now, of course, you need to remember what zipfiles contain. Which is
other files. So the data you *write* tot he zipfile has to be associated
with a filename in the archive. Of course you don't have the data in a
file, you have it in a string, so you would use

f.writestr("somefile.dat", decoded)
f.close()

You have now written a zip file containing a single "somefile.dat" file
with the decoded base64 data in it. Open it with Winzip or one of its
buddies and see if anyone barfs.

> I find that I am able to unzip the resulting data.zip using the unix
> unzip command, but the file inside contains some FFFD characters, as
> described in this thread...
> http://groups.google.com/group/comp.lang.python/browse_thread/thread/4f57abea978cc0bf?hl=en#
> I don't know if the unwanted characters might be the result of my
> trying to write and unzip a file, rather than unzipping the bytes.
> The file does contain a semblance of what I ultimately want -- it's
> not all garbage.
>
But it's certainly not a zip file.
> Apologies if it's not appropriate to start a new thread for this. It
> just seems like a different topic than how to deal with the resulting
> FFFD characters.
>
Don't worry about it.

webcomm wrote:
> Hi,
> In python, is there a distinction between unzipping bytes and
> unzipping a binary file to which those bytes have been written?
>
Python's zipfile module can only read and write zip files; it can't
compress or decompress data as a bytestring.
> The following code is, I think, an example of writing bytes to a file
> and then unzipping...
>
> decoded = base64.b64decode(datum)
> #datum is a base64 encoded string of data downloaded from a web
> service
> f = open('data.zip', 'wb')
> f.write(decoded)
> f.close()
> x = zipfile.ZipFile('data.zip', 'r')
>
> After looking at the preceding code, the provider of the web service
> gave me this advice...
> "Instead of trying to create a file, take the unzipped bytes and get a
> Unicode string of text from it."
>
> If so, I'm not sure how to do what he's suggesting, or if it's really
> different from what I've done.
>
If what you've been given is data which has been zipped and then base-64
encoded, then I can't see that you might be doing wrong.
> I find that I am able to unzip the resulting data.zip using the unix
> unzip command, but the file inside contains some FFFD characters, as
> described in this thread...
> http://groups.google.com/group/comp.lang.python/browse_thread/thread/4f57abea978cc0bf?hl=en#
> I don't know if the unwanted characters might be the result of my
> trying to write and unzip a file, rather than unzipping the bytes.
> The file does contain a semblance of what I ultimately want -- it's
> not all garbage.
>
> Apologies if it's not appropriate to start a new thread for this. It
> just seems like a different topic than how to deal with the resulting
> FFFD characters.
>

On Jan 9, 3:15 pm, Steve Holden <> wrote:
> webcomm wrote:
> > Hi,
> > In python, is there a distinction between unzipping bytes and
> > unzipping a binary file to which those bytes have been written?
>
> > The following code is, I think, an example of writing bytes to a file
> > and then unzipping...
>
> > decoded = base64.b64decode(datum)
> > #datum is a base64 encoded string of data downloaded from a web
> > service
> > f = open('data.zip', 'wb')
> > f.write(decoded)
> > f.close()
> > x = zipfile.ZipFile('data.zip', 'r')
>
> > After looking at the preceding code, the provider of the web service
> > gave me this advice...
> > "Instead of trying to create a file, take the unzipped bytes and get a
> > Unicode string of text from it."
>
> Not terribly useful advice, but one presumes he she or it was trying to
> be helpful.
>
> > If so, I'm not sure how to do what he's suggesting, or if it's really
> > different from what I've done.
>
> Well, what you have done appears pretty wrong to me, but let's take a
> look. What's datum? You appear to be treating it as base64-encoded data;
> is that correct? Have you examined it?

On Fri, Jan 9, 2009 at 2:32 PM, webcomm <> wrote:
> On Jan 9, 3:15 pm, Steve Holden <> wrote:
>> webcomm wrote:
>> > Hi,
>> > In python, is there a distinction between unzipping bytes and
>> > unzipping a binary file to which those bytes have been written?
>>
>> > The following code is, I think, an example of writing bytes to a file
>> > and then unzipping...
>>
>> > decoded = base64.b64decode(datum)
>> > #datum is a base64 encoded string of data downloaded from a web
>> > service
>> > f = open('data.zip', 'wb')
>> > f.write(decoded)
>> > f.close()
>> > x = zipfile.ZipFile('data.zip', 'r')
>>
>> > After looking at the preceding code, the provider of the web service
>> > gave me this advice...
>> > "Instead of trying to create a file, take the unzipped bytes and get a
>> > Unicode string of text from it."
>>
>> Not terribly useful advice, but one presumes he she or it was trying to
>> be helpful.
>>
>> > If so, I'm not sure how to do what he's suggesting, or if it's really
>> > different from what I've done.
>>
>> Well, what you have done appears pretty wrong to me, but let's take a
>> look. What's datum? You appear to be treating it as base64-encoded data;
>> is that correct? Have you examined it?
>
> It's data that has been compressed then base64 encoded by the web
> service. I'm supposed to download it, then decode, then unzip. They
> provide a C# example of how to do this on page 13 of
> http://forums.regonline.com/forums/docs/RegOnlineWebServices.pdf
>
> If you have a minute, see also this thread...
> http://groups.google.com/group/comp...7dd4?hl=en&lnk=gst&q=webcomm#5b9eceeee3e77dd4
>

When they say "zip", they're talking about a zlib compressed stream of
bytes, not a zip archive.

You want to base64 decode the data, then zlib decompress it, then
finally interpret it as (I think) UTF-16, as that's what Windows
usually means when it says "Unicode".

On Fri, Jan 9, 2009 at 3:08 PM, Chris Mellon <> wrote:
> On Fri, Jan 9, 2009 at 2:32 PM, webcomm <> wrote:
>> On Jan 9, 3:15 pm, Steve Holden <> wrote:
>>> webcomm wrote:
>>> > Hi,
>>> > In python, is there a distinction between unzipping bytes and
>>> > unzipping a binary file to which those bytes have been written?
>>>
>>> > The following code is, I think, an example of writing bytes to a file
>>> > and then unzipping...
>>>
>>> > decoded = base64.b64decode(datum)
>>> > #datum is a base64 encoded string of data downloaded from a web
>>> > service
>>> > f = open('data.zip', 'wb')
>>> > f.write(decoded)
>>> > f.close()
>>> > x = zipfile.ZipFile('data.zip', 'r')
>>>
>>> > After looking at the preceding code, the provider of the web service
>>> > gave me this advice...
>>> > "Instead of trying to create a file, take the unzipped bytes and get a
>>> > Unicode string of text from it."
>>>
>>> Not terribly useful advice, but one presumes he she or it was trying to
>>> be helpful.
>>>
>>> > If so, I'm not sure how to do what he's suggesting, or if it's really
>>> > different from what I've done.
>>>
>>> Well, what you have done appears pretty wrong to me, but let's take a
>>> look. What's datum? You appear to be treating it as base64-encoded data;
>>> is that correct? Have you examined it?
>>
>> It's data that has been compressed then base64 encoded by the web
>> service. I'm supposed to download it, then decode, then unzip. They
>> provide a C# example of how to do this on page 13 of
>> http://forums.regonline.com/forums/docs/RegOnlineWebServices.pdf
>>
>> If you have a minute, see also this thread...
>> http://groups.google.com/group/comp...7dd4?hl=en&lnk=gst&q=webcomm#5b9eceeee3e77dd4
>>
>
> When they say "zip", they're talking about a zlib compressed stream of
> bytes, not a zip archive.
>
> You want to base64 decode the data, then zlib decompress it, then
> finally interpret it as (I think) UTF-16, as that's what Windows
> usually means when it says "Unicode".
>
> decoded = base64.b64decode(datum)
> decompressed = zlib.decompress(decoded)
> result = decompressed.decode('utf-16')
>

And of course as *soon* as I write that, I read the appendix on the
documentation in full and turn out to be wrong. Ignore me *sigh*.

That's the zip created in this line of my code...
f = open('data.zip', 'wb')

If I open the file it contains as unicode in my text editor (EditPlus)
on Windows XP, there is ostensibly nothing wrong with it. It looks
like valid XML. But if I return it to my browser with python+django,
there are bad characters every other character

> But if I return it to my browser with python+django,
> there are bad characters every other character

Please consider that we might have difficulty guessing what "return it
to my browser with python+django" means. Show actual code.
>
> If I unzip it like this...
> popen("unzip data.zip")
> ...then the bad characters are 'FFFD' characters as described and
> pictured here...http://groups.google.com/group/comp.lang.python/browse_thread/thread/...

Yup, you've somehow pushed your utf_16_le-encoded data through some
decoder that doesn't like '\x00' and is replacing it with U+FFFD whose
name is (funnily enough) REPLACEMENT CHARACTER and whose meaning is
"big fat Unicode version of the question mark".
>
> If I unzip it like this...
> getzip('data.zip', ignoreable=30000)
> ...using the function at...http://groups.google.com/group/comp.lang.python/msg/c2008e48368c6543
> ...then the bad characters are \x00 characters.

Hmmm ... shouldn't make a difference how you extracted 'data' from
'data.zip'.

There it is. Thanks.
> u'<Registration><BalanceDue>0.0000</BalanceDue><Stat'
>
>
>
> > But if I return it to my browser with python+django,
> > there are bad characters every other character
>
> Please consider that we might have difficulty guessing what "return it
> to my browser with python+django" means. Show actual code.

I did stop and consider what code to show. I tried to show only the
code that seemed relevant, as there are sometimes complaints on this
and other groups when someone shows more than the relevant code. You
solved my problem with decode('utf_16_le'). I can't find any
description of that encoding on the WWW... and I thought *everything*
was on the WWW.

I didn't know the data was utf_16_le-encoded because I'm getting it
from a service. I don't even know if *they* know what encoding they
used. I'm not sure how you knew what the encoding was.
> Please consider reading the Unicode HOWTO athttp://docs.python.org/howto/unicode.html

Probably wouldn't hurt, though reading that HOWTO wouldn't have given
me the encoding, I don't think.

On Jan 11, 6:15 am, webcomm <> wrote:
> On Jan 9, 6:07 pm, John Machin <> wrote:
>
> > Yup, it looks like it's encoded in utf_16_le, i.e. no BOM as
> > God^H^H^HGates intended:
>
> > >>> buff = open('data', 'rb').read()
> > >>> buff[:100]
>
> > '<\x00R\x00e\x00g\x00i\x00s\x00t\x00r\x00a\x00t\x00i\x00o\x00n\x00>
> > \x00<\x00B\x0
> > 0a\x00l\x00a\x00n\x00c\x00e\x00D\x00u\x00e\x00>
> > \x000\x00.\x000\x000\x000\x000\x0
> > 0<\x00/\x00B\x00a\x00l\x00a\x00n\x00c\x00e\x00D\x00u\x00e\x00>\x00<
> > \x00S\x00t\x0
> > 0a\x00t\x00'
> > >>> buff[:100].decode('utf_16_le')
>
> There it is. Thanks.
>
> > u'<Registration><BalanceDue>0.0000</BalanceDue><Stat'
>
> > > But if I return it to my browser with python+django,
> > > there are bad characters every other character
>
> > Please consider that we might have difficulty guessing what "return it
> > to my browser with python+django" means. Show actual code.
>
> I did stop and consider what code to show. I tried to show only the
> code that seemed relevant, as there are sometimes complaints on this
> and other groups when someone shows more than the relevant code. You
> solved my problem with decode('utf_16_le'). I can't find any
> description of that encoding on the WWW... and I thought *everything*
> was on the WWW.

Try searching using the official name UTF-16LE ... looks like a blind
spot in the approximate matching algorithm(s) used by the search engine
(s) that you tried :-(
> I didn't know the data was utf_16_le-encoded because I'm getting it
> from a service. I don't even know if *they* know what encoding they
> used. I'm not sure how you knew what the encoding was.

Actually looked at the raw data. Pattern appeared to be an alternation
of 1 "meaningful" byte and one zero ('\x00') byte: => UTF16*. No BOM
('\xFE\xFF' or '\xFF\xFE') at start of file: => UTF16-?E. First byte
is meaningful: => UTF16-LE.
> > Please consider reading the Unicode HOWTO at http://docs.python.org/howto/unicode.html
>
> Probably wouldn't hurt,

Definitely won't hurt. Could even help.
> though reading that HOWTO wouldn't have given
> me the encoding, I don't think.

Share This Page

Welcome to The Coding Forums!

Welcome to the Coding Forums, the place to chat about anything related to programming and coding languages.

Please join our friendly community by clicking the button below - it only takes a few seconds and is totally free. You'll be able to ask questions about coding or chat with the community and help others.
Sign up now!