Zipping and unzipping Excel xlsx files

If you have a file with an .xlsx extension on the file name that was last edited
by
Microsoft Excel, then the file is stored in an
Office
Open XML (OpenXML) zipped, XML-based file format developed by Microsoft for
spreadsheets, charts, presentations and word processing documents. You can
change the file extension to .zip by renaming the file. You should then
be able to extract the contents of the zip file as you would with any other
zip file.

There are different algorithms that can be used for storing files within a
zip file that determine the level of compression and speed of production of
a zip file. If you are using a
Linux or
Apple OS X
system, you can see choices that are available for compressing files and
directories into a zip file from the
command line, e.g. from a
Terminal
window on an OS X system, by looking at the
man page
for the zip program by
issuing the command man zip. The
following options are available for file compression:

-Z cm
--compression-method cm
Set the default compression method. Currently the main methods
supported by zip are store and deflate. Compression method can
be set to:
store - Setting the compression method to store forces zip to
store entries with no compression. This is generally faster
than compressing entries, but results in no space savings. This
is the same as using -0 (compression level zero).
deflate - This is the default method for zip. If zip determines
that storing is better than deflation, the entry will be stored
instead.
bzip2 - If bzip2 support is compiled in, this compression method
also becomes available. Only some modern unzips currently sup-
port the bzip2 compression method, so test the unzip you will be
using before relying on archives using this method (compression
method 12).
For example, to add bar.c to archive foo using bzip2 compres-
sion:
zip -Z bzip2 foo bar.c
The compression method can be abbreviated:
zip -Zb foo bar.c
-#
(-0, -1, -2, -3, -4, -5, -6, -7, -8, -9)
Regulate the speed of compression using the specified digit #,
where -0 indicates no compression (store all files), -1 indi-
cates the fastest compression speed (less compression) and -9
indicates the slowest compression speed (optimal compression,
ignores the suffix list). The default compression level is -6.
Though still being worked, the intention is this setting will
control compression speed for all compression methods. Cur-
rently only deflation is controlled.

Excel uses the "deflate" storage method as can be seen for the following
Example.xlsx file produced by Microsoft Excel for Mac 2011 on a
MacBook Pro laptop by using the
zipinfo command, which
lists detailed information about zip archives. The Example workbook has two
worksheets with the default names of sheet1 and
sheet2.

The "defS" in the sixth column indicates the compression method used.
There are six methods known at present: storing (no compression),
reducing, shrinking, imploding, tokenizing (never publicly released),
and deflating. In addition, there are four levels of reducing (1
through 4); four types of imploding (4K or 8K sliding dictionary,
and 2 or 3 Shannon-Fano trees); and four levels of deflating
(superfast, fast, normal, maximum compression). zipinfo represents
these methods and their sub-methods as follows: stor; re:1,
re:2, etc.; shrk; i4:2, i8:3,
etc.; tokn; and defS, defF,
defN, and defX.

If I rename the Example.xlsx file to Example.zip
and extract its contents and then rezip the contents of the directory produced
when I extracted the contents of the zip file with the zip utility,
I see the following:

I.e., I see that, by default, the deflation method is being used for files.
If I then check the .zip file produced by the zip utility with the
zipinfo utility, I see "defN" listed in the sixth column for the
compression method.

Deflation is the default compression method, but using the
zip utility on an OS X
system I can specify the level of reducing to be 1 using
-Z deflate -1. I then see the compression method listed as
"defF".

Using -2 for the level of delation also results in the
deflation method shown by zipinfo being "defF". If -3 is
used, "defN" is shown; "defN" is also shown by zipinfo if
-4 is used as an
argument to the zip command. If I issue the command
zip -r -n Example.zip Example/*, where n is a
number between 0 and 9, on an OS X system, I see the following methods listed
when I check the resulting zip file with zipinfo.

Number

Method

0

stor

1

defF

2

defF

3

defN

4

defN

5

defN

6

defN

7

defN

8

defX

9

defX

You might expect that you could recompress the folder where the files
were extracted to produce a new zip file and then rename that to be a .xlsx
file which you could then open with Excel. That won't work, however. If you
right click on the directory containing the extracted files from the
Finder application and choose Compress dirname where
dirname is the relevant directory name, that will produce a new zip
file. E.g., if the file you started from was Example.xls and you
renamed it to Example.zip and then extracted the contents of the
zip file to the directory Example, you would now have a
Example.zip file again. You can right-click on it and rename the
extension back to .xlsx, but if you try to open that file with Excel you will
see the message "file format is not valid". You will also have the same problem
if you use the command line zip
program, e.g., if you used zip -r Example.zip Example/*.
The zip file produced by those methods on an Apple OS X system does
not match what Excel is expecting.

If you create a zip file with the
Pythonzipfile
module, the deflate method is used by default and you will see
"defN", if you check the zip file created with that module. You can
create a zip file of a directory from the command line with that
module using the command python -m zipfile -c zipfilenamedirectory where zipfilename is the name you
wish to give to the zip file and directory is the directory
you wish to compress into a zip file. E.g., python -m zipfile
-c Example.zip Example/. You will also not be able to open the zip file
produced by that method with Excel by simply renaming the file to have a .xlsx
rather than a .zip extension. You will again get the "file format is not
valid" message, if you try to open the file in Excel.

However, if you use the Python
shutil module to
create the zip file from the directory and then rename the zip file to a .xlsx
file, you can open it in Excel. The
zipdir.py Python script can be
used to create a zip file that can be opened with Excel after changing the file
extension.

I noticed that when I produce zip files from a directory on an OS X system
using methods other than with the Python shutil module, that the directories
are shown with the "stor" compression method. I.e., they are stored with no
compression employed. When I compress the directory with
zipdir.py
E.g., zipinfo shows the following information for a zip file produced
by python -m zipfile -c Example.zip Example/: