Convert SpreadSheets to CSV files with Python and pyuno, Part 1v2

Some months back I developed some pyuno code for converting
spreadsheets into CSV files from the command line.
Pyuno being the Python interface to the OpenOffice runtime.
One of the enhancement suggestions I got was to add the ability
to extract all the sheets and/or a specific sheet rather than
always extracting the first sheet.
The following update to the code does just that.

By way of refreshing our memory, the converter program
takes pairs of input and output files, for example:

$ python ssconverter.py file1.xls file1.csv file2.ods file2.csv

Each input file is a spreadsheet and it is converted into the corresponding
output file as a CSV file.
The new version has been enhanced so that you can append a sheet
name or number to the end of the input file name to specify a
particular sheet to extract, for example:

If the format specifier is %d, then the sheet number is formatted
into the output file name, if %s is specified then the sheet name is used.
The %d specifier can include zero pad and width specifiers (eg %04d).

The new code has three main changes over the the original code.
The first change is to check for an input file sheet specification:

The regular expression search checks to see if the input file name
can be broken into a file name part and a sheet name part.
If it can, and only if the original unbroken input file name does
not refer to an existing file, the pieces are used as the
input file and sheet specification.
Checking to see if the original unbroken input file name refers
to an existing file allows you to specify input files that have colons
and at signs in their names, and it continues to work even in the case where
you have an input file with a colon or at sign in its name and you want
to extract a particular sheet, for example:

$ python ssconverter.py this:month.xls:sales output.csv

will correctly recognize that the sheet name is "sales" and
that the file name is "this:month.xls".
Since the regular expression search is be default greedy the
break will occur at the last colon and since "this:month.xls:sales"
does not refer to an existing file the match will be used.

The second main change, is related to how the document is loaded.
After getting the rest of the code to the point where I thought
it ought to be working I discovered while testing that OpenOffice
was always converting the first sheet.
After some flailing at the code failed to fix the problem I
started looking at a few other examples of pyuno code and the only
thing I really noted was that many did not use the Hidden property
when loading the document.
So, I gave that a shot and sure enough it fixed the problem.
The accompanying comment has a few more notes:

# Sheet activation does not work properly when Hidden is specified.# Although the sheet does become the active sheet, it's not the sheet that# gets saved if the spreadsheet is loaded with Hidden=True.## Removing Hidden=True doesn't seem to change anything: nothing appears# on the screen regardless of the Hidden value.## document = self.desktop.loadComponentFromURL(inputUrl, "_blank", 0, ooutils.oo_properties(Hidden=True))document=self.desktop.loadComponentFromURL(inputUrl,"_blank",0,ooutils.oo_properties())

The third change is the one where the sheet or sheets are extracted.
If a particular sheet is specified the sheet is activated first
then saved to the output file:

# Activate the sheet to be converted.ifre.search(r'^\d+$',inputSheet):sheet=sheets.getByIndex(int(inputSheet)-1)else:sheet=sheets.getByName(inputSheet)controller.setActiveSheet(sheet)outputUrl=uno.systemPathToFileUrl(os.path.abspath(outputFile))document.storeToURL(outputUrl,props)

The regular expression here checks to see if the sheet name
is a number or a string (sheet number or sheet name).

If all the sheets are being saved then the sheets are activated
one by one, the output file name is formatted and the sheet is saved:

# Use the sheet number if the format is %d, otherwise the sheet name.dfmt=re.search(r'%[0-9]*d',outputFile)sfmt=re.search(r'%s',outputFile)ifdfmtorsfmt:i=0whilei<sheets.getCount():# Activate the sheet.sheet=sheets.getByIndex(i)controller.setActiveSheet(sheet)# Create output file name.ifdfmt:ofile=outputFile%(i+1)else:ofile=outputFile%sheet.getName().replace(' ','_')ifverbose:print" %s"%ofile# Save the sheet to the output file.outputUrl=uno.systemPathToFileUrl(os.path.abspath(ofile))document.storeToURL(outputUrl,props)i+=1

The code in its entirety appears below:

1#!/usr/bin/python 2# 3# Convert spreadsheet to CSV file. 4# 5# Based on: 6# PyODConverter (Python OpenDocument Converter) v1.0.0 - 2008-05-05 7# Copyright (C) 2008 Mirko Nasato <mirko@artofsolving.com> 8# Licensed under the GNU LGPL v2.1 - or any later version. 9# http://www.gnu.org/licenses/lgpl-2.1.html 10# 11 12importos 13importre 14importooutils 15 16importuno 17fromcom.sun.star.taskimportErrorCodeIOException 18 19 20 21classSSConverter: 22""" 23 Spreadsheet converter class. 24 Converts spreadsheets to CSV files. 25 """ 26 27def__init__(self,oorunner=None): 28self.desktop=None 29self.oorunner=None 30 31 32defconvert(self,inputFile,outputFile,verbose=False): 33""" 34 Convert the input file (a spreadsheet) to a CSV file. 35 36 The input file name can contain a sheet specification to specify a particular sheet. 37 The sheet specification is either a number or a sheet name. 38 The sheet specification is appended to the file name separated by a colon 39 or an at sign: ":" or "@". 40 41 If the output file name contains a %d or %s format specifier, then all the sheets 42 in the input file are converted, otherwise only the first sheet is converted. 43 44 If the output file name contains a %d format specifier then the sheet number 45 is used when formatting the output file name. 46 The format can contain a width specifier (eg %02d). 47 48 If the output file name contains a %s specifier then the sheet name is used 49 when formatting the output file name. 50 """ 51 52# Start openoffice if needed. 53ifnotself.desktop: 54ifnotself.oorunner: 55self.oorunner=ooutils.OORunner() 56 57self.desktop=self.oorunner.connect() 58 59# Check for sheet specification in input file name. 60match=re.search(r'^(.*)[@:](.*)$',inputFile) 61ifos.path.exists(inputFile)ornotmatch: 62inputUrl=uno.systemPathToFileUrl(os.path.abspath(inputFile)) 63inputSheet='1'# Convert fist sheet. 64else: 65inputUrl=uno.systemPathToFileUrl(os.path.abspath(match.group(1))) 66inputSheet=match.group(2) 67 68 69# NOTE: 70# Sheet activation does not work properly when Hidden is specified. 71# Although the sheet does become the active sheet, it's not the sheet that 72# gets saved if the spreadsheet is loaded with Hidden=True. 73# 74# Removing Hidden=True doesn't seem to change anything: nothing appears 75# on the screen regardless of the Hidden value. 76# 77# document = self.desktop.loadComponentFromURL(inputUrl, "_blank", 0, ooutils.oo_properties(Hidden=True)) 78document=self.desktop.loadComponentFromURL(inputUrl,"_blank",0,ooutils.oo_properties()) 79 80try: 81props=ooutils.oo_properties(FilterName="Text - txt - csv (StarCalc)") 82# 83# Another useful property option: 84# FilterOptions="59,34,0,1" 85# 59 - Field separator (semicolon), this is the ascii value. 86# 34 - Text delimiter (double quote), this is the ascii value. 87# 0 - Character set (system). 88# 1 - First line number to export. 89# 90# For more information see: 91# http://wiki.services.openoffice.org/wiki/Documentation/DevGuide/Spreadsheets/Filter_Options 92 93# To convert a particular sheet, the sheet needs to be active. 94# To activate a sheet we need the spreadsheet-view, to get the spreadsheet-view 95# we need the spreadsheet-controller, to get the spreadsheet-controller 96# we need the spreadsheet-model. 97# 98# The spreadsheet-model interface is available from the document object. 99# The spreadsheet-view interface is available from the controller.100#101controller=document.getCurrentController()102sheets=document.getSheets()103104# If the output file name contains a %d or %s format specifier, convert all sheets.105# Use the sheet number if the format is %d, otherwise the sheet name.106dfmt=re.search(r'%[0-9]*d',outputFile)107sfmt=re.search(r'%s',outputFile)108109ifdfmtorsfmt:110i=0111whilei<sheets.getCount():112# Activate the sheet.113sheet=sheets.getByIndex(i)114controller.setActiveSheet(sheet)115116# Create output file name.117ifdfmt:118ofile=outputFile%(i+1)119else:120ofile=outputFile%sheet.getName().replace(' ','_')121122ifverbose:print" %s"%ofile123124# Save the sheet to the output file.125outputUrl=uno.systemPathToFileUrl(os.path.abspath(ofile))126document.storeToURL(outputUrl,props)127i+=1128129else:130# Activate the sheet to be converted.131ifre.search(r'^\d+$',inputSheet):132sheet=sheets.getByIndex(int(inputSheet)-1)133else:134sheet=sheets.getByName(inputSheet)135136controller.setActiveSheet(sheet)137outputUrl=uno.systemPathToFileUrl(os.path.abspath(outputFile))138document.storeToURL(outputUrl,props)139finally:140ifdocument:document.close(True)141142143if__name__=="__main__":144fromsysimportargv145fromos.pathimportisfile146147iflen(argv)==2andargv[1]=='--shutdown':148ooutils.oo_shutdown_if_running()149else:150iflen(argv)<3orlen(argv)%2!=1:151print"USAGE:"152print" python %s INPUT-FILE[:SHEET] OUTPUT-FILE ..."%argv[0]153print"OR"154print" python %s --shutdown"%argv[0]155exit(255)156157try:158i=1159converter=SSConverter()160161whilei+1<len(argv):162print'%s => %s'%(argv[i],argv[i+1])163converter.convert(argv[i],argv[i+1],True)164i+=2165166exceptErrorCodeIOException,exception:167print"ERROR! ErrorCodeIOException %d"%exception.ErrCode168exit(1)

Comment viewing options

After investigating deeper, this script does shutdown OpenOffice. This works around the OpenOffice memory leak problems. Good.

The memory leak issue I saw appears if OpenOffice is already started BEFORE calling this script (headless instance started on the same port). Uno seems to be confused with what instance of OpenOffice to use and close. I ended up with hundreds of openoffice processes.

I tried to call your code on a few hundreds of spreasheets (1 call per 1 spreadsheet) and I noticed that the conversion (Excel to CSV) became slower after each call. In fact, something was eating more and more memory. My server eventually ran out of memory/swap and stopped.

With the calling code on my side, I had previously tried JODConvertor and Roo. I did not have such memory issues.

BTW, JODConverter worked fine but it does not support multiple sheets.
Roo has support for multiple sheets but it has two major issues for me. Limited support for Asian encoded characters. And silent crash for lines containings Merged cells (the lines are just skipped in the CSV output).

I have a problem using this script in Ubuntu,
NameError: global name 'OPENOFFICE_BIN' is not defined
in order to solve it, edit ooutils.py, go to line 19 and add this path:
/usr/lib/openoffice/program/

Thanks for listening. I posted a request for this functionality some time ago. As I couldn't figure out how to do it, I resourced to xls2csv, a Perl module. As to date it has worked fine, specially since sometimes the files to convert are large, say 200+ cols and 1500+ rows. Only complaint is that the installation is not as easy as installing OpenOffice.org with a simple apt-get install.

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.