How To Generate Excel Files with Ruby

While Harvest provides powerful reports sometimes nothing beats having all of your data in a spreadsheet. Unfortunately, the defacto spreadsheet tool (Excel) has some tricky file formats, making programatic export more difficult than it should be. We recently made improvements to how we handle Excel exports in Harvest, and I’d like to share a few options for exporting spreadsheets with Ruby.

Perhaps the simplest way is to export Comma Separated Value (CSV) and let the Excel process these. Generating CSV files is generally fast, the format is well known but not trouble free. If you only export simple ASCII characters, Excel will work fine, but drop one non-English character in and Excel gets confused. Excel cannot select the encoding, so if you want to export UNICODE data (such as when you’re working in non-English languages) using CSV files with Excel won’t do. Sure all other spreadsheet programs can import UTF8 encoded CSV files, some of these programs are free (OpenOffice) so if you have the option of ignoring Excel CSV will do.

Next is using TSV files (Tab Separated Values), these solve the UNICODE problem but the user must remember to explicitly convert to native Excel format and this is not what they usually do with spreadsheets. Clearly, we must produce native Excel files, and this means either XLS (format used up till Office 2003 included) or the new XLSX documents based on the controversial OpenXML. The older format is undocumented but many hours of reverse engineering has been spent on it and there are quite a few libraries out there. For Ruby the most notable is Spreadsheet.

XLS & Spreadsheet has some drawbacks. First it has a column limit of 256 and 64K on the number of rows. Generally you will break down a lot sooner when exporting large number of rows as Spreadsheet is very slow to construct the global string table. See all strings in an XLS file are stored in a global string table then the row data merely refers to string by the index in the table. A nice optimization trick that makes opening XLS files significantly faster, but it also makes writes slow. Constructing the string table takes time and with the particular implementation Spreadsheet consumes a lot of memory as it keeps the entire document in memory all rows included, then upon save it constructs a Hash of all strings in the document. In short Spreadsheet is a great library for parsing and constucting decorated documents but breaks down above 20k rows.

We have implemented an XLSX generator to get Excel exports in Harvest. Unlike XLS the format is documented, albeit the thousands of pages of documentation provides only marginal help. However, it does support another string storage model, where all values are stored inline. This makes file generation faster while’st opening such documents for the first time will be somewhat slower. The resulting export will be about the same size since XLSX is basically a zip file of other xml documents. To use, first you need to install the gem:

The gem will recognize a few basic ruby data types and generate cells of the right type. There is no other support for formating the document as in different colors, font styles etc just raw output compatible with the following programs:

Open Office 3.2 (Linux, Mac, Windows)

Neo Office 3.2 (Mac)

Microsoft Office 2007 (Windows)

Microsoft Office 2010 (Windows)

Microsoft Office 2008 for Mac (versions 12.2.5 or above)

Microsoft Excel Viewer (Windows)

One notable exception from this list is Numbers from iWork ’09 as it does not yet support documents in the inline string storage model. Apple may fix this eventually.

I was evaluating simple_xlsx along with ruby spreadsheet ( http://spreadsheet.rubyforge.org/ ) and arydjmal’s to_xls ( http://github.com/arydjmal/to_xls ) and found it a good balance between simplicity (spreadsheet requires ole) and compatibility (to_xls generates plain xml that opens ok in Excel but not in some other applications).

With simple_xlsx and the example code above I was getting the following error message: “Excel cannot open this file. The file might have been damaged or modified from its original format.” After some serious debugging turns out “dcterms:created” time format in docProps/core.xml doesn’t comply with the expected xml time format. To fix this, open “lib/simple_xlsx/serializer.rb” on line 109 and replace:
2010-07-20T14:30:58.00Z
with:
#{Time.now.utc.xmlschema}
This will insert the current time in the correct format in the xml output generated.

I also found it useful to run this as a rails plugin rather than as a separate gem. First, install the code as a plugin:
./script/plugin install git://github.com/harvesthq/simple_xlsx_writer.git
Next, add “vendor/plugins/simple_xlsx_writer/init.rb” and paste in the following:
require ‘simple_xlsx’
You will need the rubyzip gem (and perhaps fast_xs) to satisfy the dependencies.

I am currently using the spreadsheet gem with some success. However, I have several ‘workarounds’ that I’d like to get away from.

Has anyone tried modifying this to open an existing workbook and adding rows to it, instead of creating each workbook from scratch? I have existing workbooks that I use as ‘templates’, and if I could use it in this way, it would great. Just checking before I try to make this work…

Hi Chris, I’m afraid simple_xlsx_writer won’t let you read or modify existing Excel documents, which is what you need for a ‘templating’ approach. Have a look at ruby spreadsheet ( http://spreadsheet.rubyforge.org/ ) it might be able to do what you need. It won’t work though if your worksheets are massive or you specifically need to use the newer XLSX file format.

We’ve had the same problem as you are describing with the Excel gem: it broke down when generating files with about 25k rows. We fixed this by tuning the garbage collector with the following values (this is for REE, but you can use similar values for 1.9)