Document Migrations

If you’ve been around this business for a while, you’ve seen your share of migrations. New operating systems, new networks, new hardware, even new document formats. I’d like to share some recollections of one such migration, and then some suggest a solution.

In 1995 I was working at Lotus on Freelance Graphics, along with many others, getting SmartSuite ready for Windows 95. One day, as I walked to work and rounded the corner of Binney Street, I saw something unusual, even more unusual than the usual unusual one sees in Cambridge. Something was up. There were news vans parked in front of LDB, camera crews and reporters looking for comments, Lotus security videotaping the reporters asking for comments, and me standing there, clueless.

This was how I first heard of IBM’s take-over offer. It was hard to concentrate on porting to Windows 95 with all that news going on downstairs, but we managed.

In the weeks and months that followed there were many changes. At Lotus we were 100% SmartSuite users. No surprise there. Most of us did not even have a copy of Microsoft Office on our machines, unless we worked on file compatibility. Not only did we use SmartSuite for our collaborative work, creating and reviewing specifications, giving presentations, etc., we also ran some of our business processes on it. In particular we used an expense report application, done in 1-2-3 with LotusScript.

But IBM used Microsoft Office. So when IBM took over, we needed to migrate. Sure, there was whining and moaning and gnashing of teeth on our end about having to move to an inferior product. And it did take a little while to get accustomed to the different conventions of Office, typing AVERAGE() in Excel, rather than @AVG() in 1-2-3 and stuff like that. But we did it. We moved to Office. It was clear to all that the benefits of having a single file format outweighed the short-term pain on migration.

It is interesting what we did not do:

We did not go and convert all existing legacy SmartSuite documents into Office format. What would have been the point? Most old documents are never touched again. Let them rest in peace.

We did not delete SmartSuite from our hard drives. We kept the application there for cases where we needed to access old documents.

We did not simply continue using SmartSuite and tell it to save in Office format. We knew that both fidelity-wise and performance-wise it is far better to use an application that supports a format natively than to rely on conversion software for interoperability.

We did not translate 1-2-3 macro-based applications into Excel macro-based applications. We took the opportunity to move straight to web based applications. Aside from some standard presentation templates and similar boiler-plate templates we did not do a lot of conversion work.

Looking back in retrospect, the migration of file formats was one of the least contentious changes that accompanied the IBM takeover. We can handle file format changes, but eliminating the traditional Friday Beer Cart, now that was something to complain about…

I’m not much of one for committing unprovoked acts of methodology, but if I had to summarize what little wisdom I have in this area, I’d say that for a migration you want evaluate your existing documents by three criteria: stability, complexity and business criticality, and develop a migration plan based on that.

In the first case you classify documents by how stable (unchanging) they are:

Hot documents — the documents that are being heavily changed and edited today, works-in-progress, in active collaborations

Cold documents — the documents which are no longer edited, though perhaps they are still read. Many of these documents may have zero value and are just taking up space. Others may be valuable records, but hidden away on someone’s hard-drive.

Warm documents — These are the ones that are in the middle, not seeing heavy activity, but they aren’t quite frozen either.

From the perspective of complexity we have:

Low complexity — simple text and graphics

Medium complexity — using more advanced features, created by power users

High complexity — “engineered documents”, using scripting and macros to create applications.

Finally you can also look at these documents from the perspective of business criticality. Of course, this will vary according to your business. It might be relevance to ongoing litigation, it might be according to a records retention policy, it might be whether it concerns currently open projects, etc. But for sake of argument, let’s take client or public exposure as a proxy for criticality, so we get this:

Internal use documents — internal presentations and reports

Customer facing documents — engagement reports, proposals, etc.

Publication ready documents — white papers, journal articles, etc.

These three dimensions — stability, complexity and criticality — can be combined, creating 27 different document classes. For example, our old expense report based on 1-2-3 macros would be classified as a hot, high complexity, internal use document.

So you are transitioning from Office legacy binary formats to ODF. What do you do with each of these document classes? You have four main strategies to consider:

Do nothing and preserve the document in the legacy format, maintaining, as needed, access to the legacy application.

Convert document to a portable high fidelity static representation, like PDF

Convert directly to ODF.

Reengineer as something other than a document.

So one migration policy might look like this:

Stability

Complexity

Exposure

Strategy

Cold

Low

Internal Use

Do nothing

Cold

Low

Customer Facing

Do nothing

Cold

Low

Publication Ready

Do nothing

Cold

Medium

Internal Use

Do nothing

Cold

Medium

Customer Facing

Do nothing

Cold

Medium

Publication Ready

Do nothing

Cold

High

Internal Use

Do nothing

Cold

High

Customer Facing

Convert to PDF

Cold

High

Publication Ready

Convert to PDF

Warm

Low

Internal Use

Convert to ODF

Warm

Low

Customer Facing

Convert to ODF

Warm

Low

Publication Ready

Convert to ODF

Warm

Medium

Internal Use

Convert to ODF

Warm

Medium

Customer Facing

Convert to ODF

Warm

Medium

Publication Ready

Convert to ODF

Warm

High

Internal Use

Convert to ODF

Warm

High

Customer Facing

Publish as PDF

Warm

High

Publication Ready

Publish as PDF

Hot

Low

Internal Use

Convert to ODF

Hot

Low

Customer Facing

Convert to ODF

Hot

Low

Publication Ready

Convert to ODF

Hot

Medium

Internal Use

Convert to ODF

Hot

Medium

Customer Facing

Convert to ODF

Hot

Medium

Publication Ready

Convert to ODF

Hot

High

Internal Use

Reengineer

Hot

High

Customer Facing

Reengineer

Hot

High

Publication Ready

Reengineer

There may be a better way of expressing this above (Karnaugh maps anyone?) but that gives the idea. Also, I’m not suggested that this is the “one true answer”, but merely that this may be a useful way of framing the problem.

Variations might include:

Have a default policy of doing no conversions, but create all new documents in ODF format.

By default, ignore all legacy documents. But the first time any legacy document is read or written, put it into a queue for evaluation and possible conversion.

Much of this lends itself to automation. For example:

First you need to find all of the documents in an organization. This could be done by an activeX control on a page everyone in the company visits, an agent that spiders the intranet web pages and file servers, etc.

Each document is then scored.

Finding the stability of a document could be done by looking at the last read and last write stamps on the file. Also can look weblogs. Maybe even metadata in the document that tells how many times it has been edited.

Complexity could be determined by scanning the document to see what features it uses. Some features, like script, would weight heavily for complexity. Think of it as a “goodness of fit” metric for how well the features used in the document fit within the ODF model.

Business criticality is harder to automate, but could be done based on owner of the document, metadata in the document, location of the document (public web page versus intranet), etc.

Calculate the scores, suggest actions to take, and then automate the action. This could lead to a nice automated migration solution.

In summary, it probably is not worth while simply to go out and convert all of your legacy documents in a giant cathartic orgy of document transformations. Not all documents are worth that effort. In any organization you probably have many many documents that will never be read again, ever. You also likely have some very complex documents that probably should be reengineered as web applications on your intranet. The other documents, the ones in the middle, that is where you focus your migration effort.

Not converting documents does have one downside: the longer you wait to convert, the harder it might become, because the program that created the original document will at some point cease to work on readily available hardware, and new programs may be unable to read the old format.

On the other hand, I guess it’s a balance of costs: certain high cost to do all conversions now, versus a low probability of a high cost to do a small number of conversions much later.

No big conversion orgy is required. Over 95% of the documents need only to be read. This does not call for a massive conversion. What needs conversion is two things:

– The templates, especially those with plenty of macros. These are the ones that need to be used on a daily basis to produce new documents and feed corporate workflows. You want the new documents to be in the new format.

– The applications, those that use Excel as a reporting or data entry tool, the systems indexing official versions of documents. You want to adapt them to use the new format.

Almost everything else can be left out of scope for the migration project. Most documents are kept only for archival value. Users can be trained to convert the document they need to have handy in the new format. Selecting these document is a business decision that is best left to their owner. The conversion itself can be performed on an as needed basis. No need to gather a big inventory and anguish about getting the classification right. Users will convert only what they need. After a while most documents will become obsolete and there will be no point converting them anymore.

If there are difficult cases that can’t be read with the new office suite, a server can be configured to automatically convert anything sent to it into PDF. This server can be left available for years until there is very few legacy documents still relevant.

Stuff that needs to be retained on a very long duration for legal reasons can be batch processed into PDF upfront if keeping them in legacy format is not good enough. Users can point out where repositories of such documents are stored. If there is a legal reason to keep them, they should be indexed and kept in a well defined place anyway. If they are not, then you have grounds for a project having nothing to do with the choice of the next file format or office suite.

For read only compatibility, PDF is more than good enough. I find the later point ironic since Microsoft advertise read and convert compatibility as the main justification of OOXML without noting that PDF already does about everything that needs to be done. We don’t want to retroactively edit our archives.

If there is anything left that still can’t be handled, then there is an administrative procedure called derogation. You give some individual that have unusual requirements some unusual rights to deviate from the corporate standard and use a second Office suite on top of the main one. He knows he is an exception and will jump through the required hoops to interface with the rest of the corporation. After a while the annoyance will make him find ways to bring himself back into the corporate standard.

In my opinion, massive legacy document conversion is a red herring.

A more serious issue is what you do with correspondents and business partners that insist on using a different file format. The perception that Microsoft defines the standard due to market share works like a self-fulfilling prophecy. There are people that will be reluctant to adopt ODF by fear of being unable to exchange with the masses that will use OOXML. But if everybody thinks the same, there is no real reason to stick with Microsoft, just some irrational group behavior.

An excellent guide to study and share with others and organizations. Thanks for taking the time to outline it in such a clear manner, Rob.

For me, I will convert my personal writings (short stories, theses, dissertation) to ODF, but since StarOffice/OpenOffice reads my legacy .doc files with such accuracy, your guidelines will save me a year of conversion time.

Conversion costs apply regardless you go ODF or OOXML. This is important because the Microsoft compatibility talk implies that ODF requires costly conversions while OOXML does not. This perception is wrong.

We don’t need to convert most legacy documents. Among those that need conversion, most are kept for archival purpose and can be converted to PDF because they will never be edited. What is left?

The conversion to ODF is required for stuff like templates and application front-ends. Should one choose OOXML, these same documents and applications also need to be converted. You don’t want to keep generating legacy stuff forever.

Whatever format one choose, there is a need to inventory, convert and test. The difference in costs will occur only if OOXML somehow makes faster/better automatic conversion than ODF. How does that compare to licensing or hardware upgrades for Office 2007? Compatibility issues are overblown.

The difficulty of doing so is not really a good argument: not only does MS made import/export converters for older versions of Office available, but it provides tools* that make inventorying and bulk conversion relatively trouble-free. Seriously, what could they have done to make it any easier?

By not upgrading now, one is simply passing on costs to the future. It’s akin to deferring maintenance–generally not a wise idea.

Admittedly, there may be some intricate documents that break upon conversion. But that’s why, before converting, you archive all your documents. Then, if something goes amiss, you always have the option of restoring the original file.

I tested with 1 GB of office docs and xls files and the MS mass ooxml converter. It seems to work pretty good.

I did notice one or two slight look changes in the spreadsheets.

We might consider it as a valid method for conversion in a year or more. Allthough we do not expect the old MS Office formats to cause any problems for 5-10 years at least. We did actually found it reduced storage capacity with almost 27%. That would save us 100k-200k in expenses a year (storage,+ backup + servers) not including possible netwerk/email server savings. We should test more as the savings could be due to the nature of the converted documents in the tested set.

Allthough we consider ODF a fine alternative we have not considered it as an option (yet) as we consider the prime choice is between Office products and not between Office formats.

I don’t think many people look at conversion as a space saver but it might wel be worth checking out if you have millions of documents.

I forgot to mention another good conversion strategy: to implement an internal service bureau. This would be a team of expert that can help users with difficult conversions. The service bureau may also operate utilities like:

– A server that convert documents received by mail into PDF and mail back the result to the user.

– A Terminal server running Microsoft Office for users that have a one-time or short term need to edit OOXML documents sent by third parties.

During initial implementation, the service bureau will act as a task force to convert large repositories where there is a business reason to convert.

Afterwards, conversion will be an on-going activity because outside correspondents might not migrate to ODF right away and insist on OOXML. This situation will persist until the market perception that document standards must come from Microsoft wears out.

The service bureau staff requirement will be reduced past the initial implementation, but some service will need to be maintained. Once ODF becomes dominant, the service bureau can be totally discontinued.

A note for those that believe too much in automated tools. They are good but only to a point. Software has bugs. Converted documents need to be checked for fidelity. Inventories need to be validated for accuracy. This is a project cost in the cycle of inventory, convert and validate. This cost is a function of volume. Any strategy that reduces the volume reduces costs. Also documents have retention limits past which they can be destroyed. In many cases if the retention limit is shorter than the viability of keeping the ability to read and convert on-demand legacy stuff, there is no immediate business needs to convert.

Any conversion requires some human verification to see that it worked correctly. I’ll suggest a criterion for all: when translation technology is so perfect that you would be willing to run your resume through it, and send out the resulting document, as-is, without looking at it first, then translation software is good enough to run without a manual check, and even then only for documents as simple as a resume.

Compare this to image conversions, where it is expected to be 100% flawless. This is because image formats (TIFF/GIF/JPG/PNG) represent nearly the same thing. They are really just different encodings of a rectangular grid of pixels, with different colors, along with some metadata specifying transparency, color space, compression, etc. But the core of it is just a grid of pixels, a bitmap. The abstract model expressed by a document format is several thousand times more complex than this.

Rob, doesn’t your above statement about the similarities of the different image formats invalidate much of your reasoning about why OOXML cannot exist as a separate standard format?

If two different image formats which represent almost exactly identical subject matter can exist as non-contradictory standards, why cannot two different office document formats that encompass significantly different object models do the same?

The image formats we have today were developed over a 20-year period. They developed to meet different needs, first scanners (TIFF), then for use in web graphics (GIF) then photographs (JPEG) and then to avoid a GIF patent (PNG).

I should also note that the JTC1 prohibition against contradictory standards did not exist 10-15 years ago when these conflicting image standards came about. The contradiction clause in the JTC1 Directives is a relatively recent addition. So one can assume that it was added to curb the abuses of the past. So presenting examples of contradictory standards before then does not really make a strong argument.

So back to OOXML. It was submitted to ISO only 3-months after ISO’s publication of ODF. This is hardly the timescale where one can argue that ODF was superseded by newer technology. Also it is clear that OOXML is attempting to serve the same market niche as ODF — Personal Productivity Application. No one can argue that Microsoft Office and Open Office are in different markets.