Contents

XPS (XML Paper Specification) is a fixed page format specification that is a useful alternative to PDF. Just as PDF is a 'cut-down' version of PostScript, XPS is a reduced schema version of XAML specifically for fixed page layout. With XPS being XML based, it should be a great format for generating your own documents. Unfortunately, there seems to be little available describing this format in a way that's useful for actual implementation when you want to do just that. I'm hoping to help fill in some of those gaps with a (short) series of articles.

My introduction to XPS began with mocking up documents using Word and then "printing" them using the XPS Printer driver provided by .NET v3, afterwards examining the XPS documents to learn how they are structured and how to manipulate them. Apparently, if you have MS Office 2007 and get an optional update, you can also do a "Save As" to produce an XPS document.

I found that those XPS files produced by Word and the "XPS Printer" often included a large number of unnecessary artifacts (especially if it's a file you've edited several times, changed the fonts, etc.). This particular tool cleans out a large number of those artifacts, eliminates some duplicates, and does a few other tweaks that help to reduce the overall size of the XPS file, although in most cases, only by a few KB. Stepping through what it does also serves as a useful introduction to XPS files.

If you're planning on doing your own XPS output, then mocking up your intended format and using this tool to clean up the result is a really handy way to start.

Originally, I had pursued XPS purely as a proof of concept for a billing system. However, when it became apparent that commercial systems for PDF / PostScript production were going to be in the "insanely expensive" price bracket, my "proof of concept" became the actual production system.

This particular part of the project came out of the necessity of cleaning up marketing materials ready for their inclusion into the customer's bill. This CodeProject article is derived from that work.

The OOXML organisation used by XPS files includes a large number of cross references between different parts (files) and within the individual files. I won't go into whether OOXML is a good or a bad thing, there's already enough argument about that. However, just to add to the confusion, the OOXML "spec" has been slightly tweaked for XPS.

In the case of XPS files, the internal structure can be thought of as having three tiers (please note that this is not the official explanation, but it works for me). At the root, there's the XPS file itself. Next, there are the individual documents carried within that. Finally, there are the individual pages for each document. At each of these tiers may be held references to other parts and also resources of various types. All of this is a gross over-simplification, of course, but you get the idea.

Many of the parts (files) within the OOXML structure can be given different names, rather than the ones used by the "XPS Printer" or that are shown in the sample files, so long as all of the cross references line up.

Within each file, the various parts generally don't have to be in any specific order. It's only specific issues with regard to the layout of pages where order may become important. Otherwise, it's just do whatever is most convenient for processing.

After "printing" a sample XPS file yourself (and renaming it to a .zip file), you would probably see a structure similar to the following:

At each tier within the XPS document, you can find three different folders, although they don't have to be present at each tier:

Folder Name

Description

_rels

Contains files that describe the relationships the files at this tier have with other parts within the XPS file.

Metadata

Holds metadata files related to this tier. For instance, thumbprint images of the document or the PrintTicket files.

Resources

Contains the resources (e.g., fonts and images) used by this tier of the XPS file.

Root Tier (XPS File)

At the root tier, there will be two files:

[Content_Types].xml

Enumerates the different types of files, specifically the file extensions contained within this XPS document.

FixedDocumentSequence.fdseq

Will list out the actual documents contained within the XPS file, in effect pointing to the next tier in the hierarchy.

Note the schema namespace declaration in the Root Types element, and the "rels" extension declaration, these are specific to OOXML. Next, there's the "fdseq", "fdoc", and "fpage" extensions which all declare parts of the XPS structure. Then, the "odttf" for obfuscated open type font files; more on these in another article. Unfortunately, "xml" is used as the extension for the metadata PrintTicket files. And then finally, "JPG" and "PNG" for the image files; you may also see others depending on what's sitting in your original source document. You can assume "JPG" is always going to be present because the metadata thumbnail image that's generated by the XPS printer driver is always a small JPEG image.

FixedDocumentSequence.fdseq is normally very simple. Not just its name, but also its extension tells us that it is a fixed document sequence file. For an XPS printer driver generated document, it should always look like this:

You can see that it identifies the FixedDocumentSequence.fdseq file in the root tier and assigns it an arbitrary ID of R0. It also identifies the metadata thumbnail image which will be the thumbnail image for the entire XPS file itself.

Also, in the _rels folder is FixedDocumentSequence.fdseq.rels - it should be fairly obvious what this is the relationships file for:

Here, the only relationship described is to the metadata PrintTicket file. PrintTickets will also be described in another article. This file will often be the only file in the root tier Metadata folder.

Document Tier

Also in the root there will be the Documents folder. This folder will contain the actual document within the XPS file. When using the .NET v3 XPS Printer Driver, this document (in its own subfolder) is always named "1", although the document can actually have any name. Normally, resources such as fonts and images used within the document will be contained at this tier under Resources.

Under the "1" folder will be FixedDocument.fdoc referred to in FixedDocumentSequence.fdseq above. This file lists out the pages in the order they are to be displayed or printed.

Page Tier

Finally, each document subfolder will contain a Pages subfolder, and each Pages subfolder has the individual page files. There will also be another _rels folder at this level containing a .rels file corresponding to each .fpage file.

If you open up each page file, you'll see quite plainly how XPS is a restricted subset of XAML with all the Path and Glyphs elements. Don't be surprised though to see the different parts of the page layout seemingly scattered about within the file. As long as there are no z-axis issues (i.e., one element must appear behind another), the XPS Printer Driver pumps out the various elements of the page in the order that suits it.

FixedPage is the root element for all pages. There can be a lot of other elements contained within a FixedPage element, but the XPS printer driver typically leaves us with just Path (graphics) and Glyphs (text) elements.

When it comes to the actual output of Glyphs, it's the Indices that are used in preference to the UnicodeString. I've occasionally found that this has led to some interesting output. The Indices attribute is a list of all the glyphs to be used. If it is present, then there must be a corresponding character in the UnicodeString for each Indices entry. Each entry in the list of indices comprises a glyph ID, optionally a comma, followed by an AdvanceWidth, and finally, delimited with a semi-colon. There is actually a lot more that could be present in Indices, but this is about the limit of what you'll see being pumped out by the XPS printer driver. If you want Justified, Centered, or Right aligned text, then the Indices attribute is essential; take it out and you end up with simple Left aligned text with no special tricks. Although, there is a special trick to outputting Right aligned text without having to delve into the font files, which I'll cover in another article.

In the above extract, you can see some of the redundant artifacts that can be "cleaned" out. Within the Data attribute of the Path element, the spaces behind the "M" and "L" are not needed as is the space before the terminating "z". The Glyphs element that has a UnicodeString of " " is completely unnecessary, and the trailing space at the end of the UnicodeString (and Indices) attribute in the next Glyphs element can also be eliminated. These may not seem like much, but a heavily edited Word document will tend to have a large number of such artifacts that end up in the corresponding XPS; get rid of these, and you can quite often get rid of some of the embedded font files as well, resulting in a massive reduction in file size.

Other redundant artefacts can be identified by comparing all of the files within the XPS file looking for duplicates and keeping a copy of those that are found. Later, the files that refer to the duplicate copies can have that reference altered to point to the original.

Speaking of the obfuscated font files, these are really extracts from the full font file of only the characters needed for your document. This can get interesting when you want to programmatically output some XPS (without using the .NET XPS methods) and find some of your characters have mysteriously disappeared.

This is a simple console application designed to be executed from your command line. Pass it the name of the XPS file you want cleaned. It will describe the steps it's going through as it progresses, and then finally, leave you with an output file with "-clean" appended to the filename.

Please read "Other Stuff" at the bottom of this article as you will need to get the ICSharpCode zip library to make this all work and I haven't put its DLL into the Zip.

It will be very trivial to convert this simple application to a service or DLL.

This code should really be thought of as an XML pipeline, and in fact much of its operation could be changed to pass the constituent documents through as streams from one step to the next rather than using the intermediate files as I have here. However, I've structured it this way so that you can comment out the code that deletes the intermediate files and then go in and have a look inside them.

Also, having cleaned out a lot of the unnecessary artifacts, the resulting parts that make up the "cleaned" version of the files tend to make more sense.

Next, the original XPS file is opened up and each file is compared with every other file of the same type and size in an effort to identify duplicates. These duplicates will be dumped as the cleaned version of the XPS is built up, and any references to them in other files will also be altered. This code isn't that elegant, but it does the job.

// Duplicate files will be dropped and references to them
// altered to point to the 'original'
foreach (ZipEntry ze1 in zf)
{
string ze1NewName = ze1.Name.Replace("Documents/1/", "Documents/2/");
// Skip this file if we've already identified it as a duplicate
if (dupFiles.ContainsKey(ze1NewName))
continue;
// Go back through the list to identify any duplicates
foreach (ZipEntry ze2 in zf)
{
// Ready the stream for the 'original' file
using (Stream zs1 = zf.GetInputStream(ze1))
{
string ze2NewName = ze2.Name.Replace("Documents/1/",
"Documents/2/");
// Skip this file if it happens to be the same one
// or is not the same type (extension)
// or are of differing file sizes
if (ze1NewName == ze2NewName ||
Path.GetExtension(ze1NewName) != Path.GetExtension(ze2NewName) ||
ze1.Size != ze2.Size)
continue;
bool isEqual = true;
// Ready some small buffers for the comparison
byte[] buffer1 = newbyte[4096];
byte[] buffer2 = newbyte[4096];
int sourceBytes1;
int sourceBytes2;
// Now open up the two files and check if they are the same
using (Stream zs2 = zf.GetInputStream(ze2))
{
// Using a fixed size buffer here makes no noticeable difference
// for performance but keeps a lid on memory usage.
do
{
sourceBytes1 = zs1.Read(buffer1, 0, buffer1.Length);
sourceBytes2 = zs2.Read(buffer2, 0, buffer2.Length);
for (int i = 0; i < buffer1.Length; i++)
{
if (buffer1[i] != buffer2[i])
{
isEqual = false;
break;
}
}
// If filesize can be relied on
// this test should never fire
if (sourceBytes1 != sourceBytes2)
{
isEqual = false;
}
} while (sourceBytes1 > 0 && isEqual);
}
if (isEqual)
{
// This file must be identified as a duplicate
dupFiles.Add(ze2NewName, ze1NewName);
}
}
}
}

Then, the actual cleaning phase begins with each file in the XPS that's not some kind of resource or metadata file processed in turn, being put into the output XPS file once it's been worked on. One common change that's applied is to 'move' all the files and references from document '1' to document '2'. Doing this sort of thing makes it a lot easier to merge one XPS file, produced by the XPS printer driver, with another later on.

The page files (.fpage) are passed through the cleanup XSLT to remove the redundant references and do some of the other tweaks; their corresponding .rels files are also regenerated from this 'cleaned' page file. This file in turn is processed to build up a list of resources and metadata actually used.

Below is the XSLT that does most of this cleanup work on the page file itself. The existing XPS methods in .NET 3 are focused around the simple generation of XPS output. To actually manipulate it requires switching to something like XSLT.

Just a note, these XSLTs are specifically set up to accommodate a Microsoft XSLT quirk that dates back at least to MSXML 3. Within each template, each element being created must have the correct namespace declared (unless it's being created inside another element), which will be discarded by the MS XSLT processor when it realises it doesn't need it. If you don't have a namespace declaration, the MS XSLT processor will insert an empty namespace declaration (xmlns="") in your element, which really tends to screw things up quite nicely.

The above XSLT is primarily focussed around identifying redundant whitespace and eliminating that. What this occasionally leads to is a situation where a particular font file is no longer needed, and it's this situation where we can really reduce the size of the XPS file.

I could have added a call to the XSL documents() function to include the list of duplicate files (formatted in XML) and use them in the processing. However, this requires making further changes to how the precompiled XSLT is generated, because it's a potential security risk, and also substantial changes to the XSLT itself for it to identify the references to the 'duplicates' and replace them with a reference to the 'original'. I opted for a simpler solution, from a coding perspective, to just do a search and replace, line by line, on the output from the above XSLT.

The next XSLT to be run regenerates the .rels file for us from the 'cleaned' fpage file, in effect throwing away the references to now redundant resources and/or metadata.

<?xmlversion="1.0"?><xsl:stylesheetversion="1.0"xmlns:xsl="http://www.w3.org/1999/XSL/Transform"xmlns:x="http://schemas.microsoft.com/xps/2005/06"exclude-result-prefixes="x"><xsl:outputindent="yes"method="xml"encoding="utf-8"omit-xml-declaration="yes"/><xsl:keyname="resourceKey"match="//@*[starts-with(., '/Resources/Fonts/')
or starts-with(., '/Documents/2/Resources/Images/')
or starts-with(., '/Documents/2/Metadata/')]"use="."/><xsl:templatematch="/"><Relationshipsxmlns="http://schemas.openxmlformats.org/package/2006/relationships"><!-- Work our way through every unique resource attribute
using the Muenchian method --><xsl:apply-templatesselect="//@*[contains(., '/Resources/') or contains(., '/Metadata/')]
[generate-id() = generate-id(key('resourceKey', .))]"/><!-- Add in a reference for the printticket
as this won't be found in the source page files --><RelationshipType="http://schemas.microsoft.com/xps/2005/06/printticket"Target="/Documents/2/Metadata/Page1_PT.xml"><xsl:attributename="Id"><xsl:value-ofselect="concat('R', count(//@*[starts-with(., '/Resources/Fonts/')
or starts-with(., '/Documents/2/Resources/Images/')
or starts-with(., '/Documents/2/Metadata/')]))"/></xsl:attribute></Relationship></Relationships></xsl:template><xsl:templatematch="@*"><!-- List out the resource identifier --><RelationshipType="http://schemas.microsoft.com/xps/2005/06/required-resource"xmlns="http://schemas.openxmlformats.org/package/2006/relationships"><xsl:attributename="Target"><xsl:value-ofselect="."/></xsl:attribute><xsl:attributename="Id"><xsl:value-ofselect="concat('R', position())"/></xsl:attribute></Relationship></xsl:template></xsl:stylesheet>

Well, it wouldn't be a real project involving XSLT unless the Muenchian method made an appearance now, would it? This XSLT ensures that we have only one Relationship element for each unique resource.

Another XSLT works with all the other files that need their references to various resources adjusted because we're moving everything from "1" to "2".

<?xmlversion="1.0"?><xsl:stylesheetversion="1.0"xmlns:xsl="http://www.w3.org/1999/XSL/Transform"xmlns:x="http://schemas.microsoft.com/xps/2005/06"xmlns:r="http://schemas.openxmlformats.org/package/2006/relationships"exclude-result-prefixes="x r"><xsl:outputindent="yes"method="xml"encoding="utf-8"omit-xml-declaration="yes"/><xsl:templatematch="/"><xsl:apply-templatesselect="*"/></xsl:template><xsl:templatematch="r:Relationships"><!-- Processing for the 'primary' elements of page related .rels files --><!-- Actually this particular template should never be invoked,
it's here as an 'insurance' policy against 'maintenance' --><Relationshipsxmlns="http://schemas.openxmlformats.org/package/2006/relationships"><xsl:apply-templatesselect="@*"/><xsl:apply-templatesselect="*[not(contains(@Target, '/Fonts/'))]"/></Relationships></xsl:template><xsl:templatematch="r:Relationship"><!-- Processing for the 'primary' elements of other .rels files --><Relationshipxmlns="http://schemas.openxmlformats.org/package/2006/relationships"><xsl:apply-templatesselect="@*"/></Relationship></xsl:template><xsl:templatematch="x:FixedDocument|x:FixedPage|x:FixedDocumentSequence"><!-- Processing for the 'primary' elements for other than .rels files --><xsl:elementname="{name(.)}"namespace="http://schemas.microsoft.com/xps/2005/06"><xsl:apply-templatesselect="@*"/><xsl:apply-templatesselect="*"/></xsl:element></xsl:template><xsl:templatematch="*"><!-- Processing for all other elements --><xsl:elementname="{name(.)}"namespace="http://schemas.microsoft.com/xps/2005/06"><xsl:apply-templatesselect="@*"/><xsl:choose><xsl:whentest="count(*) &gt; 0"><!-- If there are sub-elements process these --><xsl:apply-templatesselect="*"/></xsl:when><xsl:otherwise><!-- If there are no sub-elements
then just take the contents of this element --><xsl:value-ofselect="."/></xsl:otherwise></xsl:choose></xsl:element></xsl:template><xsl:templatematch="@*"><!-- Processing for all attributes --><xsl:attributename="{name(.)}"><xsl:choose><xsl:whentest="starts-with(., '/Documents/1/Resources/Fonts/')"><!-- Alter font references to point to the 'root' resources folder --><xsl:value-ofselect="substring-after(., '/Documents/1')"/></xsl:when><xsl:whentest="starts-with(., '/Documents/1')"><!-- Alter all other document references to point to document '2' --><xsl:value-ofselect="concat('/Documents/2',
substring-after(., '/Documents/1'))"/></xsl:when><xsl:otherwise><!-- Leave all other references alone --><xsl:value-ofselect="."/></xsl:otherwise></xsl:choose></xsl:attribute></xsl:template></xsl:stylesheet>

The final XSLT actually produces text output. This one is designed to read all of the .rels files (those for each page, and the other one in the 'root' _rels folder, plus any others) and simply generate a listing that we can process to determine what resources and metadata files we really need.

<?xmlversion="1.0"?><xsl:stylesheetversion="1.0"xmlns:xsl="http://www.w3.org/1999/XSL/Transform"xmlns:r="http://schemas.openxmlformats.org/package/2006/relationships"exclude-result-prefixes="r"><xsl:outputindent="yes"method="text"encoding="utf-8"omit-xml-declaration="yes"/><xsl:templatematch="/"><!-- List all the relationship 'targets' the resources and metadata files --><xsl:for-eachselect="r:Relationships/r:Relationship/@Target"><xsl:value-ofselect="."/><xsl:value-ofselect="' '"/></xsl:for-each></xsl:template></xsl:stylesheet>

The output from this last XSLT is the only one we don't pump out to a temporary file. It instead is pushed via stream into a StringBuilder that's later processed into a list.

That then brings up stage 3 of processing the original XPS file. In the third pass, the files we worked with in the second stage are skipped (their processed output is already in the new XPS file); instead, it picks up all the resource and metadata files and, using the above list, puts them into the right places in the new XPS file. Any 'duplicate' files are tossed (ignored), and then finally, any other outstanding files are also grabbed at this time.

Comments and Discussions

You mention that in a future article you will mention a trick to right align text. I looked and couldn't see a future article with this information. I am wondering if you could discuss what needs to be done to right align the string in the following:

The xps file was generated from Excel with the string in this field containing "1234567890" and being right aligned. I have figured out myself that if I use an En Space (U+2002) and left pad the text so that the unicode string is " $78.90" that I can make the text in the xps appear to be right aligned. Does your trick do something other than left pad the string like I found that could work?

My trick certainly does do something different. However, yours is more appropriate where you want to ensure that recipients of your document can still cut and paste the values from the document.

Here's my trick:Find the OriginX that corresponds with where the text should finish, use this value instead.Reverse the string, in your case it would become "09.87$".Flip the Bidi flag to output the string right to left.

Hey Presto, right aligned text. But if you cut and paste from the XPS document itself you'll get "09.87$"

The "proper" way is to know the widths of the glyphs and use a suitable AdvanceWidth value for the first glyph (the '$' sign) in the indices. But that requires processing the font file to figure that out, which is probably more than what you want.

Never attribute to malice that which can be adequately explained by stupidity.

Thanks for the idea. I initially tried padding text but the results were really really poor so I ended up coming up with a way to dynamically right align things.

I used your idea and added in the excel file (before it was generated into an xps) a 3 pixel row containing a keyword in each cell up to and including a cell to the right of all the content. So I can find the Glyph tags in the xps containing this keyword, find their OriginX, and use this as a guide as to where each column starts. So right aligning something is as easy as knowing which column it is in and then getting the column + 1's OriginX from this keyword data.

It means I have a lot of information stored about what is right aligned and in which column but it seems to work quite well. I also can use the OriginX's to figure out roughly where centered text would appear and have a not bad algorithm for it which includes the fontsize.

Thanks again for the idea because the padding was never going to work with what I was needing.

* An improvment of your code, is to make it a class, taking an XPS stream or package as constructor and a property to get the cleaned version.

This is a possibility. I'm not really working on this particular package much at the moment as it does what I've asked of it - mostly.However ...

Guson wrote:

* An other improvement is to use the control files, and not assume "Document/1" or "1.fpage" used by MS XPS printer.

Yes the above was an assumption. The right way of handling this is to 'read' the primary index document and use that to 'traverse' through the document. This change I am contemplating, as some other tools that generate XPS documents do some 'strange' things.

There is a good post by Jo0815 from 7/11/07 which shows some code how part of this task could be accomplished.

I also experimented by hand and deleted some files from the XPS archive and removed that page name from the FixedDocument.fdoc file, and that did work... I ended up with a 2-page document that worked correctly in the viewer when it had originally been a 3-page document.

In Jo0815's post, he (or she, I suppose) describes how there is still code to be written to remove any resouces which are no longer necessary.

When I read this article about the XPSCleaner, I imagine that the two functionalities are a good fit, since I believe from your text that your app WOULD remove extraneous resources.

Have you considered adding this functionality to your app (or tackling it as a future written article)?

My code still isn't perfect as it assumes the XPS is coming from a particular source. But if you start with that source and then remove the page it will complete the cleanup of redundant resources.

The steps to remove a page (without reading your cross-referenced articles, so I could be repeating things here) is the following:0. Open up the XPS file as a ZIP.1. Examine the contenttypes file for which file actually contains the document page listing - lets assume that's FixedDocument.fdoc2. Remove the page reference from FixedDocument.fdoc3. Remove the actual page file being referred to above.4. Remove the associated .rels file for the page that's just been deleted.

Then cleanup the other resources, this really requires identifying all the remaining resource references and resources themselves and then eliminating the redundant resources. Use my code for that.