OpenOffice.org ODF, Python and XML

Combine Python with the open format of ODF files to manipulate fine details.

Simple String Replacement

Let's take fix1.py and make an easy modification. Whenever two
hyphens appear, replace them with the em dash. Then, when we're
done, write the XML to stdout—that's exactly what the shell script
(fixit.sh) expects.

When I select the long dash (the em dash), its Unicode value
appears in the lower-right corner, where I've put a purple ellipse;
that's the value to put into the string in place of the double
hyphens. Let's call this script fix2.py:

Success! Now for the rest. Besides the double hyphen, we want to
change the en dash into an em dash. That syntax is just like the
double hyphen replacement.

Replacement Using Regular Expressions

Replacing straight quotes with curly ones is more complicated though,
because we have to decide between a starting double quote and an ending
double-quote character. How to tell? Well, if the quote character is
at the start of the string, and there's a nonspace character afterward,
it's a left (or start of quote) curly quote. Ditto if
there's a blank before it and a nonspace afterward.

That's the easy way to describe it. We could code it like that,
or we could simply write a regular expression. I looked at the section
titled “re -- Regular expression operations” in Chapter 4 of Python's
library documentation and eventually came up with this:

sDpat = re.compile(r'(\A|(?<=\s))"(?=\S)', re.U)

Let me explain this left to right. We are creating sDpat, the
pattern for a starting double quote or Starting Double-quote PATtern.
We do that by calling the method compile in the re module (for
regular expressions). That analyzes the pattern once and creates
a regular expression object. We'll use sDpat to match straight
double quotes that should be turned into nice curly quotes at the
start of a quotation.

Now, about the pattern—the pattern contains a double-quote character
(") so we delimit it with single quotes, 'like this'. Also, we'll
pass some escapes (such as \A and \s) to re.compile, so let's make
this a raw string by putting an r in front of it.

(A little explanation for Perl users: in Python, \ escapes are
interpolated except in raw strings, whether single-quoted or
double-quoted; the delimiters don't affect interpolation as they
do in Perl.)

We can see how raw strings work by using Python's shell:

>>> print 'normal string: \n is a newline'
normal string:
is a newline
>>> print r'raw string: \n is not a newline'
raw string: \n is not a newline
>>>

So, what's in that raw string? It consists of three parts:

The part before the quote character (\A|(?<=\s)).
What we are doing is matching something (the '"' in this
case), but only if it occurs at the beginning of the
string or if it's preceded by a whitespace character.
The \A means “match beginning of the string”, the | means
“or” and (?<=\s) means “match if immediately preceded by
whitespace (a blank, tab or newline), but don't include that
whitespace itself in the match”. The enclosing parentheses
denote grouping.

The straight double quote itself: ".
That's what we're matching.

The part after the '"': (?=\S).
What we're doing is adding another condition—that the quote
character be followed by a non-whitespace character.

If all three conditions are met—that is, if a quote is there
(condition 2),
and it's either at the start of the string or preceded by whitespace
(condition 1), and it's followed by some non-whitespace character
(condition 3),
we want to replace it by an opening double-quote character.

Besides the pattern, you also can pass flags to re.compile. We pass
re.U to make certain escapes dependent on the Unicode character
database. Because we're parsing a Unicode string, I think we want that.

Note that the syntax for replacing a regular expression differs from
that of substring replacement: we use the sub (substitute) method
of the regular expression object (sDpat in this case):

td.data = sDpat.sub(sDquote, td.data)

Here we're taking td.data, the data in this particular node in the
XML tree, looking for the regular expression specified by sDpat,
and replacing whatever matched it (the straight " character in the
appropriate context) with the starting double quote, sDquote.

Trending Topics

Webinar: 8 Signs You’re Beyond Cron

Scheduling Crontabs With an Enterprise Scheduler
11am CDT, April 29th

Join Linux Journal and Pat Cameron, Director of Automation Technology at HelpSystems, as they discuss the eight primary advantages of moving beyond cron job scheduling. In this webinar, you’ll learn about integrating cron with an enterprise scheduler.