Module diff

NoDeletes
Raised when the document no longer contains any pending deletes
(DEL_START/DEL_END)

token
Represents a diffable token, generally a word that is displayed to
the user. Opening tags are attached to this token when they are
adjacent (pre_tags) and closing tags that follow the word
(post_tags). Some exceptions occur when there are empty tags
adjacent to a word, so there may be close tags in pre_tags, or
open tags in post_tags.

tag_token
Represents a token that is actually a tag. Currently this is just
the <img> tag, which takes up visible space just like a word but
is only represented in a document by a tag.

href_token
Represents the href in an anchor tag. Unlike other words, we only
show the href when it changes.

merge_delete(del_chunks,
doc)
Adds the text chunks in del_chunks to the document doc (another
list of text chunks) with marker to show it is a delete.
cleanup_delete later resolves these markers into <del> tags.

cleanup_delete(chunks)
Cleans up any DEL_START/DEL_END markers in the document, replacing
them with <del></del>. To do this while keeping the document
valid, it may need to drop some tags (either start or end tags).

split_delete(chunks)
Returns (stuff_before_DEL_START, stuff_inside_DEL_START_END,
stuff_after_DEL_END). Returns the first case found (there may be
more DEL_STARTs in stuff_after_DEL_END). Raises NoDeletes if
there's no DEL_START found.

locate_unbalanced_start(unbalanced_start,
pre_delete,
post_delete)
pre_delete and post_delete implicitly point to a place in the
document (where the two were split). This moves that point (by
popping items from one and pushing them onto the other). It moves
the point to try to find a place where unbalanced_start applies.

The documents must be fragments (str/UTF8 or unicode), not
complete documents

The markup argument is a function to markup the spans of words.
This function is called like markup('Hello', 'version 2'), and
returns HTML. The first argument is text and never includes any
markup. The default uses a span with a title:

htmldiff(old_html,
new_html)

Do a diff of the old and new document. The documents are HTML
fragments (str/UTF8 or unicode), they are not complete documents
(i.e., no <html> tag).

Returns HTML with <ins> and <del> tags added around the
appropriate text.

Markup is generally ignored, with the markup from new_html
preserved, and possibly some markup from old_html (though it is
considered acceptable to lose some of the old markup). Only the
words in the HTML are diffed. The exception is <img> tags, which
are treated like words, and the href attribute of <a> tags, which
are noted inside the tag itself when there are changes.

cleanup_delete(chunks)

Cleans up any DEL_START/DEL_END markers in the document, replacing
them with <del></del>. To do this while keeping the document
valid, it may need to drop some tags (either start or end tags).

It may also move the del into adjacent tags to try to move it to a
similar location where it was originally located (e.g., moving a
delete into preceding <div> tag, if the del looks like (DEL_START,
'Text</div>', DEL_END)

split_unbalanced(chunks)

Return (unbalanced_start, balanced, unbalanced_end), where each is
a list of text and tag chunks.

unbalanced_start is a list of all the tags that are opened, but
not closed in this span. Similarly, unbalanced_end is a list of
tags that are closed but were not opened. Extracting these might
mean some reordering of the chunks.

locate_unbalanced_start(unbalanced_start,
pre_delete,
post_delete)

pre_delete and post_delete implicitly point to a place in the
document (where the two were split). This moves that point (by
popping items from one and pushing them onto the other). It moves
the point to try to find a place where unbalanced_start applies.

As you can see, we moved the point so that the dangling <div> that
we found will be effectively replaced by the div in the original
document. If this doesn't work out, we just throw away
unbalanced_start without doing anything.

tokenize(html,
include_hrefs=True)

Parse the given HTML and returns token objects (words with attached tags).

This parses only the content of a page; anything in the head is
ignored, and the <head> and <body> elements are themselves
optional. The content is then parsed by lxml, which ensures the
validity of the resulting parsed document (though lxml may make
incorrect guesses when the markup is particular bad).

<ins> and <del> tags are also eliminated from the document, as
that gets confusing.

If include_hrefs is true, then the href attribute of <a> tags is
included as a special kind of diffable token.