Contents

Differences because of implementation differences or functionality gaps[edit]

Difference

Description

Proposed resolution

Status

Parsoid is based on HTML5 semantics whereas PHP parser is based on HTML4 semantics

Parsoid uses Domino, which is based on the newer standardized HTML5 semantics. However, PHP Parser relies on Tidy which is based on HTML4. There are a bunch of parsing and rendering differences that arise from this, primarily around broken HTML (which there is a lot of on Wikimedia wikis).

Parsoid is getting an update to generate the same heading ids as core. We will look at generating the mw-headline class as well in Parsoid. But, we don't intend on generating the inner <span> since those are unnecessary. If any bots and gadgets are affected, we'll work with authors to update them.

Done

Edge case differences between Parsoid's native implementation of some extensions compared to PHP implementations of the same

For any extensions that process wikitext (ex: Cite, Gallery), Parsoid needs a native implementation of the same in Parsoid. However, because of implementation differences, there are edge cases where the output differs (ex: T104662, T96555, and a few others related to gallery).

Some of these (T104662, T96555) will be fixed in Parsoid. Others might be tweaked in the PHP implementation, or we might just treat the edge case differences as undefined behavior which shouldn't be relied on by editors. Since these are edge cases, they will be fairly uncommon usage in wikis (otherwise, we would have fixed them).

In progress

Unavailability of some parser hooks in Parsoid compared to PHP parser

Parsoid and PHP parser have different internals and hence not all the PHP parser's tag hooks are available in Parsoid. This page with parser hook stats lists extensions and the parser hooks they use. Some hooks like ParserBeforeStrip, ParserAfterStrip have no equivalent in Parsoid. So, in a Parsoid-only world, this could affect output and functioning of extensions like <translate>

We are going to develop a parser hooks API that is implementation independent (without exposing the internal details of how parsing happens) and port all the Wikimedia extensions to use this new API.

Parsoid is developing an extension API to support existing Parsoid-native extensions cleanly (Cite, Gallery, Poem, etc). We plan to extend the API gradually based on experience with adapting more extensions to work with Parsoid. In parallel, we will continue to deprecate unnecessary hooks and possibly rename some to reflect desired semantics.

This task is likely going to be completed after Parsoid moves to core.

Parsoid doesn't have special handling for pages in namespaces that has generated content. For example, the content for a page in a Category namespace is generated dynamically. Content for a page in a File namespace similarly has some generated content. There is a good argument to be made that Parsoid shouldn't be duplicating this support and that clients should fetch this from the MediaWiki API directly. However, this does leave Parsoid clients in a bit of a bind because they don't know which of these namespaces are special in that content for those pages is better fetched from the MediaWiki API directly. So, some good resolution of this problem would be helpful. Maybe Parsoid should handle requests for content in all namespaces, and where that content is better served from the MediaWiki API, redirect the client to the right url?

We run mass visual diff tests comparing rendering of Parsoid output and PHP parser output. This table will be filled out as we inspect the visual diffs and identify the underlying cause for those diffs. In addition to the above source of diffs, here are a few more specific ones that we discovered.

Thumbnail images in PHP parser output have a magnify icon with HTML output of the form: <div class="magnify"><a href=".." class="internal" title="Enlarge"></a></div>. Parsoid output is missing this. This difference is the source of a lot of visual diffs

Cite output needs styling (T156351 and T156350). This should also cover the styling requirements for cite ref links - some wikis like eswiki and frwiki skip the brackets. In addition, knwiki (Kannada) uses Kannada numerals for the ref text.

The necessary styles for these various wikis are being added to visual diffing code. Most of these styles for wikis are good to be added to commons.css on these specific wikis.

However, as part of this, we've also identified some limitations in the Cite CSS output. We'll have to figure out how to resolve that.

In progress

Stalled on trying to figure out how general i18n support in Parsoid should work vis-a-vis visual editing.

Broken wikitext tables

Tables in fosterable position has different fixups in Tidy vs. a HTML5 parser (RemexHTML, Parsoid)