Description

MediaWiki core creates ids per heading based on the content of that heading. This is used to link to the given section from the table of contents, and is also often used by users to reference specific sections.

It would be great if Parsoid implemented the same ids, possibly in addition to the current random ids (using a meta tag?).

So, heading anchors are generated from wikitext of a heading, not the innerHTML of the heading. So, how is MCS doing this right now?

I ask because Parsoid doesn't have easy access to the wikitext of individual nodes that come from template content. It only has this for top-level content. So, I am trying to figure out how to deal with this. Of course, most headings are plain text, and innerHTML of the heading will be equivalent to the wikitext especially since the formatHeadings code strips HTML tags and normalizees whitespace. But, this fails for headings that have quotes, links, or templates.

So, heading anchors are generated from wikitext of a heading, not the innerHTML of the heading. So, how is MCS doing this right now?

I ask because Parsoid doesn't have easy access to the wikitext of individual nodes that come from template content. It only has this for top-level content. So, I am trying to figure out how to deal with this. Of course, most headings are plain text, and innerHTML of the heading will be equivalent to the wikitext especially since the formatHeadings code strips HTML tags and normalizees whitespace. But, this fails for headings that have quotes, links, or templates.

Scratch all that. I was wrong. It looks like the heading anchors are generated from the HTML, not the wikitext. I misinterpreted $text (param to formatHeadings) as being wikitext, but it is actually html.

IIRC sometimes the {{}} of the template is visible in the ID; it depends on which of several anchor encoding code paths are called. (I think there is one for the actual page HTML, one for the links in page history / recentchanges, and one for returning you where you were after clicking on a section edit link and saving? I might be confusing things, it was a long time ago I looked at this.)

IIRC sometimes the {{}} of the template is visible in the ID; it depends on which of several anchor encoding code paths are called. (I think there is one for the actual page HTML, one for the links in page history / recentchanges, and one for returning you where you were after clicking on a section edit link and saving? I might be confusing things, it was a long time ago I looked at this.)

In the enwp sandbox, I tested == {{1x|1=moo and ''boo'' and [[gah]] and {{1x|1=wtf}} x}} == and the heading anchor is <span class="mw-headline" id="moo_and_boo_and_gah_and_wtf_x"> which doesn't have any of the wikitext chars in it.

So, it turns out that inserting a <meta id="foo" /> does not help. https://.../title#foo does not take you to the anchor. You are still left at the top of the page. <h2 id="foo"> works as expected, but we cannot add ids to heading tags since we are using ids for data-parsoid. So, realistic options are using <span id="foo"></span> or <a name="foo"></a> both of which work as expected.

Named anchors solution seems like a "clean" solution. However, this can potentially impact clients that inspect a-tags. So, needs some discussion with them.

So, it turns out that inserting a <meta id="foo" /> does not help. https://.../title#foo does not take you to the anchor. You are still left at the top of the page. <h2 id="foo"> works as expected, but we cannot add ids to heading tags since we are using ids for data-parsoid. So, realistic options are using <span id="foo"></span> or <a name="foo"></a> both of which work as expected.

Named anchors solution seems like a "clean" solution. However, this can potentially impact clients that inspect a-tags. So, needs some discussion with them.

We could use empty spans, but it feels like a hack.

Thoughts?

Or, Parsoid could generate a <h2><span id="..">heading here </span></h2> like the PHP parser does. In any case, no matter which solution we go with, Parsoid's serialization code needs to be fixed up to ignore these new elements.

One other thing I discovered is that the core code does not deduplicate ids if the heading ids are present elsewhere on some other element. For example:

<div id='x'>foo</div>
==x==

assigns id='x' to the heading as well which is broken. Since we are going to dedupe ids for headings, we will dedupe it across the board in Parsoid HTML.

I don't know whether it is worth generating additional empty-span tags (cannot be meta as we discovered) to keep those links unbroken.

My gut feeling would be no. Direct links to sections are likely less used in projects where the encoding would actually change for most titles, and they'd probably be fixed up fairly quickly. In any case, the experience would degrade somewhat gracefully, by still reaching the linked page.

The way to find out would be to drop the old-style escaping from IDs generated by the PHP parser.

So, it turns out that inserting a <meta id="foo" /> does not help. https://.../title#foo does not take you to the anchor. You are still left at the top of the page. <h2 id="foo"> works as expected, but we cannot add ids to heading tags since we are using ids for data-parsoid. So, realistic options are using <span id="foo"></span> or <a name="foo"></a> both of which work as expected.

Named anchors solution seems like a "clean" solution. However, this can potentially impact clients that inspect a-tags. So, needs some discussion with them.

Since we want to get Parsoid HTML to read views, and given that there might be gadgets, bots, scripts, etc. that might rely on output of the PHP parser, what is the argument for not generating <h*><span class="mw-headline" id="..">..</span></h*> like the PHP parser generates?

The main reason for adding those spans was displaying section edit links next to the headings with 2004 browser technology. These days, it seems likely that section edit links could be added without those span wrappers, and in any case we wouldn't want to add those links in Parsoid output.

Considering that Parsoid HTML is aimed more at a clean structural representation of content, I think it would be preferable to avoid legacy UI artifacts leaking into it without a concrete need. While you are right that there will likely be some code expecting those span wrappers, the cost of migration should be more than offset by avoiding the cost inflicted on every new consumer who would be expecting headings to contain regular heading content, rather than a mix of edit UI & content.

The main reason for adding those spans was displaying section edit links next to the headings with 2004 browser technology. ...

I see ..

While you are right that there will likely be some code expecting those span wrappers, the cost of migration should be more than offset by avoiding the cost inflicted on every new consumer who would be expecting headings to contain regular heading content, rather than a mix of edit UI & content.