XML Sitemaps Feature Project Proposal

While web crawlers usually discover pages from links within the site and from other sites, sitemaps supplement this approach by allowing crawlers to pick up all URLs included in the sitemap and learn about those URLs using the associated metadata.

Today, WordPress core does not generate XML Sitemaps by default, affecting a high number of WordPress websites search engine discoverability. 4 out of the top 15 plugins on WordPress plugin repository currently ship with their own implementation of XML sitemaps, pointing to a universal need for this feature and a great potential to join forces.

This post proposes integration of XML Sitemaps to WordPress Core as a feature project. The proposal was created as a collaboration between Yoast*, Google** and various contributors.

Proposed Solution

In a nutshell, the goal of the proposal is to integrate basic XML Sitemaps in WordPress Core and introduce an XML Sitemaps API to make it fully extendable. Below is a diagram of the proposed XML Sitemaps structure:

XML Sitemaps will be enabled by default making the following content types indexable

Homepage

Posts page

Core Post Types (Pages and Posts)

Custom Post Types

Core Taxonomies (Tags and Categories)

Custom Taxonomies

Users (Authors)

Additionally, the robots.txt file exposed by WordPress will reference the sitemap index.

Developers

An XML Sitemaps API will be introduced as part of the integration allowing extensibility. At a high level, below is a list of the ways the XML Sitemaps may be manipulated via the API:

Add extra sitemaps and sitemap entries

Add extra attributes to sitemap entries

Provide a custom XML Stylesheet

Exclude a specific post type from the sitemap

Exclude a specific post from the sitemap

Exclude a specific taxonomy from the sitemap

Exclude a specific term from the sitemap

Exclude a specific author from the sitemap

Exclude a specific authors with a specific role from the sitemap

Non Goals

While the initial XML Sitemaps integration will fulfill search engines minimum requirements and cover most WordPress content types, below is a list of features which will not be included in the initial integration:

Image sitemaps

Video sitemaps

News sitemaps

User-facing changes like UI controls to exclude individual posts or pages from the sitemap

XML Sitemaps caching mechanisms

i18n

Since there are plans by WordPress leadership to officially support multilingual websites in WordPress, the XML Sitemaps will be flexible enough to list localized content in the future as per web development best practices.

What’s next?

Your thoughts on this proposal would be greatly valued. Please share your feedback, questions or interest in collaboration by commenting on this post. After that we can decide on how to best proceed with this proposed project and set up a meeting on Slack to kick things off.

Share this:

Thank you @tweetythierry for this proposal. This is one of the most needed feature in WordPress. As many people who uses WordPress doesn’t know much about sitemaps and struggling to get their content indexed in search engines like Google. This will be very useful in WordPress. I’m more interested in collaborating on this project.

This is a good idea – I think it needs to be done in collaboration with all the major SEO plugins. You need to also build in the mechanism to “no-index” certain posts, or even post / taxonomy types. I think that there also needs to be the option to disable the sitemap functionality if you have an alternative solution. The no-indexing feature is important, as you are potentially exposing pages that users have kept hidden from the search engine results page. I am very much for this feature becoming standard, however, sitemaps are a really big deal for sites that rely on Google for Traffic – so needs to be properly implemented. You may also find that some of the SEO plugin providers may not be willing to collaborate as you are potentially taking away their sole purpose of the plugin (or at least a major feature) – which could impact the number of installs through the repo that they have and therefore the number of potential leads for their pro products or services. It would be great if this could happen. It would be a great and market disrupting feature.

So far we’ve received good feedback from Yoast and Jetpack in private discussions. I’m sure other plugin vendors will express interest in this as well.

As for the noindex feature you mention, the post already says that “UI controls to exclude individual posts or pages from the sitemap” is not part of this proposal.

At its core, this proposed sitemap solution would contain the minimum functionality needed to have a good, functioning XML sitemap in WordPress. SEO plugins can then extend this functionality to add UI controls, disable certain things, etc.

If anything, this proposal makes it easier for all these plugins to offer SEO solutions for WordPress, as there would be a unified, standardized solution in WordPress core that serves as a foundation for whatever features they might build on top of it.

Performance should not be taken lightly. As simple and straightforward as XML sitemaps are, they present some relatively significant performance challenges at scale.

As one for-instance, how many urls are going to be in each paginated (sub-) sitemap? A sitemap index file is limited to 50,000 sitemaps, so even though each sitemap is limited to a maximum of 50,000 urls, generating 50,000 urls in one page request would be extremely difficult and non-performant to do on-the-fly. If each paginated sitemap were limited to 10 posts (the default `posts_per_page` option value), that would mean a total capacity of only 500,000 pieces of content across all terms and posts. Even 100 per paginated sitemap would yield a relatively low upper limit of 5,000,000 pieces of content.

I would like to get involved to help work through and overcome some of these challenges. What is the best avenue to do so, @tweetythierry?

Raising very valid points like this and expressing your interest here is a great first step! 🙂

I expect there to be some more comments flowing in on this post throughout the week. After that we can collect all the feedback so far and decide on the next steps, which likely means setting up a meeting on Slack to kick things off.

Hi,
This sounds like a great feature to have in core to avoid having to install a potentially bulkier plugin just for that feature.

One issue though, in the ‘Non-Goals’ section above it says:

“User-facing changes like UI controls to exclude individual posts or pages from the sitemap”

Does this mean that there won’t be any method to exclude Published posts (post / page / CPTs) from appearing in the sitemap file? This would be an essential requirement from my perspective as it’s pretty common to have content that is Published but is not intended to be listed on the site. E.g. it doesn’t appear in Navigation menus or any other list of pages such as regular user-facing Site map pages – I commonly use ‘wp_list_pages’ on 404 page templates to show a user-facing Site map using the ‘exclude’ parameter to exclude any post ID’s that I don’t want to be listed.

I don’t see there having to be a provide UI to exclude posts / pages from this, but a developer function so that they can be excluded from the sitemap XML file similar to ‘wp_list_pages’ would be enough. I would expect third-party XML sitemap plugins would also want this to be able to make use of the core feature but bring the added-value functions that their plugins provide.

The goal of the proposal is to integrate basic XML Sitemaps in WordPress Core and introduce an XML Sitemaps API to make it fully extendable. Take a look at the Developer section of the post, the XML Sitemaps AP will allow devs to granularly exclude urls.

Great news, though the risk here for me is that you have way more sitemaps than necessary for SEO – and turning them all on by default could lead Google up the proverbial garden path – wasting a site’s crawl budget.

Good point. Having a sitemap and actually submitting it to a search engine’s webmaster tools / search console are two different things. But I am sure any core solution and also any plugin adding sitemaps to WordPress can find a way to prevent any duplication in this regard 🙂

> Having a sitemap and actually submitting it to a search engine’s webmaster tools / search console are two different things.
Google/other search engines crawl the web – if they come across a sitemap index (especially if it is linked from robots.txt) they will index it, crawl it, and crawl all the sitemaps listed – unless they are physically blocked from doing so.

Such a great news that WordPress plan to added sitemap in core like Site Health module. Also it was changes based on user need using actions/filters with other options so that user don’t need any extra plugin to generate there website sitemap.

Add an option to remove default sitemap so if needed user can use other plugin sitemap.

If any dev needed then i’m interested in collaborating on this project.

Ideally, they don’t even have to know! As the post says, it should just work. No user-facing changes are planned.

You’re right.

If it is not configurable without an extension, it will add many unwanted entries

The proposal so far is to only add things to the sitemap that are publicly viewable anyway.

I’m an SEO consultant. Unfortunately, many WordPress developers make mistakes when creating a Custom Post Type or Custom Taxonomy: they leave it “public”, which lead to SEO issues.

The same is true for some content that is supposed to be private, such as a thank you page or WooCommerce’s private pages (“My account” for example).

A sitemap without configuration will therefore only cause SEO issues. It will do more harm than good.

If the feature does not include automatic submission to Google’s Search Console, it has absolutely no use

As far as I know, automatic submission is not something smaller sites need to be concerned about. And as long as the sitemap is referenced in in the `robots.txt`, crawlers will pick it up.

Do not trust Google on that. Sometimes, he does check your sitemap when you add it to the robots.txt file, but most of the time this is not the case (It can be easily seen by looking at the log files). The only effective method is manual submission.

Implementing image sitemaps is currently a non-goal, meaning there’s no intent to implement these. Plugins would be responsible for doing that.

While millions of WordPress sites might already have a plugin installed that adds XML sitemaps, it’s definitely not the majority of sites. Remember, WordPress powers an estimated third of the web.

keellye
8:02 pm on June 12, 2019

1. Оkay, but please make a note and if you decide in the future to add images to the site map, keep this in mind. Thank you in advance.

2. Perhaps it is necessary to detect already installed plugins (All in one Seo Pack, Yoast Seo etc.) and/or the presence of sitemaps before “enable sitemap by default” and enable or disable this feature to avoid conflicts?
It seems to me that the absence of sitemaps on a small site/blog does not prevent its indexing by search engines – sites of less than several thousand pages are successfully indexed (all large sites/blogs are aware of sitemaps).

Arne Brachhold
4:23 pm on June 12, 2019

Hi,

Great idea! XML Sitemaps have really evolved to a standard every website should have, therefore it makes sense to include a basic functionality in the core.

The proposal looks very good to me. All published content should be included by default, because that is probably what the majority of users would expect. All functionality regarding filters based on posts, categories and so on should be plugin territory, since users probably would also like to have “noindex” tags on these pages. While the proposal already lists a list of possible exclusion criteria, I would also add a generic filter to let plugins filter based on custom rules or post meta data.

Priorities for individual entries can also be provide by a plugin using the additional attributes (=child elements?) mentioned in the proposal.

The structure looks fine for me too. One thing to keep in mind is that if you have a website with a million posts and delete or unpublish one, not all the paginated sitemaps should change. This would happen if you just paginate them by ID and a LIMIT. The root sitemap should also contain the correct modification date of the sub-sitemaps so they do not need to be re-indexed if nothing has changed.

Caching can be done by the popular caching plugins, so no need to include this in the core.

If the root sitemap is included in the robots.txt, Google will find it automatically. I am not sure if the “ping” functionality to notify Google about changes, which is provided by most Sitemap plugins, can also be implemented via the existing “Update Services” feature in WP. (If this feature will still survive the next WP releases, since the days of Technorati and Feedburner are gone for a long time).

All other major CMS have XML Sitemaps features in core (i.e. Magento, Prestashop, etc) and it’s a long time demanded functionality for WordPress in every poll made in the past to users asking for future features.

I don’t think this should be added without a way for users to configure what is in the sitemap.
Custom post types are used by plugins for loads of different things that shouldn’t be added to a sitemap.
For example, contact form 7 stores submissions as a custom post type. Slideshow plugins store configurations as custom post types.
Even when building a site with your own Costs and taxonomies you might not want them all in the sitemap.

Seems like it should be off by default rather than enabling something which could potentially cause problems for a lot of non technical users who then have to add code to disable it.

You have some good points about custom post types though. Ideally such a feature would only add posts and terms to the sitemap that are publicly viewable already. So internal post types or taxonomies would not be considered.

Yeah, my first thought was just enable posts, pages and their related taxonomies by default but then there needs to be a way to enable others.
The public setting for the post type is a good indicator but a dedicated setting might also be useful. I seem to remember a case where posts had to be public even if you don’t want the single pages to be accessible. Possibly when you just want to show a collection of items on an archive page or page template.

I would disagree that new features have to be enabled automatically to be useful. Enabling on a new installation maybe, but turning on a new feature which the site admin may be unaware of and could conflict with existing plugins and add unwanted content to a sitemap may be more unhelpful than helpful.
That said, most people with a sitemap are probably using something like Yoast anyway so it should just be a case of updating to a version that uses the API.
Yoast also turns everything on by default and most sites I look at have things in the sitemap that shouldn’t be there.

Off by default is the best option, then allowing users to turn on individual sitemaps. Otherwise you are going to send Google towards trillions of URLs that Google doesn’t need to crawl and waste site crawl resources.

I think yoast seo handles this brilliantly and is a great model to adapt.

The one (and only) problem we have with yoast seo’s sitemaps is on some of our very large sites (100,000+ pages) it can literally take down our server so we throttle it, cache it and run on a background cron.

I think not including a UI to disallow on specific pages/categories/tags is a poor choice. This is a very important feature for us. It creates a lot of potential for pages appearing in sitemaps which shouldn’t be index or are set to noindex by any of the many other methods to do so.

I would also point to Google XML sitemaps brilliant feature to add items to the virtual robots.txt. Perhaps outside the minimal scope of this, but another feature that should be in core to properly manage these virtual files core creates.

Great to see this coming, fantastic news…. this should be in core for sure, but I do agree that by default maybe it should be off, and in general settings its a simple switch on that then enabled this.

If it is to be on by default, then I would like to see it on by default only for the most important post types leg PAGES and POSTS as default, It should not be on for CPT, like some said some plugins create their own mess that doesnt even need to be crawled like some sliders for example.

My only request here is that we should be able to easily turn on and off features of what is to be in the index file and what is not to be in the index file, if not I can see most people just turning it off and going back to a third party plugin that does what they want with XML sitemaps.

On a sidenote, why only extend this to XML sitemaps, how about HTML sitemaps also, they both work in conjunction with each other and pretty much and SEO who does one ends up doing the other…. 😀

As far as on or off by default, I think it wouldn’t make sense for something like this to be included in core and then off by default.

Having said that, there could be an exclude_from_sitemap or include_in_sitemap for CPT registrations, and maybe even in for custom taxonomy registration as well for the taxonomy archives.

EBrockway
2:07 pm on June 13, 2019

Anyone is seing like me how this could easily lead to lot of WordPress installations being hacked?

If you publish by default the usernames in a sitemap, you have the username of the admins, you just have to brute force the password!
Also on e-commerce websites it could lead up to hack of clients accounts…

Please atleast provide a way to active/deactivate certains sitemaps! And for security purposes leave user sitemap deactivated by default.

This is a long-waited feature for me, but without the control of what gets generated and added in the sitemap (eg. cpt / users), it could be nice but too much for some projects + maybe a security risk (considering the users get pushed in) as @ebrockway sayd.

Different use cases make use of WP in completely different ways. It would therefor be essential to many projects to be able to configure what the index includes. Users need control over what shows up in search engines.

As the original proposal explicitly excludes the idea of such an interface, I just cannot keep from wondering whether this proposal is a “clever” marketing stunt for premium plug-ins, that serve the sole purpose of selling the user a level of control he already had before it got taken away from him.

At this point roughly half the comments above mention needing the ability to control what is included in the site map.

Several commenters have outlined clearly why including “all public post types” is insufficient. Many plugins set CPTs public for various reasons which still shouldn’t be included in a site maps and indexed. There must be a way to manually enable/disable inclusion in core, without an additional plugin.

Will this concern be addressed directly in this proposal?

I really want to see this in core, but without addressing this the proposal is fatally flawed.

This is the territory of the plugins, not the Core to take care of that…
This will generate a lot of support for existing SEO plugins (like mine), but also errors in Google Search Console (because many sites / themes are poorly designed).
In the end, frustration and misunderstanding for everyone.
34% of the web risk overnight to see its SEO impacted.
Google shouldn’t be bothering with that.
The formatting of the web is definitively launched.

I believe I read in one of the proposal/comments that image sitemaps wouldn’t be included in the initial implementation. If you do, please add logic that automatically excludes image pages that are only used on pages or posts that are excluded. I ran into this anomaly with one of the current SEO plugins and it was the only crawl exception on the Google Search Console. This is hardly fatal, but irritating for OCD types like me.

In the above proposal you state that caching mechanisms will not be part of this initial integration.

I think this will probably make it a non-viable solution right from the start for inclusion into Core, as the main issue with sitemaps is not their basic generation (which is very straight-forward), but their scalability.

Doing only uncached requests will just not be an option on sites that are large, complex, high traffic or a combination of these. And as soon as you try to add caching, you’ll face the actual real issue with sitemaps: how to know what to update when (in paginated requests) without going through the entire dataset again.

There are solutions to this, of course, but scalability absolutely should be one of the design requirements.