Massive duplication of oEmbed postmeta

Description

Hey guys,

Ever since my blog grew to a considerable size (a few million PVs a month) and started slowing down and exploding my server, I've been looking and implementing various optimizations. During one such passes through the data, I noticed this really weird oEmbed related behavior, which I've been observing for a number of WP version upgrades.

I use [embed] shortcodes a lot, and every new post after a few minutes ends up with a ton of oembed caches that don't belong to it at all - they're all from other posts. Posts that don't even have [embed]s at all still have over 100 oembed entries in wp_postmeta.

Here's an example of just a small subset of data residing in the table:

There are now about 150,000 entries in the wp_postmeta table due to this, half of which are duplicated _oembed entries, which I think has heavy impact on server load. Not only that but I'm sure WP is filling the table up with values by redoing oEmbed queries, which may explain that load shoots up very high at times when publishing.

Change History (38)

Duplicates is expected -- it's not smart enough to cache cross-post (say if you embed the same video in multiple posts).

However it should only cache to the current post (i.e. the one with the video in it) and if it's not, then perhaps you are polluting the $post global somewhere. Can you reproduce this using the default theme? Perhaps you are using query_posts() in your sidebar or something?

By the way -- it's safe to delete all of those rows if you want to. The caches will be repopulated on the fly.

@Viper: even if he is using query_posts() before the posts are displayed or something along those lines (which I suspect is likely, since many themes offer a widgets area before the loop kicks in), shouldn't WP be smart enough to consult $wp_the_query rather than $wp_query when caching these things?

@Viper: even if he is using query_posts() before the posts are displayed or something along those lines (which I suspect is likely, since many themes offer a widgets area before the loop kicks in), shouldn't WP be smart enough to consult $wp_the_query rather than $wp_query when caching these things?

Just to update you guys on this ticket, Alex's recommendation before sending him the theme was to figure out if it's a plugin doing it or not. So, I cloned the setup, deleted all the oembed keys and tried to repro the issue, but so far was not able to. Because production gets so much more traffic hitting it in every which way possible, it could be a concurrency issue that is not present when I'm testing or something that is hitting it in a specific way.

One thing I did notice - if I delete the oembed keys for a specific post on production, then they don't get repopulated until I update or publish something. It doesn't happen instantly either - it seems to coincide with /wp-cron.php?doing_wp_cron showing up in the running thread list, although I can't be 100% sure they're related. Something to think about.

After doing more extensive research and stracing Apache threads, I believe I've finally root caused this bug.

It seems like all the duplicate oembed entries are created due to #17560 that I opened earlier today. (Please read it first, then proceed).

I've reset to_ping on all posts that had it set to

(comma
separated)

which was about 850 out of almost 3000 posts, and none of the subsequent new posts contain duplicate oembed keys anymore.

I then cleared all oembed keys from the db (128k entries) and monitored the situation for an hour. No duplicates were created.

Then, to verify the theory, I went back to the backups of the postmeta and posts tables and cross referenced the embeds that are created over and over with the posts they appear in. In such comparison, as I suspected, most (95%) posts containing those embeds had that corrupt to_ping column.

Now, it's beyond my knowledge scope why exactly what is happening here is happening, but my guess is somewhere in wp-cron, paths get crossed, and duplicate oembed entries are created all over the place.

I've confirmed that fixing #17560 solves both this bug and my massive load after each post is published problem that's been getting worse and worse over the past months and made me lose a lot of sleep. Wordpress makes sense again!

I'd like to re-open this bug. I have a script that cleans out bad values from #17560 every day, which helps, but the problem still exists in WP 3.2, and it's pretty bad. My server keeps getting annihilated after new posts, and I suspect it has something to do with these oembeds.

For example, I just published a fresh post with a number of oembed tags. The load shot up to 50 and apache was maxing out big time. After it finally calmed down a few minutes later, I went to check and found this:

Checking one of those oembeds with 176 duplicates, I found that all 176 are attached to the same post id, the one I just published. This points towards some sort of a concurrency issue - as if all PHP threads suddenly see a new oembed tag that hasn't been resolved yet and all try to do it at the same time. I'm not sure what the mechanism for that is, perhaps it's related to wp-cron concurrency issues, but perhaps not.

It's also worth pointing out that I have apache configured with 170 max children, which is suspiciously close to 176.

P.S. I've switched from WP Super Cache to W3 Total Cache (latest public version 0.9.2.4) to utilize memcached support, in case this ends up being important. I'm using page cache, object cache, and db cache, all pulling from memcached.

I'm guessing that's the idea, but maybe concurrency bugs with multiple PHP threads are kicking in and messing things up.

As for JS, do I have JavaScript disabled? I'm not following what that has to do with the server-side? Or did you mean something else? The site has about 150k daily pageviews, so what my browser settings are shouldn't really matter.

Then, on line ​http://core.trac.wordpress.org/browser/trunk/wp-includes/media.php#L1049, the oembed cache is filled via an ajax call if one updates the post (or publishes it). However, at this point the post was already inserted in the DB and if it's publicly accessible (i.e. not private), it will show up in the front page. Thus, there may be duplicated keys if update_post_meta is executed concurrently.

One thing to try would be to update the oembed cache before the post becomes available on the front page.

Not sure if the that makes sense or if it probably means that I have to sleep :)

Then, on line ​http://core.trac.wordpress.org/browser/trunk/wp-includes/media.php#L1049, the oembed cache is filled via an ajax call if one updates the post (or publishes it). However, at this point the post was already inserted in the DB and if it's publicly accessible (i.e. not private), it will show up in the front page. Thus, there may be duplicated keys if update_post_meta is executed concurrently.

One thing to try would be to update the oembed cache before the post becomes available on the front page.

Not sure if the that makes sense or if it probably means that I have to sleep :)

Makes perfect sense — that's actually why I asked if the author had JS disabled, because that would mean the cache wouldn't even be populated at all. Definitely possible for there to be a race condition here.

That said, update/add_metadata() are fairly careful. Pretty crazy traffic to cause that kind of flood. Curious if this has been seen on WP.com or if the code runs differently there.

Nacin, yeah, but I'm still not sure what 1 person's JS has to do with this issue where traffic is generated by 300-700 people on the site at any given time.

I think there are still other issues exacerbating the situation and making it worse (the issue is a LOT worse on W3TC compared to WP Super Cache, for example), but those issues are helping surface the true problem with oEmbeds.

Shouldn't this be in the options table and just not autoload? If I post the same YouTube video on 19 posts, I should just have one cache entry. Seems like _oembed_* as an option or transient would solve this? It seems like deleting from the cache is less important than not duplicating.

I have over 270,000 _oembed_* entries in post_meta and only about 33K blog entries. Is it ok to just delete all of these or will it have severe consequences. Looking this ticket over it seems to be ok but would just like to make sure.

I have over 270,000 _oembed_* entries in post_meta and only about 33K blog entries. Is it ok to just delete all of these or will it have severe consequences. Looking this ticket over it seems to be ok but would just like to make sure.

Thanks,
alan

It's OK, WP will just re-generate the oEmbed tags on the fly. The only thing is if some videos were taken down, instead of the embed tag (which wouldn't play anything anyway), you'd get plain links instead. Not really a big deal.

I'd like to get rid of the big non-autoload values that end up in options, not add more.

As things get pushed out of the cache we could see lots of oembed refetch activity on a busy site.

I said approximately the same thing to wonderboymusic myself, just not as well :) Pointed out that, for example, Twitter has a relatively low rate limit based on IP, which already quickly runs out on a shared host and for a user who embeds a bunch of tweets in one post and either leaves the editor open or saves small changes a bunch of times (anything that triggers save_post). Triggering even more of those requests on the front all at once because of expiration would not end well.

Would love to see a better solution, but don't think an expiring transient is it. Also a bit late for 3.5.

Seems like the latter part of this discussion (about the storage location) would be better suited for #14759. The original problem with this ticket is why the duplication is happening in the first place.

Improve oEmbed caching. Introduces the concept of a TTL for oEmbed caches and a filter for oembed_ttl.

We will no longer replace previously valid oEmbed responses with an {{unknown}} cache value. When this happens due to reaching a rate limit or a service going down, it is data loss, and is not acceptable. This means that oEmbed caches for a post are no longer deleted indiscriminately every time that post is saved.

oEmbed continues to be cached in post meta, with the addition of a separate meta key containing the timestamp of the last retrieval, which is used to avoid re-requesting a recently cached oEmbed response. By default, we consider a valued cached in the past day to be fresh. This can greatly reduce the number of outbound requests, especially in cases where a post containing multiple embeds is saved frequently.

The TTL used to determine whether or not to request a response can be filtered using oembed_ttl, thus allowing for the possibility of respecting the optional oEmbed response parameter cache_age or altering the period of time a cached value is considered to be fresh.

Now that oEmbeds are previewed in the visual editor as well as the media modal, oEmbed caches are often populated before a post is saved or published. By pre-populating and avoiding having to re-request that response, we also greatly reduce the chances of a stampede happening when a published post is visible before oEmbed caching is complete.

As it previously stood, a stampede was extremely likely to happen, as the AJAX caching was only triggered when $_GET['message'] was 1. The published message is 6. We now trigger the caching every time $_GET['message'] is present on the edit screen, as we are able to avoid triggering so many HTTP requests overall.

It doesn't seem like we ever really identified what was happening in your case, but I went ahead and mentioned this in [28972], as I believe the possibility of stampedes (a likely culprit) to be significantly reduced in many cases. The one where it might not is if you are using the text editor or caching oEmbeds from areas besides the post content, and posts are immediately published and viewed by many. markjaquith and I talked about implementing a lock during caching, but locks are not foolproof. Leaving open for now, would love if archon810 were able to report back. :)

This ticket was mentioned in IRC in #wordpress-dev by DrewAPicture. ​View the logs.

I'm looking at my up to date 4.5.3 multisite retaining hundreds of rows of oembed cache post meta values, and am having trouble understanding from [28792] whether WP is supposed to clear ancient oembed cache values over time, or are these going to be building up infinitely. TTL seems to indicate it should, but how do I verify whether this is operating correctly - the hashed keys don't make it particularly easy?