[21:01:08] https://phabricator.wikimedia.org/E228 about to start....
[21:01:09] robla: am I chairing this meeting?
[21:01:24] I'm chairing....just getting my stuff in order....one sec
[21:02:21] #startmeeting RfC: image and oldimage tables
[21:02:21] Meeting started Wed Jul 13 21:02:21 2016 UTC and is due to finish in 60 minutes. The chair is robla. Information about MeetBot at http://wiki.debian.org/MeetBot.
[21:02:22] Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
[21:02:22] The meeting name has been set to 'rfc__image_and_oldimage_tables'
[21:02:41] #topic Please note: Channel is logged and publicly posted (DO NOT REMOVE THIS NOTE) | Logs: http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-office/
[21:03:03] hi everyone!
[21:03:27] https://phabricator.wikimedia.org/T589 is the topic for today
[21:03:54] Krinkle: can you give a quick summary of the topic?
[21:04:03] (Hi Everyone :)
[21:04:12] Yep
[21:04:25] Reference: https://www.mediawiki.org/wiki/Requests_for_comment/image_and_oldimage_tables - outlining here now
[21:05:14] hmm
[21:05:19] There is a fair number of problems with how we store multimedia in MediaWiki. This RFC will focus specifically on internal storage model for files and revisions.
[21:05:27] (we're also trying to have a field narrowing conversation as defined here: https://www.mediawiki.org/wiki/Good_meetings#Taxonomy )
[21:06:13] (if we're succesful, that means: "tentative consensus about the limited set of options, with someone assigned to clearly document the tentative consensus in the consensus-building forum of record")
[21:06:19] I think we have a lot of flexibility on where we want to go with this. Historically, the direction has been to replace it with a page/revision-like set up where files and file revisions have numerical IDs.
[21:06:24] if the idea is to get rid of INSERT SELECT then moving it to multi-content revisions won't do that
[21:06:27] But recently, several other ideas have come up.
[21:07:08] when jcrespo complained about how inefficient INSERT SELECT is, he linked to an incident involving filearchive, i.e. file deletion
[21:07:09] For the moment, I've summarised two problems on top of the page: "File revisions should have better unique identifiers" and "Files should not be identified by their caption, but be given a language-neutral identifier."
[21:07:39] TimStarling: Yeah, I think whatever we come up with, the idea of rows being moved should definitely go away.
[21:07:48] Which to me seems like proposal #1 is unfit.
[21:07:50] those two problem statements have some overlap with the idea to use content hashes
[21:08:03] https://phabricator.wikimedia.org/T66214
[21:08:54] gwicke: That RFC primarily deals with the thumbnail API, and HTTP api for files. It doesn't neccecarily dictate the storage model and/or how people reference them in wikitext/html.
[21:08:57] as I said on the task, I am not entirely sure whether this is in scope, or how it would interact
[21:09:06] #info topic of discussion - what is the scope of this RFC
[21:09:53] One use case you could think about is, if we have file IDs, those will likely be per-wiki, which makes them unsuitable for use from a foreign file repo (e.g. if we have {{#file:1234}} how should that work cross-wiki, do we still allow shadowing of file names?
[21:09:56] much of the motivation for content-based addressing is to solve the cache invalidation / consistency issue
[21:10:23] which is also a part of the motivation for moving towards more stable ids that don't change when images are renamed
[21:10:54] Another use case is to disallow mutation of files, and require a 1:1 mapping of a file upload and a description page - in which case re-upload would require updating all usage - which isn't practical cross-wiki and to third-parties. Though redirects could be use as canonical alias for frequently updated files.
[21:11:22] you could have a 1:n mapping from descriptions to files
[21:11:39] m:n if you factor in languages
[21:11:55] Yeah, though I think that would move internal details to the user land too much. We can duplicate internally on a per-fileblob level.
[21:12:05] By using hashes indeed.
[21:12:16] Having a 1:n mapping for mutable files and descriptions doesn't seem useful.
[21:12:34] we already have that, and it's shown as a "file history"
[21:12:47] #1 is a good idea regardless right? All tables should have primary keys...
[21:13:00] there is one description page per file history, not multiple for the same "file"
[21:13:05] And it doesn't prevent us from implementing any of the other solutions
[21:13:29] legoktm: Yeah, it would be a very minimal change, but wouldn't solve any of the underlying problems such as killing the concept of rows being moved across tables, which is the main blocker (/me adds to problem statement)
[21:13:32] Krinkle: I think we agree, it's already 1:n
[21:13:39] one description, n file revisions
[21:14:05] And having the primary key will make future schema changes (whatever they are) much easier
[21:14:08] #info 14:12:48 #1 is a good idea regardless right? All tables should have primary keys...
[21:14:18] gwicke: Ah, yes.
[21:14:29] gwicke: I was thinking 1 versioned file, multiple versioned description pages
[21:14:32] which we don't.
[21:14:53] yeah
[21:14:53] not currently
[21:15:07] although with languages, that might be in scope (?)
[21:15:20] #info [primary keys] wouldn't solve any of the underlying problems such as killing the concept of rows being moved across tables, which is the main blocker (/me adds to problem statement)
[21:16:05] at a high level, is the goal of this RFC to clean up how the entities in the media / image space are structured?
[21:16:20] I'd like to disqualify option 1 from the RFC. But I'm curious of people think that would be worth the effort to do first. Personally I don't think it would benefit much given virtually any future direction would end up replacing that intermediary state completely.
[21:16:38] also, are concerns around multi-project & multi-language media use in scope?
[21:16:55] gwicke: What do you mean by 'how they are structured'?
[21:17:14] Krinkle: how the data model is structured
[21:17:24] (I hope all this Wikicommons' media including video can feed in to, conceptually, a Google Street View with OpenSimulator some years ahead, where we can all "wiki-add" to Google Street View and OpenSim (conceptually again) since WMF may have a film-realistic, interactive, 3D, group build-able, with avatars planned for all 8k languages with Time Slider virtual earth for wiki STEM research, on the horizon :)
[21:17:28] Krinkle: I'm not sure if option 1 solves the goals of what you want to do, but I think all tables do need primary keys regardless, so unless your plan is to remove the tables entirely, I don't see how it can be disqualified.
[21:17:29] for example, how many descriptions can we have per media blob
[21:18:07] * aude waves
[21:18:14] legoktm: option 2 involves a schema change that also includes primary keys. option 3 involves dropping the tables effectively, and re-creating them as secondary data (like categories) - which would have primary keys, but for completely new rows.
[21:18:40] #info Krinkle wants to remove option #1, legoktm suggests we need primary keys regardless
[21:18:41] gwicke: I think whether descriptions are smashes into 1 wikitext blob or multiple should be discussed separately.
[21:18:41] so really option 1 is already part of option 2 and 3?
[21:19:20] Krinkle: it affects the data model, in the sense that each language would probably have its own page id & history
[21:19:22] fine by me to drop option 1
[21:19:37] option 1 is: add img_id and oi_id to the existing tables
[21:19:40] while the license is probably tied to the blob itself
[21:19:50] legoktm: Kind of, option 1 has one aspect the other one doeesn't: It makes moving rows more complicated by lazy-assinging primary keys as they move.
[21:19:53] Which is worse in some ways.
[21:20:31] It means you still don't have a revision ID until after it becomes the N-1 revision.
[21:20:43] okay
[21:20:54] As long as the tables have primary keys, I don't have an opinion :)
[21:21:02] Cool
[21:21:27] jynus: welcome! we've already decided everything; hope you don't mind ;-)
[21:21:36] * aude reads the rfc
[21:21:52] gwicke: If we want to split translations of descriptions onto separate wiki pages (e.g. sub pages as opposed to slots), I'd prefer to keep that separate. It woudl also involve creating a linking table of file to wiki page.
[21:21:53] the particular strategy suggested in #2 may need some refinement
[21:21:59] * robla now resumes trying to be a responsible chair
[21:22:19] of course with cur/old we renamed old and proxied cur, introducing a completely new page table
[21:22:24] which I think was the right way to do it
[21:22:35] INSERT SELECT is not inneficient, it is dangerous
[21:22:35] I think option 2 would be good to do even if we consider option 3 later. It would put us in a better shape. But the migration is very large, so we shoudl keep in mind the migration cost.
[21:22:59] revision was also a completely new table
[21:23:16] basically a new abstraction layer which allowed non-intrusive migration
[21:23:19] I'm not entirely sure that we need an image id in addition to description ids & blob hashes
[21:23:27] Krinkle: we will have some linking like that when we add structured data support for commons
[21:23:34] but then sounds like the rfc is more generally about mediawiki
[21:23:41] Yeah.
[21:24:17] TimStarling: Right. If we create them as new tables, we could populate them one by one and do it again until it is caught up with the master, and then switch masters.
[21:24:28] with structured data, descriptions might be part of the data item which is multilingual
[21:24:49] I don't know how oldimage and image are in terms of scale.
[21:24:56] Probably bigger than enwiki was at time of live-1.5
[21:25:08] Can we afford to have a complete copy of it for a short time?
[21:25:11] jynus: ^
[21:25:15] (in the same db)
[21:26:31] forget about implementation
[21:26:47] on WMF, that is my job to figure it out
[21:26:55] yeah, but we are figuring it out
[21:27:03] so do not worry about it
[21:27:17] not the implementation
[21:27:19] jynus: I was just wondering whether it is feasible space-wise to have a complete duplicate of 'image' and 'oldimage' in the commons prod as part of the migration script.
[21:27:22] the migration, I mean
[21:27:34] why do you need that?
[21:27:43] :D
[21:28:11] jynus: Tim mentioned the best way to migrate with minimal read-only time is to create new tables and populate those instead of changing the existing one.
[21:28:24] jynus: I think a lot of the thought around migration is to make sure we have something that works outside of the Wikimedia context (as well)
[21:28:33] I wouldn't be surprised if full migration would take days if not weeks.
[21:28:48] cannot we just convert image into image_revision?
[21:28:55] so, in a future world where descriptions are in wikidata & this RFC introduces an image id, we'd have a mapping from name to image id, Q-item to image ID(?), and image id to list of blob hashes?
[21:29:13] gwicke: That sounds pretty good.
[21:29:23] Note the concern about the practical use of image IDs, though.
[21:29:33] this RFC is about image revision backend, we're not going to have this discussion on terms of "don't think about that, that's the DBA's job"
[21:29:46] I do not mean that
[21:29:51] Where file names (terrible as they are) are kind-of practical to use globally (across wikis), file IDs would obviously conflict if kept numerical.
[21:30:04] I mean that sometimes you block tourselves thinking "that cannot be done"
[21:30:17] and 99% of the times things can be done
[21:30:35] discuss *if* you want a final state or not
[21:30:39] Krinkle: maybe globally, then need to be qualified with interwiki prefix or something
[21:30:42] ok, but I think you're jumping in without being fully up to date on the discussion
[21:30:48] Okay, as long as it doens't require a larger hard drive or newer hardware just to migrate.
[21:30:49] there is always a way
[21:31:01] to make globally unique
[21:31:03] I do wonder if it could be name to blob hashes, name to Q-item, and Q-item to blob hashes instead
[21:31:17] I read the backlog
[21:31:55] ok, can we talk about what you're proposing then?
[21:32:21] convert image to image_revision -- not sure what this means
[21:32:53] I am not proposing anything, I am asking why you want to duplicate the 'image' and 'oldimage' tables
[21:33:55] here image_revision (in my mind) is the image table
[21:34:31] jynus: The current tables have each row describe the object in full (with no primary keys). With 'image' holding the current revs and oldimage holding the non-current ones. one the new schemas proposed would involve the 'image' table no longer containing full descriptions (just a pointer to the current revision in image_revision) and both current and
[21:34:31] non-current rows being in image_revision.
[21:34:57] I know that
[21:35:15] I just do not see why you want to duplicate them
[21:35:22] it was just an idea
[21:35:30] I'm happy to back off from it if you have a better idea
[21:35:38] tell me about that idea
[21:36:30] Firstly, the software would need to know which table to query, and whether to expect the old or new system in it during the migration.
[21:36:34] I do not see how it fits proposal 1 or 2
[21:36:57] ok, so you are not duplicating things
[21:37:13] you just have 4 tables during the migration
[21:37:28] jynus: one idea to migrate to the new schema was to create the newly schemad and named tables, import all the data, and drop/flip once finised.
[21:37:29] Yes
[21:37:32] which is ok
[21:37:41] that doesn't need *duplicating data*
[21:37:49] legoktm: Re [14:20] As long as the tables have primary keys, I don't have an opinion :) ... what are the steps for anticipating orders of magnitude more primary keys ... say for modeling a) brain neurons and b) at the much smaller nano-level (in a hypothetical Google Street View with OpenSimulator, conceptually and some years' ahead, and for brain research? Thanks.
[21:37:54] nothing against it
[21:38:01] and it does not require extra resources
[21:38:35] my idea was to do it in a similar way to how page/revision migration was done
[21:38:48] jynus: Well, commonswiki would temporarily hold a significantly larger data set. Since we'd have essentially a copy of those two tables?
[21:39:01] Or does mariadb have a way to import string and binrary values by reference?
[21:39:19] but images would be on one format or another, not both, right?
[21:39:19] that is, add a new table (say imagerevision) which has one row per image revision, but with very small rows
[21:39:52] If what Tim says is #2, I like that better
[21:40:11] "just adding PKs" would be very inneficient
[21:40:19] jynus: No, the software would keep using the old tables until the new tables are completely ready for use (with a small read-only time to catch up)
[21:40:19] and denoramized
[21:40:40] no, I do not like that, Krinkle
[21:40:45] Exactly.
[21:41:04] we can do better
[21:41:06] All of this is #2. We're just talking about how the migration would go.
[21:41:11] the image table currently has things like img_height, img_metadata
[21:41:22] we can migrate progresively
[21:41:26] img_metadata in particular can be enormous
[21:41:37] #info that is, add a new table (say imagerevision) which has one row per image revision, but with very small rows [if this is proposal #2] I like that better
[21:41:41] ftr, I'm still not convinced that an image id would help us solve the changing-image-dimension or more generally image metadata caching problem
[21:41:56] so you could introduce imagerevision which would just give a pointer to the location of the full image metadata
[21:42:13] yes, and that would be relativelly small
[21:42:22] TimStarling: What location is that?
[21:42:39] initially it could point to the existing image and oldimage tables
[21:42:47] I do not have 100% clear the fields
[21:43:01] but I think those are details
[21:43:13] * gwicke heads out for lunch
[21:43:59] I would push for a non-freezing migration- being temporarily compatible with both or filling it up in parallel
[21:44:11] so...we have three proposals in the current RFC, and option #2 is the one that this group seems to be the most interested in fleshing out. is that right?
[21:44:11] TimStarling: Hmm I guess we could even not have the pointers if this file-metadata table becomes keyed by image_revision
[21:44:42] potentially
[21:44:56] it may be that I'm overcomplicating things
[21:45:00] jynus: Yeah, but I don't think that is feasible due to the lack of the needed indexes and primary keys. I don't see a way to make the software maintain both in parallel.
[21:45:28] Anyhow, let's do actually worry about migration later.
[21:45:39] Let's continue about what direction we want to go in.
[21:45:39] since the problem we were dealing with with cur/old was that the old table at the time had the actual text in it, something like 90% of our entire database and we didn't even have the disk space to duplicate it
[21:46:03] Yeah, good point.
[21:46:06] It predates ES
[21:46:17] Krinle, remember that I am going to add an id to the watchlist table (which has not PK) in a hot way
[21:46:39] so there are many tricks to do
[21:46:51] what is better
[21:46:54] Adding keys is simple imho. The software can be feature flagged even as to whether to create/query those.
[21:46:56] #info old cur/old->page migration happened before External Storage existed
[21:47:00] I added an autoincrement PK to the logging table, it was a nuisance IIRC, but possible
[21:47:19] I plan to reduce our database size magically: https://phabricator.wikimedia.org/T139055
[21:47:35] that was my point about "do not worry too much about that"
[21:47:45] I have you covered
[21:48:20] If we go with option 2. What about file IDs? Would they be appropiate for use in APIs and user generated content? How would this go cross-wiki?
[21:48:28] #info jynus plans to employ InnoDB compression, discussed in T139055
[21:48:28] T139055: Test InnoDB compression - https://phabricator.wikimedia.org/T139055
[21:48:41] Krinkle: you mean like https://www.mediawiki.org/wiki/User:NeilK/Multimedia2011/Titles ?
[21:48:58] with file IDs used as the main user-visible identifiers?
[21:49:28] [[File:38798231]] etc.
[21:50:01] TimStarling: Yeah, I mean, initially I guess that wouldn't be relevant, since we'd still have file description pages, which have a unique wiki page name, and transclusion of files is indirectly behind resolving the page title first.
[21:50:23] here file means, actual files that can be overrided by another revision or revision?
[21:50:24] But it is something we may wish we had done differently if we don't think about it now.
[21:50:31] * robla looks in the backlog for gwicke's task number regarding file ids
[21:50:57] I am a long way from convinced on this
[21:51:24] * gwicke returns with burrito
[21:51:28] I'm okay with saying that file IDs will stay internal like we do with page IDs.
[21:51:29] T66214 is the task gwicke mentioned earlier
[21:51:29] T66214: Use content hash based image / thumb URLs & define an official thumb API - https://phabricator.wikimedia.org/T66214
[21:52:08] the more I think, the more this is non-trivial
[21:52:13] it would however move the group of file name problems for a later RFC (need to have unique names at upload time, sometimes akward to use in wikitext, not stable, subject to redirects being deleted etc.)
[21:52:26] purging caches? will it affect them?
[21:52:57] jynus: The current option proposed here doesn't affect that in any way. However if we adopt stable file IDs, all that will becomem significantly easier, not harder.
[21:53:13] (I need to to check the loging and how it is stored on db)
[21:53:20] jynus: those problems are what content hashes are designed to solve
[21:54:21] Krinkle: this is mostly independent of the backend issue isn't it?
[21:54:27] from what I can tell, for most use cases listed in the present RFC, content hashes would actually have better caching properties
[21:54:29] TimStarling: re "I am a long way from convinced on this" - can you elaborate? I'd agree that there doesn't seem to be an easy solution to having a unique file ID that is stable and yet usable across wikis.
[21:54:37] we can add a UI to access file IDs later if we want to
[21:54:49] Yeah, totally, we don't need to tackle that at all.
[21:55:04] we're nearing the end of our time. we started the meeting with 3 proposals in the RFC, and option 2 seems the most viable at the moment. is that right?
[21:55:07] The only that would become exposed is file revision ID, not file ID.
[21:55:16] Since we'd use that instead of the current (file name, timestamp) tuple.
[21:55:20] in APIs
[21:55:42] TimStarling: Right?
[21:55:49] yes
[21:55:53] (and that tuple is already fragmented by wiki, so no concerns there)
[21:56:38] sounds good, robla:
[21:56:38] revision id would be unique, as with page revisions
[21:56:41] TimStarling: I suppose link tables will also stay the same, for the same reason we changed them to be based on titles instead of page ids.
[21:57:06] wait, would this be for the file, or the page description?
[21:57:08] jynus: Yeah, but unlike file IDs (which would need to make sense from another wiki), file revisions are never referenced without a wiki context.
[21:57:20] yes
[21:57:23] URLs make sense in a global context
[21:57:27] as in, you are right
[21:57:32] autoincrement IDs, not so much
[21:57:38] UUIDs could work
[21:57:46] you could assign a 128-bit random ID or something
[21:58:10] but it's a discussion for another day
[21:58:26] gwicke: If we adopt something better than user-generated page names for public file identifiers in the future, the file page would presumably use that same ID in its title.
[21:58:46] I think there is generaly suppor for that, but maybe this need to mature a bit more with other outstanding issues related?
[21:58:52] would the ID identify a blob, or a description of a blob?
[21:59:05] But I agree with Tim that we should leave that as-is for the purpose of this RFC (we keep using file "titles" as the file page and transclusion name for instant commons etc.)
[21:59:33] we need to support titles anyway, at least for b/c
[21:59:36] Krinkle: any decisions you hope to make sure we document here?
[21:59:44] so I don't think we're boxing ourselves in by considering it later
[22:00:34] gwicke: The file ID I theorise here would identify the file as mutable entity. a revisioned wiktext description page would use that ID in its page name. (e.g. File:12345)
[22:00:44] But anyway, issues. Let's keep that outside the scope for now.
[22:00:50] I'm going to summarize this as "Krinkle updated the RFC to give 3 options, we discussed them, and #2 seems to be the current favorite"
[22:01:00] LGTM.
[22:01:25] * robla is going to hit #endmeeting in a couple of minutes; furhter discussion welcome on #wikimedia-tech
[22:01:27] option #1 is now excluded, but option #3 was not really discussed
[22:01:57] I'll update the task and call for fleshing out the details with regards to the exact schema (separate image meta data? any other columns to be dropped?) and how to migrate there (rename and mutate, or re-create, or both for a while)
[22:02:17] Krinkle: "the file as mutable entity" - great ... and the primary key too?
[22:02:18]