During the above session, we will talk about: what are the main problems with current xml/sql/other dumps, for users? If we could redo the entire dumps system from scratch, how would we solve these issues?

Suppose we tossed all our existing dumps infrastructure. What would we want the new Dumps 2.0 to look like? All formats, all content, all means of generation/scheduling up for discussion. Includes: HTML format, XML? JSON? Incrementals, media, EVERYTHING. Let's have a plan. If you don't come, you'll wind up with the same old Dumps 1.0!

The existing dumps have lasted a long time but it's time to put them to bed. Why?

Formats. We already have wikidata json dumps, HTML dumps from Restbase, XML dumps, mysql table dumps, and for a glorious breif while we had media image tarballs. These are/were all produced separately via separate tools, on different schedules, each as their own project. What we should have is one DumpManager to rule them all. Want dumps of X project in a new offline format? Write code for the job(s) needed that takes input/output filenames and a directory location, define dependencies and let the Manager do the rest.

Incrementals. This has to be solved a different way than either the custom binary format (which gets harder when we add new fields to db tables) or than the horrid xml hack from the "adds-changes" dumps. We need to deal with this on the MediaWiki side, generating a list of X that have changed between Y and Z. How that then gets processed into nice content for the user is another discussion.

Content. We have WikiData which has its own unique content; should we do something special here? Media tarballs are missing; there was discussion some time back about "tarballs on the fly" (for hundreds of GB at once). Is that reasonable, can we pare this down to a more manageable problem? What about the right to fork? If any specific content is downloadable by the user but not all of it by the same user, is the right to fork preserved?

Scalability. The current code can't really be made much faster. We scale it by throwing more hardware at it, in a not very clever way. With the current approach we can't actually speed up the en wikipedia dumps; they will take longer and longer over time. The one DumpManager to rule them all should be able to farm jobs out to a little cluster of boring servers all alike and collect the output pieces back; want en wikipedia to run in a week? Make the cluster larger and it will.

Reliability. Because the (XML) dumps are a complicated duct-tape-and-paper-clip thing, and also because jobs take a long time, they are prone to failure for various reasons, from network hiccups to db issues to MW code pushes. If jobs were cheap, we would record the failure, rerun it automatically, if it failed again we would reschedule it automatically with some delay and retry up to n times and then whine to IRC and email on final failure.

Storage. These files are all sitting on one webserver with some sweet arrays. Maybe we want something else, a swift or ceph or some other distributed filesystem? Could any give dump be produced on local disks and then shoved into such a filesystem?

Downloading. We currently cap downloads rather severely. If the backend storage was different, and we had 2-3 boxes in front with good bandwidth, could we make downloads much better for community members? What about access to WMF folks generating stats, or labs users, how can we get them "nearly live" data without bw limits?

Maintainability. I've been hacking on these organically for 5 years now. And they were in use essentially unchanged years before that. Ewww. 'Nuff said.

What we should get done at the Summit:
Hash out the architecture and building blocks, with lists of things to research. Example: We need a job scheduler that can handle dependencies, reruns, collect output, monitor progress reports from job runners and feed them to a log, keep track of resources used per job per host, etc. What is available?
Determine what changes we would need to MW core in order to get a "changed content" list.
Rough outline of what we would need for media dumps to happen regularly without breaking the bank.
Parcel out research tasks and set a time frame for agreement on an architecture.

I have concerns about Formats and Incrementals (points one and two in the description above).

Formats - the need to rewrite dump files

It is necessary to rewrite both XML and SQL dump files before importing them. This is true for the:

XML files always;

SQL files, when the user has a DBMS other than MySQL/MariaDB; and

SQL files, when the user wishes to avoid loss of service caused by the DROP TABLE command.

Dump transform tool example. I use a GAWK script to: a) remove the DROP TABLE and CREATE TABLE commands; b) substitude all INSERT commands with REPLACE; and c) scrape the CREATE TABLE command to determine the column order, and then substitute VALUES with (col1, col2, ...) VALUES. The first (a) prevents loss of service, second (b) prevents loss of good records, and the third (c) protects against the frequent changes in column order in the database schema T103583.

Incrementals - missing incremental SQL dumps

The daily incremental dumps feature XML dumps only. This creates an impedence mismatch for the user. The complete dumps can be imported using transform tools such as mwxml2sql followed by importing the SQL dumps. By contrast, the daily incremental dumps can only be imported using maintenance/importDump.php, which is so slow that, for the largest wikis, importing each daily incremental dump takes longer than a day.

Suggested solution

Given that dump files must be rewritten, let us:

3.1) dump each wiki as three files in XML format:

foowiki-yyyymmdd-pages-articles.xml.bz2;

foowiki-yyyymmdd-stub-articles.xml.gz;

foowiki-yyyymmdd-links-articles.xml.bz2; and

mutatis mutandis for pages-meta-current and pages-meta-history;
3.2) provide a well supported tool to transform the trio of XML dumps into SQL dumps for each supported DBMS; and
3.3) discontinue dumping SQL format.

Initial steps

4.1) Dump generation. I am enhancing WP-MIRROR so than it can generate dumps in many formats (XML, SQL, HTML/ZIM, media, thumb). That way I can experiment without disturbing WMF dump operations.

4.2) Dump transform tool. As to point 3.2) above, I have written wpmirror-mwxml2sql in GAWK. Currently, it is a drop-in replacement for mwxml2sql (the @ArielGlenn version). The disadvantage is that it runs at about half the speed of mwxml2sql (this is the difference between interpreted GAWK and compiled C). The advantage is maintainability (greatly enhanced perspicuity, a fraction of the lines of code, no buffer overflows). To me, this tool could easily be extended to transform the above suggested trio of XML dump files into a complete set of SQL files for any supported DBMS (I would need to add a --dbms command-line option).

Further, I have been working with distributed processing frameworks like Hadoop and Spark. These systems are designed to process log files one line at a time. Needless to say, they work very badly with the current XML format. We (me and analytics engineers) have devoted a large chunk of time and engineering resources towards converting the XML dumps to line-by-line JSON so that we can work with them in these frameworks. I have been developing a schema that describes the format of this conversion. See https://github.com/mediawiki-utilities/python-mwxml/blob/master/doc/revision_document-0.1.0.json

I'm not sure if such a JSON format should be primary or if we should just formalize the work that we've been doing in analytics to make the production of JSON files follow the production of XML files. I just wanted to flag that we're doing this and it would be good to move the work towards the source.

@wpmirrordev and @Halfak, these are both good suggestions (though I would likely not toss the sql dumps but keep them along with anything new we generate). But these are both secondary to what I have in mind; I want a new framework that will allow dumps in new formats to be easily added in to existing dumps; problems of scale to be (relatively) easily resolved if we are willing to add generic servers to a cluster; maintenance to be done by a number of people instead of only me; and rely in part on some sort of grid computing setup based on an existing open source project. I spent this week writing some notes with a couple of pictures to get at some of these issues and I'll get those on line this weekend. Getting a new framework with separate components deployed would allow anyone to add a JSON formatter to process the output of any particular dump, for example.

Having said that, please keep on with the suggestions, we'll use them, whether earlier or later.

The notes are now all online at the above location. Please remember that the diagram and the walkthroughs and generally all proposals are for getting this discussion started. Nothing is decided (well some of the goals are a must, but that's it). Please start commenting (here, preferably, or there if it's more convenient but notify here in that case).

Splitting can be achieved by subheadings. A typical example is Wiktionary that has one page per word, and subheadings for each language that has a definition of that word, e.g. https://en.wiktionary.org/wiki/Eskimo

However, these structures are not reflected in dumps, templates, robot operations or searches. But maybe they should? It would be nice to be able to search for words only inside the current book without learning all the intitle: syntax for searches. Wiktionary has many templates with the argument lang=, which is almost always lang=fr in the ==French== section of the page and lang=de in the ==German== section of the page. If the subheading could set this as a context (in this section, assume lang=fr, like a local variable in a programming language), the parameter would not be needed in each template call. (It would only be needed when it deviates from the section default.) So, considering that the XML dump is a structured document, perhaps we should see the wiki as one large structure, rather than a flat set of pages.

@LA2 this is something I've long wanted for the Wiktionaries, which all have their own 'markup' for language, definition, translation and so on. I do think it's out of the scope of this project but something that should be raised separately.

So guys and gals and the rest of us, I don't see any discussion yet on the architecture. We need to do as much of this _before_ the dev summit as possible, to make the best use of our time there. If you don't feel you know enough to comment, start asking questions; if you know enough to comment, start doing so. At a minimum I want people to comment on the architecture diagram and the various components; makes sense? Too heavy? Functions split up improperly? No grid computing tool will do what we need? Or you know just the thing?

Well, here are some comments and questions. Hopefully they aren't too obtuse, and might be of some utility.

comments

I like the idea of a central object store that contains various "chunks" of data which can be reprocessed by transformers, notifiers, downloaders, etc. I'm not sure what work is involved to produce this, but it's definitely the best feature of the redesign.

I think the architecture and various component functionality look fine, though it's too high-level to comment directly upon. I don't know if you're looking for direct feedback, but the job runner, logger, status monitor, etc look like an appropriate division of labor.

questions

What format are the chunks in the object store? The Redesign documents indicate uncompressed XML (and probably SQL for the table dumps) but I just want to be sure. Any possibility of JSON (which is less bulky) or SQLite (which is directly queryable)?

I honestly don't understand the grid concept. The document states: "This is some grid computing thing that we didn't write and we don't maintain.". Is this some third-party service, like Amazon Web Services (editorial: hope not) that does all the heavy-lifting? If so, how exactly does that happen since all the data resides on Wikimedia servers and would need to be transported anywhere else for work? Wouldn't the time saved in grid-computing be offset by data-transport?

I never understood why history dumps need to be redumped every month, especially since they are so tremendous -- particularly the English Wikipedia history dumps. Since each revision gets assigned an incrementing integer ID, can't it just be dumped once and not redumped again? For example, assume all edits between 2015-11-01 and 2015-11-30 will be assigned a revision_id of 31,000,000 to 31,999,999. Can't Wikimedia just dump one file called "enwiki-revision-31000000-31999999" once, and never have to dump it again? From a server perspective, this would save time and space. From a user perspective, it would mean downloading a large file only once, rather than downloading the entire terabyte data-set month after month.

So first: grid computing in this context means: open source software that schedules and runs jobs keeping track of available resources on its cluster, rerunning if necessary, able to manage dependencies (I think), and so on. In our case a pretty small cluster of hosts, that are just COTS hardware. NOT AT ALL sending files somewhere else; the idea is simply that right now we have our own several levels of python code that do this poorly, and it would be nice to use an already developed open source package that was designed for this from the start.

Second: we dump the history dumps ordered by page; that's what's most useful for most users. Since pages change from one dump to the next, we rewrite them. If we decided to do this in revision id order instead, and we got everyone to go along with it, rewriting their tools and such, we would still need to regenerate the dumps from time to time to remove deleted revisions (from deleted pages) or oversighted/hidden revisions. Yes, it's true that those would still be available on archive.org but we don't want to be providing them forever as apparently current content. This thing about the deleted/oversighted revisions (which might often contain e.g. personal identifying information) is something where we might want to get the Legal team's advice.
It would be nice to have true incrementals in a format easily parsable with standard tools, and that indeed is one of the things up for discussion; what those would look like, how they could be implemented.

Third (but first for you): the object store will have intermediate files in uncompressed something. I don't really care what they are. They might be XML if we stick to XML format; they might be something else. We want a format converter that can produce multiple output formats when it comes time to recombine these into nice downloadable pieces anyways. Uncompressed is needed so that recombining them is as fast as we can have it.

I am indeed looking for direct feedback on the division of labour. and on everything else. If it would be helpful I could describe in another part of that document what each of those pieces corresponds to in the current code (ewww) but I don't know if that would be generally useful or not. If/when we get general agreement about this high level overview we can start fleshing out the details. But if you have proposals, no reason to wait til then.

Ah! Thanks for the explanations. I don't have much to add, but just some minor points

OSS grid computing gets a breath of relief for me. I don't really have any experience in this arena though. Are there certain packages that are leading candidates?

I didn't realize that revisions get deleted. I would think that these events are relatively uncommon, and that the range encompassing the deletions can just be dumped again. Over time, old revision-ranges (for example, 2002-02-25 -> 2002-03-24) should become static. Even in the worst case, if 1 page from every revision-range gets deleted, each range will just need to be dumped again. This won't be any worse than today. :)

As for page-order dumps, I honestly can't see a valid use case for them. For example, the 1st range is p000000010 - p000002875 . This encompasses a wide range of unrelated pages including Alchemy, A, Aristotle, International Atomic Time, Air conditioning, Arabian Sea, Parallel ATA, etc. I really can't see someone saying "I'll download p000000010 - p000002875 but not p000002877 - p000005030." If someone were looking for the complete edit history for a handful of pages, they'd use the API. If someone were looking for the complete edit history for tens of thousands of pages, then they would probably be randomly spread out over all these page-ranges and they'd have to download the whole thing. The new paradigm would be no different than the old paradigm.

I think uncompressed XML is fine for the object store formats. I think JSON would be "lighter", but most of the existing tool support is probably around XML. So no strong feelings / arguments from me there. :)

However, I would think that there would be a mixture of other formats also: SQL for the table dumps and binary for the image dumps (whenever that happens). Is that the case, or will everything be standardized to XML?

I'll think a little more on the overall division of labor, but this isn't something I work with, so my input will probably be very shallow. Are there specific issues you're concerned about, such as is 2 rounds of failures enough to require manual intervention for a failed job?

There are two ways for this to happen. One is that a page gets deletd; in that case the revision is no longer visible to a normal user and so it would not be dumped either. The second way it can 'disappear' is for it to be specifically oversighted because it has damaging content; in this case, the other revisions for the page are still accessible but the one revision can no longet be viewed by a normal user, and again it would not be dumped.

People use page-order dumps all the time because their bots or editing tools or history analyzers or whatnot all expect to look at the info for a single page at once.

I'm not wedded to any particular format for things in the object store. I'd prefer to have one format that fits all internally, but I could be talked out of that. I assume we'll provide XML final output and SQL final out because people have tools written for those formats. ANd we'll see what else people desire and what makes sense.

As to the division of labor, I'm concerned about all aspects of it :-) So if there's anything whatsoever that jumps out at you, please comment.

Regarding page-order dumps, I do understand that there are existing tools built around them. However, I still think that revision-order dumps would be better for the majority of users, as well as Wikimedia in the long-run. Especially since page-order dumps basically have a shelf-life of 1 month (they need to be regenerated / redownloaded the next month). Revision-order dumps persist longer: they only need to be regenerated / redownloaded when revisions get deleted.

That said, I'm not one to press the point, and would be interested to hear counterpoints from page-order users.

As for object store format, I really would not want one-universal-format, as it would make for awkward shoehorning. I believe separate formats would work fine: SQL for table dumps; binary for image dumps; SQL / XML / JSON for revision dumps. Personally, I think it's better to use the right tool for each task than to try to craft one tool for all tasks.

I looked again at the redesign documents, and at the risk of being an echo chamber, I really think it is logically fine. I think the dump manager does do a lot of work (it has the most number of bi-directional interactions with the other components), so it seems like it may be the biggest bottleneck (it submits a job, is the exclusive partner for the client, processes all jobs when completed, etc.). However, I think it's fine to have one central manager, and don't see any easy ways to decompose its duties. Again, just my 2 cents.

Notes on some possible job queue management/grid/multitask stuff. The go-to in python is celery, all exceptions from any job must be Pickleable (ewww), it has limited functionality but we might be ok with it because it's rather low level; another possibility is the kybernetes job management piece but this may be too high level for us. Collecting other suggestions and we should start a page someplace to evaluate alternatives once there are more than two. Kybernetes is god for managing jobs that take different amount of resources and celery is not, but with the dumps 2.0 straw man model we are going for 1 core per job so that all jobs look alike, though some may run longer than others.

This doesn't seem clear to me, and I think I'm coming to agree with @GWicke's comment T119029#1866851 that this is probably not a good for the summit in its current form. That said, it would be very disappointing to kick the can down the road for some more time.

@ArielGlenn, you wrote: "If you don't come, you'll wind up with the same old Dumps 1.0!" Really? Do you have a concise answer for "if I were in charge, I would solve X with Y?", for the following values of X?

Format

Incrementals

Content

Scalability

Reliability

Storage

Downloading

Maintainability

You also say "FOR CURRENT DISCUSSION" and then point to a giant wall of prose. That doesn't look like a discussion to me. It looks like it might be time to revisit @Halfak's earlier comment, because that looks like the seeds of a discussion that were killed early.

First, the same old Dumps 1.0 are what are running now. If we don't get some collaboration on 2.0 there won't be a 2.0. And that's exactly what I meant when I said that if people don't show up we'll just have the same old dumps we always had.

Second, for current discussion: the document is meant to be a straw man proposall that people can pick apart for discussion, in particular the diagram, because without giving some starting point for discussion, no discussion was taking place.

Halfak's comment is fine, it's just premature, in the sense that if people agree that there ought to be some sort of formatter that produces output, we can have that produce xml, json, etc. The idea is not dead at all.

I think there's a risk of second system syndrome here in the current proposal. I would start with the question "what's the most important problem to solve with the current system?" My instinct is that we can come up with a very simple system to solve @Halfak's problem, possibly a (hopefully temporary) "next gen" version that doesn't replace the old system at first. We keep iteratively solving problems with the new system, and then when the pain of maintaining two systems becomes too much, implement the old system on top of the new system.

OK, I'll backtrack then. There's some problems listed in the description but maybe none of those is "the most important problem to solve" with the current system. So I"m taking bids, what do people think? And yes as the maintainer I have an opinion but it's as a maintainer, not as a user, so I'll chime in after some other folks have spoken up.

Problem 1 does not need to wait for dumps 2.0; the instant workaround (though a workaround and not a solution) is to download from the your.org mirror which provides reasonable download speeds. In the meantime I will open a ticket for this issue and cc you on it, gathering the network issues ticket you have, a ticket I have open for ms1001 eth bonding, and adding one suggested by paravoid about looking at dataset1001 memory and related issues that may impact disk i/o.

Problem 2 should get a task for it also (again it does not need to wait for dumps 2.0) though I can not promise to start on it right away.

@ArielGlenn: To me it seems that the discussion so far lacks a shared agreement on what the most pressing problems with dumps are. This makes it difficult to evaluate candidate solutions and their trade-offs relative to the top priorities.

With the right preparation, a discussion at the dev summit could perhaps help to establish a shared agreement on the top problems to solve. It would be helpful if a candidate list could be worked out before the summit, so that it can inform the discussion.

I have scheduled this session on the Unconference track for tomorrow, Jan 5 at 10 a.m. We'll be identifying the main issues for users of the dumps (I already know what the main issues are for the maintainer!), and then discussing approaches to address those issues if we were able to design the dump system from scratch.

I agree with @Halfak that many of the big XML dumps are very difficult to use, and a single-line JSON format could be easily parallelized by users and would be much more convenient to parse in modern languages. They should also be compressed with bz2, not gz, as noted in T115222.

I would second that work as a good high-priority starting point, as suggested by @RobLa-WMF

Lots to process still. I'm going to chat with Adam Wight (didn't get to do that at the dev summit) and also Gabriel Wicke (same) this week and add those notes to the etherpad so we don't lose them; I suppose I should copy those etherpad notes to a wiki page right after that, actually. See the action items, the ball's on me but I'm trying more to coordinate than to decide by fiat.

Wikimedia Developer Summit 2016 ended two weeks ago. This task is still open. If the session in this task took place, please make sure 1) that the session Etherpad notes are linked from this task, 2) that followup tasks for any actions identified have been created and linked from this task, 3) to change the status of this task to "resolved". If this session did not take place, change the task status to "declined". If this task itself has become a well-defined action which is not finished yet, drag and drop this task into the "Work continues after Summit" column on the project workboard. Thank you for your help!

https://etherpad.wikimedia.org/p/WikiDev16-T114019 has rought notes from talk with millimetric's team, I will clean these up and move to the wiki page for the dev session soon. AWight and maybe one more Aaron are next. Then it will be question culling time.

Will chat with @Halfak tomorrow at around 18.00 my time i.e. EET (16.00 UTC I guess). Notes to go to etherpad first and then wiki page as usual. After that AWight and then done with info gathering for this round.

Yes indeed, Flow and anything folks want to produce in the future. We want this to be as easy as adding a config section to a puppet manifest (once the script to produce the dataset is written and tested).

I have broken out all the questions into tasks and added them as blocking tasks to one of the four groups outlined by @awight above. In the next days I'll start grabbing suggestions from the session and from notes and adding them on those tasks; dont' wait for that however. Please rewrite/clarify/add/move questions as you deem appropriate. Once we have settled on the list I'll prioritize a few that I'll try to move along towards conclusions first, but comment on any of them.

As promised this task is now being closed. If you're not getting notifications about those other tasks don't forget to subscribe to the Dumps-Rewrite project.