On Thu, Aug 9, 2012 at 6:54 AM, Denny Vrandečić <denny.vrandecic [at] wikimedia> wrote: > * Merging the Wikidata branch (ContentHandler) is still open, see > <https://bugzilla.wikimedia.org/show_bug.cgi?id=38622>. There has been > no feedback in the last few weeks. Daniel is waiting for input.

Per discussion on the bug, there's an unresolved issue with the code as stored in Gerrit. Tim tried cloning this, and wasn't able to find some of the revisions that Daniel referred to. Last week, you mentioned that Daniel was going to send mail to the list about the Gerrit stuff, but I don't recall seeing that.

If you can get the code somewhere Tim can review it, he's ready to look at it.

I'd strongly suggest starting a separate thread on this mailing list about this (please change the subject line if you reply to this message). In short, this is a controversial approach, and is unclear why you're letting it block your work.

The current "interwiki" related code in core has many assumptions baked in that prevent us from doing what we need to do in phase 1. For instance "language links" can only be made to sites with an id that is a language code. Since we're properly identifying sites across our clients, we're using global identifiers, which will be "enwiki" rather then "en", so this cannot work. That's only one of the many evil things in the current code.

> this is a controversial approach

How so?

Is anyone suggesting building on top of the pile of crap we currently have would be better?

> Hi all, > > [...] > * Changeset <https://gerrit.wikimedia.org/r/#/c/14295/>, bug > <https://bugzilla.wikimedia.org/show_bug.cgi?id=38705> about handling > sites. The idea is to migrate from the "interwiki" table to the new > "Sites" facility. RobLa mentioned two weeks ago that Chad seems to be > working in a similar direction, but we haven't seen comments yet. No > discussion is ongoing or any substantial feedback was received here as > well, and it seems somewhat stuck. > [...] > > I hope this helps, > Cheers, > Denny >

I would like some more information on this. The bug doesn't appear to even have the correct link for a discussion on this.

Redoing our interwiki code to deal with some mistakes we made in storage was something I was hoping to do. So if this is something hoping to replace the interwiki system I'd like to look over what the plan and overall idea is with this to make sure we don't repeat the same mistakes.

> One thing that was tacked on the wiki page without mention here or a > bug created was the "Stick to that language" extension. Is that a > hard requirement, or nice to have?

We are currently investigating using the "Universal Language Selector" instead of "Stick to that language", and on first glance it looks good. If this remains like this, we will drop "Stick to that language". That is why I didn't list the corresponding open issues there. We'd be happy to go for ULS instead. We expect to have a resolution on that next week.

On 09.08.2012 16:49, Rob Lanphier wrote: > On Thu, Aug 9, 2012 at 6:54 AM, Denny Vrandečić > <denny.vrandecic [at] wikimedia> wrote: >> * Merging the Wikidata branch (ContentHandler) is still open, see >> <https://bugzilla.wikimedia.org/show_bug.cgi?id=38622>. There has been >> no feedback in the last few weeks. Daniel is waiting for input. > > Per discussion on the bug, there's an unresolved issue with the code > as stored in Gerrit. Tim tried cloning this, and wasn't able to find > some of the revisions that Daniel referred to.

That would be strange, a straight fresh clone works fine for me, the revisions are in the log.

Tim, please confirm that you are unable to see the changes I mentioned when you just switch to the Wikidata branch on an up to date working copy of core, ignoring Gerrit.

Also, dennyb added direct links to the respective commits on gitweb. They are there. Gerrit just doesn't know about it. And the shortlogs on gitweb are strange.

> Last week, you > mentioned that Daniel was going to send mail to the list about the > Gerrit stuff, but I don't recall seeing that.

I investigated the problem and reported my findings on bugzilla. There isn't muchz to say except "gerrit doesn't know about direct pushes" and "gitweb is confusing".

> If you can get the code somewhere Tim can review it, he's ready to look at it.

Well, it's in the git repo. Everyone in the team is using that branch for development and testing, they'd notice if important changes were missing. So i'm confident that it really *is* there.

> Hey, > >> So if this is something hoping to replace the interwiki system I'd like >> to look over what the plan and overall idea is with this to make sure we >> don't repeat the same mistakes. > > Please have a look at the patch on gerrit then. Feedback is much > appreciated :) https://gerrit.wikimedia.org/r/#/c/14295/> > Cheers > > -- > Jeroen De Dauw > http://www.bn2vs.com> Don't panic. Don't be evil. > --

Looking over the code it does seem we're repeating the same issues that exist with the current interwiki system I was planning to eliminate when I moved includes/Interwiki.php to includes/interwiki/Interwiki.php and put this on my endless to-do list.

The issue I was trying to deal with was storage. Currently we 100% assume that the interwiki list is a table and there will only ever be one of them. But this counters multiple facts about interwikis in practice: - We have a default set of interwiki links. Because we use a database instead of flat files we end up inserting stuff on installation. As a result when something changes eg: Wikimedia supports https:// and now all links are supposed to be protocol-relative. We have hundreds of wikis all using outdated interwiki rules even after they upgrade MediaWiki because interwiki links are only inserted by software on installation, they are not taken directly from the software map. - In practice we don't want one interwiki map. In projects like Wikimedia we actually usually want two or three. We want a global shared list of interwikis so that [[Wikipedia:]] [[commons:]] etc... work on every project. We want a shared list of interwikis for each project (ie: Wikipedias, Wiktionaries, etc...), primarily so that [[en:]] [[es:]] etc... language links are not duplicated, since these can't be global but also there may be some interwiki links that apply to some projects but not others. And sometimes we also want a wiki-local interwiki list because some communities want to add links to sites that other wikis don't. Or we may want to localize a link. And we end up writing absolutely horrible hacks we shouldn't have to because implementation is ignorant of reality.

I had planned to do a few primary things to the system: - Drop the notion of the interwiki list simply being a database table. Multiple class implementations were going to make it possible to have database backed interwiki lists, file backed interwiki lists (multiple formats), etc... - Drop the single-list handling and add allow a list of multiple interwiki sources to be configured from a wg variable. Together it would mean that our default list of interwiki links would no longer be stored in the interwiki table and instead would be read directly from our source code where cleaning up the urls would nicely update all wikis when they upgrade. And it would mean that it would be easy to setup multiple interwiki list sources for wikis. Such as a global interwiki database, a project one, and a local one. And it would be possible to use simple text based file backed interwiki lists so that people don't need to mess with sql.

---- But it looks like the new sites code is also focused around a single list of database backed sites.

((Also, while there are a number of really interesting ideas, sorry to say it but some of the code already triggers that "Must rewrite!" mood rather than thinking of small incremental tweaks))

Also anything in this area really needs to think of our lack of user interface. If we rewrite this then we absolutely must include a UI to view and edit this in core. By rewriting it we ditch every hack trying to make it easy to control the interwiki list and only make the problem worse. The notes on synchronizing with wikidata look interesting. But this kind of thing absolutely has to be user-friendly and multi-wiki friendly at a core level, not only for wikis using wikidata. ---- I think some of this stuff is a bit large to discuss in code review or email. I'd like to do this RfC style, listing everything we need from different perspectives so we can come up with something that doesn't need to be redone yet again.

Originally I was focused around taking interwiki dependence out-of the database. But the talk of synchronization and other things in the code has me thinking of other things like a database table as a final index (like pagelinks, etc...), fetching lists, siteinfo, etc... from other sites, and other things. So I have a feeling that the best thing we come up with will probably look different than what either of us started with.

Firstly though, I probably won't be able to come up with a good idea without a good understanding of Wikidata's role in all this: - I would like to understand what Wikidata needs out of interwiki/sites and what it's going to do with the data - I'd also like to know if Wikidata plans to add any interface that will add/remove sites

If we do this hastily I think we may also miss a very good chance to make fixing bug 11 and bug 10237 much more sanely possible.

bug 39199 also covers a thought on linking in pages I've been thinking about.

> The issue I was trying to deal with was storage. Currently we 100% assume that the interwiki list is a table and there will only ever be one of them.

Yes, we are not changing this. Having a more flexible system might or might not be something we'd want in MediaWiki. We do not need it in Wikidata though. The changes we're making here do not seem to affect this issue at all, so you can just as well implement it later on.

> In practice we don't want one interwiki map. In projects like Wikimedia we actually usually want two or three. > .. > And sometimes we also want a wiki-local interwiki list because some communities want to add links to sites that other wikis don't.

This we are actually tacking, although in a different fashion then you propose. Rather then having many different lists of sites to maintain, we have split sites from their configuration. The list of sites is global and shared by all clients. Their configuration however is local. So if wiki a wants to use site x as interwikilink with prefix foobar, wiki b wants to use it with prefix baz and wiki c does not want to use it as interwikilink at all, this is perfectly possible. This split and associated generalization our changes bring add a lot of flexibility compared to the current system and remove bad assumptions currently baked in.

> Also anything in this area really needs to think of our lack of user interface. If we rewrite this then we absolutely must include a UI to view and edit this in core.

Again, this is not something we're touching at all, or want to touch, as we don't need it. Personally I think I'd be great to have such facilities, and it makes sense to add these after the backend has been fixed. I'd be happy to work with you on this (or leave it entirely up to you) once we got the relevant rewrite work done.

> By rewriting it we ditch every hack trying to make it easy to control the interwiki list and only make the problem worse.

Our change will not drop any existing functionality. I will make sure there are tools/facilities at least as good (and probably better) then the current ones.

> I would like to understand what Wikidata needs out of interwiki/sites and what it's going to do with the data

We need this for our "equivalent links", which consist out of a global site id and a page. Right now we do not have consistent global ids, in fact we don't have global ids. We just have local ids that happen to be similar everywhere (while one might not want this, but is pretty much forced to right now), which must be language codes in order to be "languagelinks" or (better named) "equivalent links". Also, right now, all languagelinks are interwikilinks (wtf) - we want to be able to have "equivalent links" without then also being interwiki links!

> I'd also like to know if Wikidata plans to add any interface that will add/remove sites

The backend will have an interface to do this, but we're not planning on any API modules or UIs. The backend will be written keeping in mind people will want those though, so it ought to be easy to add them later on.

So to wrap up: I don't think there is any conflict between what we want to do (if you disagree, please provide some pointers). You can make your changes later on, and will have a much more solid base to work on then now.

> We are currently investigating using the "Universal Language Selector" > instead of "Stick to that language", and on first glance it looks > good. If this remains like this, we will drop "Stick to that > language". That is why I didn't list the corresponding open issues > there. We'd be happy to go for ULS instead. We expect to have a > resolution on that next week. >

On 12-08-09 12:00 PM, Jeroen De Dauw wrote: > Hey, > > Daniel, thanks for your input. > > TL;DR at the bottom :) > > > The issue I was trying to deal with was storage. Currently we 100% > assume that the interwiki list is a table and there will only ever be > one of them. > > Yes, we are not changing this. Having a more flexible system might or > might not be something we'd want in MediaWiki. We do not need it in > Wikidata though. The changes we're making here do not seem to affect > this issue at all, so you can just as well implement it later on. > > > In practice we don't want one interwiki map. In projects like > Wikimedia we actually usually want two or three. > > .. > > And sometimes we also want a wiki-local interwiki list because some > communities want to add links to sites that other wikis don't. > > This we are actually tacking, although in a different fashion then you > propose. Rather then having many different lists of sites to maintain, > we have split sites from their configuration. The list of sites is > global and shared by all clients. Their configuration however is > local. So if wiki a wants to use site x as interwikilink with prefix > foobar, wiki b wants to use it with prefix baz and wiki c does not > want to use it as interwikilink at all, this is perfectly possible. > This split and associated generalization our changes bring add a lot > of flexibility compared to the current system and remove bad > assumptions currently baked in. I think we're going to need to have some of this and the synchronization stuff in core. Right now the code has nothing but the one sites table. No repo code so presumably the only implementation of that for awhile will be wikidata. And if parts of this table is supposed to be editable in some cases where there is no repo but non-editable then I don't see any way for an edit ui to tell the difference.

I'm also not sure how this synchronization which sounds like one-way will play with individual wikis wanting to add new interwiki links.

> > Also anything in this area really needs to think of our lack of user > interface. If we rewrite this then we absolutely must include a UI to > view and edit this in core. > > Again, this is not something we're touching at all, or want to touch, > as we don't need it. Personally I think I'd be great to have such > facilities, and it makes sense to add these after the backend has been > fixed. I'd be happy to work with you on this (or leave it entirely up > to you) once we got the relevant rewrite work done. > > > By rewriting it we ditch every hack trying to make it easy to > control the interwiki list and only make the problem worse. > > Our change will not drop any existing functionality. I will make sure > there are tools/facilities at least as good (and probably better) then > the current ones. I'm talking about things like the interwiki extensions and scripts that turn wiki tables into interwiki lists. All these things are written against the interwiki table. So by rewriting and using a new table we implicitly break all the working tricks and throw the user back into SQL.

> > I would like to understand what Wikidata needs out of > interwiki/sites and what it's going to do with the data > > We need this for our "equivalent links", which consist out of a global > site id and a page. Right now we do not have consistent global ids, in > fact we don't have global ids. We just have local ids that happen to > be similar everywhere (while one might not want this, but is pretty > much forced to right now), which must be language codes in order to be > "languagelinks" or (better named) "equivalent links". Also, right now, > all languagelinks are interwikilinks (wtf) - we want to be able to > have "equivalent links" without then also being interwiki links! I like the idea of table entries without actual interwikis. The idea of some interface listing user selectable sites came to mind and perhaps sites being added trivially even automatically. Though if you plan to support this I think you'll need to drop the NOT NULL from site_local_key.

Actually, another thought makes me think the schema should be a little different. site_local_key probably shouldn't be a column, it should probably be another table. Something like site_local_key (slc_key, slc_site) which would map things like en:, Wikipedia:, etc... to a specific site. I can see wikis wanting to use multiple interwiki names for the same site. In fact I'm pretty sure this already happens with the existing interwiki table. We just create duplicate rows. But you want global ids so I really don't think you want data duplication like that to happen.

> > I'd also like to know if Wikidata plans to add any interface that > will add/remove sites > > The backend will have an interface to do this, but we're not planning > on any API modules or UIs. The backend will be written keeping in mind > people will want those though, so it ought to be easy to add them > later on. > > So to wrap up: I don't think there is any conflict between what we > want to do (if you disagree, please provide some pointers). You can > make your changes later on, and will have a much more solid base to > work on then now. I think I need to understand the plans you have for synchronization a bit more. - Where does Wikidata get the sites - What synchronizes the data - What is the repo like. Also what it it based off of. Is this wikis syncing from another wiki's sites table or does Wikidata have a real set of data the sites table gets based off of. - Is this one-way synchronization or multiway.

synchronization, treatment of the table (whether it's an index of something else or first class data), and editing/UIs for editing are a set of things where you can get in the way of the ability to do the others later if you don't think of them all up front.

Our old interwiki table was treated as first-class data and was simple data that was easy to create an edit interface for. As a result it's hard to do any synchronization for since we didn't plan for it. Likewise if we design a sites table focused on synchronizing data and treatment of the table as simultaneous first-class data with some of it treated like an index. We can easily come up with something that is going to get in the way of the consistency needed for a UI.

One of our options might be to treat sites like an index of data built from other sources just like pagelinks. Wikidata can act as a repo, the sites code can build from multiple sources with Wikidata being the first, and when a UI comes into play the UI can create it's own list of sites and that can be used as a source for the building of the sites table. ---- Heh, it probably doesn't help that this is making my abstract revision idea come up and make me want to have the UI depend off of that. > Cheers > > -- > Jeroen De Dauw > http://www.bn2vs.com> Don't panic. Don't be evil. > -- Btw if you really want to make this an abstract list of sites dropping site_url and the other two related columns might be an idea. At first glance the url looks like something standard that every site would have. But once you throw something like MediaWiki into the mix with short urls, long urls, and an API the url really becomes type specific data that should probably go in the blob. Especially when you start thinking about other custom types. _______________________________________________ Wikitech-l mailing list Wikitech-l [at] lists https://lists.wikimedia.org/mailman/listinfo/wikitech-l

I think we're going to need to have some of this and the synchronization > stuff in core. > Right now the code has nothing but the one sites table. No repo code so > presumably the only implementation of that for awhile will be wikidata. And > if parts of this table is supposed to be editable in some cases where there > is no repo but non-editable then I don't see any way for an edit ui to tell > the difference. >

We indeed need some configuration setting(s) for wikis to distinguish between the two cases. That seems to be all "synchronisation code" we'll need in core. It might or might not be useful to have more logic in core, or in some dedicated extension. Personally I think having the actual synchronization code in a separate extension would be nice, as a lot of it won't be Wikidata specific. This is however not a requirement for Wikidata, so the current plan is to just have it in the extension, always keeping in mind that it should be easy to split it off later on. I'd love to discuss this point further, but it should be clear this is not much of a blocker for the current code, as it seems unlikely to affect it much, if at all.

On that note consider we're initially creating the new system in parallel with the old one, which enabled us to just try out changes, and alter them later on if it turns out there is a better way to do them. Then once we're confident the new system is what we want to stick to, and know it works because of it's usage by Wikidata, we can replace the current code with the new system. This ought to allow us to work a lot faster by not blocking on discussions and details for to long.

> I'm also not sure how this synchronization which sounds like one-way will play with individual wikis wanting to add new interwiki links.

For our case we only need it to work one way, from the Wikidata repo to it's clients. More discussion would need to happen to decide on an alternate approach. I already indicated I think this is not a blocker for the current set of changes, so I'd prefer this to happen after the current code got merged.

I'm talking about things like the interwiki extensions and scripts that > turn wiki tables into interwiki lists. All these things are written against > the interwiki table. So by rewriting and using a new table we implicitly > break all the working tricks and throw the user back into SQL. >

I am aware of this. Like noted already, the current new code does not yet replace the old code, so this is not a blocker yet, but it will be for replacing the old code with the new system. Having looked at the existing code using the old system, I think migration should not be to hard, since the new system can do everything the old one can do and the current using code is not that much. The new system also has clear interfaces, preventing the script from needing to know of the database table at all. That ought to facilitate the "do not depend on a single db table" a lot, obviously :)

I like the idea of table entries without actual interwikis. The idea of > some interface listing user selectable sites came to mind and perhaps sites > being added trivially even automatically. > Though if you plan to support this I think you'll need to drop the NOT > NULL from site_local_key. >

I don't think the field needs to allow for null - right now the local keys on the repo will be by default the same as the global keys, so none of them will be null. On your client wiki you will then have these values by default as well. If you don't want a particular site to be usable as "languagelink" or "interwikilink", then simply set this in your local configuration. No need to set the local id to null. Depending on how actually we end up handling the defaulting process, having null might or might not turn out to be useful. This is a detail though, so I'd suggest sticking with not null for now, and then if it turns out I'd be more convenient to allow for null when writing the sync code, just change it then.

Actually, another thought makes me think the schema should be a little > different. > site_local_key probably shouldn't be a column, it should probably be > another table. > Something like site_local_key (slc_key, slc_site) which would map things > like en:, Wikipedia:, etc... to a specific site. >

Denny and I discussed this at some length, now already more then a month ago (man, this is taking long...). Our conclusions where that we do not need it, or would benefit from it much in Wikidata. In fact, I'd introduce additional complexity, which is a good argument for not including it in our already huge project. I do agree that conceptually it's nicer to not duplicate such info, but if you consider the extra complexity you'd need to get rid of it, and the little gain you have (removal of some minor duplication which we've had since forever and is not bothering anyone), I'm sceptical we ought to go with this approach, even outside of Wikidata.

I think I need to understand the plans you have for synchronization a bit > more. > - Where does Wikidata get the sites >

The repository wiki holds the canonical copy of the sites, which gets send to all clients. Modification of the site data can only happen on the repository. All wikis (repo and clients) have their own local config so can choose to enable all sites for all functionality, completely hide them, or anything in between.

- What synchronizes the data >

The repo. As already mentioned, it might be nicer to split this off in it's own extension at some point. But before we get to that, we first need to have the current changes merged.

Btw if you really want to make this an abstract list of sites dropping site_url > and the other two related columns might be an idea. > At first glance the url looks like something standard that every site > would have. But once you throw something like MediaWiki into the mix with > short urls, long urls, and an API the url really becomes type specific data > that should probably go in the blob. Especially when you start thinking > about other custom types. >

The patch sitting on gerrit already includes this. (Did you really look at it already? The fields are documented quite well I'd think.) Every site has a url (that's not specific to the type of site), but we have a type system with currently the default (general) site type and a MediaWikiSite type. The type system works with two blob fields, one for type specific data and one for type specific configuration.

On 12-08-09 3:55 PM, Jeroen De Dauw wrote: > Hey, > > You bring up some good points. > > I think we're going to need to have some of this and the > synchronization stuff in core. > Right now the code has nothing but the one sites table. No repo > code so presumably the only implementation of that for awhile will > be wikidata. And if parts of this table is supposed to be editable > in some cases where there is no repo but non-editable then I don't > see any way for an edit ui to tell the difference. > > > We indeed need some configuration setting(s) for wikis to distinguish > between the two cases. That seems to be all "synchronisation code" > we'll need in core. It might or might not be useful to have more logic > in core, or in some dedicated extension. Personally I think having the > actual synchronization code in a separate extension would be nice, as > a lot of it won't be Wikidata specific. This is however not a > requirement for Wikidata, so the current plan is to just have it in > the extension, always keeping in mind that it should be easy to split > it off later on. I'd love to discuss this point further, but it should > be clear this is not much of a blocker for the current code, as it > seems unlikely to affect it much, if at all. > > On that note consider we're initially creating the new system in > parallel with the old one, which enabled us to just try out changes, > and alter them later on if it turns out there is a better way to do > them. Then once we're confident the new system is what we want to > stick to, and know it works because of it's usage by Wikidata, we can > replace the current code with the new system. This ought to allow us > to work a lot faster by not blocking on discussions and details for to > long. > > > I'm also not sure how this synchronization which sounds like one-way > will play with individual wikis wanting to add new interwiki links. > > For our case we only need it to work one way, from the Wikidata repo > to it's clients. More discussion would need to happen to decide on an > alternate approach. I already indicated I think this is not a blocker > for the current set of changes, so I'd prefer this to happen after the > current code got merged. > > I'm talking about things like the interwiki extensions and scripts > that turn wiki tables into interwiki lists. All these things are > written against the interwiki table. So by rewriting and using a > new table we implicitly break all the working tricks and throw the > user back into SQL. > > > I am aware of this. Like noted already, the current new code does not > yet replace the old code, so this is not a blocker yet, but it will be > for replacing the old code with the new system. Having looked at the > existing code using the old system, I think migration should not be to > hard, since the new system can do everything the old one can do and > the current using code is not that much. The new system also has clear > interfaces, preventing the script from needing to know of the database > table at all. That ought to facilitate the "do not depend on a single > db table" a lot, obviously :) > > I like the idea of table entries without actual interwikis. The > idea of some interface listing user selectable sites came to mind > and perhaps sites being added trivially even automatically. > Though if you plan to support this I think you'll need to drop the > NOT NULL from site_local_key. > > > I don't think the field needs to allow for null - right now the local > keys on the repo will be by default the same as the global keys, so > none of them will be null. On your client wiki you will then have > these values by default as well. If you don't want a particular site > to be usable as "languagelink" or "interwikilink", then simply set > this in your local configuration. No need to set the local id to null. > Depending on how actually we end up handling the defaulting process, > having null might or might not turn out to be useful. This is a detail > though, so I'd suggest sticking with not null for now, and then if it > turns out I'd be more convenient to allow for null when writing the > sync code, just change it then. You mean site_config? You're suggesting the interwiki system should look for a site by site_local_key, when it finds one parse out the site_config, check if it's disabled and if so ignore the fact it found a site with that local key? Instead of just not having a site_local_key for that row in the first place?

> Actually, another thought makes me think the schema should be a > little different. > site_local_key probably shouldn't be a column, it should probably > be another table. > Something like site_local_key (slc_key, slc_site) which would map > things like en:, Wikipedia:, etc... to a specific site. > > > Denny and I discussed this at some length, now already more then a > month ago (man, this is taking long...). Our conclusions where that we > do not need it, or would benefit from it much in Wikidata. In fact, > I'd introduce additional complexity, which is a good argument for not > including it in our already huge project. I do agree that conceptually > it's nicer to not duplicate such info, but if you consider the extra > complexity you'd need to get rid of it, and the little gain you have > (removal of some minor duplication which we've had since forever and > is not bothering anyone), I'm sceptical we ought to go with this > approach, even outside of Wikidata. You've added global ids into this mix. So data duplication simply because one wiki needs a second local name will mean that one url now has two different global ids this sounds precisely like something that is going to get in the way of the whole reason you wanted this rewrite. It will also start to create issues with the sync code. Additionally the number of duplicates needed is going to vary wiki by wiki. en.wikisource is going to need one Wikipedia: to link to en.wp while fr.wp is going to need two, Wikipedia: and en: to point to en.wp. I can only see data duplication creating more problems than we need.

As for the supposed complexity of this extra table. site_data and site_config are blobs of presumably serialized data. You've already eliminated the simplicity needed for this to be human editable from SQL so there is no reason to hold back on making the database schema the best it can be. As for deletions if you're worried about making them simple just add a foreign key with cascading deletion. Then the rows in site_local_key will automatically be deleted when you delete the row in sites without any extra complexity.

> I think I need to understand the plans you have for > synchronization a bit more. > - Where does Wikidata get the sites > > > The repository wiki holds the canonical copy of the sites, which gets > send to all clients. Modification of the site data can only happen on > the repository. All wikis (repo and clients) have their own local > config so can choose to enable all sites for all functionality, > completely hide them, or anything in between. Ok, I'm leaning more and more towards the idea that we should make the full sites table a second-class index of sites pulled from any number of data sources that you can carelessly truncate and have rebuilt (ie: it has no more value than pagelinks). Wikidata's data syncing would be served by creating a secondary table with the local link_{key,inline,navigation}, forward, and config columns. When you sync the data from the Wikidata repo and the site local table would be combined to create what goes into the index table with the full list of sites. Doing it this way frees us from creating any restrictions on whatever source we get sites from that we shouldn't be placing on them. Wikidata gets site local stuff and global data and doesn't have to worry about whether parts of the row are supposed to be editable or not. There is nothing stopping us from making our first non-wikidata site source a plaintext file so we have time to write a really good UI. And the UI is free from restrictions placed by using this one table, so it's free to do it in whatever way fits a UI best. Whether that means it's an editable wikitext page or better yet a nice ui using that abstract revision system I wanted to build.

> - What synchronizes the data > > > The repo. As already mentioned, it might be nicer to split this off in > it's own extension at some point. But before we get to that, we first > need to have the current changes merged. > > Btw if you really want to make this an abstract list of sites > dropping site_url and the other two related columns might be an idea. > At first glance the url looks like something standard that every > site would have. But once you throw something like MediaWiki into > the mix with short urls, long urls, and an API the url really > becomes type specific data that should probably go in the blob. > Especially when you start thinking about other custom types. > > > The patch sitting on gerrit already includes this. (Did you really > look at it already? The fields are documented quite well I'd think.) > Every site has a url (that's not specific to the type of site), but we > have a type system with currently the default (general) site type and > a MediaWikiSite type. The type system works with two blob fields, one > for type specific data and one for type specific configuration. Yeah, I looked at the schema I know there is a data blob, that's what I'm talking about. I mean while you'd think that a url is something every site would have one of it's actually more of a type specific piece of data because some site types can actually have multiple urls, etc... which depend on what the page input is. So you might as well drop the 3 url related columns and just use the data blob that you already have. The $1 pattern may not even work for some sites. For example something like a gerrit type may want to know a specific root path for gerrit without any $1 funny business and then handle what actual url gets output in special ways. ie: So that [[gerrit:14295]] links to https://gerrit.wikimedia.org/r/#/c/14295 while [[gerrit: I0a96e58556026d8c923551b07af838ca426a2ab3]] links to https://gerrit.wikimedia.org/r/#q,I0a96e58556026d8c923551b07af838ca426a2ab3,n,z

You mean site_config? > You're suggesting the interwiki system should look for a site by > site_local_key, when it finds one parse out the site_config, check if it's > disabled and if so ignore the fact it found a site with that local key? > Instead of just not having a site_local_key for that row in the first place? >

No. Since the interwiki system is not specific to any type of site, this approach would be making it needlessly hard. The site_link_inline field determines if the site should be usable as interwiki link, as you can see in the patchset:

-- If the site should be linkable inline as an "interwiki link" using -- [[site_local_key:pageTitle]]. site_link_inline bool NOT NULL,

So queries would be _very_ simple.

> So data duplication simply because one wiki needs a second local name will mean that one url now has two different global ids this sounds precisely like something that is going to get in the way of the whole reason you wanted this rewrite.

* It does not get in our way at all, and is completely disjunct from why we want the rewrite * It's currently done like this * The changes we do need and are proposing to make will make such a rewrite at a later point easier then it is now

> Doing it this way frees us from creating any restrictions on whatever source we get sites from that we shouldn't be placing on them.

* We don't need this for Wikidata * It's a new feature that might or might not be nice to have that currently does not exist * The changes we do need and are proposing to make will make such a rewrite at a later point easier then it is now

> So you might as well drop the 3 url related columns and just use the data blob that you already have.

I don't see what this would gain us at all. It's just make things more complicated.

> The $1 pattern may not even work for some sites.

* We don't need this for Wikidata * It's a new feature that might or might not be nice to have that currently does not exist * The changes we do need and are proposing to make will make such a rewrite at a later point easier then it is now

And in fact we are making this more flexible by having the type system. The MediaWiki site type could for instance be able to form both "nice" urls and index.php ones. Or a gerrit type could have the logic to distinguish between the gerrit commit number and a sha1 hash.

thanks for your comments. Some of the suggestions you make would extend the functionality beyond what we need right now. They look certainly useful, and I don't think that we would make implementing them any harder than it is right now -- rather the opposite.

As usual, the perfect and the next-step are great enemies. I understand that the patch does not lead to a perfect world directly, that would cover all of your use cases -- but it nicely covers ours.

My questions would be:

* do you think that we are going in the wrong direction, or do you think we are not going far enough yet?

* do you think that we are making some use cases harder to implement in the future than they would be now, and if so which ones?

* do you see other issues with the patch that should block it from being deployed, and which ones would that be?

Cheers, Denny

2012/8/10 Daniel Friesen <lists [at] nadir-seen-fire>: > On 12-08-09 3:55 PM, Jeroen De Dauw wrote: >> Hey, >> >> You bring up some good points. >> >> I think we're going to need to have some of this and the >> synchronization stuff in core. >> Right now the code has nothing but the one sites table. No repo >> code so presumably the only implementation of that for awhile will >> be wikidata. And if parts of this table is supposed to be editable >> in some cases where there is no repo but non-editable then I don't >> see any way for an edit ui to tell the difference. >> >> >> We indeed need some configuration setting(s) for wikis to distinguish >> between the two cases. That seems to be all "synchronisation code" >> we'll need in core. It might or might not be useful to have more logic >> in core, or in some dedicated extension. Personally I think having the >> actual synchronization code in a separate extension would be nice, as >> a lot of it won't be Wikidata specific. This is however not a >> requirement for Wikidata, so the current plan is to just have it in >> the extension, always keeping in mind that it should be easy to split >> it off later on. I'd love to discuss this point further, but it should >> be clear this is not much of a blocker for the current code, as it >> seems unlikely to affect it much, if at all. >> >> On that note consider we're initially creating the new system in >> parallel with the old one, which enabled us to just try out changes, >> and alter them later on if it turns out there is a better way to do >> them. Then once we're confident the new system is what we want to >> stick to, and know it works because of it's usage by Wikidata, we can >> replace the current code with the new system. This ought to allow us >> to work a lot faster by not blocking on discussions and details for to >> long. >> >> > I'm also not sure how this synchronization which sounds like one-way >> will play with individual wikis wanting to add new interwiki links. >> >> For our case we only need it to work one way, from the Wikidata repo >> to it's clients. More discussion would need to happen to decide on an >> alternate approach. I already indicated I think this is not a blocker >> for the current set of changes, so I'd prefer this to happen after the >> current code got merged. >> >> I'm talking about things like the interwiki extensions and scripts >> that turn wiki tables into interwiki lists. All these things are >> written against the interwiki table. So by rewriting and using a >> new table we implicitly break all the working tricks and throw the >> user back into SQL. >> >> >> I am aware of this. Like noted already, the current new code does not >> yet replace the old code, so this is not a blocker yet, but it will be >> for replacing the old code with the new system. Having looked at the >> existing code using the old system, I think migration should not be to >> hard, since the new system can do everything the old one can do and >> the current using code is not that much. The new system also has clear >> interfaces, preventing the script from needing to know of the database >> table at all. That ought to facilitate the "do not depend on a single >> db table" a lot, obviously :) >> >> I like the idea of table entries without actual interwikis. The >> idea of some interface listing user selectable sites came to mind >> and perhaps sites being added trivially even automatically. >> Though if you plan to support this I think you'll need to drop the >> NOT NULL from site_local_key. >> >> >> I don't think the field needs to allow for null - right now the local >> keys on the repo will be by default the same as the global keys, so >> none of them will be null. On your client wiki you will then have >> these values by default as well. If you don't want a particular site >> to be usable as "languagelink" or "interwikilink", then simply set >> this in your local configuration. No need to set the local id to null. >> Depending on how actually we end up handling the defaulting process, >> having null might or might not turn out to be useful. This is a detail >> though, so I'd suggest sticking with not null for now, and then if it >> turns out I'd be more convenient to allow for null when writing the >> sync code, just change it then. > You mean site_config? > You're suggesting the interwiki system should look for a site by > site_local_key, when it finds one parse out the site_config, check if > it's disabled and if so ignore the fact it found a site with that local > key? Instead of just not having a site_local_key for that row in the > first place? > >> Actually, another thought makes me think the schema should be a >> little different. >> site_local_key probably shouldn't be a column, it should probably >> be another table. >> Something like site_local_key (slc_key, slc_site) which would map >> things like en:, Wikipedia:, etc... to a specific site. >> >> >> Denny and I discussed this at some length, now already more then a >> month ago (man, this is taking long...). Our conclusions where that we >> do not need it, or would benefit from it much in Wikidata. In fact, >> I'd introduce additional complexity, which is a good argument for not >> including it in our already huge project. I do agree that conceptually >> it's nicer to not duplicate such info, but if you consider the extra >> complexity you'd need to get rid of it, and the little gain you have >> (removal of some minor duplication which we've had since forever and >> is not bothering anyone), I'm sceptical we ought to go with this >> approach, even outside of Wikidata. > You've added global ids into this mix. So data duplication simply > because one wiki needs a second local name will mean that one url now > has two different global ids this sounds precisely like something that > is going to get in the way of the whole reason you wanted this rewrite. > It will also start to create issues with the sync code. > Additionally the number of duplicates needed is going to vary wiki by > wiki. en.wikisource is going to need one Wikipedia: to link to en.wp > while fr.wp is going to need two, Wikipedia: and en: to point to en.wp. > I can only see data duplication creating more problems than we need. > > As for the supposed complexity of this extra table. site_data and > site_config are blobs of presumably serialized data. You've already > eliminated the simplicity needed for this to be human editable from SQL > so there is no reason to hold back on making the database schema the > best it can be. As for deletions if you're worried about making them > simple just add a foreign key with cascading deletion. Then the rows in > site_local_key will automatically be deleted when you delete the row in > sites without any extra complexity. > >> I think I need to understand the plans you have for >> synchronization a bit more. >> - Where does Wikidata get the sites >> >> >> The repository wiki holds the canonical copy of the sites, which gets >> send to all clients. Modification of the site data can only happen on >> the repository. All wikis (repo and clients) have their own local >> config so can choose to enable all sites for all functionality, >> completely hide them, or anything in between. > Ok, I'm leaning more and more towards the idea that we should make the > full sites table a second-class index of sites pulled from any number of > data sources that you can carelessly truncate and have rebuilt (ie: it > has no more value than pagelinks). > Wikidata's data syncing would be served by creating a secondary table > with the local link_{key,inline,navigation}, forward, and config > columns. When you sync the data from the Wikidata repo and the site > local table would be combined to create what goes into the index table > with the full list of sites. > Doing it this way frees us from creating any restrictions on whatever > source we get sites from that we shouldn't be placing on them. > Wikidata gets site local stuff and global data and doesn't have to worry > about whether parts of the row are supposed to be editable or not. There > is nothing stopping us from making our first non-wikidata site source a > plaintext file so we have time to write a really good UI. And the UI is > free from restrictions placed by using this one table, so it's free to > do it in whatever way fits a UI best. Whether that means it's an > editable wikitext page or better yet a nice ui using that abstract > revision system I wanted to build. > >> - What synchronizes the data >> >> >> The repo. As already mentioned, it might be nicer to split this off in >> it's own extension at some point. But before we get to that, we first >> need to have the current changes merged. >> >> Btw if you really want to make this an abstract list of sites >> dropping site_url and the other two related columns might be an idea. >> At first glance the url looks like something standard that every >> site would have. But once you throw something like MediaWiki into >> the mix with short urls, long urls, and an API the url really >> becomes type specific data that should probably go in the blob. >> Especially when you start thinking about other custom types. >> >> >> The patch sitting on gerrit already includes this. (Did you really >> look at it already? The fields are documented quite well I'd think.) >> Every site has a url (that's not specific to the type of site), but we >> have a type system with currently the default (general) site type and >> a MediaWikiSite type. The type system works with two blob fields, one >> for type specific data and one for type specific configuration. > Yeah, I looked at the schema I know there is a data blob, that's what > I'm talking about. I mean while you'd think that a url is something > every site would have one of it's actually more of a type specific piece > of data because some site types can actually have multiple urls, etc... > which depend on what the page input is. So you might as well drop the 3 > url related columns and just use the data blob that you already have. > The $1 pattern may not even work for some sites. For example something > like a gerrit type may want to know a specific root path for gerrit > without any $1 funny business and then handle what actual url gets > output in special ways. ie: So that [[gerrit:14295]] links to > https://gerrit.wikimedia.org/r/#/c/14295 while [[gerrit: > I0a96e58556026d8c923551b07af838ca426a2ab3]] links to > https://gerrit.wikimedia.org/r/#q,I0a96e58556026d8c923551b07af838ca426a2ab3,n,z> >> Cheers >> >> -- >> Jeroen De Dauw >> http://www.bn2vs.com>> Don't panic. Don't be evil. >> -- > > ~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name] > > _______________________________________________ > Wikitech-l mailing list > Wikitech-l [at] lists > https://lists.wikimedia.org/mailman/listinfo/wikitech-l

> Hey, > > You mean site_config? > > You're suggesting the interwiki system should look for a site by > > site_local_key, when it finds one parse out the site_config, check if it's > > disabled and if so ignore the fact it found a site with that local key? > > Instead of just not having a site_local_key for that row in the first place? > > > > No. Since the interwiki system is not specific to any type of site, this > approach would be making it needlessly hard. The site_link_inline field > determines if the site should be usable as interwiki link, as you can see > in the patchset: > > -- If the site should be linkable inline as an "interwiki link" using > -- [[site_local_key:pageTitle]]. > site_link_inline bool NOT NULL, > > So queries would be _very_ simple. > > > So data duplication simply because one wiki needs a second local name > will mean that one url now has two different global ids this sounds > precisely like something that is going to get in the way of the whole > reason you wanted this rewrite. > > * It does not get in our way at all, and is completely disjunct from why we > want the rewrite > * It's currently done like this > * The changes we do need and are proposing to make will make such a rewrite > at a later point easier then it is now > > > Doing it this way frees us from creating any restrictions on whatever > source we get sites from that we shouldn't be placing on them. > > * We don't need this for Wikidata > * It's a new feature that might or might not be nice to have that currently > does not exist > * The changes we do need and are proposing to make will make such a rewrite > at a later point easier then it is now > > > So you might as well drop the 3 url related columns and just use the data > blob that you already have. > > I don't see what this would gain us at all. It's just make things more > complicated. > > > The $1 pattern may not even work for some sites. > > * We don't need this for Wikidata > * It's a new feature that might or might not be nice to have that currently > does not exist > * The changes we do need and are proposing to make will make such a rewrite > at a later point easier then it is now > > And in fact we are making this more flexible by having the type system. The > MediaWiki site type could for instance be able to form both "nice" urls and > index.php ones. Or a gerrit type could have the logic to distinguish > between the gerrit commit number and a sha1 hash. > > Cheers

[.Just to clarify, I'm doing inline replies to things various people said, not just Jeroen]

First and foremost, I'm a little confused as to what the actual use cases here are. Could we get a short summary for those who aren't entirely following how wikidata will work, why the current interwiki situation is insufficient? I've read the I0a96e585 and http://lists.wikimedia.org/pipermail/wikitech-l/2012-June/060992.html,but everything seems very vague "It doesn't work for our situation", without any detailed explanation of what that situation is. At most the messages kind of hint at wanting to be able to access the list of interwiki types of the wikidata "server" from a wikidata "client" (and keep them in sync, or at least have them replicated from server->client). But there's no explanation given to why one needs to do that (are we doing some form of interwiki transclusion and need to render foreign interwiki links correctly? Want to be able to do global whatlinkshere and need unique global ids for various wikis? Something else?)

>* Site definitions can exist that are not used as "interlanguage link" and >not used as "interwiki link"

And if we put one of those on a talk page, what would happen? Or if foo was one such link, doing [[:foo:some page]] (Current behaviour is it becomes an interwiki).

Although to be fair, I do see how the current way we distinguish between interwiki and interlang links is a bit hacky.

>And in fact we are making this more flexible by having the type system. The >MediaWiki site type could for instance be able to form both "nice" urls and >index.php ones. Or a gerrit type could have the logic to distinguish >between the gerrit commit number and a sha1 hash.

I must admit I do like this this idea. In particular the current situation where we treat the value of an interwiki link as a title (aka spaces -> underscores etc) even for sites that do not use such conventions, has always bothered me. Having interwikis that support url re-writing based on the value does sound cool, but I certainly wouldn't want said code in a db blob (and just using an integer site_type identifier is quite far away from giving us that, but its still a step in a positive direction), which raises the question of where would such rewriting code go.

> The issue I was trying to deal with was storage. Currently we 100% assume >that the interwiki list is a table and there will only ever be one of them.

Do we really assume that? Certainly that's the default config, but I don't think that is the config used on WMF. As far as I'm aware, Wikimedia uses a cdb database file (via $wgInterwikiCache), which contains all the interwikis for all sites. From what I understand, it supports doing various "scope" levels of interwikis, including per db, per site (Wikipedia, Wiktionary, etc), or global interwikis that act on all sites.

The feature is a bit wmf specific, but it does seem to support different levels of interwiki lists.

Furthermore, I imagine (but don't know, so lets see how fast I get corrected ;) that the cdb database was introduced not just as convenience measure for easier administration of the interwiki tables, but also for better performance. If so, one should also take into account any performance hit that may come with switching to the proposed "sites" facility.

Having interwikis that support > url re-writing based on the value does sound cool, but I certainly > wouldn't want said code in a db blob >

We do not have code in blobs in the db, that seems like a rather mad thing to do! :)

The blobs we have are for holding data or config specific to some site type. Having url re-writing based on page value does not require any such type specific data or config. It requires type specific logic, which would just go in the relevant Site deriving class, for instance MediaWikiSite.

> Hi Daniel, > > thanks for your comments. Some of the suggestions you make would > extend the functionality beyond what we need right now. They look > certainly useful, and I don't think that we would make implementing > them any harder than it is right now -- rather the opposite. > > As usual, the perfect and the next-step are great enemies. I > understand that the patch does not lead to a perfect world directly, > that would cover all of your use cases -- but it nicely covers ours. > > My questions would be: > > * do you think that we are going in the wrong direction, or do you > think we are not going far enough yet? > > * do you think that we are making some use cases harder to implement > in the future than they would be now, and if so which ones?

I don't think you're going far enough. You're rewriting a core feature in core but key issues with the old system that should be fixed in any rewrite of it are explicitly being repeated just because you don't need them fixed for Wikidata. I'm not expecting any of you to spend a pile of time writing a UI because it's missing. But I do expect that if we have a good idea what the optimal database schema and usage of the feature is that you'd make a tiny effort to include the fixes that Wikidata doesn't explicitly need. Instead of rewriting it using a non-optimal format and forcing someone else to rewrite stuff again.

Taking site_local_key for an example. Clearly site_local_key as a single column does not work. We know from our interwiki experience we really want multiple keys. And there is absolutely no value at all to forcing data duplication.

If we use a proper site_local_key right now before submitting the code it should be a minimal change to the code you have right now. (Unless the ORM mapper makes it hard to use joins, in which case you'd be making a bad choice from the start since when someone does fix site_local_key they will need to break interface compatibility)

If someone trys to do this later. They are going to have to do schema changes, a full data migration in the updater, AND they are going to have to find some way to do de-dupliation of data. These are things that wouldn't need to be bothered with at all if the initial rewrite just made a few extra steps.

> * do you see other issues with the patch that should block it from > being deployed, and which ones would that be?

I covered a few of them in inline comments on the commit. Things like not understanding the role of group. Using ints for site types being bad for extensibility. etc...

> 2012/8/10 Daniel Friesen <lists [at] nadir-seen-fire>: >> On 12-08-09 3:55 PM, Jeroen De Dauw wrote: >>> Hey, >>> >>> You bring up some good points. >>> >>> I think we're going to need to have some of this and the >>> synchronization stuff in core. >>> Right now the code has nothing but the one sites table. No repo >>> code so presumably the only implementation of that for awhile will >>> be wikidata. And if parts of this table is supposed to be editable >>> in some cases where there is no repo but non-editable then I don't >>> see any way for an edit ui to tell the difference. >>> >>> >>> We indeed need some configuration setting(s) for wikis to distinguish >>> between the two cases. That seems to be all "synchronisation code" >>> we'll need in core. It might or might not be useful to have more logic >>> in core, or in some dedicated extension. Personally I think having the >>> actual synchronization code in a separate extension would be nice, as >>> a lot of it won't be Wikidata specific. This is however not a >>> requirement for Wikidata, so the current plan is to just have it in >>> the extension, always keeping in mind that it should be easy to split >>> it off later on. I'd love to discuss this point further, but it should >>> be clear this is not much of a blocker for the current code, as it >>> seems unlikely to affect it much, if at all. >>> >>> On that note consider we're initially creating the new system in >>> parallel with the old one, which enabled us to just try out changes, >>> and alter them later on if it turns out there is a better way to do >>> them. Then once we're confident the new system is what we want to >>> stick to, and know it works because of it's usage by Wikidata, we can >>> replace the current code with the new system. This ought to allow us >>> to work a lot faster by not blocking on discussions and details for to >>> long. >>> >>> > I'm also not sure how this synchronization which sounds like one-way >>> will play with individual wikis wanting to add new interwiki links. >>> >>> For our case we only need it to work one way, from the Wikidata repo >>> to it's clients. More discussion would need to happen to decide on an >>> alternate approach. I already indicated I think this is not a blocker >>> for the current set of changes, so I'd prefer this to happen after the >>> current code got merged. >>> >>> I'm talking about things like the interwiki extensions and scripts >>> that turn wiki tables into interwiki lists. All these things are >>> written against the interwiki table. So by rewriting and using a >>> new table we implicitly break all the working tricks and throw the >>> user back into SQL. >>> >>> >>> I am aware of this. Like noted already, the current new code does not >>> yet replace the old code, so this is not a blocker yet, but it will be >>> for replacing the old code with the new system. Having looked at the >>> existing code using the old system, I think migration should not be to >>> hard, since the new system can do everything the old one can do and >>> the current using code is not that much. The new system also has clear >>> interfaces, preventing the script from needing to know of the database >>> table at all. That ought to facilitate the "do not depend on a single >>> db table" a lot, obviously :) >>> >>> I like the idea of table entries without actual interwikis. The >>> idea of some interface listing user selectable sites came to mind >>> and perhaps sites being added trivially even automatically. >>> Though if you plan to support this I think you'll need to drop the >>> NOT NULL from site_local_key. >>> >>> >>> I don't think the field needs to allow for null - right now the local >>> keys on the repo will be by default the same as the global keys, so >>> none of them will be null. On your client wiki you will then have >>> these values by default as well. If you don't want a particular site >>> to be usable as "languagelink" or "interwikilink", then simply set >>> this in your local configuration. No need to set the local id to null. >>> Depending on how actually we end up handling the defaulting process, >>> having null might or might not turn out to be useful. This is a detail >>> though, so I'd suggest sticking with not null for now, and then if it >>> turns out I'd be more convenient to allow for null when writing the >>> sync code, just change it then. >> You mean site_config? >> You're suggesting the interwiki system should look for a site by >> site_local_key, when it finds one parse out the site_config, check if >> it's disabled and if so ignore the fact it found a site with that local >> key? Instead of just not having a site_local_key for that row in the >> first place? >> >>> Actually, another thought makes me think the schema should be a >>> little different. >>> site_local_key probably shouldn't be a column, it should probably >>> be another table. >>> Something like site_local_key (slc_key, slc_site) which would map >>> things like en:, Wikipedia:, etc... to a specific site. >>> >>> >>> Denny and I discussed this at some length, now already more then a >>> month ago (man, this is taking long...). Our conclusions where that we >>> do not need it, or would benefit from it much in Wikidata. In fact, >>> I'd introduce additional complexity, which is a good argument for not >>> including it in our already huge project. I do agree that conceptually >>> it's nicer to not duplicate such info, but if you consider the extra >>> complexity you'd need to get rid of it, and the little gain you have >>> (removal of some minor duplication which we've had since forever and >>> is not bothering anyone), I'm sceptical we ought to go with this >>> approach, even outside of Wikidata. >> You've added global ids into this mix. So data duplication simply >> because one wiki needs a second local name will mean that one url now >> has two different global ids this sounds precisely like something that >> is going to get in the way of the whole reason you wanted this rewrite. >> It will also start to create issues with the sync code. >> Additionally the number of duplicates needed is going to vary wiki by >> wiki. en.wikisource is going to need one Wikipedia: to link to en.wp >> while fr.wp is going to need two, Wikipedia: and en: to point to en.wp. >> I can only see data duplication creating more problems than we need. >> >> As for the supposed complexity of this extra table. site_data and >> site_config are blobs of presumably serialized data. You've already >> eliminated the simplicity needed for this to be human editable from SQL >> so there is no reason to hold back on making the database schema the >> best it can be. As for deletions if you're worried about making them >> simple just add a foreign key with cascading deletion. Then the rows in >> site_local_key will automatically be deleted when you delete the row in >> sites without any extra complexity. >> >>> I think I need to understand the plans you have for >>> synchronization a bit more. >>> - Where does Wikidata get the sites >>> >>> >>> The repository wiki holds the canonical copy of the sites, which gets >>> send to all clients. Modification of the site data can only happen on >>> the repository. All wikis (repo and clients) have their own local >>> config so can choose to enable all sites for all functionality, >>> completely hide them, or anything in between. >> Ok, I'm leaning more and more towards the idea that we should make the >> full sites table a second-class index of sites pulled from any number of >> data sources that you can carelessly truncate and have rebuilt (ie: it >> has no more value than pagelinks). >> Wikidata's data syncing would be served by creating a secondary table >> with the local link_{key,inline,navigation}, forward, and config >> columns. When you sync the data from the Wikidata repo and the site >> local table would be combined to create what goes into the index table >> with the full list of sites. >> Doing it this way frees us from creating any restrictions on whatever >> source we get sites from that we shouldn't be placing on them. >> Wikidata gets site local stuff and global data and doesn't have to worry >> about whether parts of the row are supposed to be editable or not. There >> is nothing stopping us from making our first non-wikidata site source a >> plaintext file so we have time to write a really good UI. And the UI is >> free from restrictions placed by using this one table, so it's free to >> do it in whatever way fits a UI best. Whether that means it's an >> editable wikitext page or better yet a nice ui using that abstract >> revision system I wanted to build. >> >>> - What synchronizes the data >>> >>> >>> The repo. As already mentioned, it might be nicer to split this off in >>> it's own extension at some point. But before we get to that, we first >>> need to have the current changes merged. >>> >>> Btw if you really want to make this an abstract list of sites >>> dropping site_url and the other two related columns might be an >>> idea. >>> At first glance the url looks like something standard that every >>> site would have. But once you throw something like MediaWiki into >>> the mix with short urls, long urls, and an API the url really >>> becomes type specific data that should probably go in the blob. >>> Especially when you start thinking about other custom types. >>> >>> >>> The patch sitting on gerrit already includes this. (Did you really >>> look at it already? The fields are documented quite well I'd think.) >>> Every site has a url (that's not specific to the type of site), but we >>> have a type system with currently the default (general) site type and >>> a MediaWikiSite type. The type system works with two blob fields, one >>> for type specific data and one for type specific configuration. >> Yeah, I looked at the schema I know there is a data blob, that's what >> I'm talking about. I mean while you'd think that a url is something >> every site would have one of it's actually more of a type specific piece >> of data because some site types can actually have multiple urls, etc... >> which depend on what the page input is. So you might as well drop the 3 >> url related columns and just use the data blob that you already have. >> The $1 pattern may not even work for some sites. For example something >> like a gerrit type may want to know a specific root path for gerrit >> without any $1 funny business and then handle what actual url gets >> output in special ways. ie: So that [[gerrit:14295]] links to >> https://gerrit.wikimedia.org/r/#/c/14295 while [[gerrit: >> I0a96e58556026d8c923551b07af838ca426a2ab3]] links to >> https://gerrit.wikimedia.org/r/#q,I0a96e58556026d8c923551b07af838ca426a2ab3,n,z>> >>> Cheers >>> >>> -- >>> Jeroen De Dauw >>> http://www.bn2vs.com>>> Don't panic. Don't be evil. >>> -- >> >> ~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name] _______________________________________________ Wikitech-l mailing list Wikitech-l [at] lists https://lists.wikimedia.org/mailman/listinfo/wikitech-l

> But I do expect that if we have a good idea what the optimal database schema and usage of the feature is that you'd make a tiny effort to include the fixes that Wikidata doesn't explicitly need.

This is entirely reasonable to ask. However this particular change is not tiny, and it would cost us both effort to implement and make the change even bigger, while we're trying to keep it small. We actually did go for the low hanging fruit we did not need ourselves here, so implying we don't care about concerns outside of our project would be short-sighted. After all, strictly speaking we do not _need_ this rewrite. We could just continue on pouring crap onto the current pile and hope it does not collapse rather then fix all of the issues our change is tackling.

> Instead of rewriting it using a non-optimal format and forcing someone else to rewrite stuff again.

We are not touching this, so you would still need to make the change if you want to fix this issue, but you would not need to do it _again_. To be honest I don't understand why you have a problem here. We're making it easier for you to make this change. If you think it's that important, then let's get our changes through so you can start making yours without us getting in each others way.

> Unless the ORM mapper makes it hard to use joins

It does basically does not affect joins - it has no facilities for it, but it does not make them harder either.

On Fri, Aug 10, 2012 at 9:02 AM, Daniel Friesen <lists [at] nadir-seen-fire> wrote: > I don't think you're going far enough. > You're rewriting a core feature in core but key issues with the old system > that should be fixed in any rewrite of it are explicitly being repeated just > because you don't need them fixed for Wikidata. > I'm not expecting any of you to spend a pile of time writing a UI because > it's missing. But I do expect that if we have a good idea what the optimal > database schema and usage of the feature is that you'd make a tiny effort to > include the fixes that Wikidata doesn't explicitly need. Instead of > rewriting it using a non-optimal format and forcing someone else to rewrite > stuff again. >

I agree one billion percent with everything you've said here, and it's the *exact* point I've been trying to make all along.

I have no qualms with people trying to fix this--it's something that needs to be fixed and has been on my todo list for far longer than it should've been. But if it's going to be fixed/rewritten, time should be taken so it is done properly.