Mon, Aug 19

My latest patchset on that change above is just a draft implementing some of the thoughts so far. It implements the following so that we have a place to start from when we finalize our thoughts on privacy here:

I could use a collaboration on the list of countries to blacklist. The paper that Nuria mentions includes: China, Cuba, Egypt, Indonesia, Iran, Kazakhstan, Pakistan, Russia, Saudi Arabia, South Korea, Syria, Thailand, Turkey, Uzbekistan, Vietnam. But the reason for censorship is pretty different in each country, and they don't all seem like they need a blacklist. I tried to guess at a first draft of the blacklist but honestly I'm not sure. The governments in not just those countries but those regions seem pretty troubling to me. And I don't have enough knowledge to know when something goes from troubling to dangerous.

@Yair_rand, that's what we're trying to prevent, yes. The value of the data is great, and the risk will be minimized as much as possible. As Asaf points out above, we have had this conversation for a very long time. Our legal and security teams have thought about the potential danger of this dataset and signed off on us publishing it. Nevertheless, I personally would like to protect this dataset as much as possible and that's why I'm looking into how to make it harder to determine the country of specific editors. Does that make sense? Do you have additional concerns?

Sun, Aug 18

Thanks very much. I also would much prefer not to rehash the conversations we've already had. We're ready to release this data, and the work we're doing to find the best way to release it is just preparation until the privacy framework is ready. We just want to be able to justify our decisions. Bucketing and blacklisting seem like they fit into the privacy framework drafts I've seen so far, and I'll take some time during paternity leave to fill out that rationale if needed. So, here's where we are so far:

Fri, Aug 16

Quick status update. I am currently evaluating ways to release this data. This is just while we wait for our privacy framework to be finished. As soon as that's done, we can evaluate our possible solutions here and execute the release fairly quickly.

Right, but this task description mentions the CheckUser change, so either this description should be updated or subtasks should be added here, no? Otherwise how would people like me know where to look?

@MarcoSwart it hasn't changed, but maybe spider traffic has been steadily rising. There's also an unrelated but confusing bug where the time period on the dashboard changes by itself, the fix for that is being deployed soon.

We just want to be involved if you want to whitelist data to be kept more than 90 days. Other than that, you don't need any approval from us. We're happy to look over your schema and give advice about how easy data would be to load into, for example, Druid.

I looked at webrequest for four hours around the time of Petr's post, and I couldn't see any 403s to wikidata.org/w/api.php. If someone could know when the error would show up, you could find it in the webrequest table very easily:

Jul 29 2019

@leila: we can of course iterate on the format in the future. Eventually we'll have a public API to query the whole dataset. But for now we just want some idea of common / high priority use cases that we can try to serve with a simpler release. Thank you so much for looking into it.

Quick question: the description shows plans to refactor cu_changes.cuc_comment to cuc_comment_id. The activity on this task seems to have stopped and this refactor doesn't seem to have happened yet. Will it happen eventually or has it been abandoned?

Jul 26 2019

So, looked into code history more carefully. There's literally one code change in AQS in 2019, and it doesn't touch pageviews handling at all. npm saw fit to update some of the repository references for kad, swagger-ui, and json-stable-stringify. I suppose we could look into those but that would be pretty crazy bad luck. I think the logical next place to look is the layer in front of AQS, the problem is 99% likely to be from there. Pinging @Pchelolo to see if this sounds familiar. Petr, basically we're seeing a lot more 429s since around April 2019, and we see two different kinds:

I think that's right, we have a pretty good working group of the people that care about this. If they agree I don't see too much need for an RfC, there would be too much context for someone else to catch up with. So I would vote to close as invalid. As for requirements/constraints, I think the phab tasks describe those for now, and we'll document them as we go forward (there's still thinking/testing on what the stack should be)

@daniel sorry I didn't update the group on this, but this project is being led by Filippo and progressing nicely. There's a working version deployed in beta and progress on a production launch is tracked here: T226986. I'm not sure at this point how that interacts with this RfC. Maybe we should update it when we have a good idea what the production stack should look like.

This can be tricky to diagnose because we don't really know what if any upstream changes are made to Hyperswitch. Do you have a more accurate idea about when you started seeing this? Is it when you made the task, beginning of April this year?