Close votes as a distinct type of flag (previously they were stored as votes)

However, none of this is in the public data dump. There are bits and pieces - suggested edits predate the new review system, and duplicate links can be obtained from PostHistory with a bit of trouble - but not enough for any serious analysis.

Therefore, I recommend the following information be added to the public data dumps and data explorer:

Close and reopen flags

As with close votes, this information should have userIDs and exact times stripped before being made public. Suggested schema:

The question of whether or not deleted votes/flags should be included is a big one here: IMHO, it's very useful for doing historical analysis... But right now it's explicitly not included. If you can think of a good reason not to make this information public, post it below.

Links

The PostLinks table is quite useful: it contains not only duplicate links, but also ad-hoc links and even related links (the two sidebar lists next to most questions). I would like to have all of this information made available, but if this isn't feasible for some reason then ad-hoc links and duplicate links would be the most useful. Suggested schema:

Id
CreationDate
PostId
RelatedPostId
LinkTypeId (there is no LinkTypes table, but maybe there should be one)
Rank

Since all of this information is already public, I don't foresee any privacy issues with publishing it in the dump.

Review tasks

This covers everything available under /review. Nothing in this table should really be private, but I would recommend excluding information on audits for space and simplicity's sake. Suggested schema:

This is a tricky one... Most of this information is already public in some fashion, but making it easily accessible is another matter. In particular, First Post and Late Answer reviews can involve actions that are not made public: voting and flagging. Making it too easy to correlate a FP/LA review with specific events on a post could lead to trouble. Therefore, I recommend the following:

As with votes, omit the ID of the user reviewing.

As with votes, omit the exact time that the review took place

Omit any detailed information on the actions that contributed to the review for First Posts and Late Answers reviews.

Omit audit information (if it is left out of ReviewTasks)

I may be over-cautious here... For instance, it's possible that including #1 would be fine as long as #3 was omitted. Post your thoughts below.

Account ID on Users rows

This is actually a very old request, but Tim Stone reminded me of it again tonight. This is public information already; at one point, the ability to disassociate accounts from your network profile may have made it problematic, but given we don't allow that anymore... I don't see any issue here.

It's another integer in each row in the Users table that corresponds to your network ID (mine is 620) and is identical on each site linked to your network account.

Please review the above suggestions. In particular, if you can envision a way in which the release of such data could be used to compromise information that is normally considered private (such as voting data), call it out now - much better to be safe than sorry.

For review task results, viewing the task result page in a browser already shows the users, full timestamps, and result, so I don't think it is necessary to leave that information out. Omitting the detailed information should be sufficient.
–
ughoavgfhwNov 25 '13 at 1:56

Wanna post an answer that elaborates on that a bit, @ughoavgfhw? Or just post an answer?
–
Shog9♦Nov 25 '13 at 19:59

Actually I +1'd this because I thought it was to make the user ID's public for the already available data (I clearly didn't read it in remotely enough detail). I guess that's not going to happen. I just really want to generate some (non-user-specific) stats about close voting, for which I need user-specific data. (But if a change is in the works that would provide a long-term fix for the queue size, I guess that's not really needed any more)
–
DukelingNov 28 '13 at 2:02

We don't show pending users for close votes on the site, @Dukeling, so including them in the dump is a bit sketchy; we probably could do so for reviews (as noted above) but even that isn't strictly necessary for analyzing review.
–
Shog9♦Nov 28 '13 at 2:25

@Shog9 Even just adding the user ID for votes cast on already closed questions may allow for some fairly decent stats (e.g. how many 20k+ users vote to close questions how often).
–
DukelingNov 28 '13 at 3:11

If you're only interested in already-closed questions, you can get that data now, @Dukeling - but you'll have to extract the JSON that represents it from the data dump. It's not hard, but chances are you'll only do it if you have a really good reason to want that data.
–
Shog9♦Nov 28 '13 at 3:20

Can you update us if this request is near a status-planned phase?
–
reneMar 9 '14 at 19:53

It's on the list, @rene - but see this update: there are some issues with trying to cram all of this into the data dump, so only the stuff that makes sense to publish in both the dump and SEDE has been completed so far. We'll try to move some of the more transient stuff into SEDE only at some point, but that'll take some extra work to make it happen.
–
Shog9♦Mar 9 '14 at 19:58

6 Answers
6

I don't think it is necessary to leave the user ID or full timestamp out of the review task results table. That information is already available and associated with a general result on the review task's web page, viewable from the history. For example, here is a recent review from the late answer queue (with the username changed):

Since this information is publicly available already, there is no need to leave it out of the data dump. It should be sufficient to leave out detailed information about the actions taken.

After taking a close look at the proposed schemas here, I think we need to make a few clarifications and/or adjustments.

Specifically:

For CloseReasons, we should stick with the same naming conventions as the rest of the tables currently exposed through SEDE and call it CloseReasonTypes.

CloseAsOfftopicReasons should probably also be named CloseAsOfftopicReasonTypes (say that ten times fast!), and it should explicitly include:

the userids/names of the moderators involved in creating and approving the close reasons.

all currently active and deactivated close reasons. (Unapproved close reasons should be excluded).

Any close/reopen votes, flags, links, and review tasks should be excluded if they are related to a deleted post.

Exclude deleted votes and flags (that is, flags where ResultDate is not null) on the grounds that because a lot of closed posts are also deleted, providing information that's guaranteed to be incomplete will only do a disservice to anyone trying to perform any sort of meaningful analysis on how folks vote or flag.

We should generate a LinkTypes table. Don't need one internally, really, but it'd be nice to have one in the public schema for convenience/clarity.

The CompletedReviewTask field Shog mentions will have to be calculated because there's no other way to figure out what happened to the review item once we zero out the time part of the timestamp.

As far as including information about review tasks goes, I think being more paranoid is going to pay off more in the long run. While it's true that the information like exact timestamps is available on the site, acquiring it in aggregate right now is rather painful. Exposing it via SEDE would make it far easier, and we kind of don't want to make it easier for folks to make any guesses about how others vote or act.

So it's been pointed out that a lot of this data would be effectively worthless in a public data dump, due to the fact that they're generally months out of date. In addition, XML is huge, and the Stack Overflow data dump is already very large.

Therefore, I recommend including all of the information outlined above (with Anna's clarifications and caveats) in the Stack Exchange Data Explorer, but adding only the following to the public data dump:

Links but excluding Related links

Also a LinkTypes table documenting the two types of links that are included

Account ID on Users rows

That is all. Note that I'm explicitly excluding Close and Reopen flag-votes: as Anna noted, including deleted votes gives a skewed picture of what actually happens unless flags related to deleted posts are included as well... Since most close and re-open flags are deleted, this leaves very little relevant information to include in a public, months-out-of-date dump.

It would be nice to figure out a way of including some meta-data on deleted posts, as that would enable including a much larger set of information regarding flags and reviews... However, this would be a fairly major change, and as such is out of scope for this feature request.

If the work isn't too difficult, please consider converting NULL in denormalized count fields to a definite number (usually 0). As I noted with CommentCount, some of these fields mean that the count has never been update. But the distinction is rarely useful and there are some strange edge cases that mean it's not terribly reliable. For the sake of simplifying the data and not producing misleading results, please convert nulls in these fields to 0 when exporting to the public:

Nick wrapped this up today, mostly as specified with Anna's adjustments. The one notable difference is the replacement of ReviewTaskResults.CompletedReviewTask with ReviewTask.CompletedByReviewTaskId - this is just easier to use.

It would be great if you could add a magic link for the review tasks, like we have for the post itself, comments, and users (maybe others?). Something like review.PostID [Review Link]. Or it may be necessary to specify a specific queue, so [CloseReview Link], [FirstPostReview Link], etc.

Such a "magic link" would have to look up the PostID in the ReviewTasks table and pick a task to link to; as gnat notes, there might be more than one (particularly if a specific queue wasn't specified). I tend to think this would be easier handled in the query itself.
–
Shog9♦Jan 20 '14 at 16:44