Description of the tool/project

The ReadingLists extension provides an API to store and retrieve private lists of pages, e.g. for a "bookmark" or "read it later" feature. It's written with wikifarms in mind (a list can contain pages from multiple wikis) and aims to support both browsers and mobile devices which have their own permanent storage .
For more information see https://www.mediawiki.org/wiki/Reading/Reading_Lists and T164990: RfC: Reading List service.

Main security-relevant characteristics

API only, no user interface

all operations (except for the maintenance script for purging old data) including viewing are limited to the current user, enforced in the DAO class

opt-in/opt-out API endpoints; no data can be stored for the given user before opt-in, all data destroyed on opt-out

Description of how the tool will be used at WMF

The API will power a REST proxy (to be written; does not require review¹). Initially, the REST proxy will be used to back up the reading lists in the Android app (which already has such a feature, but with no portability). iOS and web will add support later on.purge.php will be run periodically via cron to prune old deleted data.

Dependencies

None.

Has this project been reviewed before?

No.

Working test environment

Use the vagrant role as described on the extension page.

Post-deployment

Same as pre-deployment.

Notes

¹ The REST proxy will be written as a simple patch to the existing RESTBase codebase (T168972), and won't do much beyond translating REST requests into api.php requests (and requests to the RESTBase summary endpoint, in one case), and composing the responses into a single JSON. It will rely on api.php for security, by forwarding cookies.

Overall: Passes the security review. However I'm concerned about general scalability and also some of the db queries (Several of the queries do Using where; Using temporary; Using filesort). I get the impression that much of this code is assuming specific size limits on the lists, it would be nice if scaling assumptions were more specificly documented.

I would ask that the schema/queries be approved by a DBA prior to this extension being deployed.

Misc comments

None of this is security related, just my general opinion. For general comments you should feel free to ignore any that you disagree with. Anything related to db queries can be ignored if a DBA says its ok.

The reverse interwiki prefix thing sort of looks like it should actually be in core.

This seems to have many hardcoded assumptions about lists being small (e.g. Reording requiring to submit the order of all items on a list, with the api unworkable if lists have > 1000 entries. Some queries not having LIMIT clauses. Paging using OFFSET clauses instead of range conditions in the WHERE. Some parts not even having paging at all).

First of all, it seems kind of odd to design something with such limitations from the get-go. User demand for large things tend to increase over time. They will probably want large lists eventually, and it will be much harder to add support for that later (schema changes, breaking api changes) than it would be to do it from the beginning.

Secondly, there is actually no restrictions in place preventing lists from being arbitrarily large. If we're going to have assumptions lists are short, then that has to actually be enforced.

No validation of entries (illegal titles such as {{foo}} allowed)

No namespace support for titles. Is the assumption that people will only use main namespace pages (Seems like an odd assumption)? Is the assumption that namespaces will be stored in textual form (e.g. page title 'Project:Foo')? In which case, the ?generator=readinglistentries doesn't work properly. Additionally, not all titles can be stored in such a scheme, as there needs to be 255 bytes for the name part of the title plus the space for the namespace part.

No validation of project names. Seems like inserting an invalid project should be an error. Its also kind of odd to use domain names. Almost all other cross-wiki extensions use wiki-id's.

The interwiki lookup scheme is not robust against people running multiple wikis on a single domain (Not sure if that matters)

If there are no local interwiki prefixes, there are a bunch of php warnings generated on ?action=query&generator=readinglistentries

I'm getting some "Expectation (masterConns <= 0) by ApiMain::setRequestExpectations not met" warnings as well from the api query modules.

Has @jcrespo approved the SQL parts? Some of the queries used are Using where; Using temporary; Using filesort. Since readinglists are meant to be small, this is not as bad as it usually is, but nonetheless that seems unideal, and in the case I looked at, rather minor changes to the schema would fix the query.

The fact that ?action=query&meta=readinglists will potentially return a random order for the order field in the event that no specific order is set on list items (ordering by a null field) with no indication that the elements are unordered, is kind of odd. I would normally expect it to default to date added or something.

Doesn't properly truncate description, colour, etc fields before inserting into DB. We shouldn't rely on mysql's silent truncation field, and also its important to truncate to prevent dangling unicode characters. Instead of truncating, it may make sense to return an error in the API if field too long.

IMO the RESTBase part doesn't require a security review: it's basically just a web proxy that converts between slightly different URL styles.

I get the impression that much of this code is assuming specific size limits on the lists, it would be nice if scaling assumptions were more specificly documented.
(...)

This seems to have many hardcoded assumptions about lists being small (e.g. Reording requiring to submit the order of all items on a list, with the api unworkable if lists have > 1000 entries. Some queries not having LIMIT clauses. Paging using OFFSET clauses instead of range conditions in the WHERE. Some parts not even having paging at all).

First of all, it seems kind of odd to design something with such limitations from the get-go. User demand for large things tend to increase over time. They will probably want large lists eventually, and it will be much harder to add support for that later (schema changes, breaking api changes) than it would be to do it from the beginning.

Secondly, there is actually no restrictions in place preventing lists from being arbitrarily large. If we're going to have assumptions lists are short, then that has to actually be enforced.

Internally, the order operations are the only ones with no paging; I should probably fix that. In the API, exposing paging for those would require support for the kind of two-level continuation that you get when combining list/generator and prop modules; I would rather not go there if I don't have to.

Very theoretically, different projects could have different rules for illegal titles (e.g. by customizing $wgLegalTitleChars) so it's hard to check due to the cross-wiki nature of the project.
In practice the security benefits probably make it worth to ignore that corner case. I'll add a check.

No namespace support for titles. Is the assumption that people will only use main namespace pages (Seems like an odd assumption)? Is the assumption that namespaces will be stored in textual form (e.g. page title 'Project:Foo')? In which case, the ?generator=readinglistentries doesn't work properly. Additionally, not all titles can be stored in such a scheme, as there needs to be 255 bytes for the name part of the title plus the space for the namespace part.

I assumed textual form. Apps should be able to generate API requests from the list data they get, and that's difficult with namespace numbers. And the extension cannot turn numbers into namespaces on the fly since it would have to know the namespace config of all the wikis for that. I think I tested the generator with non-mainspace titles and it worked fine but I'll recheck.
Good point about the field length.

No validation of project names. Seems like inserting an invalid project should be an error. Its also kind of odd to use domain names. Almost all other cross-wiki extensions use wiki-id's.

Not sure if there is a way to tell what's valid in general. I guess a configurable callback could work.
The domain name is used by the client to construct API calls; clients are generally unable to translate wiki IDs to domain names. Internally, the projects will eventually have their own lookup table to preserve database space; wiki IDs could easily be added through that if there is a use case. (Maybe that should be turned into a more generic have-info-about-all-sites-of-a-wikifarm-in-the-DB thing and moved to core?)

The interwiki lookup scheme is not robust against people running multiple wikis on a single domain (Not sure if that matters)

Seemed to much effort to do that robustly when chances are no one will ever use it. The lookup class was made a service so that it is easy to replace with an implementation specific to a local wikifarm.

If there are no local interwiki prefixes, there are a bunch of php warnings generated on ?action=query&generator=readinglistentries

Will fix.

I'm getting some "Expectation (masterConns <= 0) by ApiMain::setRequestExpectations not met" warnings as well from the api query modules.

That means unclosed connections, but the code does do any kind of low-level DB interaction, so not sure what to do about that...

I would ask that the schema/queries be approved by a DBA prior to this extension being deployed.
(...)

Some of the queries used are Using where; Using temporary; Using filesort. Since readinglists are meant to be small, this is not as bad as it usually is, but nonetheless that seems unideal, and in the case I looked at, rather minor changes to the schema would fix the query.

With the sortkeys being in a separate table, I don't think there is a way to avoid filesorts. Due to the (to be implemented) limits, they should happen on fairly small result sets as you say.

The fact that ?action=query&meta=readinglists will potentially return a random order for the order field in the event that no specific order is set on list items (ordering by a null field) with no indication that the elements are unordered, is kind of odd. I would normally expect it to default to date added or something.

Yeah, that would be very confusing. I should probably just make sure the order is set on insert, but having a fallback sortfield certainly wouldn't hurt.

Doesn't properly truncate description, colour, etc fields before inserting into DB. We shouldn't rely on mysql's silent truncation field, and also its important to truncate to prevent dangling unicode characters. Instead of truncating, it may make sense to return an error in the API if field too long.

IMO the RESTBase part doesn't require a security review: it's basically just a web proxy that converts between slightly different URL styles.

That's fair, I just wanted to be clear on what parts I reviewed and what parts I didn't.

Very theoretically, different projects could have different rules for illegal titles (e.g. by customizing $wgLegalTitleChars) so it's hard to check due to the cross-wiki nature of the project.
In practice the security benefits probably make it worth to ignore that corner case. I'll add a check.

It does get complicated though if you do normalization. E.g. Wiktionary with allowing lowercase first letters. In practise it doesn't matter too much if they are not validated as long as you don't directly use Title::makeTitle() anywhere, but it would be nice if possible. (At the moment the code does use Title::makeTitle in the generator. It should be switched to Title::makeTitleSafe() if the db entries aren't validated & normalized before insertion).

Not sure if there is a way to tell what's valid in general. I guess a configurable callback could work.
The domain name is used by the client to construct API calls; clients are generally unable to translate wiki IDs to domain names. Internally, the projects will eventually have their own lookup table to preserve database space; wiki IDs could easily be added through that if there is a use case. (Maybe that should be turned into a more generic have-info-about-all-sites-of-a-wikifarm-in-the-DB thing and moved to core?)

I was thinking do the ReverseInterwikiLookup thing, and reject the project if it can't find the relavent interwiki code. Alternatively, could look at the things in WikiMap::getCanonicalServerInfoForAllWikis()

I'm getting some "Expectation (masterConns <= 0) by ApiMain::setRequestExpectations not met" warnings as well from the api query modules.

That means unclosed connections, but the code does do any kind of low-level DB interaction, so not sure what to do about that...

I think maybe its because $this->dbw = wfGetDB( DB_MASTER ); is called even if its unused. I'm unsure if that opens an actual connection to the database or if the connection is only open once you actual use the db reference.

I would ask that the schema/queries be approved by a DBA prior to this extension being deployed.

Some of the queries used are Using where; Using temporary; Using filesort. Since readinglists are meant to be small, this is not as bad as it usually is, but nonetheless that seems unideal, and in the case I looked at, rather minor changes to the schema would fix the query.

With the sortkeys being in a separate table, I don't think there is a way to avoid filesorts. Due to the (to be implemented) limits, they should happen on fairly small result sets as you say.

TBH, I'm not sure i understand the benefits of having a separate table here. I guess its to use a single INSERT instead of many UPDATE queries, but I'm not sure if avoiding doing many UPDATE queries really matters much. In any case, I'm ok with this provided that DBAs are ok with this, but I'd like them to explicitly say they are ok with the extension doing queries involving (small) filesorts and temporary tables, as that's usually frowned upon afaik.

> This seems to have many hardcoded assumptions about lists being small (e.g. Reording requiring to submit the order of all items on a list, with the api unworkable if lists have > 1000 entries. Some queries not having LIMIT clauses. Paging using OFFSET clauses instead of range conditions in the WHERE. Some parts not even having paging at all).

I specifically said I will block any deployment to production if lists get larger than 100 items/rows as there was not allocated hardware for this functionality. I still stand by that statement, no matter if it has been ignored by the development team or not.

> This seems to have many hardcoded assumptions about lists being small (e.g. Reording requiring to submit the order of all items on a list, with the api unworkable if lists have > 1000 entries. Some queries not having LIMIT clauses. Paging using OFFSET clauses instead of range conditions in the WHERE. Some parts not even having paging at all).

I specifically said I will block any deployment to production if lists get larger than 100 items/rows as there was not allocated hardware for this functionality. I still stand by that statement, no matter if it has been ignored by the development team or not.

I specifically said I will block any deployment to production if lists get larger than 100 items/rows as there was not allocated hardware for this functionality. I still stand by that statement, no matter if it has been ignored by the development team or not.

Reading Lists are used by approximately 10.5% of users (730K / 7M active users).
Median pages per list: 3
Average pages per list: 17
Approximate total number of lists: 1,059,354
Approximate number of lists per user: 1.45
Approximately 77% of users who use Reading Lists have one list. Most users have only a single digit number of entries in a list.

Additionally, we have found that there are 2 types of users (and the limits are designed around these use cases):

Users have a low number of items per lists AND have a low number of lists

Some users have one large list (definitely more than 100)

While deciding whether to order new hardware we also thought about the roll out strategy:

Beta Android Q2

Production Android Q2

Beta iOS Q2

Production iOS Q3

Web FY2018-19 Q1

The total MAUs on Android and iOS is around 10M. So this is rather small user-wise. And Web will not be until next FY… and we can get new hardware then.

Given this information, it didn't seem like limits of 100/1000 would be that burdensome to the existing hardware. Is there something you are looking at that is concerning about 1000 vs 100?

I would like to get a handle on real practical limits on what existing hardware can handle and what the effects of 100 vs 1000 would be. Would it be helpful if we measured server load using the Android beta process?

Can we log reads and writes of small vs large lists while it is in beta on Android? We would have a lot of control there and may be able to answer some questions you have about performance. If we can find the bottle necks, we are more than willing to find workable solutions.

@Bawolff: in case you do want to look at the RESTBase part after all, it is in the reading_lists_beta branch, with 9ef2b5fd and 32131948 being the main change. I think there are only two aspects which have relevance for security: caching headers (since all data is private) and cookie forwarding (cf. T179553: Cookies should not be forwarded to different domains; the enumeration of known domains in RESTBase config, the network restrictions of the scb cluster, and the static list of projects introduced in a02d6eba are three independent guarantees that the cookies do not get sent to an insecure location).

All the non-security issues mentioned here have been fixed:

This seems to have many hardcoded assumptions about lists being small (e.g. Reording requiring to submit the order of all items on a list, with the api unworkable if lists have > 1000 entries. Some queries not having LIMIT clauses. Paging using OFFSET clauses instead of range conditions in the WHERE. Some parts not even having paging at all).

First of all, it seems kind of odd to design something with such limitations from the get-go. User demand for large things tend to increase over time. They will probably want large lists eventually, and it will be much harder to add support for that later (schema changes, breaking api changes) than it would be to do it from the beginning.

Post T180402: Remove custom ordering from ReadingLists extension allowing larger lists should not cause any performance problems. (The only operation the complexity of which still scales with limit size is counting the number of lists/items to enforce the limits. That's unavoidable and shouldn't be too bad.)

Secondly, there is actually no restrictions in place preventing lists from being arbitrarily large. If we're going to have assumptions lists are short, then that has to actually be enforced.

No namespace support for titles. Is the assumption that people will only use main namespace pages (Seems like an odd assumption)? Is the assumption that namespaces will be stored in textual form (e.g. page title 'Project:Foo')? In which case, the ?generator=readinglistentries doesn't work properly. Additionally, not all titles can be stored in such a scheme, as there needs to be 255 bytes for the name part of the title plus the space for the namespace part.

The fact that ?action=query&meta=readinglists will potentially return a random order for the order field in the event that no specific order is set on list items (ordering by a null field) with no indication that the elements are unordered, is kind of odd. I would normally expect it to default to date added or something.

Doesn't properly truncate description, colour, etc fields before inserting into DB. We shouldn't rely on mysql's silent truncation field, and also its important to truncate to prevent dangling unicode characters. Instead of truncating, it may make sense to return an error in the API if field too long.

Some of the queries used are Using where; Using temporary; Using filesort. Since readinglists are meant to be small, this is not as bad as it usually is, but nonetheless that seems unideal, and in the case I looked at, rather minor changes to the schema would fix the query.

Well, it's either that or use lots of indexes. So I switched to the latter. There are still some filesorts but I don't think they can be avoided and they are on small resultsets. T182053 and the commit message for ad1edc63 has the details.

I'll close this for housekeeping; the actual security review happened in T174126#3633349, and as far as I can see all the non-security concerns have been addressed as well. Please reopen if you think something was left out.