Oct 14 2017

It might be worth focusing more on robustness than simple-page latency, as that is the more critical issue with Electron. Previously, I tested with a few very large articles (see T142226#2537844). This tested timeout enforcement. Testing with a simulated overload (many concurrent requests for huge pages) could also be useful to ensure that concurrency limits and resource usage limits are thoroughly enforced.

Oct 2 2017

See T172815 for our general thinking on robust PDF rendering based on the experience with OCG and Electron. It boils down to using a fresh render process per request & thoroughly controlling its resource consumption and maximum runtime.

Sep 29 2017

Queues caused many of the issues with OCG. I would really advise you to stick to a simple stateless HTTP service. Such a service offers sane error handling, provides back-pressure, integrates well with caching and rate / concurrency limiting infrastructure, and is easy to test and reason about. Once you add a queue & separate request from response, you lose all of this.

Sep 26 2017

In today's team sync meeting, we briefly touched on the possibility of combining the migration to the restriction table with the move to Cassandra 3. I think combining the two is attractive, as it lets us leverage the parallel double write / read testing we are doing anyway to test the new restriction storage as well. Doing this migration also lets us drop the revision table, in favor of action API requests for the few direct requests to the /title/{title} endpoint (T158100).

If I recall correctly, ResourceLoader client code on desktop already looks at a list of modules needed in a given page, checks client side caches, and fetches the remaining modules from the RL API (in a single call), and caches those modules separately in localstorage. Given that this discussion is making no reference to this, I am getting the impression that this understanding might be wrong. Could you clarify?

Sep 13 2017

Given the useful information we have in this task, I am proposing to widen the scope beyond the first job, towards generally coordinating the order of migrating individual jobs. @mobrovac, does that sound reasonable to you?

We briefly discussed this during today's sync meeting. While there are ways to set up targeted processing priorities for specific jobs (by wiki, type, or other criteria), we realized that there will likely be less of a need for this in the new setup. The Redis job queue divides processing throughput evenly between projects. This makes it relatively likely for individual projects to accumulate large backlogs, which would then need manual intervention (re-prioritization) to address.

Sep 12 2017

As far as I can tell, the page image(s) are handled as part of deferred linksUpdate processing. This means that the updates would be executed after the main web request, but on the same PHP thread that handled the original edit request.

Considering the scalability limits of Cassandra's schema synchronization we see in production, I think it would be good to reduce the number of storage groups more aggressively. Perhaps something like this?

I believe it was the pageimages designation for those articles I mentioned above. Not exactly sure what happened on wiki since the revisions have been deleted from public archives (and I don't have the permission to view it).

@Ottomata, from a cursory look at those connectors, it looks like they all aim to capture all SQL updates (update, insert, delete). They don't seem to be targeted at emitting specific semantic events, such as the ones we are interested in for EventBus. This is where the SQL comment idea could help, by letting us essentially embed the events we want to have emitted in the statement, rather than trying to reverse-engineer an event from raw SQL statement(s).

In terms of document structure, the behavior in line two (add section around <div>-wrapped heading) seems to make sense. I think it also matches edit section behavior, which should ignore the <div> completely (as it is not DOM-based).

From a practical perspective, I think the biggest question is how common clients behave these days when must-revalidate is omitted, and the client cache timeout expires. My memory on this is rather foggy, but I *think* in the dark ages behavior in that area was inconsistent, with early IE versions not re-validating even when they were online. If we can verify that all browsers we care about do the right thing (check as if must-revalidate was set when connected), then dropping must-revalidate in the headers would be harmless.

This proposed optimization is similar to something I implemented in Parsoid's HTML5 serializer. In that case, we switch between single & double quotes for HTML attributes depending on whether the attribute value contains more single quotes or double quotes. This had a very significant impact on Parsoid HTML size, mainly because it has many JSON values embedded in attributes.

Sep 7 2017

Facebook actually heavily relies on SQL comments to pass event information to binlog tailer daemons (see the TAO paper). We currently use those SQL comments only to mark the source of a SQL query (PHP function), but could potentially add some annotations that would make it easy to generically extract & export such events into individual Kafka topics.

I don't have strong views on how to scale metrics and log collection. In any case, we have been doing this remotely for a while now (using standard formats like gelf for logs), so whether things are aggregated per pod or more centrally doesn't make a big difference to the services themselves.