This is definitely something we want to work on after the Cirrus->Kafka work is done;

This is done. We need to get UDF that marks a search request as concluding in zero or nonzero results. Then we can just aggregate by wikiid and zero_results

We still have no idea how we're going to visualise that many pairings in a satisfactory way (other than infinitely long sets of dropdowns).

I was thinking of doing something similar to the "Tile by zoom level" (http://discovery.wmflabs.org/maps/#tiles_total_by_zoom) where the user can choose an arbitrary combination of zoom levels to visualize simultaneously. So we could have two of those and let the user select arbitrary pairs.

Okay. So I've got a query that works and gets what we want. Problem: we have A LOT of wikis. Specifically, for 2015-11-10, the query returns nonzero/zero results counts for 840 wikis! That means the dataset containing these aggregates is going to grow by ~840 rows every day. That's...not good.

Do we want to limit this to specific wikis? Daily top 100? Daily top 10? Here are the top 20 wikis for 2015-11-10 by # of nonzero-result queries:

USE ebernhardson;
SELECT wikiid,
SUM(results.outcome) AS nonzero,
COUNT(*)-SUM(results.outcome) AS zero
FROM (
SELECT wikiid, IF(requests.hitstotal[SIZE(requests.querytype)-1] > 0, 1, 0) AS outcome
FROM cirrussearchrequestset
WHERE year = 2015 AND month = 11 AND day = 10
) AS results
GROUP BY wikiid;

To answer the actual question I was asked, it might be good to have the top n projects (for, say, n=3) on the dashboard somewhere, but the question is... where? Clutter is bad, so we need to be careful about throwing more data in because we can.

This would be a good topic of discussion for the Analysis meeting this afternoon.

We've solved some of the questions about visualisations here, because we did something very similar with the portal dashboards in T123347: Include geolocation data in portal dashboards. So, given that, this can be reprioritised because there's not as many outstanding product questions.

This still represents a not-unsubstantial amount of engineering work, though.

@Deskana have fun and let us know if you run into any problems. If you're happy with it after a few days (or a week?) we'll push it out to production.

I'm personally not satisfied with the performance hit at startup (caused by reading in the 2 new datasets which are substantially larger than the others we have) but there's also not much we can do about that. It's just going to be a slow initial experience for whoever is the first person to open the dashboard on any given day. I wonder if we should move this out of the metrics dashboard and into its own "experimental" dashboard (where the forecasting dash lives). That way Dan and others can still use it but without it having an impact on the main dashboard. @Ironholds, thoughts?

I'm not sure if moving is necessarily the solution. Like, this should eventually live in those dashboards.

Do we gain anything if we do all the processing server-side? Like, we could output both a flat TSV for transparency/reproducibility purposes, and a serialised .RData all the computations have happened on, and rely on the RData. It should be much faster to load.