One source of complexity & JavaScript use on gwern.net is the use of Google AdSense advertising to insert banner ads. In considering design & usability improvements, removing the banner ads comes up every time as a possibility, as readers do not like ads, but such removal comes at a revenue loss and it’s unclear whether the benefit outweighs the cost, suggesting I run an A/B experiment. However, ads might be expected to have broader effects on traffic than individual page reading times/bounce rates, affecting total site traffic instead through long-term effects on or spillover mechanisms between readers (eg social media behavior), rendering the usual A/B testing method of per-page-load/session randomization incorrect; instead it would be better to analyze total traffic as a time-series experiment.

Design: A decision analysis of revenue vs readers yields an maximum acceptable total traffic loss of ~3%. Power analysis of historical gwern.net traffic data demonstrates that the high autocorrelation yields low statistical power with standard tests & regressions but acceptable power with ARIMA models. I design a long-term Bayesian ARIMA(4,0,1) time-series model in which an A/B-test running January-October 2017 in randomized paired 2-day blocks of ads/no-ads uses client-local JS to determine whether to load & display ads, with total traffic data collected in Google Analytics & ad exposure data in Google AdSense. The A/B test ran from 1 January 2017 to 15 October 2017, affecting 288 days with collectively 380,140 pageviews in 251,164 sessions.

Correcting for a flaw in the randomization, the final results yield a surprisingly large estimate of -14% traffic loss if all traffic were exposed to ads (95% credible interval: -13-16%) and an expected traffic loss of -9.7% from the subset of users without adblock, exceeding my decision threshold for disabling ads and rendering further experimentation profitless.

Thus, banner ads on gwern.net appear to be harmful and AdSense has been removed. If these results generalize to other blogs and personal websites, an important implication is that many websites may be harmed by their use of banner ad advertising without realizing it.

One thing about gwern.net I prize, especially in comparison to the rest of the Internet, is the fast page loads & renders. This is why in my previous A/B tests of site design changes, I have generally focused on CSS changes which do not affect load times. Benchmarking loads, the total time is dominated by Google AdSense (for the medium-sized banner advertisements centered above the title) and Disqus comments. While I want comments, so the Disqus is not optional1, AdSense I keep only because, well, it makes me some money (~$30 a month or ~$360 a year; it would be more but ~60% of visitors have adblock, which is apparently unusually high for the US). So ads are a good thing to do an experiment on: it offers a chance to remove one of the heaviest components of the page, an excuse to use a decision approach, an opportunity to try applying Bayesian time-series models in JAGS/Stan, and an investigation into whether longitudinal site-wide A/B experiments are practical & useful.

This isn’t a huge amount (it is much less than my monthly Patreon) and might be offset by the effects on load/render time and people not liking advertisement. If I am reducing my traffic & influence by 10% because people don’t want to browse or link pages with ads, then it’s definitely not worthwhile.

One of the more common criticisms of the usual A/B test design is that it is missing the forest for the trees & giving fast precise answers to the wrong question; a change may have good results when done individually, but may harm the overall experience or community in a way that shows up on the macro but not micro scale.2 In this case, I am interested less in time-on-page than in total traffic per day, as the latter will measure effects like resharing on social media (especially, given my traffic history, Hacker News, which always generates a long lag of additional traffic from Twitter & aggregators). It is somewhatappreciated that A/B testing in social media or network settings is not as simple as randomizing individual users & running a t-test - as the users are not independent of each other (violating SUTVA among other things). Instead, you need to randomize groups or subgraphs or something like that, and consider the effects of interventions on those larger more-independent treatment units.

So my usual ABalytics setup isn’t appropriate here: I don’t want to randomize individual visitors & measure time on page, I want to randomize individual days or weeks and measure total traffic, giving a time-series regression.

This could be randomized by uploading a different version of the site every day, but this is tedious, inefficient, and has technical issues: aggressive caching of my webpages means that many visitors may be seeing old versions of the site! With that in mind, there is a simple A/B test implementation in JS: in the invocation of the AdSense JS, simply throw in a conditional which predictably randomizes based on the current day (something like the day-of-year (1-366) modulo 2, hashing the day, or simply a lookup in an array of constants), and then after a few months, extract daily traffic numbers from Google Analytics/AdSense and match up with randomization and do a regression. By using a pre-specified source of randomness, caching is never an issue, and using JS is not a problem since anyone with JS disabled wouldn’t be one of the people seeing ads anyway. Since there might be spillover effects due to lags in propagating through social media & emails etc, daily randomization might be too fast, and 2-day blocks more appropriate, ensuring occasional runs up to a week or so to expose longer effects while still ensuring allocation equal total days to advertising/no-advertising.3

Setting this up in JS turned out to be a little tricky since there is no built-in function for getting day-of-year or for hashing numbers/strings; so rather than spend another 10 lines copy-pasting some hash functions, I copied some day-of-year code and then simply generated in R 366 binary variables for randomizing double-days and put them in a JS array for doing the randomization:

While simple, static, and cache-compatible, a few months in I discovered that I had perhaps been a little too clever: checking my AdSense reports on a whim, I noticed that the reported daily impressions was rising and falling in roughly the 2-day chunks expected, but it was never falling all the way to 0 impressions, instead, perhaps to a tenth of the usual number of impressions. This was odd because how would any browsers be displaying ads on the wrong days given that the JS runs before any ads code, and any browser not running JS would, ipso facto, never be running AdSense anyway? Then it hit me: whose date is the randomization based on? The browser’s, of course, which is not mine if it’s running in a different timezone. Presumably browsers across a dateline would be randomized into on on the same day as others are being randomized into off. What I should have done was some sort of timezone independent date conditional. Unfortunately, it was a little late to modify the code.

This implies that the simple binary randomization test is not good as it will be substantially biased towards zero/attenuated by the measurement error inasmuch as many of the pagehits on supposedly ad-free days are in fact being contaminated by exposure to ads. Fortunately, the AdSense impressions data can be used instead to regress on, say, percentage of ad-affected pageviews.

From a decision theory perspective, this is a good place to apply sequential testing ideas as we face a similar problem as with the Candy Japan A/B test and the experiment has an easily quantified cost: each day randomized off costs ~$1, so a long experiment over 200 days would cost ~$100 in ad revenue etc. There is also the risk of making the wrong decision and choosing to disable ads when they are harmless, in which case the cost as NPV (at my usual 5% discount rate, and assuming ad revenue never changes and I never experiment further, which are reasonable assumptions given how fortunately stable my traffic is and the unlikeliness of me revisiting a conclusive result from a well-designed experiment) would be 360log(1.05)=$7378\frac{360}{log(1.05)} = \$7378, which is substantial.

On the other side of the equation, the ads could be doing substantial damage to site traffic; with ~40% of traffic seeing ads and total page-views of 635123 in 2016 (1740/day), a discouraging effect of 5% off that would mean a loss of 635123⋅0.40⋅0.05=12702635123 \cdot 0.40 \cdot 0.05 = 12702, the equivalent of 1 week of traffic. My website is important to me because it is what I have accomplished & is my livelihood, and if people are not reading it, that is bad, both because I lose possible income and because it means no one is reading my work.

How bad? In lieu of advertising it’s hard to directly quantify the value of a page-view, so I can instead ask myself hypothetically, would I trade ~1 week of traffic for $360 (~$0.02/view, or to put it another way which may be more intuitive, would I delete gwern.net in exchange for >$18720/year)? Probably; that’s about the right number - with my current parlous income, I cannot casually throw away hundreds or thousands of dollars for some additional traffic, but I would still pay for readers at the right price, and weighing the feelings, I feel comfortable valuing page-views at ~$0.02. (If the estimate of the loss turns out to be near the threshold, then I can revisit it again and attempt more preference elicitation. Given the actual results, this proved to be unnecessary.)

Thus, the decision question is whether the decrease for the ad-affected 40% of traffic is >7%; or for traffic as a whole, if the decrease is >2.8%. If it is, then I am better off removing AdSense and increasing traffic; otherwise, the money is better.

How much should we expect traffic to fall?

Unfortunately, I’m not aware of any previous research similar to my proposal for examining the effect on total traffic rather than more common metrics such as revenue or per-page engagement. I assume such research exists, since there’s a literature on everything, but I haven’t found it yet and no one I’ve asked knows where it is either; and of course presumably the big Internet advertising giants have detailed knowledge of such spillover or emergent effects, although no incentive to publicize the harms.4 There is a sparse open literature on advertising avoidance, which focuses on surveys of consumer attitudes and economic modeling; skimming, the main results appear to be that people claim to dislike advertising on TV or the Internet a great deal, claim to dislike personalization but find personalized ads less annoying, a nontrivial fraction of viewers will take action during TV commercial breaks to avoid watching ads (5-23% for various methods of estimating/definitions of avoidance, and sources like TV channels), and are particularly annoyed by ads getting in the way when researching or engaged in goal-oriented activity, and in a work context (Amazon Mechanical Turk) will tolerate non-annoying ads without demanding large payment increases (Goldstein et al 2013/Goldstein et al 2014). But while those surveys & measurements show users will do some work to avoid ads (which is supported by the high percentage of browsers with adblockers installed) and in some contexts like jobs appear to be insensitive to ads, there is little information about to what extent ads unconsciously drive users away from a publisher towards other publishers or mediums, with pervasive amounts of advertising taken for granted (see cites in Abernethy, Edwards et al 2002, & Wilbur 2016). After running this experiment, a paper was published in 2018 on a large-scale (n=34m) long-term (~2 years) individual-level advertising experiment by the Pandora streaming music service (Huang et al 2018) which found a strikingly large effect of number of ads on reduced listener frequency & worsened retention, which accumulated over time and would have been hard to observe in a short-term experiment; this roughly paralleled my results (although it doesn’t examine any potential spillover or global effects). Almost simultaneously, Mozilla (Miroglio et al 2018) conducted a longitudinal (but non-randomized, using propensity scoring) study which found that after installing adblock, adblock users experienced increases in both active time spent in the browser (+28% over [matched] controls) and the number of pages viewed (+15% over control); the latter correlate is almost identical to the per-user effect my first A/B test had yielded.

My best guess is that the effect of any advertising avoidance ought to be a small percentage of traffic, since: A/B tests in general find small effects in the percentage range (my corpus of A/B tests - or see What works in e-commerce - a meta-analysis of 6700 online experiments, Brown & Jones 2017; p-Hacking and False Discovery in A/B Testing, Berman et al 2018), ads are pervasive online, only a fraction of users are ever exposed, AdSense loads ads asynchronously in the background so it never blocks the page loading or rendering (which would definitely be frustrating & web design holds that small delays in pageloads are harmful5), Google supposedly spends billions of dollars a year on a surveillance Internet & the most cutting-edge AI technology to better model users & target ads to them without irritating them too much (eg Hohnhold et al 2015), ads should have little effect on SEO or search engine ranking (since why would search engines penalize their own ads?), I’ve seen a decent amount of research on optimizing ad deliveries to maximize revenue & avoiding annoying ads but not reducing total harm, the banner ads have never offended or bothered me much when I browse my pages with adblocker disabled as they are normal medium-sized banners centered above the <title> element where one expects an ad6, and you would think that if there were any worrisome level of harm someone would’ve noticed by now & it’d be common knowledge to avoid ads unless you were desperate for revenue. So my prior estimate is of a small effect and needing to run for a long time to make a decision at a moderate opportunity cost.

How do we analyze this? In the ABalytics per-reader approach, it was simple: we defined a threshold and did a binomial regression. But by switching to trying to increase overall total traffic, I have opened up a can of worms. Let’s look at the traffic data:

Two things jump out. The distribution of traffic is weird, with enormous spikes; doing a log-transform to tame the spikes, it is also clearly a non-stationary time-series with heavy autocorrelation as traffic consistently grows and declines. These are not surprising, as social media traffic from sites like Hacker News or Reddit are notorious for creating spikes in site traffic (and sometimes bringing them down under the load), and I would hope that as I keep writing things, traffic would gradually increase! Nevertheless, both of these will make the traffic data difficult to analyze despite having over 6 years of it.

The answer is no. We are nowhere near being able to detect it with either a t-test or the nonparametric u-test (which one might expect to handle the strange distribution better), and the log transform doesn’t help. We can hardly even see a hint of the decrease in the t-test - the decrease in the mean is ~30 pageviews but the standard deviations are ~2900 and actually bigger than the mean. So the spikes in the traffic are crippling the tests and this cannot be fixed by waiting a few more months since it’s inherent to the data.

If our trusty friend the log-transform can’t help, what can we do? In this case, we know that the reality here is literally a mixture model as the spikes are being driven by qualitatively distinct phenomenon like a gwern.net link appearing on the HN front page, as compared to normal daily traffic from existing links & search traffic7; but mixture models tend to be hard to use. One ad hoc approach to taming the spikes would be to effectively throw them out by winsorizing/clipping everything at a certain point (since the daily traffic average is ~1700, perhaps twice that, 3000):

Better but still inadequate. Even with the spikes tamed, we continue to have problems; the logged graph suggests that we can’t afford to ignore the time-series aspect. A check of autocorrelation indicates substantial autocorrelation out to lags as high as 8 days:

Autocorrelation in gwern.net daily traffic: previous daily traffic is predictive of current traffic up to t=8 days ago

The usual regression framework for time-series is the ARIMA time-series model, in which the current daily value would be regressed on by each of the previous day’s values (with an estimated coefficient for each lag, as day 8 ought to be less predictive than day 7 and so on) and possibly a difference and a moving average (also with varying distances in time). The models are usually denoted as ARIMA(days back to use as lags, days back to difference, days back for moving average). So the pacf suggests that an ARIMA(8,0,0) might work - lags back 8 days but agnostic on differencing and moving averages, respectively. R’s forecast library helpfully includes both an arima regression function and also an auto.arima to do model comparison. auto.arima generally finds that a much simpler model than ARIMA(8,0,0) works best, preferring models like ARIMA(4,1,1) (presumably the differencing and moving-average steal enough of the distant lags’ predictive power that they no longer look better to AIC). Such an ARIMA model works well and now we actually can detect convincingly our simulated effect:

One might reasonably ask, what is doing the real work, the truncation/trimming or the ARIMA(4,1,1)? The answer is both; if we go back and regenerate the ads dataset without the truncation/trimming and we look again at the estimated effect of Ads, we find it changes to

For the simple linear model with no time-series or truncation, the standard error on the ads effect is 121; for the time-series with no truncation, the standard error is 81; and for the time series plus truncation, the standard error is 11. My conclusion is that we can’t leave either one out if we are to reach correct conclusions in any feasible sample size - we must deal with the spikes, and we must deal with the time-series aspect.

So having settled on a specific ARIMA model with truncation, I can do a power analysis. For a time-series, the simple bootstrap is inappropriate as it ignores the autocorrelation; the right bootstrap is the block bootstrap: for each hypothetical sample size n, split the traffic history into as many non-overlapping n-sized chunks m as possible, select-with-replacement from them m, and run the analysis. This is implemented in the R boot library.

So the false-positive rate is preserved for both, the ARIMA requires a reasonable-looking n<70 to be well-powered, but the u-test power is bizarre - the power is never great, never going >31.6%, and actually decreasing after a certain point, which is not something you usually see in a power graph. (The ARIMA power curve is also odd but at least it doesn’t get worse with more data!) My speculation about that is that it is because as the time-series window increases, more of the spikes come into view of the u-test, making the distribution dramatically wider & this more than overwhelms the gain in detectability; hypothetically, with even more years of data, the spikes would stop coming as a surprise and the gradual hypothetical damage of the ads will then become more visible with increasing sample size as expected.

An model is an autoregressive model of order 2, or AR(2) model. This model can be written as: X_t - mu = (Beta1 * (X_t-1 - mu)) + (Beta2 * (Xt-2 - mu)) + Z_t, where X_t is the stationary time series we are studying (the time series of volcanic dust veil index), mu is the mean of time series X_t, Beta1 and Beta2 are parameters to be estimated, and Z_t is white noise with mean zero and constant variance.

This was perhaps the first time I’ve attempted to write a complex model in Stan, in this case, adapting a simple ARIMA time-series model from the Stan manual. Stan has some interesting features like the variational inference optimizer which can find sensible parameter values for complex models in seconds, an active community & involved developers & an exciting roadmap, and when Stan works it is substantially faster than the equivalent JAGS model; but I encountered a number of drawbacks (in decreasing order of importance):

Stan’s treatment of mixture models and discrete variables is… not good. I like mixture models & tend to think in terms of them and latent variables a lot, which makes the neglect an issue for me. This was particularly vexing in my initial modeling where I tried to allow for traffic spikes from HN etc by having a mixture model, with one component for regular traffic and one component for traffic surges. This is relatively straightforward in JAGS as one defines a categorical variable and indexes into it, but it is a nightmare in Stan, requiring a bizarre hack.

I defy any Stan user to look at the example mixture model in the manual and tell me that they naturally and easily understand the target/temperature stuff as a way of implementing a mixture model. I sure didn’t. And once I did get it implemented, I couldn’t modify it at all. And it was slow, too, eroding the original performance advantage over JAGS. I was saved only by the fact that the A/B test period happened to not include many spikes and so I could simply drop the mixture model entirely.

mysterious segfaults and errors under a variety of condition; once when my cat walked over my keyboard, and frequently when running multi-core Stan in a loop. The latter was a serious issue for me when running a permutation test with 5000 iterates: when I run Stan on 8 chains in parallel normally (hence 1/8th the samples per chain) in a for-loop - the simplest way to implement the permutation test - it would occasionally segfault and take down R. I was forced to reduce the chains to 1 before it stopped crashing, making it 8 times slower unless I wished to add in manual parallel processing, running 8 separate Stans.

Stan’s support for posterior predictives is poor. The manual tells one to use a different module/scope, generated_quantities, lest the code be slow, which apparently requires one to copy-paste the entire likelihood section! Which is especially unfortunate when doing a time-series and requiring access to arrays/vectors declared in a different scope… I never did figure out how to generate posterior predictions correctly for that reason, and resorted to the usual Bugs/JAGS-like method (which thankfully does work)

Stan’s treatment of missing data is also uninspiring and makes me worried about more complex analyses where I am not so fortunate as to have perfectly clean complete datasets

Stan’s syntax is terrible, particularly the entirely unnecessary semicolons. It is 2017, I should not be spending my time adding useless end-of-line markers. If they are necessary for C++, they can be added by Stan itself. This was particularly infuriating when painfully editing a model trying to implement various parameterizations and rerunning only to find that I had forgotten a semicolon (as no language I use regularly, R, Haskell, shell, Python, or Bugs/JAGS, insists on them!).

Traffic looks similar whether counting by total page views or total time reading (average-time-reading-per-session x number-of-sessions); the data is definitely autocorrelated, somewhat noisy, and I get an impression that there is a noticeable decrease in the advertising days despite the measurement error:

The tests can only handle a binary variable, so next is a quick simple linear model, and then a Bayesian regression with an autocorrelation term; both turn up a weak effect for the binary randomization, and then much stronger (and negative) for the more accurate percentage measurement:

Treating percentage as additive is also odd, as one would expect it to be multiplicative in some sense. As well, brms makes it convenient to throw in a simple Bayesian structural autocorrelation (corresponding to AR(1) if I am understanding it correctly) but the function involved does not support the higher-order lags or moving average involved in traffic.

The fully Bayesian analysis in Stan, using ARIMA(4,0,1) time-series, a multiplicative effect of ads as percentage of traffic, skeptical informative priors of small negative effects, and extracting posterior predictions (of each day if hypothetically it were not advertising-affected) for further analysis:

We see a consistent & large estimate of harm: the mean of traffic falls by -14.5% (95% CI: -0.16 to -0.13; permutation test: p=0.01) on 100% ad-affected traffic!

To more directly calculate the harm, I turn to the posterior predictions, which were computed for each day under the hypothetical of no advertising; one would expect the prediction for all days to be somewhat higher than the actual traffics were (because almost every day has some non-zero % of ad-affected traffic), and, summed or averaged over all days, that gives the predicted loss of traffic from ads:

As this is so far past the decision threshold and the 95% credible interval around -0.14 is extremely tight (-0.16 - -0.13), the EVSI of any additional sampling is surely negative & not worth calculating.

Thus, I permanently removed the AdSense banner ad in the middle of 11 September 2017.

The result is surprising. I had been expecting some degree of harm but the estimated reduction is much larger than I expected. Could banner ads really be that harmful?

The effect is estimated with considerable precision, so it’s almost certainly not a fluke of the data (if anything I collected far more data than I should’ve); there weren’t many traffic spikes to screw with the analysis, so omitting mixture model or t-scale responses in the model doesn’t seem like it should be an issue either; the modeling itself might be driving it, but the crudest tests suggest a similar level of harm (just not at high statistical-significance or posterior probability); it does seem to be visible in the scatterplot; and the more realistic models - which include time-series aspects I know exist from the long historical time-series of gwern.net traffic & skeptical priors encouraging small effects - estimate it much better as I expected from my previous power analyses, and considerable tinkering with my original ARIMA(4,0,1) Stan model to check my understanding of my code (I haven’t used Stan much before) didn’t turn up any issues or make the effect go away. So as far as I can tell, this effect is real. I still doubt my results, but it’s convincing enough for me to disable ads, at least.

Does it generalize? I admit gwern.net is unusual: highly technical longform static content in a minimalist layout optimized for fast loading & rendering catering to Anglophone STEM-types in the USA. It is entirely possible that for most websites, the effect of ads is much smaller because they already load so slow, have much busier cluttered designs, their users have less visceral distaste for advertising or are more easily targeted for useful advertising etc, and thus gwern.net is merely an outlier for whom removing ads makes sense (particularly given my option of being Patreon-supported rather than depending entirely on ads like many media websites must). I have no way of knowing whether or not this is true, and as always with optimizations, one should benchmark one’s own specific usecase; perhaps in a few years more results will be reported and it will be seen if my results are merely a coding error or an outlier or something else.

If a max loss of 14% and average loss of ~9% (both of which could be higher for sites whose users don’t use adblock as much) is accurate and generalizable to other blogs/websites , there are many implications: in particular, it implies a huge deadweight loss to Internet users from advertising; and suggests advertising may be a net loss for many smaller sites. Ironically, in the latter case, those sites may not yet have realize, and may never realize, how much the pennies they earn from advertising are costing them, because the harm won’t show up in standard single-user A/B testing due to either measurement error hiding much of the effect or because it exists as an emergent global effect, requiring long-term experimentation & relatively sophisticated time-series modeling.

There may be a connection here to earlier observations on the business of advertising questioning whether advertising works, works more than it hurts or cannibalizes other avenues, works sufficiently well to be profitable, or sufficiently well to know if it is working at all: on purely statistical grounds, it should be hard to cost-effectively show that advertising works at all (Lewis & Reiley 2008/Lewis & Rao 2014), Steve Sailer’s observation that IRI’s BehaviorScan field-experiment linking individual TV advertisements & grocery store sales likely showed little effect, eBay’s own experiments to similar effect (Lewis et al 2011, Blake et al 2014), P&G & JPMorgan digital ad cuts (and the continued success of many São Paulo-style low-advertising retailers like Aldi or MUJI), the extreme inaccuracy of correlational attempts to predict advertising effects (Lewis et al 2011, Gordon et al 2016; see also Eckles & Bakshy 2017), political science’s difficulty showing any causal impact of campaign advertisement spending on victories (Kalla & Broockman 2017; & mostly recently, Donald Trump), and many anecdotal reports seriously questioning the value of Facebook or Google advertising for their businesses in yielding mistaken/curious or fraudulent or just useless traffic.

An attempt at a ARIMA(4,0,1) time-series mixture model, where the mixture has two components: one component for normal traffic where daily traffic is ~1000 making up >90% of daily data, and one component for the occasional traffic spike around 10x larger but happening rarely:

This code winds up continuing to fail due to label-switching issue (ie the MCMC bouncing between estimates of what each mixture component is because of symmetry or lack of data) despite using some of the suggested fixes in the Stan model like the ordering trick. Since there were so few spikes in 2017 only, the mixture model can’t converge to anything sensible; but on the plus side, this also implies that the complex mixture model is unnecessary for analyzing 2017 data and I can simply model the outcome as a normal.

Demo code of simple Expected Value of Sample Information (EVSI) in a JAGS log-Poisson model of traffic (which turns out to be inferior to a normal distribution for 2017 traffic data but I keep here for historical purposes). We consider an experiment resembling historical data with a 5% traffic decrease due to ads; the reduction is modeled and implies a certain utility loss given my relative preferences for traffic vs the advertising revenue, and then the remaining uncertainty in the reduction estimate can be queried for how likely it is that the decision is wrong and that collecting further data would then change a wrong decision to a right one:

so EVPI is $0.07. This doesn’t pay for any additional days of sampling, so there’s no need to calculate an exact EVSI.

I am not too happy about how much uncached JS the Disqus plugin loads or how long it takes to set itself up while spewing warnings in the browser console, but at the moment, I don’t know of any other static site commenting system which has good anti-spam capabilities or an equivalent user base, and Disqus has worked reasonably well for 5+ years.↩

This is especially an issue with A/B testing as usually practiced with NHST & arbitrary alpha threshold, which poses a sorites of suck problem; one could steadily degrade one’s website by repeatedly making bad changes which don’t appear harmful in small-scale experiments (no user harm, p>0.05, and increased revenue p<0.05, no harm, increased revenue, no harm, increased revenue etc). One might call this the Schlitz beer problem (How Milwaukee’s Famous Beer Became Infamous: The Fall of Schlitz), after the famous business case study: a series of small quality decreases/profit increases eventually had catastrophic cumulative effects on their reputation & sales. Another example is the now-infamous Red Delicious apple: widely considered one of the worst-tasting apples commonly sold, it was reportedly an excellent-tasting apple when first discovered in 1880, winning contests for its flavor; but its flavor worsened rapidly over the 20th century, a decline blamed on apple growers gradually switching to ever-redder sports which looked better in grocery stores, a decline which ultimately culminated in the near-collapse of the Red-Delicious-centric Washington State apple industry when consumer backlash finally began in the 1980s with the availability of tastier apples like Gala. This death by degrees can be countered by a few things, such as either testing regularly against a historical baseline to establish total cumulative degradation or carefully tuning α\alpha/β\beta thresholds based on a decision analysis (likely, one would conclude that statistical power must be made much higher and the p-threshold should be made less stringent for detecting harm).↩

Why block instead of, say, just randomizing 5-days at a time (simple randomization)? If we did that, we would occasionally do something like spend an entire month in one condition without switching, simply by rolling a 0 5 or 6 times in a row; since traffic can be expected to drift and change and spike, having such large units means that sometimes they will line up with noise, increasing the apparent variance thus shrinking the effect size thus requiring possibly a great deal more data to detect the signal. Or we might finish the experiment after 100 days (20 units) and discover we had n=15 for advertising and only n=5 for non-advertising (wasting most of our information on unnecessarily refining the advertising condition). Not blocking doesn’t bias our analysis - we still get the right answers eventually - but it could be costly. Whereas if we block pairs of 2-days ([00,11] vs [11,00]), we ensure that we regularly (but still randomly) switch the condition, spreading it more evenly over time, so if there are 4 days of suddenly high traffic, it’ll probably get split between conditions and we can more easily see the effect. This sort of issue is why experiments try to run interventions on the same person, or at least on age and sex-matched participants, to eliminate unnecessary noise. The gains can be extreme; in one experiment, I estimated that using twins rather than ordinary school-children would have let n<120n \lt \frac{1}{20}: The Power of Twins: Revisiting Student’s Scottish Milk Experiment Example. Thus, when possible, I block my experiments at least temporally.↩

After publishing initial results, Chris Stucchio commentedon Twitter: Most of the work on this stuff is proprietary. I ran such an experiment for a large content site which generated directionally similar results. I helped another major content site set up a similar test, but they didn’t tell me the results…as well as smaller effects from ads further down the page (e.g. Taboola). Huge sample size, very clear effects.David Kitchen, a hobbyist operator of several hundred forums, claims that removing banner ads boosted all his metrics (admittedly, at the cost of total revenue), but unclear if this used randomizations or just a before-after comparison.↩

Several of the Internet giants like Google have reported measurable harms from delays as small as 100ms.↩

Examples of ads I saw would be Lumosity ads or online university ads (typically master’s degrees, for some reason) on my DNB FAQ. They looked about what one would expect: generically glossy and clean. It is difficult to imagine anyone being offended by them.↩

This mixture model would have two distributions/components in it; there apparently is no point in trying to distinguish between levels of virality, as 3 or higher does not fit the data well: