To gain some statistical & web development experience and to improve my readers’ experiences, I have been running a series of CSS A/B tests since June 2012. As expected, most do not show any meaningful difference.

https://support.google.com/websiteoptimizer/bin/answer.py?hl=en-AU&answer=74345 Time on page as a conversion goal - every page converts, by using a timeout (mine is 40 seconds). Problem: dichotomizing a continuous variable into a single binary variable destroys a massive amount of information. This is well-known in the statistical and psychological literature (eg. MacCallum et al 2002) but I’ll illustrate further with some information-theoretical observations.

According to my Analytics, the mean reading time (time on page) is 1:47 and the maximum bracket, hit by 1% of viewers, is 1801 seconds, and the range 1-1801 takes <10.8 bits to encode (log2(1801) ~> 10.81), hence each page view could be represented by <10.8 bits (less since reading time is so highly skewed). But if we dichotomize, then we learn simply that ~14% of readers will read for 40 seconds, hence each reader carries not 6 bits, nor 1 bit (if 50% read that long) but closer to 2/3 of a bit:

CSS-3 property: set how wide the page will be in pixels if unlimited screen real estate is available. I noticed some people complained that pages were too wide and this made it hard to read, which apparently is a real thing since lines are supposed to fit in eye saccades. So I tossed in 800px, 900px, 1300px, and 1400px to the first A/B test.

It ran from mid-June to 1 August 2012. Unfortunately, I cannot be more specific: on 1 August, Google deleted Website Optimizer and told everyone to use Experiments in Google Analytics - and deleted all my information. The graph over time, the exact numbers - all gone. So this is from memory.

The results were initially very promising: conversion was defined as staying on a page for 40 seconds (I reasoned that this meant someone was actually reading the page), and had a base of around 70% of readers converting. With a few hundred hits, 900px converted at 10-20% more than the default! I was ecstatic. So when it began falling, I was only a little bothered (one had to expect some regression to the mean since the results were too good to be true). But as the hits increased into the low thousands, the effect kept shrinking all the way down to 0.4% improved conversion. At some points, 1300px actually exceeded 900px.

The second distressing thing was that Google’s estimated chance of a particular intervention beating the default (which I believe is a Bonferroni-corrected p-value), did not increase! Even as each version received 20,000 hits, the chance stubbornly bounced around the 70-90% range for 900px and 1300px. This remained true all the way to the bitter end. At the end, each version had racked up 93,000 hits and still was in the 80% decile. Wow.

Ironically, I was warned at the beginning about both of these possible behaviors by a paper I read on large-scale corporate A/B testing: http://www.exp-platform.com/Documents/puzzlingOutcomesInControlledExperiments.pdf and http://www.exp-platform.com/Documents/controlledExperimentDMKD.pdf and http://www.exp-platform.com/Documents/2013%20controlledExperimentsAtScale.pdf It covered at length how many apparent trends simply evaporated, but it also covered later a peculiar phenomenon where A/B tests did not converge even after being run on ungodly amounts of data because the standard deviations kept changing (the user composition kept shifting and rendering previous data more uncertain). And it’s a general phenomenon that even for large correlations, the trend will bounce around a lot before it stabilizes (Schönbrodt & Perugini 2013).

Oy vey! When I discovered Google had deleted my results, I decided to simply switch to 900px. Running a new test would not provide any better answers.

In March 2013, I decided to give A/B testing another whack. Google Analytics Experiment did not seem to have improved and the commercial services continued to charge unacceptable prices, so I gave the Google Analytics custom variable integration approach another trying using ABalytics. The usual puzzling, debugging, and frustration of combining so many disparate technologies (HTML and CSS and JS and Google Analytics) aside, it seemed to work on my test page. The current downside seems to be that the ABalytics approach may be fragile, and the UI in GA is awful (you have to do the statistics yourself).

1100px is close to my original A/B test indicating 1000px was the leading candidate, so that gives me additional confidence, as does the observation that 1300px and 1200px are the other leading candidates. (Curiously, the site conversion average before was 13.88%; perhaps my underlying traffic changed slightly around the time of the test? This would demonstrate why alternatives need to be tested simultaneously.) A quick and dirty R test of 1100px vs 1300px (prop.test(c(2632,2581),c(18164,18071))) indicates the difference isn’t statistically-significant (at p=0.58), and we might want more data; worse, there is no clear linear relation between conversion and width (the plot is erratic, and a linear fit a dismal p=0.89):

But I want to move on to the next test and by the same logic it is highly unlikely that the difference between them is large or much in 1300px’s favor (the kind of mistake I care about: switching between 2 equivalent choices doesn’t matter, missing out on an improvement does matter - maximizing β, not minimizing α).

The New York Times ran an informal online experiment with a large number of readers (n=60750) and found that the Baskerville font led to more readers agreeing with a short text passage - this seems plausible enough given their very large sample size and Wikipedia’s note that The refined feeling of the typeface makes it an excellent choice to convey dignity and tradition.

Would this font work its magic on gwern.net too? Let’s see. The sample size is quite manageable, as over a month I will easily have 60k visits, and they tested 6 fonts, expanding their necessary sample. What sample size do I actually need? Their professor estimates the effect size of Baskerville at 1.5%; I would like my A/B test to have very high statistical power (0.9) and reach more stringent statistical-significance (p<0.01) so I can go around and in good conscience tell people to use Baskerville. I already know the average conversion rate is ~13%, so I get this power calculation:

15000 visitors in each group seems reasonable; at ~16k visitors a week, that suggests a few weeks of testing. Of course I’m testing 4 fonts (see below), but that still fits in the ~2 months I’ve allotted for this test.

I had not used Baskerville but Georgia since Georgia seemed similar and was convenient, but we’ll fix that now. Besides Baskerville & Georgia, we’ll omit Comic Sans (of course), but we can try Trebuchet for a total of 4 fonts (falling back to Georgia):

The sample size for each font is 20k higher than I projected due to the enormous popularity of an analysis of the lifetimes of Google services I finished during the test. Regardless, it’s clear that the results - with double the total sample size of the NYT experiment, focused on fewer fonts - are disappointing and there seems to be very little difference between fonts.

Since there’s only small differences between individual fonts, I wondered if there might be a difference between the two sans-serifs and the two serifs. If we lump the 4 fonts into those 2 categories and look at the small difference in mean conversion rate:

With essentially no meaningful differences between conversion rates, this suggests that however fonts matter, they don’t matter for reading duration. So I feel free to pick the font that appeals to me visually, which is Baskerville.

I have seen complaints that lines on gwern.net are too closely spaced or run together or cramped, referring to the line height (the CSS property line-height). I set the CSS to line-height: 150%; to deal with this objection, but this was a simple hack based on rough eyeballing of it, and it was done before I changed the max-width and font-family settings after the previous testing. So it’s worth testing some variants.

Most web design guides seem to suggest a safe default of 120%, rather than my current 150%. If we try to test each decile plus one on the outside, that’d give us 110, 120, 130, 140, 150, 160 or 6 options, which combined with the expected small effect, would require an unreasonable sample size (and I have nothing in the pipeline I expect might catch fire like the Google analysis and deliver an excess >50k visits). So I’ll try just 120/130/140/150, and schedule a similar block of time as fonts (ending the experiment on 16 August 2013, with presumably >70k datapoints).

One of the suggestions in the A/B testing papers was to run a null A/B test (or A/A test) where the payload is empty but the A/B testing framework is still measuring conversions etc. By definition, the null hypothesis of no difference should be true and at an alpha of 0.05, only 5% of the time would the null tests yield a p<0.05 (which is very different from the usual situation). The interest here is that it’s possible that something is going wrong in one’s A/B setup or in general, and so if one gets a statistically-significant result, it may be worthwhile investigating this anomaly.

It’s easy to switch from the lineheight test to the null test; just rename the variables for Google Analytics, and empty the payloads:

The hyperlinks, on the other hand, make use of a off-black color: #303C3C, partially motivated by Ian Storm Taylor’s advice to Never Use Black. I wonder - should all the text be off-black too? And which combination is best? White/black? Off-white/black? Off-white/off-black? White/off-black? Let’s try all 4 combinations here.

I am a little curious about this one, so I scheduled a full month and half: 10 September - 20 October. Due to far more traffic than anticipated from submissions to Hacker News, I cut it short by 10 days to avoid wasting traffic on a test which was done (a total n of 231,599 was more than enough). The results:

So, this suggests a change to the CSS: we switch the default background color from #FCFCFC to white, while leaving the default color its current black.

Reader Lucas asks in the comment sections whether, since we would expect new visitors to the website to be less likely to read a page in full than a returning visitor (who knows what they’re in for & probably wants more), whether including such a variable (which is something Google Analytics does track) might improve the analysis. It’s easy to ask GA for New vs Returning Visitor so I did:

I make heavy use of unordered lists in articles; for no particular reason, the symbol denoting the start of each entry in a list is the little black square, rather than the more common little circle. I’ve come to find the little squares a little chunky and ugly, so I want to test that. And I just realized that I never tested font size (just type of font), even though increasing font size one of the most common CSS tweaks around. I don’t have any reason to expect an interaction between these two bits of designs, unlike the previous A/B test, but I like the idea of getting more out of my data, so I am doing another factorial design, this time not 2x2 but 3x5. The options:

The results are a little confusing in factorial form: it seems pretty clear that Size is bad and that 100% performs best, but what’s going on with the list icon type? Do we have too little data or is it interacting with the font size somehow? I find it a lot clearer when plotted:

Immediately the negative effect of increasing the font size jumps out, but it’s easier to understand the list icon estimates: square performs the best in the 100% (the original default) font size condition but it performs poorly in the other font sizes, which is why it seems to do only medium-well compared to the others. Given how much better 100% performs than the others, I’m inclined to ignore their results and keep the squares.

100% and squares, however, were the original CSS settings, so this means I will make no changes to the existing CSS based on these results.

Another bit of formatting I’ve been meaning to test for a while is seeing how well Readability’s pull-quotes next to blockquotes perform, and to check whether my zebra-striping of nested blockquotes is helpful or harmful.

I discovered during this experiment that I could graph the conversion rate of each condition separately:

Google Analytics view on blockquote factorial test conversions, by day

What I like about this graph is how it demonstrates some basic statistical points:

the more traffic, the smaller sampling error is and the closer the 4 conditions are to their true values as they cluster together. This illustrates how even what seems like a large difference based on a large amount of data, may still be - unintuitively - dominated by sampling error

day to day, any condition can be on top; no matter which one proves superior and which version is the worst, we can spot days where the worst version looks better than the best version. This illustrates how insidious selection biases or choice of datapoints can be: we can easily lie and show black is white, if we can just manage to cherrypick a little bit.

the underlying traffic does not itself appear to be completely stable or consistent. There are a lot of movements which look like the underlying visitors may be changing in composition slightly and responding slightly. This harks back to the paper’s warning that for some tests, no answer was possible as the responses of visitors kept changing which version was performing best.

It was pointed out to me that in my previous font-size test, the clear linear trend may have implied that larger fonts than 100% were bad, but that I was making an unjustified leap in implicitly assuming that 100% was best: if bigger is worse, then mightn’t the optimal font size be something smaller than 100%, like 95%?

And while the blockquote background coloring is a good idea, per the previous test, what about the other place on gwern.net where I use a light background shading: the Table of Contents? Perhaps it would be better with the same background shading as the blockquotes, or no shading?

Finally, because I am tired of just 2 factors, I throw in a third factor to make it really multifactorial. I picked the number-sizing from the existing list of suggestions.

The two size tweaks turn out to be unambiguously negative compared to the status quo (with an almost negligible interaction term probably reflecting reader preference for consistency in sizes of letters and numbers - as one gets smaller, the other does better if it’s smaller too). The Table of Contents backgrounds also survive (thanks to the new vs old visitor type covariate adding power): there were 3 background types, e/f/r[gb], and f/r turn out to have negative coefficients, implying that e is best - but e is also the status quo, so no change is recommended.

At this point it seems worth asking whether running multifactorials has been worthwhile. The analysis is a bit more difficult, and the more factors there are, the harder to interpret. I’m also not too keen on encoding the combinatorial explosion into a big JS array for ABalytics. In my tests so far, have there been many interactions? A quick tally of the glm()/step() results:

Text & background color:

original: 2 main, 1 two-way interaction

survived: 2 main, 1 two-way interaction

List symbol and font-size:

original: 3 main, 2 two-way interactions

survived: 1 main

Blockquote formatting:

original: 2 main, 1 two-way

survived: 1 main

Font size & ToC background:

original: 4 mains, 5 two-ways, 2 three-ways

survived: 3 mains, 2 two-way

So of the 11 main effects, 9 two-ways, & 2 three-ways, there were confirmed in the reduced models: 7 mains, 3 two-ways (22%), & 0 three-ways (0%). And of the 2 interactions, only the black/white interaction was important (and even there, if I had regressed instead cbind(Successes, Failures) ~ Black + White, black & white would still have positive coefficients, they just would not be statistically-significant, and so I would likely have made the same choice as I did with the interaction data available).

Uppercase and none beat capitalize in both page titles & section headers (interaction does not survive). So I toss in a CSS declaration to uppercase section headers as well as the status quo of the title.

After the page title, the next thing a reader will generally see on my pages in the table of contents. It’s been tweaked over the years (particularly by suggestions from Hacker News) but still has some untested aspects, particularly the first two parts of div#TOC:

So, as I expected, putting the ToC on the right performed worse; the larger ToC widths don’t seem to be better but it’s unclear what’s going on there. A visual inspection of the Width data (library(ggplot2); qplot(Width,Rate,color=Alignment,data=rates)) suggests that 20% width was the best variant, so might as well go with that.

BLR is a JS library for highlighting textual paragraphs with pairs of half-lines to make reading easier. I run a randomized experiment on several differently-colored versions to see if default site-wide usage of BLR will improve time-on-page for gwern.net readers, indicating easier reading of the long-form textual content. Most versions perform worse than the control of no-highlighting; the best version performs slightly better but the improvement is not statistically-significant.

BeeLine Reader (BLR) is an interesting new browser plugin which launched around October 2013; I learned of it from the Hacker News discussion. The idea is that part of the difficulty in reading text is that when one finishes a line and saccades left to the continuation of the next line, the uncertainty of where it is adds a bit of stress, so one can make reading easier by adding some sort of guide to the next line; in this case, each matching pair of half-lines is colored differently, so if you are on a red half-line, when you saccade left, you look for a line also colored red, then you switch to blue in the middle of that line, and so on. A colorful variant on boustrophedon writing. I found the default BLR coloring garish & distracting, but I couldn’t see any reason that a subtle gray variant would not help: the idea seems plausible. And very long text pages (like mine) are where BLR should shine most.

I asked if there were a JavaScript version I could use in an A/B test; the initial JS implementation was not fast enough, but by 10 March 2014 it was good enough. BLR has several themes, including gray; I decided to test the variants no BLR, dark, blues, & expanded the gray selection to include grays #222222/#333333/#444444/#555555/#666666/#777777 (gray-6; they vary in how blatant the highlighting is) for a total of 9 equally-randomized variants.

Since I’m particularly interested in these results, and I think many other people will find the results interesting, I will run this test extra-long: a minimum of 2 months. I’m only interested in the best variant, not estimating each variant exactly (what do I care if the ugly dark is 15% rather than 14%? I just want to know it’s worse than the control) so conceptually I want something like a sequential analysis or adaptive clinical trial or multi-armed bandit where bad variants get dropped over time; unfortunately, I haven’t studied them yet (and MABs would be hard to implement on a static site), so I’ll just ad hoc drop the worst variant every week or two. (Maybe next experiment I’ll do a formal adaptive trial.)

The usual implementation using ABalytics doesn’t work because it uses a innerHTML call to substitute the various fragments, and while HTML & CSS get interpreted fine, JavaScript does not; the offered solutions were sufficiently baroque I wound up implementing a custom subset of ABalytics hardwired for BLR inside the Analytics script:

(Why bl3? I don’t know JS, so it took some time; things I learned along the line included always leaving whitespace around a < operator, and that the none argument passed into beeline.setOptions causes a problem which some browsers will ignore and continue recording A/B data after but most browsers will not; this broke the original test. Then I discovered that BLR by default broke all the MathML/MathJax, causing nasty-looking errors over pages with math expressions; this broke the second test, and I had to get a fixed version.)

On 31 March, with total n having reached 15652 visits, I deleted the worst-performing variant: gray4, which at 19.21% was substantially underperforming the best-performing variant’s 22.38%, and wasting traffic. On 6 April, two Hacker News submissions having doubled visits to 36533, I deleted the next-worst variant, gray5 (14.66% vs control of 16.25%; p=0.038). On 9 April, the almost as inferior gray6 (15.67% vs 16.26%) was deleted. On 17 April, dark (16.00% vs 16.94%) was deleted. On 30 April, I deleted gray2 (17.56% vs 18.07%). 11 May, blues was gone (18.11% vs 18.53%), and on 31 May, I deleted gray3 (18.04% vs 18.24%).

Due to caching, the deletions didn’t necessarily drop data collection instantly to zero. Traffic was also heterogeneous: Hacker News traffic is much less likely to spend much time on page than the usual traffic.

The conversion data, with new vs returning visitor, segmented by period, and ordered by when a variant was deleted:

Variant

Old

Total: n (%)

10-31 March

1-6 April

7-9 April

10-17 April

18-30 April

1-11 May

12-31 May

1-8 June

none

FALSE

17648 (16.01%)

1189 (19.26%)

3607 (13.97%)

460 (17.39%)

1182 (16.58%)

3444 (17.04%)

2397 (14.39%)

3997 (17.39%)

2563 (16.35%)

none

TRUE

8009 (23.65%)

578 (24.91%)

1236 (22.09%)

226 (20.35%)

570 (23.86%)

1364 (27.05%)

1108 (23.83%)

2142 (22.46%)

1363 (23.84%)

gray1

FALSE

17579 (16.28%)

1177 (19.71%)

3471 (14.06%)

475 (13.47%)

1200 (17.33%)

3567 (17.49%)

2365 (13.57%)

3896 (18.17%)

2605 (17.24%)

gray1

TRUE

7694 (23.85%)

515 (28.35%)

1183 (23.58%)

262 (21.37%)

518 (21.43%)

1412 (26.56%)

1090 (24.86%)

2032 (22.69%)

1197 (23.56%)

gray3

FALSE

14871 (15.81%)

1192 (18.29%)

3527 (14.15%)

446 (15.47%)

1160 (15.43%)

3481 (17.98%)

2478 (14.65%)

3776 (16.26%)

3 (33.33%)

gray3

TRUE

6631 (23.06%)

600 (24.83%)

1264 (21.52%)

266 (18.05%)

638 (21.79%)

1447 (25.22%)

1053 (24.60%)

1912 (23.17%)

51 (5.88%)

blues

FALSE

10844 (15.34%)

1157 (18.93%)

3470 (14.35%)

449 (16.04%)

1214 (15.57%)

3346 (17.54%)

2362 (13.46%)

3 (0.00%)

blues

TRUE

4544 (23.04%)

618 (27.18%)

1256 (23.81%)

296 (20.27%)

584 (22.09%)

1308 (24.46%)

1052 (22.15%)

48 (12.50%)

gray2

FALSE

8646 (15.51%)

1220 (20.33%)

3649 (13.81%)

416 (15.14%)

1144 (15.03%)

3433 (17.54%)

4 (0.00%)

gray2

TRUE

3366 (22.82%)

585 (22.74%)

1271 (21.79%)

230 (16.52%)

514 (21.60%)

1298 (25.42%)

44 (27.27%)

6 (0.00%)

3 (0.00%)

dark

FALSE

5240 (14.05%)

1224 (20.59%)

3644 (13.83%)

420 (13.81%)

1175 (14.81%)

1 (0.00%)

dark

TRUE

2161 (20.59%)

618 (21.52%)

1242 (20.85%)

276 (21.74%)

574 (20.56%)

64 (10.94%)

1 (0.00%)

2 (0.00%)

2 (50.00%)

gray6

FALSE

4022 (13.30%)

1153 (19.51%)

3610 (12.88%)

409 (17.11%)

1 (0.00%)

2 (0.00%)

3 (0.00%)

gray6

TRUE

1727 (20.61%)

654 (23.70%)

1358 (22.02%)

259 (18.92%)

95 (7.37%)

11 (9.09%)

1 (0.00%)

gray5

FALSE

3245 (12.20%)

1175 (16.68%)

3242 (12.21%)

3 (0.00%)

gray5

TRUE

1180 (21.53%)

559 (25.94%)

1130 (21.77%)

34 (17.65%)

16 (12.50%)

gray4

FALSE

1176 (18.54%)

1174 (18.57%)

1174 (18.57%)

2 (0.00%)

gray4

TRUE

673 (19.91%)

650 (20.31%)

669 (20.03%)

1 (0.00%)

1 (0.00%)

2 (0.00%)

137438 (18.27%)

Graphed:

Weekly conversion rates for each of the BeeLine Reader settings

I also received a number of complaints while running the BLR test (principally due to the dark and blues variants, but also apparently triggered by some of the less popular gray variants; the number of complaints dropped off considerably by halfway through):

2 in emails

2 on IRC unsolicited; when I later asked, there were 2 complaints of slowness loading pages & after reflowing

The BLR people say that there may be cross-browser differences, so I thought about throwing in browser as a covariate too (an unordered factor of Chrome & Firefox, and maybe I’ll bin everything else as an other browser); it seems I may have to use the GA API to extract conversion rates split by variant, visitor status, and browser. This turned out to be enough work that I decided to not bother.

As usual, a logistic regression on the various BLR themes with new vs returning visitors (Old) as a covariate. Because of the heterogeneity in traffic (and because I bothered breaking out the data by time period this time for the table), I also include each block as a factor. Finally, because I expected the 6 gray variants to perform similarly, I try out a multilevel model nesting the grays together.

The results are not impressive: only 2 gray variants out of the 8 variants have a positive estimate, and neither is statistically-significant; the best variant was gray1 (#222222 & #FBFBFB), at an estimated increase from 19.52% to 20.04% conversion rate. More surprising, the nesting turns out to not matter at all, and in fact the worst variant was gray. (The best-fitting multilevel model ignore the variants entirely, although it did not fit better than the regular logistic model incorporating all of the time periods, Old, and variants.)

An unlikely +0.5% to reading rates isn’t enough for me to want to add a dependency another JS library, so I will be removing BLR. I’m not surprised by this result, since most tests don’t show an improvement, BLR coloring test is pretty unusual for a website, and users wouldn’t have any understanding of what it is or ability to opt out of it; using BLR by default doesn’t work, but the browser extension might be useful since the user expects the coloring & can choose their preferred color scheme.

I was surprised that the gray variants could perform so wildly different, from slightly better than the control to horribly worse, considering that they didn’t strike me as looking that different when I was previewing them locally. I also didn’t expect blues to last as long as it did, and thought I would be deleting it as soon as dark. This makes me wonder: are there color themes only subtly different from the ones I tried which might work unpredictably well? Since BLR by default offers only a few themes, I think BLR should try out as many color themes as possible to locate good ones they’ve missed.

Some limitations to this experiment:

no way for users to disable BLR or change color themes

did not include web browser type as a covariate, which might have shown that particular combinations of browser & theme substantially outperformed the control (then BLR could have improved their code for the bad browsers or a browser check done before highlighting any text)

did not use formal adaptive trial methodology, so the p-values have no particular interpretation

One of the site features I like the most is how the endnotes pop-out/float when the mouse hovers over the link, so the reader doesn’t have to jump to the endnotes and back, jarring their concentration and breaking their train of thought. I got the JS from Luka Mathis back in 2010. But sometimes the mouse hovers by accident, and with big footnotes, the popped-up footnote can cover the screen and be unreadable. I’ve wondered if it’s as cool as I think it is, or whether it might be damaging. So now that I’ve hacked up an ABalytics clone which can handle JS in order to run the BLR experiment, I might as well run an A/B test to verify that the floating footnotes are not badly damaging conversions. (I’m not demanding the floating footnotes increase conversions by 1% or anything, just that the floating isn’t coming at too steep a price.)

As I had hoped, floating footnotes seems to do no harm, and the point-estimate is positive. The 95% CI, while not excluding zero, does exclude values worse than -0.035, which satisfies me: if floating footnotes are doing any harm, it’s a small harm.

Could you format your pages so that the texts are all aligned at the left? It looks unprofessional when the lines of text break at different areas. Could you make the site like a LaTeX article? The formatting is the only thing preventing you from looking really professional.

I wasn’t sure what he meant, since the text is left-aligned, and I can’t ask for clarification (anonymous means anonymous).

Looking at a random page, my best guess is that he’s bothered by the indentation at the start of successive paragraphs: in a sequence of paragraphs, the first paragraph is not indented (because it can’t be visually confused) but the successive paragraphs are indented by 1.5em in order to make reading easier. The CSS is:

I liked this, but I suppose for lots of small paragraphs, it lends a ragged appearance to the page. So might as well test a few variants of text-indent to see what works best: 0em, 0.1, 0.5, 1.0, 1.5, and 2.0.

In retrospect years later, after learning more about typography and revamping gwern.net CSS a number of times, I think Anonymous was actually talking about text justification: HTML/gwern.net is by default flush left, ragged right, with large whitespace gaps left where words of different lengths get moved to the next line but not broken/hyphenated or stretched to fill the line. Some people do not like text justification, describing ragged right as easier to read, but most typographers endorse it, it was historically the norm for professionally-set print, still carries connotations of class, and I think the appearance fits in with my overall site esthetic. I eventually enabled text justification on gwern.net in February 2019 (although I was irritated by the discovery that the standard CSS method of doing so does not work in the Chrome browser due to a long-standing failure to implement hyphenation support).

On 27 July 2014, since the 95% CIs for the best and worst indent variants no longer overlapped, I deleted the worst variant (0.1). On 23 August 2014, the 2.0em and 0.0em variants no longer overlapped, and I deleted the latter.

Daily traffic and conversion rates for each of the indentation settings

The conversion data, with new vs returning visitor, segmented by period, and ordered by when a variant was deleted:

A simple analysis of the totals would indicate that 0.1em is the best setting - which is odd since it was the worst-performing and first variant to be deleted, so how could it be the best? The graph of traffic suggests that, like before, the final totals are confounded by time-varying changes in conversion rates plus dropping variants; that is, 0.1em probably only looks good because after it was dropped, a bunch of Hacker News traffic hit and happened to convert at lower rates, making the surviving variants look bad. One might hope that all of that effect would be captured by the Old covariate as HN traffic gets recorded as new visitors, but that would be too much to hope for. So instead, I add a dummy variable for each of the 3 separate time-periods which will absorb some of this heterogeneity and make clearer the effect of the indentation choices.

There’s definitely temporal heterogeneity, given the statistical-significance of the time-period dummies, so that is good to know. But the estimated effects for each indentation variant is derisorily small (despite having spent n=159634), suggesting readers don’t care at all. Since I have no opinion on the matter, I suppose I’ll go with the highest point-estimate, 2em.

Looking at my current pages, one of the visual aspects that bother me is the sidebar: it contains links to top-level pages, page-specific metadata, a search interface, and donation widgets (all separated by whitespace and horizontal rulers). It comes off as a little disorganized and messy.

So, I’d like to try out removing the horizontal ruler as dividers, and hiding the search-engine and donations. Then in another A/B test I can try out different tweaks (maybe resort the sections or change the word-breaking.)

I killed the test in late January. (I had gotten an idea I wanted to test, see next section: if the sidebar is too cluttered with site navigation, donation and metadata, why not move the metadata into the body?)

Looking at the sidebar some more, it occurred to me that the sidebar was serving 3 different purposes all mixed together:

site-wide: navigation to the main index/homepage, as well as meta-site pages like about me, the site, recent updates, and ways of getting RSS/email updates

site-wide: donation requests

page-specific: a page’s metadata about when that page’s content was first created, last modified, content tags, etc

The page metadata is the odd man out, and I’ve noticed that a lot of people seem to not notice the page metadata hiding in the sidebar (eg there will be comments wondering when a page was created, when that’s listed clearly right there in the page’s sidebar). What if I moved the page metadata to underneath the big title? I’d have to change the formatting, since I can’t afford to spend 10+ vertical lines of space the way it must be formatted in the sidebar, but the metadata could fit in 2-5 lines if I combine the logical pairs (so instead of 4 lines for created: / 7 May 2013 / modified: / 09 Jan 2015, just one line created: 7 May 2013; modified: 09 Jan 2015).

There are several different ways and levels of density, so I created 6 variants with increasing amounts of density.

As an HTML rather than CSS change, the implementation as an A/B test is more complex.

I define inline in the HTML template each of the 6 variants, as divs ID metadata1..metadata6. In the default.css, I set them to display: none so the user does not 6 different metadatas taking up 2 screens of space. Then, each A/B variant passed to ABalytics toggles back on one version using display: block. I also include a 7th variant, where none of the 6 should be visible, which is effectively the control condition which roughly matches the status quo of showing the metadata in the sidebar. (Roughly, since in the none condition, there won’t be metadata anywhere in the displayed page; but since the previous experiment indicated that removing elements from the sidebar didn’t make any noticeable difference, I decided to simplify the HTML source code by removing the original metadata div entirely to avoid any collisions or issues with the CSS/HTML I’ve defined.)

On 5 February 2015, the top variant (meta5) outperformed the bottom one (meta1, corresponding to my expectation that the taller variants would be worse than the compactest ones), so the worst was deleted. On 8 February 2015, the new top variant (meta6) now outperformed (meta4), so I deleted it. On 22 March 2015, it outperformed none. On 25 May 2015, the difference was not statistically-significant but I decided to delete meta3 anyway. On 2 July 2015, I deleted meta2 similarly; given the ever smaller differences between variants, it may be time to kill the experiment.

A strange set of results. meta2 performs the best on new visitors, and worst on old visitors; while meta6 is the exact opposite. Because there are more new visitors than old visitors, meta2 is the best on average. Except I hate how meta2 looks and much prefer meta6. The confidence intervals are wide, though - it’s not clear that meta6 is definitely worse than meta2.

A CSE is a Google search query but one specialized in various ways - somewhat like offering a user a form field which redirects to a Google search query like QUERY site:gwern.net/docs/, but more powerful since you can specify thousands of URLs to blacklist and whitelist and have limited patterns. I have two: one is specialized for searching for anime/manga news sites and makes writing Wikipedia articles much easier (since you can search for a particular anime title and the results will be mostly news and reviews which you can use in a WP article, rather than images, songs, memes, Amazon and commercial sites, blogs, etc); and the second is specialized to search gwern.net, my Reddit, LessWrong, PredictionBook, Good Reads and some other sites, to make it easier to find something I may’ve written. The second I created to put in the sidebar and serve as a website search function. (I threw in the other sites because why not?)

Google provides HTML & JS for integrating a CSE somewhere, so creating & installing it was straightforward, and it went live 24 May 2013.

The problem is that the CSE search input takes up space in the sidebar, and is more JS to run on each page load and loads at least one other JS file as well. So on 17 July 2015, I took a look to evaluate whether it was worth keeping.

There had been 8974 searches since I installed it 785 days previously or ~11.4 searches per day; at least 119 were searches for e, which I assume were user mistakes where they didn’t intend to search and probably annoyed them. (The next most popular searches are Graeber/26, chunking/22, and nootropics/10, with CSE refusing to provide any further queries due to low volume. This suggests a long tail of search queries - but also that they’re not very important since it’s easy to find the DNB FAQ & my nootropics page, and it can hardly be useful if the top search is an error.)

To put these 8855 searches in perspective, in that same exact time period, there were 891,790 unique users with 2,010,829 page views. So only 0.44% of page-views involve a use of the CSE, or a ratio of 1:227 Is it net-beneficial to make 227 page-views incur the JS run & loading for the sake of 1 CSE search?

This might seem like a time to A/B test the presence/absence of the CSE div. (I can’t simply hide it using CSS like usual because it will still affect page loads.) Except consider the power issues: if that 1 CSE search converts, then to be profitable, it needs to damage the 227 other page-views conversion rate by <1/227. Or to put it the other way, the current conversion rate is ~17% of page-views and CSE search represents 0.44% of page-views, so if the CSE makes that one page-view 100% guaranteed to convert and otherwise converts normally, then over 1000 page-views, we have 0.17⋅995+1.0⋅5=1740.17 \cdot 995 + 1.0 \cdot 5 = 174 vs 0.17⋅995+0.17⋅5=1700.17 \cdot 995 + 0.17 \cdot5 = 170, or 17.4% vs 17.0%.

Even with the most optimistic possible assumptions (perfect conversion, no negative effect), it takes 279,449 page-views to get decent power. This is ridiculous from a cost-benefit perspective, and worse given that my priors are against it due to the extra JS & CSS it entails.

So I simply removed it. It was a bit of an experiment, and <8.9k searches does not seem worth it.

One source of complexity & JavaScript use on gwern.net is the use of Google AdSense advertising to insert banner ads. In considering design & usability improvements, removing the banner ads comes up every time as a possibility, as readers do not like ads, but such removal comes at a revenue loss and it’s unclear whether the benefit outweighs the cost, suggesting I run an A/B experiment. However, ads might be expected to have broader effects on traffic than individual page reading times/bounce rates, affecting total site traffic instead through long-term effects on or spillover mechanisms between readers (eg social media behavior), rendering the usual A/B testing method of per-page-load/session randomization incorrect; instead it would be better to analyze total traffic as a time-series experiment.

Design: A decision analysis of revenue vs readers yields an maximum acceptable total traffic loss of ~3%. Power analysis of historical gwern.net traffic data demonstrates that the high autocorrelation yields low statistical power with standard tests & regressions but acceptable power with ARIMA models. I design a long-term Bayesian ARIMA(4,0,1) time-series model in which an A/B-test running January-October 2017 in randomized paired 2-day blocks of ads/no-ads uses client-local JS to determine whether to load & display ads, with total traffic data collected in Google Analytics & ad exposure data in Google AdSense.

Correcting for a flaw in the randomization, the final results yield a surprisingly large estimate of -14% traffic loss if all traffic were exposed to ads (95% credible interval: -13-16%) and an expected traffic loss of -9.7%, exceeding the decision threshold for disabling ads and rendering further experimentation profitless.

Thus, banner ads on gwern.net appear to be harmful and AdSense has been removed. If these results generalize to other blogs and personal websites, an important implication is that many websites may be harmed by their use of banner ad advertising without realizing it.

A/B testing variants one at a time is fine as far as it goes, but it has several drawbacks that have become apparent:

fixed trials, compared to sequential or adaptive trial approaches, waste data/page-views. Looking back, it’s clear that many of these trials didn’t need to run so long.

they are costly to set up, both because of the details of a static site doing A/B tests but also because it requires me to define each change, code it up, collect, and analyze the results all by hand.

they are not amenable to testing complicated models or relationships, since factorial designs suffer combinatorial explosion.

they will test only the interventions the experimenter thinks of, which may be a tiny handful of possibilities out of a wide space of possible interventions (this is related to the cost: I won’t test anything that isn’t interesting, controversial, or potentially valuable, because it’s far too much of a hassle to implement/collect/analyze)

The topic of sequential trials leads naturally to multi-armed bandits (MAB), which can be seen as a generalization of regular experimenting which naturally reallocate samples across branches as the posterior probabilities change in a way which minimizes how many page-views go to bad variants. It’s hard to see how to implement MABs as a static site, so this would probably motivate a shift to a dynamic site, at least to the extent that the server will tweak the served static content based on the current MAB.

MABs work for the current use case of specifying a small number of variants (eg <20) and finding the best one. Depending on implementation detail, they could also make it easy to run factorial trials checking for interactions among those variants, resolving another objection.

They’re still expensive to set up since one still has to come up with concrete variants to pit against each other, but if it’s now a dynamic server, it can at least handle the analysis automatically.

MABs themselves are a special case of reinforcement learning (RL), which is a family of approaches to exploring complicated systems to maximize a reward at (hopefully) minimum data cost. Optimizing a website fits naturally into a RL mold: all the possible CSS and HTML variants are a very complicated system, which we are trying to explore as cheaply as possible while maximizing the reward of visitors spending more time reading each page.

To solve the expressivity problem, one could try to equip the RLer with a lot of power over the CSS: parse it into an AST, so instead of specifying by hand 100% vs 105% in a CSS declaration like div#sidebar-news a { font-size: 105%; }, the RLer sees a node in the AST like (font-size [Real ~ dnorm(100,20)]) and tries out numbers around 100% to see what yields higher conversion rates. Of course, this yields an enormous number of possibilities and my website traffic is not equally enormous. Informative priors on each node would help if one was using a Bayesian MAB to do the optimization, but a Bayesian model might be too weak to detect many effects. (You can’t easily put in interactions between every node of the AST, after all.)

In a challenging problem like this, deep neural networks come to mind, yielding a deep reinforcement learner (Q-learning) - such a system made a splash in 2013-2015 in learning to play dozens of Atari games (DQN). The deep network handles interpretation of the input, and the RLer handles policy and optimization.

So the loop would go something like this:

a web browser requests a page

the server asks the RL for CSS to include

the RL generates a best guess at optimal CSS, taking the CSS AST skeleton and returning the defaults, with some fields/parameters randomized for exploration purposes (possibly selected by Bayesian optimization to maximize information gain)

the CSS is transcluded into the HTML page, and sent to the web browser

JS analytics in the HTML page report back how long the user spent on that page and details like their country, web browser, etc, which predict time on page (explaining variance, making it easier to see effects)

this time-on-page constitutes the reward which is fed into the RL and updates

return to waiting for a request

Learning can be sped up by data augmentation or local training: the developer can browse pages locally and based on whether they look horrible or not, insert pseudo-data. (If one variant looks bad, it can be immediately heavily penalized by adding, say, 100 page-views of that variant with low rewards.) Once previews have stabilized on not-too-terrible-looking, it can be run on live users; the developer’s preferences may introduce some bias compared to the general Internet population, but the developer won’t be too different and this will kill off many of the worst variants. As well, historical information can be inserted as pseudo-data: if the current CSS file has 17% conversion over 1 million page views, one can simulate 1m page views to that CSS variant’s considerable credit.

Parsing CSS into an AST seems difficult, and it is still limited in that it will only ever tweak existing CSS fields.

How to offer more power and expressivity to the RLer without giving it so much freedom that it will hang itself with gibberish CSS before ever finding working CSS, never mind improvements?

A powerful AI tool which could generate CSS on its own are the recurrent neural networks: NNs which generate some output which gets fed back in until a long sequence has been emitted. (They usually also have some special support for storing memories over multiple recursive applications, using LSTM.) RNNs are famous for mimicking text and other sequential material; in one demo, Karpathy’s The Unreasonable Effectiveness of Recurrent Neural Networks, he trained a RNN on a Wikipedia dump in XML format and a LaTeX math book (both replicating the syntax quite well) and more relevantly, 474MB of C source code & headers where the RNN does a credible job of emitting pseudo-C code which looks convincing and is even mostly syntactically-correct in balancing parentheses & brackets, which more familiar Markov-chain approaches would have trouble managing. (Of course, the pseudo-C doesn’t do anything but that RNN was never asked to make it do something, either.) In another RNN paper, the authors trained it on Python source code and it was able to execute very simple Python programs and predict the output; this is perhaps not too surprising given the earlier Neural Turing Machines and solving the Traveling Salesman Problem (Pointer Networks). So RNNs are powerful and have already shown promise in learning how to write simple programs.

This suggests the use of an RNN inside an RLer for generating CSS files. Train the RNN on a few hundred megabytes of CSS files (there are millions online, no shortage there), which teaches the RNN about the full range of possible CSS expressions, then plug it into step 3 of the above website optimization algorithm and begin training it to emit useful CSS. For additional learning, the output can be judged using an oracle (a CSS validator like the W3C CSS Validation Service/w3c-markup-validator package, or possibly CSSTidy), and the error or reward based on how many validation errors there are. The pretraining provides extremely strong priors about what CSS should look like so syntactically valid CSS will be mostly used without the constraint of operating on a rigid AST, the RL begins optimizing particular steps, and providing the original CSS with a high reward prevents it from straying too far from a known good design.

Can we go further? Perhaps. In the Atari RL paper, the NN was specifically a convolutional neural network (CNN), used almost universally in image classification tastes; the CNN was in charge of understanding the pixel output so it could be manipulated by the RL. The RNN would have considerable understanding of CSS on a textual level, but it wouldn’t be easily able to understand how one CSS declaration changes the appearance of the webpage. A CNN, on the other hand, can look at a page+CSS as rendered by a web browser, and see what it looks like; possibly it could learn that messy layouts are bad, that fonts shouldn’t be made too big, that blocks shouldn’t overlap, etc. The RNN generates CSS, the CSS is rendered in a web browser, the rendering is looked at by a CNN… and then what? I’m not sure how to make use of a generative approach here. Something to think about.

It would be nifty if I could set up a NN to generate and optimize the CSS on gwern.net so I don’t have to learn CSS & devise tests myself; as a first step towards this, I wanted to see how well a recurrent neural network (RNN) could generate CSS after being trained on CSS. (If it can’t do a good job mimicking the average syntax/appearance of CSS based on a large CSS corpus, then it’s unlikely it can learn more useful things like generating usable CSS given a particular HTML file, or the ultimate goal - learn to generate optimal CSS given HTML files and user reactions.)

Fortunately, Karpathy has already written an easy-to-use tool char-rnn which has already been shown to work well on XML/LaTeX/C. (I was particularly amused by the LaTeX/math textbook, which yielded a compiling and even good-looking document after Karpathy fixed some errors in it; if the RNN had been trained against compile errors/warnings as well, perhaps it would not have needed any fixing at all…?)

Unfortunately, even on my i7 CPU, training is quite slow: ~3s a batch on the Tiny Shakespeare example. The important parameter is train_loss here1; after some experimenting, I found that >3=output is total garbage, 1-2=lousy, and with <1=good, with <0.8=very good.

With Tiny Shakespeare, the loss drops quickly at first, getting <4 within seconds and into the 2s within 20 minutes, but then the 1s take a long time to surpass, and <1 even longer (hours of waiting).

This is a toy dataset and suggests that for a real dataset I’d be waiting weeks or months. GPU acceleration is critical. I spent several days trying to get Nvidia’s CUDA to work, even signing up as a developer & using the unreleased version 7.5 preview of CUDA, but it seems that when they say Ubuntu 14.04 and not 15.04 (the latter is what I have installed), they are quite serious: everything I tried yielded bloodcurdling ATA hard drive errors (!) upon boot followed by a hard freeze the instant X began to run.2 This made me unhappy since my old laptop began dying in late July 2015 and I had purchased my Acer Aspire V17 Nitro Black Edition VN7-791G-792A laptop with the express goal of using its NVIDIA GeForce GTX 960M for deep learning. But at the moment I am out of ideas for how to get CUDA working aside from either reinstalling to downgrade to Ubuntu 14.04 or simply waiting for version 8 of CUDA which will hopefully support the latest Ubuntu. (Debian is not an option because on Debian Stretch, I could not even get the GPU driver to work, much less CUDA.)31

Frustrated, I finally gave up and went the easy way: Torch provides an Amazon OS image preconfigured with Torch, CUDA, and other relevant libraries for deep learning.

The Torch AMI can be immediately launched if you have an AWS account. (I assume you have signed up, have a valid credit card, IP permission accesses set to allow you to connect to your VM at all, and a SSH public key set up so you can log in.) The two GPU instances seem to have the same number and kind of GPUs (1 Nvidia4) and differ mostly in RAM & CPUs, neither of which are the bottleneck here, so I picked the smaller/cheaper g2.2xlarge type. (Cheaper here is relative; g2.2xlarge still costs $0.65/hr and when I looked at spot that day, ~$0.21.)

Once started, you can SSH using your registered public key like any other EC2 instance. The default username for this image is ubuntu, so:

First, to generate a decent sized CSS corpus; between all the HTML documentation installed by Ubuntu and my own WWW crawls, I have something like 1GB of CSS hanging around my drive. Let’s grab 20MB of it (enough to not take forever to train on, but not so little as to be trivial):

With 19.999M characters, our RNN can afford only <20M parameters; how big can I go with -rnn_size and -num_layers? (Which as they sound like, specify the size of each layer and how many layers.) The full set of char-rnn training options:

Some playing around suggests that the upper limit is 950 neurons and 3 layers, yielding a total of 18,652,422 parameters. (I originally went with 4 layers, but with that many layers, RNNs seem to train very slowly.) Some other settings to give an idea of how parameter count increases:

512/4: 8,012,032

950/3: 18,652,422

1000/3: 20,634,122

1024/3: 21,620,858

1024/4: 30,703,872

1024/5: 39,100,672

1024/6: 47,497,472

1800/4: 93,081,856

2048/4: 120,127,744

2048/5: 153,698,560

2048/6: 187,269,376

If we really wanted to stress the EC2 image’s hardware, we could go as large as this:

This turns out to not be a good idea since it will take forever to train - eg after ~70m of training, still at train-loss of 3.7! I suspect some of the hyperparameters may be important - the level of dropout doesn’t seem to matter much but more than 3 layers seems to be unnecessary and slow if there are a lot of neurons to store state (perhaps because RNNs are said to unroll computations over each character/time-step instead of being forced to do all their computation in a single deep network with >4 layers?) - but with the EC2 clock ticking and my own impatience, there’s no time to try a few dozen random sets of hyperparameters to see which achieves best validation scores.

Undeterred, I decided to upload all the CSS (using the sort-key trick to reduce the archive size):

Unsurprisingly, this did not solve the problem, and with 1GB of data, even 1 pass over the data (1 epoch) would take weeks, likely. Additional problems included -val_frac’s default 50 and -eval_val_every’s default 1000: 0.05 of 1GB is 50MB, which means every time char-rnn checked on the validation set, it took ages; and since it only wrote a checkpoint out every 1000 iterations, hours would pass in between checkpoints. 1MB or 0.001 is a more feasible validation data size; and checking every 100 iterations strikes a reasonable balance between being able to run the latest & greatest and spending as much GPU time on training as possible.

Specifically, the loss on the validation set had exploded to 333.2351 (!). When I looked at samples from the check-pointed copy, it performed both well and poorly. th sample.lua cv/lm_css_epoch0.05_333.2351.t7 yielded:

Aside from the Unicode junk at the beginning, the output actually looks tremendously like CSS! The brackets are matched, the selectors look like selectors, and the fields are properly typed (pixels go into pixel fields, colors go into color fields, etc). If I validate the non-junk CSS part, the validator remarkably yields only 1 error, at line 52/.module-contributor h2.comment-hold-homicate.sptbed_postnames where it notes that Value Error : padding-top -24px negative values are not allowed : -24px. Considering it didn’t even finish 1 epoch, the mimicking is almost uncanny: it nails the various aspects like RGB color notation (both hex & rgba()), matching brackets, plausible-sounding identifiers (eg .scegee-category), etc. If I were shown this without any corresponding HTML, I would not easily be able to tell it’s all gibberish.

Chastened by the exploding-error problem and the mostly waste of ~26 hours of processing (7:30PM - 9:30PM / $15.6), I tried a smaller yet RNN (500/2), run from 5PM-11AM (so total bill for all instances, including various playing around, restarting, generating samples, downloading to laptop etc: $25.58).

One flaw in the RNN I stumbled across but was unable to reproduce was that it seemed to have a problem with data URIs. A data URI is a special kind of URL which is its own content, letting one write small files inline and avoiding needing a separate file; for example, this following CSS fragment would yield a PNG image without the user’s browser making additional network requests or the developer needing to create a tiny file just for an icon or something:

So it’s a standard prefix like data:image/png;base64, followed by an indefinitely long string of ASCII gibberish, which is a textual base-64 encoding of the underlying binary data. The RNN sometimes starts a data URI and generates the prefix but then gets stuck continually producing hundreds or thousands of characters of ASCII gibberish without ever closing the data URI with a quote & parentheses and getting back to writing regular CSS.

What’s going on there? Since PNG/JPG are compressed image formats, the binary encoding will be near-random and the base-64 encoding likewise near-random. The RNN can easily generate another character once it has started the base-64, but how does it know when to stop? (I know how to spell banana, I just don’t know when to stop! BA NA NA NA…) Possibly it has run into the limits of its memory and once it has started emitting base-64 and has reached a plausible length of at least a few score characters (few images can be encoded in less), it’s now too far away from the original CSS, and all it can see is base-64; so of course the maximal probability is an additional base-64 character…

This might be fixable by either giving the RNN more neurons in the hope that with more memory it can break out of the base-64 trap, training more (perhaps data URIs are too rare for it to have adequately learned it with the few epochs thus far), backpropagating error further in time/the sequence by increasing the size of the RNN in terms of unrolling (such as increasing -seq_length from 50); I thought improving the sampling strategy with beam search rather than greedy character-by-character generation would help but it turns out beam search doesn’t fix it and can perform worse, getting trapped in an even deeper local minima of repeating the character A endlessly. Or of course one could delete data URIs and other undesirable features from the corpus, in which case those problems will never come up; still, I would prefer the RNN to handle issues on its own and have as little domain knowledge engineered in as possible. I wonder if the data URI issue might be what killed the large RNN at the end? (My other hypothesis is that the sort-key trick accidentally led to a multi-megabyte set of repetitions of the same common CSS file, which caused the large RNN to overfit, and then once the training reached a new section of normal CSS, the large RNN began making extremely confident predictions of more repetition, which were wrong and would lead to very large losses, possibly triggering the exploding-error killer.)

As the loss diminished to ~0.8-0.9, the sampled CSS output became even more realistic. At one point I was impressed to see that the RNN had learned to switch between minified and unminified CSS formatting. For example, above the output is unminified, but the RNN at 0.88 sometimes writes minified (following has been line-broken from a single line):

Now it’s readable and we can see the RNN has done an excellent job of still writing CSS while in minified-mode, and around this level of loss, I noticed the RNN had learned to write valid-looking URLs - fragments like background : url(/sites/all/themes/rpg_theme/images/btn/form_subscription_not-page.png) look exactly like what a human CSS programmer would write. (Unfortunately, this sample has 4 validation errors: 1 from an imbalanced bracket; 1 one parse error on *zoom: 2 !important due to the asterisk which is anold IE hack & arguably the RNN isn’t wrong; and 2 properties which don’t exist. Also in the RNN’s favor, I should note that lots of CSS in the wild will not have 0 validation errors.)

At 0.88, I also noticed the RNN was now making a valiant attempt to write comments. Bad comments, but still:

It stalwartly continued to try to write comments, approximating slightly English (even though there is not that much English text in those 20MB, only 8.5k lines with /* in them - it’s CSS, not text). Examples of comments extracted from a large sample of 0.766’s output (fgrep '/*' best.txt):

In under a day of GPU training on 20MB of CSS, a medium-sized RNN (~30M parameters) learned to produce high quality CSS, which passes visual inspection and on some batches yields few CSS syntactic errors. This strikes me as fairly impressive: I did not train a very large RNN, did not train it for very long, did not train it on very much, did no optimization of the many hyper-parameters, and it is doing unsupervised learning in the sense that it doesn’t know how well emitted CSS validates or renders in web browsers - yet the results still look good. I would say this is a positive first step.

Lessons learned:

GPUs > CPUs

char-rnn, while rough-edged, is excellent for quick prototyping

NNs are slow:

major computation is required for the best results

meaningful exploration of NN sizes or other hyperparameters will be challenging when a single run can cost days

computing large datasets or NNs on Amazon EC2 will entail substantial financial costs; it’s adequate for short runs but bills around $25 for two days of playing around are not a long-term solution

pretraining an RNN on CSS may be useful for a CSS reinforcement learner

After showing good-looking CSS can be generated from learning on a CSS corpus and mastery of the syntactic rules, the next question is how to incorporate meaning. The generated CSS doesn’t mean anything and will only do anything if it happens to have generated CSS modifying a sufficiently universal ID or CSS element (you might call the generated CSS what the average CSS looks like, although like the average man, average CSS does not exist in real life). We trained it to generate CSS from CSS. What if we trained it to generate CSS from HTML? Then we could feed in a particular HTML page and, if it has learned to generate meaningfully-connected CSS, then it should write CSS targeted on that HTML page. If a HTML page has a div named lightbox, then instead of the previous nonsense like .logo-events .show-luset .box-content li { width: 30px; }, perhaps it will learn to write instead something meaningful like lightbox li { width: 30px; }. (Setting that to 30px is not a good idea, but once it has learned to generate CSS for a particular page, then it can learn to generate good CSS for a particular page.)

Before, creating a big CSS corpus was easy: simply find all the CSS files on disk, and cat them together into a single file which char-rnn could be fed. From a supervised learning perspective, the labels were also the inputs. But to learn to generate CSS from HTML, we need pairs of HTML and CSS: all the CSS for a particular HTML page.

I could try to take the CSS files and work backwards to where the original HTML page may be, but most of them are not easily found and a single HTML page may call several CSS files or vice versa. It seems simpler instead to generate a fresh set of files by taking some large list of URLs, downloading each URL, saving its HTML and then parsing it for CSS links which then get downloaded and combined into a paired CSS file, with that single CSS file hopefully formatted and cleaned up in other ways.

I don’t know of any existing clean corpus of HTML/CSS pairs: existing databases like Common Crawl would provide more data than I need but in the wrong format (split over multiple files as the live website serves it), and I can’t reuse my current archive downloads (how to map all the downloaded CSS back onto their original HTML file and then combine them appropriately?). So I will generate my own.

I would like to crawl a wide variety of sites, particularly domains which are more likely to provide clean and high-quality CSS exercising lots of functionality, so I grab URLs from:

(uniq --check-chars=18 is there as a hack for deduplication: we don’t need to waste time on 1000 URLs all from the same domain, since their CSS will usually all be near-identical; this defines all URLs with the same first 18 characters as being duplicates and so to be removed.)

For each URL index i in 1:n:
download the HTML
parse
filter out `<link rel='stylesheet'>`, `<style>`
forall stylesheets,
download & concatenate into a single css
concatenate style into the single css
write html -> ./i.html
write css -> ./i.css

Downloading the HTML part of the URL can be done with wget as usual, but if instructed to --page-requisites, it will spit CSS files over the disk and the CSS would need to be stitched together into one file. It would also be good if unused parts of the CSS could be ignored, the formatting be cleaned up & consistent across all pages, and while we’re wishing, JS evaluated just in case that makes a difference (since so many sites are unnecessarily dynamic these days). uncss does all this in a convenient command-line format; the only downside I noticed is that it is inherently much slower, there is an unnecessary two-line header prefixed to the emitted CSS (specifying the URL evaluated) which is easily removed, and uncss sometimes hangs & so something must be arranged to kill laggard instances so progress can be made. (Originally, I was looking for a tool which would download all the CSS on a page and emit it in a single stream/file rather than write my own tagsoup parser, but when I saw uncss, I realized that the minimizing/optimizing was better than what I had intended and would be useful - why make the RNN learn CSS which isn’t used by the paired HTML?) Installing:

TODO: once the screenshotter has finished one full pass, then you can add image harvesting to enforce clean triplets of HTML/CSS/PNG

This yields a good-sized corpus of clean HTML/CSS pairs:

ls *.css | wc --lines; cat *.css | wc --char

TODO: yield seems low: 1 in 3? will this be enough even with 136k+ URLs? a lot of the errors seem to be sporadic and page downloads work when retrying them NYT seems to lock up uncss! had to filter it out, too bad, their CSS was nice and complex

Data augmentation is a way to increase corpus size by transforming each data point into multiple variants which are different on a low level but semantically are the same. For example, the best move in a particular Go board position is the same whether you rotate it by 45 degrees or 180 degrees; an upside-down or slightly brighter or slightly darker photograph of a fox is still a photograph of a fox, etc. By transforming them, we can make our dataset much larger and also force the NN to learn more about the semantics and not focus all its learning on mimicking surface appearances or making unwarranted assumptions. It seems to help image classification a lot (where the full set of data augmentation techniques used can be quite elaborate), and is a way you can address concerns about an NN not being robust to a particular kind of noise or transformation: you can include that noise/transformation as part of your data augmentation.

HTML and CSS can be transformed in various ways which textually look different but still mean the same thing to a browser: they can be minified, they can be reformatted per a style guide, some optimizations can be done to combine CSS declarations or write them in better ways, CSS files can be permuted (sometimes shuffling the order of declarations will change things by changing which of two overlapping declarations gets used, but apparently it’s rare in practice and CSS developers often write in random order), comments by definition can be deleted without affecting the displayed page,and so on.

TODO: use html5-tidy to clean up the downloaded html too? http://www.htacg.org/tidy-html5/documentation/#part_building keep both the original and clean version: this will be good data augmentation

Data augmentation:

raw HTML + uncss

tidy-html5 + uncss

tidy-html5 + csstidy(uncss)

tidy-html5 + minified CSS

tidy-html5 + shuffle CSS order as well? CSS is not fully but mostly declarative: http://www.w3.org/TR/2011/REC-CSS2-20110607/cascade.html#cascade

http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-with-gpus/ / http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-2/ / http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/ implementation of encoder-decoder with attention in Theano: https://github.com/kyunghyuncho/dl4mt-material/tree/master/session2 Currently, this code includes three subdirectories; session0, session1 and session2. session0 contains the implementation of the recurrent neural network language model using gated recurrent units, and session1 the implementation of the simple neural machine translation model. In session2, you can find the implementation of the attention-based neural machine translation model we discussed today. I am planning to make a couple more sessions, so stay tuned!

Neural Machine Translation by Jointly Learning to Align and Translate http://arxiv.org/abs/1409.0473 In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. With this new approach, we achieve a translation performance comparable to the existing state-of-the-art phrase-based system on the task of English-to-French translation. Furthermore, qualitative analysis reveals that the (soft-)alignments found by the model agree well with our intuition.

Is it important in randomized testing to control for covariates, even powerful ones? A simulation using a website’s data suggests not.

In December 2013, I was discussing website testing with another site owner, which monetizes traffic by selling a product, while I just optimize for reading time. He argued (deleting identifying details since I will be using their real traffic & conversion numbers throughout):

I think a big part that gets lost out is the quality of traffic. For our [next website version] (still speccing it all out), one of my biggest requirements for A/B testing is that all referring traffic must be bucketed and split-test against them. Buckets themselves are amorphous - they can be visitors of the same resolution, visitors who have bought our guide, etc. But just comparing how we did (and our affiliates did) on sales of our guide (an easy to measure metric - our RPU), traffic matters so much. X sent 5x the traffic that Y did, yet still generated 25% less sales. That would destroy any meaningful A/B testing without splitting up the quality.

I was a little skeptical that this was a major concern much less one worth expensively engineering into a site, and replied:

Eh. You would lose some power by not correcting for the covariates of source, but the randomization would still work and deliver you meaningful results. As long as visitors were being randomized into the A and B variants, and there was no gross imbalance in cells between Y and X, and Y and X visitors didn’t react differently, you’d still get the right results - just you would need more traffic to get the same statistical power. I don’t think 25% difference between X and Y visitors would even cost you that much power…

…we conditioned on the user level covariates listed in the column labeled by the vector W in Table 1 using several methods to strengthen power; such panel techniques predict and absorb residual variation. Lagged sales are the best predictor and are used wherever possible, reducing variance in the dependent variable by as much as 40%…However, seemingly large improvements in R2 lead to only modest reductions in standard errors. A little math shows that going from R2=0R^2 = 0 in the univariate regression to R|w2R^2_{|w} = 50% yields a sublinear reduction in standard errors of 29%. Hence, the modeling is as valuable as doubling the sample - a significant improvement, but one that does not materially change the measurement difficulty. An order-of-magnitude reduction in standard errors would require R|w2R^2_{|w} = 99%, perhaps a nearly impossible goal.

In particular, if you lost a lot of power, wouldn’t that imply randomized trials were inefficient or impossible? The point of randomization is that it eliminates the impact of the indefinitely many observed and unobserved variables to let you do causal inference.

Since this seems like a relatively simple problem, I suspect there is an analytic answer, but I don’t know it. So instead, we can set this up as a simulated power analysis: we generate random data where we force the hypothesis to be true by construction, we run our planned analysis, and we see how often we get a p-value underneath 0.05 (which is the true correct answer, by construction).

Let’s say Y’s visitors convert at 10%, then X’s must convert at 10% * 0.75, as he said, and let’s imagine our A/B test of a blue site-design increases sales by 1%. (So in the better version, Y visitors convert at 11% and X convert at 8.5%.) We generate n4\frac{n}{4} datapoints from each condition (X/blue, X/not-blue, Y/blue, Y/not-blue), and then we do the usual logistic regression looking for a difference in conversion rate, with and without the info about the source. So we regress Conversion ~ Color, to look at what would happen if we had no idea where visitors came from, and then we regress Conversion ~ Color + Source. These will spit out p-values on the Color coefficient which are almost the same, but not quite the same: the regression with the Source variable is slightly better so it should yield slightly lower p-values for Color. Then we count up all the times the p-value was below the magical amount for each regression, and we see how many statistically-significant p-values we lost when we threw out Source. Phew!

So we might like to do this for each sample size to get an idea of how they change. n=100 may not the same for n=10,000. And ideally, for each n, we do the random data generation step many times, because it’s a simulation and so any particular run may not be representative. Below, I’ll look at n=1000, 1100, 1200, 1300, and so on up until n=10,000. And for each n, I’ll generate 1000 replicates, which should be pretty accurate.

So at n=1000 we don’t have decent statistical power to detect our true effect of 1% increase in conversion rate thanks to blue - only 8% of the time will we get our magical p<0.05 and rejoice in the knowledge that blue is boss. That’s not great, but that’s not what we were asking about.

Moving on to our original question, we see that the regressions controlling for source had a very similar power as to the regressions which didn’t bother. It looks like you may pay a small price of 2% less statistical power, but probably even less than that because so many of the other entries yielded an estimate of 0% penalty. And the penalty gets smaller as sample size increases and a mere 25% difference in conversion rate washes out as noise.

As expected, with tiny samples like 12, 22, or 32, the A/B test has essentially 0% power to detect any difference, and so it doesn’t matter if one controls for source or not. In the n=42+ range, we start seeing some small penalty, but the fluctuations from a 33% penalty to 0% penalty to 50% to 23% to 0% show that once we start nearing n=100, the difference barely exists, and the long succession of 1.0000s say that past that, we must be talking a very small power penalty of like 1%.

So let me pull up some real #s. I will give you source, # of unique visitors to sales page, # of unique visitors to buy page, # of actual buyers. Also note that I am doing it on a per-affiliate basis, and thus disregarding the origin of traffic (more on that later):

Website.com - 3963 - 722 - 293

X - 1232 - 198 - 8

Y - 1284 - 193 - 77

Z - 489 - 175 - 75

So even the origin of traffic was everywhere. X was all website, but pushed via FB. EC was email. Y was Facebook. Ours was 3 - email, Facebook, Twitter. Email converted at 13.72%, Facebook at 8.35%, and Twitter at 1.39%. All had >500 clicks.

So with that in mind, especially seeing how X and Y had the same # of people visit the buy page, but X converted at 10% the rate (and relatively to X, Y converted at 200%), I would wager that re-running your numbers would find that the origin matters.

Those are much bigger conversion differentials than the original 25% estimate, but the loss of power was so minute in the first case that I suspect that the penalty will still be relatively small.

I can fix the power analysis by looking at each traffic source separately and tweaking the random generation appropriately with liberal use of copy-paste. For the website, he said 3x500 but there’s 3963 hits so I’ll assume the remainder is your general organic website traffic. That gives me a total table:

In the most extreme case (total n=1000), where our controlled test’s power is 0.105 or 10.5% (well, what do you expect from that small an A/B test?), our test where we throw away the Source info has a power of 0.093 or 9.3%. So we lost 0.1143 or 11% of the power.

Its 14% estimate is reasonably close to 10.5% given all the simplifications I’m doing here. So, imagine our 0.139 power here was the victim of the 11% loss, and the true power is x=0.11x+0.139x = 0.11x + 0.139 where then x=0.15618. Given the p1 and p2 for our A/B test, how big would n then have to be to reach our true power?

So in this worst-case scenario with small sample size and very different true conversion rates, we would need another 178 page-views/visits to make up for completely throwing out the source covariate. This is usually a doable number of extra page-views.

What are the implications for my own A/B tests, with less extreme conversion differences? It might be interesting to imagine a hypothetical where my traffic split between my highest conversion traffic source and my lowest, and see how much extra n I must pay in my testing because I decline to figure out how to record source for tested traffic.

Looking at my traffic for the year 26 December 2012-2013, I see that of the top 10 referral sources, the highest converting source is bulletproofexec.com traffic at 29.95% of the 9461 visits, and the lowest is t.co (Twitter) at 8.35% of 15168. We’ll split traffic 50/50 between these two sources.

So our power loss is not too severe in this worst-case scenario: we lose a mean of 12% of our power, or around half.

We were examining a hypothetical conversion increase by 1% from 19.15% (mean(c(bulletP, tcoP))) to 20.15%. A regular 2-proportion power calculation (the closest thing to a binomial in the R standard library)

Its 14% estimate is reasonably close to 10.5% given all the simplifications I’m doing here. So, imagine our 0.08116 power here was the victim of the 12% loss, and the true power is x=0.12x+0.08116x = 0.12x + 0.08116 where then x=0.0922273. Given the p1 and p2 for our A/B test, how big would n then have to be to reach our true power?

So this worst-case scenario means I must spend an extra n of 265 or roughly a fifth of a day’s traffic. Since it would probably cost me, on net, far more than a fifth of a day to find an implementation strategy, debug it, and incorporate it into all future analyses, I am happy to continue throwing out the source information & other covariates.

The loss here seems to be the average Negative Log Likelihood of each character; so a training loss of 3.78911860 means exp(-3.78911860) ~> 0.02 or 2% chance of predicting the next character. This is not better than the base-rate of uniformly guessing each of the 128 ASCII characters, which would yield 1/128 ~> 0.0078125 or 0.7% chance. However, after a few hours to train and getting down to ~0.8, then it’s starting to become quite impressive: 0.8 here translates to a 45% chance - not shabby! At that point, the RNN is starting to become a good natural-language compressor as it’s approaching estimates of the entropy of natural human English and RNNs have gotten close to records like 1.278 bits per character. (Which, after converting to bits per character, implies that for English text similarly complicated as Wikipedia, we shouldn’t expect our RNN to do any better than a training loss of ~0.87 and more realistically 0.9-1.1.)↩

Several days after I gave up, Nvidia released a 7.5 RC which did claim to support Ubuntu 15.04, but installing it yielded the same lockup. I then installed Ubuntu 14.04 and tried the 14.04 version of that 7.5 RC, and that worked flawlessly for GPU acceleration of both graphics & NNs.↩

Eventually the Nvidia release caught up with 15.04 and I was able to use the Acer laptop for deep learning. This may not have been a good thing in the long run because the laptop wound up being bricked on 26 November 2016, with what I think was the motherboard dying, when it was just out of warranty, and corrupting the filesystem on the SSD to boot. This is an odd way for a laptop to die, and perhaps the warnings against using laptop GPUs for deep learning were right - the laptop was indeed running torch-rnn the night/morning it died.↩

The EC2 price chart describes it as High-performance NVIDIA GPUs, each with 1,536 CUDA cores and 4GB of video memory. These apparently are NVIDIA Quadro K5000 cards, which cost somewhere around $1500. (Price & performance-wise, it seems there are these days a lot of better options now; for example, my GeForce GTX 960M seems to train at similar speed at the EC2 instances do.) At $0.65/hr, that’s ~2300 hours or 96 days; at spot, 297 days. Even adding in local electricity cost and the cost of building a desktop PC around the GPUs, it’s clear that breakeven is under a year and that for more than the occasional dabbling, one’s own hardware is key. If nothing else, you won’t feel anxious about the clock ticking on your Amazon bill!↩