Navigation

Testing a feature (i.e. not testing the code)
with users usually takes one of two forms: small-scale tests with
individuals or known group, and large-scale tests with a subset of
production users. Waffle provides tools for the former and has some
suggestions for the latter.

Testing mode makes it possible to enable a flag via a querystring
parameter (like WAFFLE_OVERRIDE) but is unique for two reasons:

it can be enabled and disabled on a flag-by-flag basis, and

it only requires the querystring parameter once, then relies on
cookies.

If the flag we’re testing is called foo, then we can enable testing
mode, and send users to oursite.com/testpage?dwft_foo=1 (or =0)
and the flag will be on (or off) for them for the remainder of their
session.

Warning

Currently, the flag must be used by the first page they visit,
or the cookie will not get set. See #80 on GitHub.

Researchers can send a link with these parameters to anyone and then
observe or ask questions. At the end of their session, or when testing
mode is deactivated, they will call back to normal behavior.

For a small group, like a company or team, it may be worth creating a
Django group and adding or removing the group from the flag.

Large scale tests are tests along the lines of “roll this out to 5% of
users and observe the relevant metrics.” Since “the relevant metrics”
is very difficult to define across all sites, here are some thoughts
from my experience with these sorts of tests.

Google Analytics—and I imagine similar products—has the ability to
segment by page or session variables. If you want to A/B test a
conversion rate or funnel, or otherwise measure the impact on some
client-side metric, using these variables is a solid way to go. For
example, in GA, you might do the following to A/B test a landing page:

Similarly you might set session or visitor variables for funnel tests.

The exact steps to both set a variable like this and then to create
segments and examine the data will depend on your client-side analytics
tool. And, of course, this can be combined with other data and further
segmented if you need to.

Other times, existing data—e.g. timers on the whole view—isn’t going to
move. If you have enough data to be statistically meaningful, you can
measure the impact for a given proportion of traffic and derive the time
for the new code.

If a flag enabling a refactored codepath is set to 20% of users, and
average time has improved by 10%, you can calculate that you’ve improved
the speed by 50%!

You can use the following to figure out the average for requests using
the new code. Let \(t_{old}\) be the average time with the flag at
0%, \(t_{total}\) be the average time with the flag at \(p *
100%\). Then the average for requests using new code, \(t_{new}\)
is…

\[t_{new} = t_{old} - \frac{t_{old} - t_{total}}{p}\]

If you believe my math (you should check it!) then you can measure the
average with the flag at 0% to get \(t_{old}\) (let’s say 1.2
seconds), then at \(p * 100\) % (let’s say 20%, so \(p = 0.2\))
to get \(t_{total}\) (let’s say 1.08 seconds, a 10% improvement) and
you have enough to get the average of the new path.