This post is the third in our blog series on testing for digital organizers. Today I’ll be talking a bit about what an A/B test is and explain how to determine the sample size (definition below) you’ll need to conduct one.

Hey, pop quiz! Is 15% greater than 14%?

My answer is “well, kind of.” To see what I mean, let’s look at an example.

Let’s say you have two elevators, and one person at a time enters each elevator for a ride. After 100 people ride each elevator, you find that 15 people sneezed in elevator 1, and 14 people sneezed in elevator 2.

Clearly, a higher percentage of people sneezed in elevator 1 than elevator 2, but can you conclude with any certainty that elevator 1 is more likely to induce sneezing in its passengers? Or, perhaps, was the difference simply due to random chance?

In this contrived example, you could make a pretty good case for random chance just with common sense, but the real world is ambiguous so decisions can be trickier. Fortunately, some basic statistical methods can help us make these judgments.

One specific type of test for determining differences in proportions1 is commonly called an A/B test. I’ll give a simple overview of the concepts involved and include a technical appendix for instructions on how to perform the procedures I discuss.

Let’s recall what we already said: we can perform a statistical test to help us detect a difference (or lack thereof) between the action rate in two samples. So, what’s involved?

I’ll skip over the nitty-gritty statistics of this, but it’s generally true that as the number of trials2 increases, it becomes easier to tell whether the difference (if there’s any difference at all) between the two variations’ proportions is likely to be real, or just due to random chance. Or, slightly more accurately, as the number of trials increases, smaller differences between the variations can be more reliably detected.

What I’m describing is actually something you’ve probably already heard about: sample size. For example, if we have two versions of language on our contribution form, how many people do we need to have land on each variation of the contribution form to reliably detect a difference (and, consequently, decide which version is statistically “better” to use going forward)? That number is the sample size.

To determine the number of people you’ll need, there are a few closely related concepts (which I explain in the appendix), but for now, we’ll keep it simple. The basic idea is that as the percent difference between variations you wish to reliably detect decreases, the sample size you’ll need increases. So, if you want to detect a relatively small (say, 5%) difference between two variations, you’ll need a larger sample size than if you wanted to be able to detect a 10% difference.

How do you know the percent difference you’d like to be able to detect? Well, a good rule of thumb to start with is that if it’s a really important change (like, say, changing the signup flow on your website), you’d want to be able to detect really small changes, whereas for something less important, you’d be satisfied with a somewhat larger change (and therefore less costly test).

Here’s what that looks like:

Required sample size varies by the base action rate and percent difference you want to be able to reliably detect. Notice the trends: as either of those factors increases, holding all else equal, the sample size decreases.

For example, if you’re testing two versions of your contribution form language to see which has a higher conversion rate, your typical conversion rate is 20%, and you want to be able to detect a difference of around 5%, you’d need about 26k people in each group .

For instructions on how to find that number, see the appendix below. Once you have determined your required sample size, you’ll be ready to set up your groups and variations, run the test, and evaluate the results of your test. Each of those will be upcoming posts in this series. For now, feel free to email info [at] actblue [dot] com with any questions!

Footnotes:1 Note that this should be taken strictly as “proportions”. Of course, there are many things to be interested in other than the percentage of people who did an action vs. didn’t (e.g., donated vs. didn’t donate), like values of actions (e.g., contribution amounts), but for now, we’ll stick to the former.2I.e., the number of times something happens. For example, this could be the number of times someone reaches a contribution form.

Appendix:

Statistics is a big and sometimes complicated world, so I won’t explain this in too much detail. There are many classes and books that will dive into the specifics, but I want you to have a working knowledge of a few important concepts you’ll need to complete an accurate A/B test. I’m going to outline four closely related concepts necessary for determining your sample size, and walk through how to find this number. Even though I’m sticking to the basics, this section will be a bit on the technical side of things. Feel free to shoot an email our way with any questions; I’m more than happy to answer any and all.

Like I said, there are four closely related concepts when it comes to this type of statistical test: significance level, power, effect size, and sample size. I’ll talk about each of these in turn, and while I do, remember that our goal is to determine whether we can reject the assumption that the two versions are equal (or, in layman’s terms, figure out that there is a real statistical difference between the two versions).

Significance level can be thought of as the (hopefully small) likelihood of a false positive. Specifically, the probability that you falsely reject the assumption that the two versions are equal (i.e., claim that one version is actually better than the other, even if it’s not.) When you hear someone talk about a p-value, they’re referencing this concept. The most commonly used significance level is 0.05, which is akin to saying “there’s a 5% chance that I claim a real difference, but there’s actually not”.

Power is the the probability that you’ll avoid a false negative. Or said another way, the probability that if there’s a real difference there, you’ll detect it. The standard value to use for this is 0.8, meaning there is an 80% chance you’ll detect it; though there are really good reasons for adjusting this value. 0.8 is by no means always the best value to choose for power; it’s generally a good idea to change it if you know exactly why you’re doing what you’re doing. .08 will work for our purposes, though. Why not just pick a value of .9999, which is similar to saying “if there’s a real difference, there’s a 99.99% chance that I’ll detect it”? Well, that would be nice, but as you increase this value, the sample size required increases. And sample size is likely to be the limiting factor for an organization with a small (say, fewer -than-100k-member) list.

Effect Size. Of the two versions you’re testing against each other, typically you’d call one the ‘control’ and the other the ‘treatment’, so we’ll use those terms. Effect size is saying, what do you expect the proportion of actions (e.g., contributions) to be for the control, and what do you expect it to be for the treatment? The percent difference is the effect size. How this affects sample size is demonstrated in the above graph. But the whole point of running this test is that you don’t know what the two proportions will be in advance, so how can you pick those values? Well, actually, you estimate what your base action rate will be. For example, if your donation rate from an email is typically 5%, then you can use that as your base action rate. Then, for the second proportion, pick the smallest difference you’d like to be able to detect. Similarly to power, you might find yourself asking “well why wouldn’t I just pick the smallest possible difference?”. Again, the answer is that as you decrease the magnitude of the difference, the sample size you need will increase.

Finally, we have sample size, or the number of people we need to run the test on. If we have values for the above three things, we can figure out how big of a sample we need!

So how do we do that? Well, there are many ways to do it, but one of the easiest, best, and most accessible is R. It’s free, open-source, and has an excellent community for support (which really helps as you’re learning). Some might ask, “well that has a relatively high learning curve, doesn’t it? And, isn’t there some easier way to do this?” The answer to both of those questions is “maybe,” but I’ll give you everything you need in this blog post. There are also online calculators of varying quality that you can use, but R is really your best bet, no matter your tech level.

Doing this in R is actually pretty simple (and you’ll pick up another new skill!). After you download, install, and open R, enter the following command:

power.prop.test(p1 = 0.1, p2 = 0.105, sig.level = 0.05, power = 0.8)

and press enter. You’ll see a printout with a bunch of information, but you’re concerned with n. In this example, it’s about 58k. That number is the sample size for each group you’d need to detect, in this case, a 5% difference at a significance level of 0.05, a power of 0.8, and a base action rate of 10%. So, just to be certain we’re on the same page, a quick explanation:p1: Your ‘base action rate’, or the value you’d expect for the rate you’re testing. If you’re donation rate is usually 8%, then p1 = 0.08p2: Your base action rate plus the smallest percent difference you’d like to be able to detect. If you only care about noticing a 10% difference, and your ‘base action rate’ is 8%, then p2 = 0.088 (0.08 + (0.08 * 0.10))

Of course, your base action rate will likely be different, as will be the percent difference you’d like to be able to detect. So, substitute those values in, and you’re all set! Playing around with different values for these can help you gain a more intuitive sense of what happens to the required sample size as you alter certain factors.

We noticed something curious this week. Mitch McConnell sent an email this week that looked just like an Express Lane email, complete with “Express donate” links denoting specific amounts. And then another strange thing happened…Steve King did the same thing. Check them out:

When we stopped laughing we wrote this nice little note to them:

——————————————————–

Dear Mitch & Steve,

We’re flattered, really, that you want to use our tools in your emails. Mitch — you’re trying to run a “presidential level campaign,” and our tools are the best in the business, after all. And Steve, you’re looking to increase your national name recognition.

And we know you’re learning first hand in Kentucky and Iowa what an empowered small dollar donor base supporting an amazing Democratic candidate means.

So we get it. You’re jealous. But no, you can’t just try and steal or copy what we’ve built this last decade at ActBlue. Frankly, it’s impossible.

That’s because the most powerful thing about ActBlue is the nearly million strong community of committed Express donors. Without that community, those links you used are just links, not money makers, not flashy technology, and no use to you.

Imitations, however pale, are still flattering, so thanks. But you’re doing it wrong.

Best,

ActBlue

——————————————————–

Hilariously, if you clicked on any of Steve or Mitch’s links, they took you to a contribution form for just that amount, which is a recipe for a lot of lost money. It’s not just that they tried to copy us, it’s that they did it so badly. And they completely missed the point of why so many candidates and organizations around the country are asking people to give specific amounts right in the email.

There are over 950,000 ActBlue Express users that have saved their credit cards with us. With just one click, Express donors are powering campaigns and organizations. These days, 62% of all donations made through ActBlue are from people giving with an Express account!! That means more contributions and more funds because the less information people have to enter in, the more likely they are to give.

Mitch and Steve thought they could get that from just copying the style of Express Lane emails. Yeah, no.

First, try investing a decade in building a base of grassroots donors (unfortunately that means getting your party to actually care about people besides the Koch brothers), and then maybe your copy and paste efforts will be effective.

BTW we tested a version of this letter as a fundraising email to our list. It was a classic case of “you are not your list.” It was an email our staff really loved and had fun crafting, but our list didn’t respond as well to it as more traditional email talking about the momentum Democrats are gaining across the country. It’s not surprising, but it is a bit sad for us email writing nerds. And it’s a good reminder why it’s so important to test email copy.

In this post, part of our testing blog series, I’ll talk a bit about some things you might consider testing, and—probably even more importantly—some things you might not want to test. This is all the more relevant if you’re managing a smaller list (say, fewer than 100k active members). As we’ll discuss in future posts, it takes huge sample sizes to reliably detect relatively small differences in two testing segments, so you’ll want to reserve your testing for factors that are likely to have larger effects on your goals, like subject lines, for example.

But to begin, we should be on the same page regarding why we test. It’s pretty simple. We tend to be pretty bad at guessing what will happen, so it’s often better to let data inform our decision making. For instance, when sending an email, should you go with a negative subject line like “This Republican is the worst!”, or a positive one like “Sally Jane is a great Democrat!”?

This trivial example allows us to demonstrate an important testing concept. Testing is only a tool; it’s not the final judge, nor does it say anything about the appropriateness of your content. If “This Republican is the worst!” isn’t in congruence with your campaign/organization’s messaging and mission, then you shouldn’t test that subject line, let alone use it for an email to your entire list.

So, then, assuming the subject matter is in-line with your messaging and mission, what’s something you should test, even with a small list? Subject lines could be one, but there are other things that could have a big impact on your action rates. What comes to mind first and foremost is email content.

By this I mean writing two completely different emails, whether they’re about the same thing or completely different concepts. The varied factor could be anything from your topic and theory of change to your tone and word choice. Even ostensibly similar emails—let alone drastically different ones—can yield very large differences in results. We at ActBlue, for example, regularly test at least three different fundraising emails for every one that we end up sending to our full list.

For one of our most recent email blasts, we sent four different email drafts, a couple of which were quite similar. The results? The best-performing draft brought in over triple the number of donations as the worst-performing drafts! So, here’s a clear case in which performing a simple test can lead to much higher action rates, whether you’re looking for signatures on a petition or donations to your cause.

It might seem that writing three or more email drafts for every send is a bit much for a resource-constrained organizer. If that’s the case, you should still be message testing periodically, say, once a month or so. The goal here is to ascertain the biggest button-pushers for your list members. A standard example is testing the performance of an email highlighting the negative characteristics of your opponent against the performance of an email highlighting the endorsement of your candidate by a local community leader. This is a less resource-intensive way to gauge the temperature of your list and see what resonates with your list members.

So if the content of an email is something that is definitely worth testing, what are some things that small campaigns shouldn’t test? Well, anything that you expect won’t result in a large percentage difference between your test segments. For example, you certainly could test four differently colored donate buttons, but you shouldn’t.

Chances are, you won’t see a significant advantage in one of them over the others. How do I know this? Well, I can’t claim 100% certainty (nor can any honest analyst), but whenever we at ActBlue or some of our larger partners have tested something very small like this, we’ve seen that result.

For example, we wanted to run an A/B test1 on our contribution form to find out whether we could increase the conversion rate by removing the header, “Employment Information” above two of the FEC-required fields. To see what that looked like (and for some more A/B test examples), check out this blog post. We knew that it would take us close to 150,000 page views to reliably detect the small percentage difference in the two segments of the test we required to make a permanent change to our contribution form. I’ll talk more about determining required sample size in a later post, but for now, the point is that it took a lot to get a little.2 If you manage a smaller list, that means sending dozens of emails for a relatively minor gain, and that’s not worth your time.

Of course, context matters a lot, and in this case, context is your email program and your members. So, the final word is that if you really, really want to know, you should indeed test something for yourself instead of taking someone else’s word for it. But you’re much better off focusing on testing more meaningful factors (like your messaging) that are likely to result in clear and large differences. For the small things, you can learn from the organizations that have the resources to test small nuances. If you subscribe to numerous email lists, you’ll get a good gauge of what community best practices are at a given time.

Testing one email draft against another tells you exactly one thing: which (if either) is better. It doesn’t, however, tell you some things that can be quite valuable: Do members of your list tend to prefer positive emails or scare-to-action emails? Do they tend to respond well to fun, edgy language or slightly more formal language?

One A/B test won’t provide much of an answer to questions like that, but repeatedly testing two different email styles—like short, punchy emails against longer, more descriptive emails—over time can help you understand the style of communication your list members prefer, and therefore help you write emails with better action rates.

As you go on and develop your testing program, examining other questions like how much money to ask for in a fundraising email, how to best segment your list, and so on becomes more important and makes more sense from a cost-benefit perspective, too.

But to start, remember: make sure what you’re testing fits in with your organization’s messaging, plan a test that has a plausible chance of realizing big gains, and, more than anything else, work on honing your messaging. You’ll need to start out with bigger questions—and, therefore, more general tests—about your list members and eventually narrow down to the specifics.

The next post in our series about testing will talk about some essential factors involved in setting up a test, like setting up your groups and determining your required sample size. Expect that one to be published next week, after Netroots Nation.

Footnotes:

1 “A/B test” is an informal term for statistically testing two variations of some singular factor against each other in order to determine which, if either, is better for your desired outcome.

2We have millions of people land on our contribution forms each month, so for us, there’s a huge payoff to testing minor details that result in small percentage-point gains. It’s thousands of tests like this one over the years that make our contribution forms so successful. But this is our context— running a testing program with a small list is a totally different game.

Recurring pledges are like gold. There’s a reason why they’re often called sustaining contributions. Building a base of recurring donors can have a huge impact on the sustainability of any organization, including campaigns.

And now we’re making it easier for you to raise more long-term recurring contributions. Introducing: infinite recurring!

You’ve got a choice: ask people for a recurring contribution for a defined number or months (old standard), or ask them for one with no expiration date (new!). You can also choose not to have a recurring option, but we don’t recommend it (I’ll explain later.)

Here’s how you do it: Go to the edit page of any contribution form. Scroll down till you see this:

Click on it to expand. It’ll look like this:

Select your radio button and then scroll down and hit submit. Yep, that’s it.

ActBlue got it’s start helping candidates raise money for their campaigns, which are built in two year cycles, so we allowed folks to set up recurring contributions for up to 48 months. The assumption was that donors would feel more comfortable signing up for a recurring contribution that would be sure to end at some point. These days, more and more organizations, who are around cycle after cycle, are using ActBlue. Plus, the way people use credit cards has changed and we have a whole system to let you extend/edit/add a new card to your recurring contribution, complete with prompts from us. It doesn’t make a ton of sense to have time-limited recurring contributions anymore.

So we tested it. Would forms with an infinite recurring ask perform the same (or better) as forms with a set number of months? AND would you raise more money if you didn’t have a recurring ask on the form, but asked people with a pop-up recurring box after their contribution was submitted?

We’ve got some answers. Several committees have run tests, confirming that conversion rates on time-limited forms and infinite recurring forms are similar. So if you’re around longer than election day, go ahead and turn on infinite recurring.

Generally speaking, making a form shorter and giving people fewer options leads to higher conversion rates. So theoretically, taking the recurring option off of a form should lead to more donations. We have a pop-up recurring box that campaigns can turn on to try and persuade a one-time donor to make their donation recurring, and there seemed to be a reasonable chance that having no recurring ask on the form would raise more money.

Nope! Turns out that we got a statistical tie on conversion rates between having the recurring option on the form or off. Just having pop-up recurring turned on did not generate as many recurring contributions as having it both on the form and as a post-donation action.

There were slightly more contributions processed on forms without a recurring option, but not enough to generate a statistically significant result. And then add to that the lost revenue from having fewer recurring donations, you end up with a pretty clear take-way: leave the recurring option on the form. Sure, you can turn off the recurring option, but you’ll likely lose money. And nobody wants that.

That’s why recurring contributions have been on every ActBlue contribution form since the beginning. These days we run anywhere from 8-14% recurring, and over $11 million is pledged to thousands of campaigns and organizations.

There is one big question we haven’t answered yet: will you raise more money overall from an infinite recurring contribution than say one with a 48 month expiration date? We’re currently working on a long-term experiment to test exactly that.

The answer might seem self-apparent, but the truth is nobody really knows. Credit cards expire and people cancel their pledges. You never know for sure how much money you’ll raise from a recurring contribution, but if you pay attention to your long-term data, you’ll be able to figure out your pledge completion rate.

If you’re interesting in figuring out a recurring donor strategy, we’re more than happy to give you some (free) advice. Just drop us a line at info@actblue.com.

We on the left have done a great job cultivating a “test, test, test” ethos, and while testing can result in big gains, it takes time and resources that digital organizers often don’t have. And for those working with a smaller list (say, fewer than 100k members or so), the challenges are even greater.

Don’t be discouraged, though; anyone can run an effective testing program, you just need to be aware of your organization’s specific circumstances. For instance, if you have a small list, it’s important to know that there are actually a lot of things that you shouldn’t test (more on this to come in future posts).

To help you get on track toward developing a strong testing program, we’re going to publish a series of blog posts, each focused on a particular aspect of digital testing for small organizations. We’ll be talking about anything from tools and techniques to case studies and getting buy-in from your supervisor.

If there are any specific issues you’d like to see addressed in this series of testing blog posts, please reach out! An email to info@actblue.com with a subject line “ActBlue Blog: Testing” will be directed my way.

Our mission has been to make your lives easier this campaign season, so you can spend more time connecting with supporters. We rolled out a new refund feature that will do just that. Now you can issue your own refunds for your campaign or organization, as long as we haven’t sent you a check with that money yet.

As always, we’re ready and willing to handle your donors’ questions and refunds in a timely manner, but this feature gives you the option to issue the refund yourself if it suits your needs.

Now, if someone contacts you directly for a refund, you can feel free to take care of that while you’re on the phone with them. If you know a particular donor is over their contribution limit, or ran a donor card incorrectly, you can handle that refund right away in-house.

Making a refund

One of our other favorite new features, the search function, makes the whole refund tool possible. Navigate to the search tab of your Dashboard, fill in the donor information that you have available, and click search. Once you’ve found the right contribution, you can click on the associated date to open up all the contribution information.

If the contribution is eligible for a refund, you’ll find a drop-down menu at the bottom of the screen where you can choose the reason for your refund and process the refund.

If the contribution has already been disbursed in a check, you’ll see a message prompting you to contact ActBlue Customer Service to obtain a refund.

You have a contribution you can’t refund yourself? Have a question for us? Would like us to handle a refund? Let us know!

Just drop us a line at info [at] actblue [dot] com and we’d be happy to help.

The last day of June started off a bit slow compared to other monster end of quarter days. The question—is $3 million possible?—loomed in the office.

But then in a 3 hour period, from 1 to 4 PM EDT, we handled almost a quarter (23.5%) of the day’s total volume of $3.7 million(!!). Campaigns were waiting on the two major Supreme Court decisions to be announced before sending emails. The day’s narrative switched from a typical, EoQ rush to the deadline to a more politicized one.

That $3.7 million came from over 85K contributions. On a single, gigantic last day of Q2 2014, we broke our own records for total dollar amount and number of contributions!

And those aren’t the only record shattering numbers from yesterday:

Of our all-time 15 busiest hours, 10 were on 6/30/14

Most contributions in an hour (7,046) from 10 to 11 PM

Most contributions in a minute (153) at 10:30 PM

We’re constantly upgrading our powerful tools to handle a ton of traffic, which makes days like these possible. At one point, we handled 655 contributions in 5 minutes! Our technical team works incredibly hard so we can offer Democrats the most reliable and secure software available. We’re ready for the surge of donations this fall.

You’ll notice in the graph above that there’s a massive spike in contributions at 4AM for each day. As you’d expect, it’s not because lots of people check their emails at that time. We get recurring contributions processed during the morning’s quiet hours, so campaigns wake up to bigger balances, ready to start the day. Overall, the number of recurring contributions has steadily increased as organizations are realizing the value of a steady stream of money.

June’s final tally? $20,513,475, our third biggest month ever. 541,427 contributions were made to 2,153 candidates, committees, and organizations, with an average donation of $37.89. Impressively, mobile donations accounted for more than a quarter (25.2%) of contributions this month.

June ’11

June ’12

June ’13

June ’14

Contributions

57,664

268,794

186,139

541,431

Volume ($)

$3,850,081

$11,624,120

$9,052,454

$20,513,558

Mean Donation

$66.77

$43.25

$48.63

$37.89

Committees

862

1,866

1,219

2,153

Compared to June 2012, the total number of contributions more than doubled, while the volume increased by more than 40%. The difference there is the decrease in the average donation size ($43.25 in 2012 to $37.89 in 2014). Chalk that up to growth in small dollar donations, which means campaigns and organizations are expanding their base. Nothing like an increase in grassroots support for Democrats!

Just a quick glance at a graph of this month’s contributions day by day shows how productive the final days of a quarter can be. Five out of the last seven days of June were over $1 million. Even the significant fundraising bump from Cantor’s defeat on June 10 could not compare to the high-stakes pressure of an EoQ week.

This was an important quarter for candidates and organizations gearing up for November, and the ActBlue donor community came out in full force: they gave over 1.1M contributions, totaling $46M, to 2,810 different candidates, committees, and organizations. Despite the busiest months of the election cycle lying ahead of us, we’ve already raised more money than any previous cycle! There’s still 4 months and 2 days left until the November elections, and all signs are pointing to a massive final push.

Part of what’s driving the push is that more and more campaigns and organizations are using ActBlue’s Express Lane program, our one-click donation system. They’re already seeing results: Express users were responsible for 59.9% of June’s contributions.

We’ve now got over 933K Express users. And thanks to the madness of EoQ, there were 45K new sign ups over this past week alone.

Looking at our numbers from June and 2014’s second quarter, it’s clear that this election year is going to get even busier. What dog days of summer? This cycle doesn’t look like it’ll slow down.