abstractWe study a voluntary contributions mechanism in which punishment may be allowed, depending on subjects voted rules. We found that out of 160 group votes, even when groups had no prior experience with unrestricted punishment, no group ever voted to allow unrestricted punishment and no group ever allowed punishment of high contributors. Over a series of votes and periods of learning we found a distinct reluctance to allow any punishment at the beginning, with a gradual but clear evolution toward allowing punishment of low contributors. And groups allowing punishment of only low contributors achieved levels of cooperation and efciency that are among the highest in the literature on social dilemmas. & 2008 Elsevier B.V. All rights reserved.

0. Introduction Organizations such as teams, rms, and military units depend on cooperative effort to succeed, and organizational leadership often attempts to increase cooperative contributions and/or reduce free riding by instituting rewards and sanctions and by building a culture or norms of cooperation. Problems of cooperation, for example, in efforts to limit greenhouse gases or depletion of sheries, and free riding in efforts to provide public goods share a common characteristic: incentives for the individual that lead to inefciency in the group. Such problems are often called social dilemmas and have been the focus of numerous studies using the method of the laboratory decision-making experiment. In one key social dilemma experiment, Ostrom et al. (1992) found for a model of overuse of a commons, that allowing face-to-face communication and allowing the subjects to sanction (punish) each other led to a signicant increase in cooperative behavior. In another inuential experiment, Fehr and Gachter (2000a) found that, in a voluntary contributions mechanism1 (VCM), the opportunity for punishment had a dramatic positive effect on contributions, but this nding did$ The research reported here was supported by N.S.F. Grant SES-0001769. We are indebted to two anonymous referees for helpful comments and suggestions. Corresponding author. Tel.: +1 401863 3837; fax: +1 401863 1970. E-mail address: Louis_Putterman@Brown.Edu (L. Putterman). 1 The basic voluntary contributions mechanism without punishment is a particularly sharp social dilemma, in which each individual maximizes his payoff when others contribute their full endowments but he himself contributes nothing; yet when everyone contributes nothing, efciency is minimized.

not extend to average efciency. In both experiments, punishment was made possible by allowing a subject to pay out of his/her earnings to reduce by a larger amount the earnings of another. Since punishment is costly to both the punisher and the punished, it was not surprising to observe that punishment had a less positive effect on efciency than on contributions in VCMs or overuse in commons problems. But at the same time practically everyone who studied the role of punishment noticed a curious phenomenon. While most punishment was targeted at low contributors in VCMs (and overusers in commons problems), a considerable amount of punishment was targeted at cooperators (high contributors in VCMs, low extractors in commons problems). The frequency of punishing high contributors in VCMs was too high to be explained as mistakes. Cinyabugama et al. (2006) estimated that about 15% of punishment in several experiments2 of this type was targeted at the highest contributor in a group, and about 25% at those who contributed more than their groups average. Researchers suggested several possible explanations: for example, revenge and harming others more than oneself to win relatively (tournament style), and moral resentment.3 These possible explanations suggested multiple preference typesincluding other-directed preferences (revenge, etc.) in addition to the self-interested preference for maximizing earnings found in most economic models. It seemed to us that the phenomenon of punishing high contributors in VCMs was more frequent than commonly recognized and likely to have adverse effects on contributions and efciency. In practical life, if decentralized punishment of high contributors by resentful free riders has comparably high frequencies it would be a serous problem.4 We called the punishment of high contributors perverse punishment because of its seeming inconsistency with self-interested earnings maximization. Because the directing of a signicant fraction of punishment at high contributors appears to limit the usefulness of decentralized punishment as a mechanism or institution, we asked whether the problem might be corrected if groups of individuals were provided with the opportunity to choose their own rules governing the application of punishment. We conducted an experiment in which rules determining who can be punished are chosen by a series of votes, in order to see how the choices of rules evolved over time and how these choices affected cooperation and efciency. In our experiment, subjects voted on three ballot items determining independently whether group members could reduce the earnings of low (below average), of average, and of high (above average) contributors to their group account (public good). We found that out of 160 group votes, no group ever voted to allow punishment of high contributors. Over a series of votes and periods of learning we found a distinct reluctance to allow any punishment at the beginning, with a gradual but clear evolution toward allowing punishment of low contributors. And groups adopting this rule of controlling perverse punishment achieved levels of contributions and efciency that are among the highest in the literature on social dilemmas. Our main contributions are: to show how rules of punishment can evolve endogenously to address free rider problems, within the opportunities of institutional choice presented to the experimental subjects; and to show that perverse punishment can have strong negative effects on contributions and efciency but is amenable to group control. These contributions, listed more specically in Results 14, are based on the observed behaviors in the experiment and rely on direct counts or non-parametric tests using fully independent observations at the group or session level. Toward the end of the result section, we also discuss regressions estimated using individual-level observations, here using group and period xed effects to partially address the possible interdependence among observations. The paper is organized as follows. Section 1 reviews the theoretical outlook that informs our own and related research, then discusses the related literature. Section 2 presents the experimental design, Section 3 presents the analysis, and Section 4 discusses interpretative issues. 1. Theoretical intuitions and literature 1.1. Theory Several social dilemmas have an iterated dominant strategy equilibrium, which implies a unique Nash equilibrium without any cooperation. The nitely repeated prisoners dilemma (Kreps et al., 1982), the centipede game (McKelvey and Palfrey, 1992), and VCMs are examples having a unique Nash equilibrium with no cooperation. (One of the assumptions that leads to this result is that of a single preference type of payoff maximizers, all of whom believe that all the players are payoff maximizers.) Kreps et al. found this equilibrium result disturbing because many experiments on the prisoners dilemma showed a pattern of substantial cooperation. A little later, McKelvey and Palfrey (1992) developed an exponential version of the centipede game for which there are large benets of cooperation, a unique Nash equilibrium with no cooperation, and substantial cooperation in experimental observations. McKelvey and Palfrey thought the centipede game In particular, Fehr and Gachter (2000a), Page et al. (2005), and Bochet et al. (2006). A low contributing individual may be made uncomfortable by a high contributors action, feel moral resentment and want to get even by punishing the high contributor. An experimental subject gave us this explanation in a debrieng statement. 4 Cinyabuguma et al. nd support for the idea that most punishment of high contributors by low ones may reect retaliatory motives. For an experiment on retaliatory punishment, see Nikiforakis (2008). Recently, the on-line auction site eBay announced a clamp-down on tit-for-tat feedback to prevent sellers from leaving negative feedback on buyers. Today, the biggest issue with the system is that buyers are more afraid than ever to leave honest, accurate feedback because of the threat of retaliation, explained eBay North America president Bill Cobb in his January 29, 2008 announcement (Bangeman, 2008).3 2

was an even simpler and ymore compelling example of the Nash equilibriums predictive failure than is the prisoners dilemma. In response to the Nash equilibriums predictive failure under assumption of payoff maximizing as the only preference type, Kreps et al. and McKelvey and Palfrey modeled the two social dilemmas as (different) games of incomplete information with multiple preference types. Kreps et al. used two types: payoff maximizers and tit-for-tat players. McKelvey and Palfrey used two types: payoff maximizers and altruists. With multiple types and incomplete information, iterated dominance no longer is implied. Instead, the researchers solved for BayesNash equilibria that more accurately predicted substantial cooperation until near the end of the game, as observed experimentally. It is easy to check that for the VCM with a punishment opportunity and voting in our experiment, under the assumption of payoff maximizers as the single type, iterated dominance implies a Nash equilibrium predicting no cooperation and no punishment (and any voting pattern, including 100% abstentions). But this implication no longer holds when there are multiple preference types. This non-implication is suggestive because in numerous experiments researchers found substantial contributions in nitely repeated VCMs without punishment (see Davis and Holt, 1993; Ledyard, 1995, for surveys). And in VCM experiments with punishment but without voting, Fehr and Gachter (2000a, b, 2002), Carpenter and Matthews (2002), Masclet et al. (2003), Page et al. (2005), and Sefton et al. (2002) found substantial contributions and substantial punishment. These studies and the non-implication suggest the presence of multiple preference types in our experiment and other VCMs. Comparison with the prisoners dilemma, the centipede game, and other Bayesian games points toward several predictions. Payoff maximizers are likely to mimic cooperators to encourage their cooperation, because this is a reasonable strategy for increasing their payoffs. Cooperators are likely to punish low contributors because they dislike free riding (see Gintis et al., 2005), and this signals and warns free riders to contribute more. Perverse punishers appear, however, to be the opposite of cooperators. Fehr and Gachter (2000a, b) interpreted their results primarily in terms of the interaction of two preference types: purely selsh players (what we have called payoff maximizers) and a conditional cooperator type (see also Hoffman et al., 1998). Fischbacher et al. (2001) and Fischbacher and Gachter (2006) used a strategy method protocol to estimate that about 50% of those in their subject pools were of this second type.5 Further, punishment of high contributors, observed by Gachter and Herrmann (2005), Gachter et al. (2005), and Cinyabugama et al. (2006), suggests that when punishment is an available option, the presence of a third type, whom we call perverse punishers, should also be taken into account. Based on the work mentioned above, we expected perverse punishers to account for not more than 25% of our subjects.6 A word of cautionwe believe that these preference types are somewhat stylized interpretations rather than sharply xed, non-overlapping characteristics. With this in mind, intuitively the interaction of the three types in our experiment leads to predictions regarding voting. It seems likely that conditional cooperators would vote to allow punishment of low contributors and prohibit punishment of high contributors, and payoff maximizers might also vote similarly.7 It also seems likely that perverse punishers would vote to allow punishment of high contributors. But being in a minority, they would likely be outvoted, although by chance they might form a majority in a few out of a large number of randomly formed groups. Considering multiple preference types has been useful in explaining results in a large number of basic VCMs and VCMs with punishment. But VCMs are more complicated than the prisoners dilemma or the centipede game, and to our knowledge, solving even the basic VCM for BayesNash equilibria has so far been intractable. We attempt here only to use the intuitions developed above to guide interpretation of observed behaviors, hopefully contributing both to a practical understanding of social dilemmas and to future renements of theory. 1.2. Related literature While our paper is the rst to directly address effects of perverse punishment by allowing or prohibiting intermediate restrictions on punishment, there are related papers on the endogenous choice of institutional rules that allow or prohibit punishment altogether, or exogenously affect the role of punishment. Gurerk et al. (2005, 2006) designed two experiments that allowed subjects to vote with their feet in choosing between two groups, one allowing unrestricted punishment and the other no punishment. Subjects initially avoided the group with punishment, but with repeated opportunities to choose, almost all eventually chose the group with punishment, in result achieving high contributions and efciency. Their5 In a different experimental setup, a VCM with endogenous group formation, Page et al. (2005) estimated a 59% proportion of conditional cooperators. 6 When subjects from a population with this rough demography of types are randomly assigned to play a VCM in small groups, the groups may differ from one another in cooperation levels due to random differences in which types are represented and with what frequencies. Ones and Putterman (2007) grouped together on the one hand subjects displaying more cooperative behaviors and on the other hand subjects displaying less cooperation and more perverse punishing. They found, predictably, that the former achieved higher contributions and earnings than the latter. 7 Incentives in voting of course differ from those in a private action. For example, a payoff maximizer may prefer free riding to contributing, but at the same time nd it in his interest to vote to allow punishment of low contributors. In his calculation he may believe that by such a rule he would lose the benet from his own free riding, but be more than compensated by many erstwhile free riders who will contribute more in response to the threatened punishment of free riding. And in a population of mixed preference types, a payoff maximizers calculations of the net advantage from the rule depends on his beliefs on whether there will be a sufcient number willing to punish free riders and make the threat of punishment effective.

experiments differ from ours in that their subjects choose groups with either no punishment or unrestricted punishment, while our subjects have xed groups and vote over alternative restrictions on punishment. Botelho et al. (2005) designed an experiment that allowed subjects to choose between an institution with unrestricted punishment and another without any punishment. They found that the subjects voted overwhelmingly for the institution without punishment. In a related experiment, Sutter et al. (2005) found that subjects most often voted to allow rewards rather than punishment even though the latter raised contributions more. These experiments differed from ours by allowing only one vote for each group, and not allowing choices of partially restricted punishment. Botelho et al. (2005) also analyzed Fehr and Gachters (2000a, 2002) data, nding lower earnings when punishment was allowed than when it was not allowed.8 In contrast, Gurerk et al. (2005, 2006) found earnings (efciency) as high or higher in VCMs with unrestricted punishment than in VCMs without punishment opportunities. Masclet et al. (2003) also found higher earnings with unrestricted punishment compared with no punishment allowed. By varying the ratio of punishments cost to the punisher versus the target of punishment, Nikiforakis and Normann (2008) and Egas and Riedl (2005) shed light on the conditions under which the unrestricted opportunity to punish does and does not increases efciency. Noting the detrimental effects of the punishment of high contributors, Cinyabugama et al. (2006) designed a procedure they believed might reduce its incidence. The rst two stages of the experiment were an ordinary VCM followed by a punishment opportunity. But in a third stage, each subject learned the frequency of each other subjects punishment of high, average, and low contributors, and each was given an opportunity to punish on the basis of this information. The authors found that this incentive system led to less perverse punishment in the second stage, but fairly frequent perverse punishment in the third stage, for example subjects who punished free riders in the second stage were then severely punished in the third stage, undermining the incentives in the rst stages. Gachter and Herrmann (2005) used population groupings (young rural Russians, older rural Russians, young urban Russians, older urban Russians) to study the effects of unrestricted punishment. They found large variations among the groups in frequency of punishing high contributors and the harmful effects of this perverse punishment which, they wrote, can undermine the positive impact of punishment for cooperation and thereby limit the success of self-governance. Like Cinyabuguma et al. and our paper, Gachter and Herrmann emphasized the detrimental effect of perverse punishment on efciency. Casari and Luini (2005) compared effects of exogenously imposed punishment rules, including a rule requiring a subject to be targeted for punishment by at least two group members (in a group of ve) before the punishment takes effect. They found that the restriction decreased punishment of high contributors and raised efciency, but in this treatment the average contribution was quite low, not exceeding half of the endowment. 2. Design and predictions 2.1. Basic design Our design extends the basic VCM in which subjects are randomly assigned to groups that remain xed (a partners design) for a nite and known number of periods. Each subject in a group is provided with an initial endowment that he or she is asked to divide between a private account and a group account. Any funds placed in the group account are scaled up by the experimenter and divided equally among the subjects in the group without regard to individual contribution. To this basic VCM we added punishment and voting opportunities in two designs to study how rules restricting or allowing punishment might emerge initially and evolve over a series of votes. In the experiment, individuals act anonymously and without communication. We initially conducted a pilot experiment in which there were four partner groups with four subjects in each group. At the beginning of the 1st period, the subjects received instructions for playing a basic VCM without punishment, and each group played 10 periods of this repeated game (details of the basic VCM and its payoff function (1) are shown below). At the beginning of the 10th period the subjects received instructions for playing a VCM with unrestricted punishment, and each group played 10 periods of this repeated game (details and payoff function (2) shown below). So far, this design is similar to Fehr and Gachter (2000a). But following these rst 20 periods, each group voted on who if anyone could be punished in a nal 10 periods (details of the ballot process is shown below). Of the four group votes, all four voted to prohibit punishment of higher-than-average contributors; one group prohibited all punishment and the other three groups voted to allow punishment of low contributors.98 Cinyabuguma et al. (2004) found similar results for Fehr and Gachter (2000a) and in public goods and sanctions experiments by Carpenter and Matthews (2002), Sefton et al. (2002), Page et al. (2005), and Bochet et al. (2006). In their working paper, Cinyabuguma et al. (2004) used regression to study the impact of punishment upon changes in the punished subjects contribution, and found that each dollar of punishment of a groups highest contributor substantially decreased his or her next period contribution. The authors concluded that a major reason why punishment reduces efciency in the experiments mentioned is the punishment of high contributors. Their calculations showed that in the related public goods and sanctions experiments by Bochet et al. and Page et al., earnings would have been higher with punishment than without it but for the presence of perverse punishment. 9 Due to a computer problem, the voted rules were not properly implemented; nonetheless, decisions up to and including the vote remain uncompromised, allowing us to make inferences from this pilot experiment occasionally in what follows.

Following this pilot, we wanted to see not only what rules are chosen initially but also what voting patterns would emerge with further experience. In the rst of two designs, we increased the number of votes to three votes for each group, and correspondingly shortened the number of periods under which a voted rule governed before the next vote from 10 periods to 8. To keep the total number of periods to 30, we shortened the introductory experiences of VCMs with and without punishment from 10 periods each to 3 periods each. This became our 3-Vote design (see Fig. 1A). As in the pilot treatment, subjects in the 3-Vote design were given instructions describing the basic VCM, and then participated in the basic VCM (this time for 3 periods), then received their second instructions about the opportunity of voluntary punishment, unrestricted except for some budgetary constraints (see below), then played for 3 periods under this condition, all before learning of the voting opportunities and items to be voted on. At the beginning of the 7th period, the subjects received their third instructions, which explained the voting process, and took their rst vote on the rules governing who, if anyone, could be punished for the next 8 periods. At the beginning of the 15th period a second vote was taken and new rules regulating punishment were chosen. Then the subjects participated in 8 periods of the VCM with punishment (if any) governed by the second chosen rules. At the beginning of the 23rd period the third and nal vote was taken, and the remaining 8 periods were conducted with possible punishment governed by this last vote. As in the pilot, we included practice exercises in each of the three sets of instructions. Surprised to nd that out of 60 group votes none allowed punishment of high contributors and that the majority of groups seemed to be converging towards allowing punishment of low contributors, we added a 5-Vote design (Fig. 1B) which differed from the 3-Vote design in that (a) there was no play, whether with or without punishment, before the determination of rules by vote, and (b) the sequences of play between votes were reduced from 8 to 6 periods, to allow for ve votes and play phases in a session of similar duration. As Fig. 1B shows, the rst and only instructions were given at the beginning of the experiment. They explained the basic VCM mechanism without punishment, possible rules governing punishment, and the opportunity to vote on them. Subjects then voted to allow or restrict punishment (without any handson experience of punishment or its restrictions). Then they participated for 6 periods in the VCM, governed by the chosen rules of punishment. At the beginning of the 7th period, the subjects voted again, and then participated in 6 periods of the VCM, governed by the chosen rules of punishment. The same process repeated for three more times, as shown in the gure. The 5-Vote design had the same number of periods (30) as the 3-Vote design. The 5-Vote design functioned as a stress test for the results of the 3-Vote design in several ways. First, the task of learning and familiarization was harder, since the rst choice of rules occurred before subjects had any experience interacting in a VCM with or without punishment. Second, the possibility that experiences such as annoyance with free riders or with receiving punishment could inuence the rst vote was eliminated. These differences permitted a test of whether the 3-Vote designs results were driven by the 3-Vote designs more gradual, hand-on learning. Third, in the 5-Vote design there were 100 group votes, thus with 160 group votes in total, unanimity in prohibiting perverse punishment would be very unlikely unless there were strong factors leading in this direction. Finally, with each group voting 5 times instead of 3, the monotone increase in votes for the rule allowing punish low-but-not-high would be less likely unless there were strong factors leading to this pattern. In both the 3- and 5-Vote designs, sessions had 16 subjects assigned randomly to four groups of four subjects who remained together throughout the session. Each subject knew there were 16 subjects in the experiment room but could not

tell which among the others in the session belonged to her group. Contribution and punishment choices (if any) were announced to other group members under randomly changing labels B, C, and D, for ones fellow members, so that the behaviors of individuals could not be tracked from period to period, except by conjecture. A subject learned the total amount of punishment she had received, but not which group members punished her or by how much. Just before the second and later votes of both designs, each subject was informed of the punishment rule chosen in the preceding votes of each of the four groups in their session along with each groups average contributions and earnings during the periods the rule governed (the information was new for the most recently taken vote, and was repeated for the earlier votes). This information was included to speed the adjustment process, if there is one, and of course learning from the examples of others occurs in many real-world settings. The downside of providing this information, in terms of the number of fully independent observations, is substantial, but our main results are statistically signicant, in spite of this. Also, the rst vote of each group remains a strictly independent observation, since no information about other groups was shared until immediately before the second vote. 2.2. Payoffs All periods shared the same underlying structure. In each period, each subject had to decide on a division of 10 experimental dollars, in integer amounts, between a private account and a group account, before observing the choices of fellow group members. In a period, subject i earned yi 10 C i 0:44 X j1

Cj

(1)

where Ci is is contribution to the public account and the summation is taken overall members of is group, including i. After all four made their decisions, each was informed of the contribution choices of the others. When punishment was permitted, it cost a subject 0.25 experimental dollars to reduce the earnings of another person by 1.00 experimental dollar. Subject is earnings after punishment were thus yi 10 C i 0:44 X j1

C j 0:25

Xjai

Rij

Xjai

Rji

(2)

where Rij is the number of dollars by which i reduced js earnings, and conversely for Rji. General constraints on punishment in all treatments were: (i) a subject could not spend more than her/his pre-punishment earnings for the period on reducing the earnings of other subjects, (ii) a subjects post-punishment earnings for a period would be set to zero if earnings yi in Eq. (2) were negative, and (iii) a subject i could not spend more on reducing the earnings of a subject j in any period than would single-handedly reduce js earnings according in (2) to less than zero.10 The Appendix shows the screen design for entering an individuals contribution and punishment decisions. 2.3. Voting In a voting stage, each subject checked off one of three boxes beside each of three ballot items, on a screen set up as follows: I vote to allow a persons earnings to be reduced if(a) that person assigns less than the average amount11 to the group account (b) that person assigns the average amount to the group account (c) that person assigns more than the average amount to the group account Yes & Yes & Yes & No & No & No & No preference & No preference & No preference &

In each group of four subjects, of those expressing a preference in ballot item (a), if there was a majority or tie of No votes against punishment of low contributors, then punishment of low contributors would be prohibited for the next 8 periods in the 3-Vote design and 6 periods in the 5-Vote design; and if a majority voted Yes, punishment of low10 The purpose of (i) and (ii) was to keep all decisions nancially independent of each other while maintaining a guaranteed minimum payment for recruiting reasons. The purpose of (iii) was to help subjects to avoid pointless spending on punishment in view of constraint (ii). Note, however, that it remained possible for subjects to overspend on punishing in the sense that both subject i and, say, subject k might each spend enough to reduce js earnings for the period to zero, although only one subjects punishment would actually be effective in that case, given (ii). This could happen because subjects did not learn of punishment not carried out by or aimed at them, and the design (as in Fehr and Gachter, 2000a) keeps such information private so as not to encourage free riding on punishment. 11 As explained in the instructions, average amount meant the average over the four members of the group in the contribution stage of the period in question. It could vary among groups and within a given group from one period to the next. Note that a vote to allow punishment of those contributing less than the group average of 4 players is the same event as a vote to allow punishment of those contributing less than the average of the 3 others.

Table 1 Numbers of groups, subjects, and votes. Session design Number of sessions Number of groups in each session Number of subjects in each group Total number of subjects Total number of group votes on rules 60 100

3-Vote 5-Vote

5 5

4 4

4 4

80 80

contributors would be allowed; and correspondingly for ballot items (b) and (c).12 After the vote, each groups members received a message reporting the voting outcome, which was one of 23 8 possible punishment rules (i.e., combinations of the three ballot item choices).13 When a group voted to restrict punishment, a xed zero appeared in the punishment box14 for all individuals to which the restriction applied during the punishment stages that followed each contribution stage. For example, members of a group that had voted to prohibit all punishment saw the standard punishment stage screen with xed 0s in all the punishment boxes, indicating that no punishment was allowed in this case. We conducted ve sessions of each design using a total of 160 subjects (see Table 1).15 All of the sessions were conducted by computer in a computer lab at Brown University. At the end of each session, cumulative earnings for the 30 periods were totaled and converted to real money at the rate of 25 experimental dollars to one real dollar, and $5 was added as a participation fee. Sessions typically lasted a little less than 2 hours including instructions, and subjects overall earnings averaged approximately $25. Instructions for both designs are similar and available in our Working Paper.16 3. Results 3.1. The voting pattern In the 3-Vote design there were 720 individual votes (80 subjects each voting 3 times on 3 ballot items), and in the 5-Vote design 1200 individual votes. Table 2 shows the number of individual votes on each ballot item. The table shows a substantial number of individuals voted to allow punishment of higher-than-average contributors, but many more voting to allow punishment of less-than-average contributors. In the 3-Vote design there were 60 group votes (see Table 1), and in the 5-Vote design there were 100 group votes. In the 160 group votes altogether, only 4 of the 8 possible combinations of rules were ever chosen by majority rule. These were to allow: (i) no punishment, 56 group votes; (ii) punishment of lower-than-average contributors and no other punishment, 98 votes; (iii) punishment of low-or-equal-to-average contributors and no other punishment, 4 votes; and (iv) punishment of equal-to-average contributors and no other punishment, 2 votes. Conspicuously absent from this list is that no group ever voted to allow punishment of higher-than-average contributors. Result 1. No group ever voted to allow punishment of higher-than-average contributors, so perverse punishment was ruled out from the rst opportunity to vote. In ruling out perverse punishment, every group also ruled out unrestricted punishment from the beginning. The two rules punishment of lower-than-average contributors and no other punishment and punishment of low-or-equal-to-average12 We expected few cases where someone was exactly an average contributor, but for symmetry we treated the average contributor on a separate ballot item. 13 Only Yes and No votes were counted in determining majorities; for example, if 2 voted Yes and 2 voted No, the proposal did not pass, but if 2 voted Yes, 1 No and 1 No preference, the proposal passed. Subjects were informed of whether a ballot item passed or not, but not by how many votes or who voted which way. 14 See the boxes labeled b, c and d on the lower left portion of the diagram in the appendix showing the screen design. 15 Subjects were Brown undergraduates, recruited by (a) distribution of yers in the mailboxes of all undergraduates, (b) distribution of yers in a large introductory economics course, (c) distribution of table slips at a student dining hall, and (d) advertising under the heading of employment in an online campus magazine, the Brown Daily Jolt. Analysis of information provided in the post-experiment debrieng shows that the subjects majored in a large range of concentrations, with the economics concentration being that of only 15%, about 5% more than the proportion in the overall student body. A little less than half the subjects had taken no economics courses at the college level. A total of 67% of the subjects were female, somewhat higher than the 53% share in the general student body. Browns undergraduate population numbers about 5500, so students participating in a given session tended not to know one another. 16 See Ertan et al. (2005). In the instructions and experiment we used neutral language and did not use words like free riding, punishment, and perverse punishment. See also Cinyabuguma et al. 2006, where we point out that punishment is most clearly perverse when aimed at a groups highest contributor. Here as in that experiment we distinguish between punishment of above average, average, and below average contributors, rather than between punishment of highest and of other contributors, because this seems more symmetrical and less likely to convey a biased framing of the problem to subjects.

0 1 3 6 9 15 18 21 Periods 3-VOTE DESIGN 12 24 27 30

1 3

12

15

18

21

24

27

30

Periods 5-VOTE DESIGN

Fig. 3. Average contributions for the two designs, by period and punishment rule: (A) the 3-Vote design and (B) the 5-Vote design.

contributors and no other punishment are similar and we grouped them together under the heading of allowing punishment of low-but-not-high contributors. Fig. 2 shows how the group voting evolved, over the sequence of votes for the 3- and 5-Vote designs. Result 2 summarizes the voting pattern over time. Result 2. In both designs, a plurality of groups voted in their rst vote to prohibit all punishment, with a substantial minority of groups voting to allow punishment of low-but-not-high contributors. Over the sequence of votes, this ordering reversed, so that in the nal vote, nearly all groups voted to allow punishment of low-but-not-high contributors, with only a few remaining groups voting to prohibit all punishment.

Fig. 4. Efciency for the two designs, by period and punishment rule: (A) the 3-Vote design and (B) the 5-Vote design.

3.2. Contributions and efciency Fig. 3 shows period-by-period contributions of groups for the two composite rules most frequently chosen. In both the 3- and 5-Vote designs, groups that allowed punishment of low-but-not-high contributors achieved substantially higher levels of contributions than did groups that prohibited punishment altogether. We tested the hypothesis that contributions are higher for groups choosing the punish low-but-not-high rule than for those choosing the rule of no punishment, against the null hypothesis of no difference, in two ways. First, to avoid the possible statistical dependence from one period to another, and from group to group in a session because of the information provided from the second vote onward, we set aside observations from the second vote onward, and then averaged contributions in the periods between the rst and second vote (with a probable loss of power). Comparing contributions under no punishment with those under punish lowbut-not-high at the group level, we found, in a one-tailed MannWhitney test, signicance at the 0.1% level for 3-Vote design (11 group observations without and 9 with punishment) and at the 5% level (p 0.034) for 5-Vote design (13 observations without and 7 with punishment). Second, we tested differences in behaviors from the second vote onward in Wilcoxon matched-pair tests at the session level, with fewer observations but similar results.17 In both the 3- and 5-Vote designs, contributions in groups that permitted punishment of low-but-not-high contributors tended to increase over time until the end-game fall off. In contrast, 3-Vote design groups that prohibited all punishment had falling levels of contributions over time, replicating earlier results on basic VCMs without punishment, and in the 5-Vote design contributions had a slightly increasing trend in the middle periods.18 Fig. 4 shows period-by-period efciency of groups that voted to prohibit all punishment and groups that voted to prohibit perverse punishment while allowing punishment of low contributors. In Fig. 4A, average period efciency was

17 For these tests, we have at most one paired observation from each session, namely the average contribution per subject in all groups in the session that chose one rule, and the corresponding average in all groups that chose the other. This yields up to ve paired averaged observations in groups allowing no punishment and in groups allowing punishment of low contributors in each design and set of periods, although there are fewer observations for sets of periods when only one rule is observed in one or more sessions. For example, if in a certain session and set of periods two groups allowed no punishment and two groups allowed punishment of low contributors, we averaged the contributions over the relevant periods in the rst two groups and likewise for the second two groups, giving us one pair of observations for that session; if all four groups follow the same rule, the session offered no observation for this test. We performed Wilcoxon matched pair (ranked sum) tests on these data with the following results beginning with 3-Vote design: for periods 714, only 3 sessions have observations of both rules, and although in all cases contributions are higher in the groups allowing punishment, the p-value of the one-tailed test is 0.055; for periods 1522, with 4 valid sets of observations, the one-tailed test p-value is 0.034; for periods 2330, only two sessions still have groups not using punishment, so although the ordering remains consistent, the one-tailed test p-value is 0.09. Turning to the 5-Vote design, we have for periods 16, 3 pairs of observations with one-tailed test p-value of 0.055; for periods 712, 4 pairs with one-tailed test p-value 0.034; for periods 1318, 4 pairs, one with contrary ordering, hence one-tailed test p-value 0.072; periods 1924, 4 pairs including one tie, and one-tailed test p-value of 0.055; periods 2530, only 2 pairs, both with the usual order, but two-tailed test p-value 0.09. 18 In Fig. 3A and especially 3B contributions under the endogenously chosen rule of no punishment are more sustained and decline more slowly than is typical in a VCM without punishment. But endogenous choice includes its process, including repeated voting and the ability to change rules, possibly leading to commitment effects (see Sutter et al.), restart effects (see the dashed vertical lines in Fig. 3), and selection effects as groups change rules in response to free-riding behavior (i.e., groups with the lowest levels of free riding are less likely to adopt a rule allowing punishment).

Notes: One-tailed Wilcoxon matched pair tests. Tests 14 are for groups in the 3-Vote design. Test 1 compares the average contributions of periods 79 in groups that chose punish low in their rst vote matched with the average contributions of the same group in periods 46 of unrestricted punishment (the number of distinct groups matched and compared is n 9); and correspondingly for efciency. Test 2 compares contributions (efciency) for groups in periods 13 with contributions for the same groups in periods 46, n 20. Test 3 compares periods 46 with 79, for the groups that chose no punishment in periods 79, n 10. Test 4 compares members of the rst groups in each session that switched from a voted rule of no punishment to a voted rule of punish low, comparing the 8-period averages before and after the switch; if two or more groups in a session switched at the same time, the behaviors of all of their members are averaged; n 5. Test 5 is the same as Test 4, except it is for the 5-Vote design and 6-period averages are compared before and after the switch, n 5. A less stringent version of Test 1 considers the rst three periods of play in any group that adopted punish low, even if after the 2nd or 3rd vote. This test has n 17 and the test has a p-valueo0.1% for contributions and o1% for earnings. We also considered less stringent versions of Tests 4 and 5 that compare each group that switched from a voted rule of no punishment to one of punish low, regardless of whether this was the rst time such a switch had occurred among groups in their session. For Test 4, there are 9 paired observations and the p-value of the test statistic is o1% for both contributions and earnings. For Test 5, there are 17 paired observations and the p-value is o0.1% for contributions and o5% for earnings. Punish low indicates punish low-but-not-high. Signicance at the 1% level. Signicance at the 5% level. Signicance at the 10% level, and $insignicant, in one-tailed tests.

always higher under the rules allowing punishment of low contributors, and similarly in Fig. 4B, except in 6 periods.19 As with contributions, we performed both MannWhitney and Wilcoxon tests of the hypothesis that earnings are higher under restricted punishment than under no punishment at different levels of aggregation, with the resulting signicance levels varying from 0.1% to 10% in 3-Vote design and from the 10% level to insignicant in 5-Vote design due to the similarity of earnings under the two rules in some groups of periods.20 Table 3 compares contributions and efciency under the two most voted rules, and the exogenously imposed conditions of unrestricted punishment (periods 46 of the 3-Vote design) and no punishment (periods 13).21 The results of the ve tests of Table 3 are summarized in Result 3: Result 3. For each of the Wilcoxon matched pair tests on contributions, contributions are higher under the rule of punish lowbut-not-high than under the rule of unrestricted punishment, and contributions are higher under the rule of unrestricted punishment than under the rule of no punishment, and this ordering is transitive. Correspondingly, efciency is higher under punish low-but-not high than under no punishment, and efciency is higher under no punishment that under unrestricted punishment, and this ordering is transitive.

19 The difference in earnings between groups with no punishment and those with the punish-low-but-not-high rule (Fig. 4) is smaller than the difference in contributions (Fig. 3) because (a) an extra E$1 (one experimental dollar) of contribution raises efciency by only E$0.60, and (b) punish-low-but-not-high groups achieve higher contributions but incur some punishment costs (E$1.25 per E$1 of punishment assigned). Experimenters with the voluntary contribution mechanism occasionally seen in the lab a group that achieves high contributions without punishment or other aids, and the two groups that resisted voting for punishment in the 5-Vote design were of this type, their members perhaps priding themselves on being able to earn as much as those in other groups even without having recourse to the punishment threat. 20 As with contributions, we begin with MannWhitney tests using group level observations from the periods between the 1st and 2nd votes, only. For the 3-Vote design, the one-tailed test p-value in this case is 0.001; for 5-Vote design, the test nds no difference based on punishment rule, consistent with what Fig. 4B shows in periods 16. Next, we performed Wilcoxon matched pair tests for each set of periods with a maximum of one pair of observations per session. For 3-Vote design, there are 3 valid pairs for periods 714, all showing higher earnings with punishment, with one-tailed test p-value of 0.055; for periods 15-22, 4 pairs, p-value 0.034; and for periods 2330, 2 pairs, ordered as expected, p-value of 0.090. For the 5-Vote design, periods 16 have 3 paired observations but the difference, as with the corresponding MannWhitney test, is not signicant; for periods 712, 4 pairs, again no difference; for periods 1318, 4 pairs, with those with punishment earning more in 3 of 4 cases, thus p-value 0.072; periods 1924, 4 pairs, again 3 favoring those allowing punishment but this time one tie, thus p-value 0.055; periods 2530 only 2 valid pairs are left, with one session displaying one order, the other the other order, hence no signicant difference. Although violating the requirement of full independence of observations, it may nevertheless help to put these results into perspective and convey a sense of the statistical power lost due to the dissemination of information if we report also the results for tests using all group level observations for all periods: for 3-Vote design, the p-value of a one-tailed test would be less than 0.001; for 5-Vote design, the p-value of the corresponding test is 0.01. 21 For example, in comparing contributions under the rule of punish low-but-not-high with contributions under the (exogenous) rule of unrestricted punishment in Test 1, we considered the 17 groups of the 3-Vote design that eventually chose the rule allowing punishment of low-but-not-high contributors (see Fig. 2A). For each of these groups we calculated the average group contribution over the rst 3 periods that the group was governed by this rule. We matched this average with the same groups average contribution over the 3 periods of unrestricted punishment (periods 46 of the 3-Vote design). In the 17 matched pairs, 14 groups had higher contributions under the rule of punish low-but-not-high, 2 groups had higher contributions under unrestricted punishment, and 1 group was tied. The difference is signicant (p 0.001) in a two-tailed Wilcoxon matched pair test.

Frequency of cases of free riding

1.0 0.8 0.6 0.4 0.2 0

(229)

(205)

total number of observations for a rule in parentheses

(103)

no punishment

unrestricted punishment

punish lowbut-not-high

Fig. 5. Frequency of cases of free riding, by punishment rule.

Because of the difference in the orderings for contributions and efciency, the sequence or tests in Table 3 for efciency are rearranged to show the transitivity. The difference in the orderings of contributions and efciency is likely due to the cost of punishment. 3.3. Mitigating the free rider problem In the literature on public goods games, it is common practice to use the term free rider loosely to denote any individual who contributes less than the socially optimal amount. It is worth noting, however, that in the absence of punishment anyone who contributes less than others earns more than these others and thus obtains a free ride on others contributions; but when punishment is possible a low contributor may fail to earn more, and therefore fail to free ride in actuality. To compare how successfully different sets of rules address free riding, we adopt in this section a denition that considers the full outcome, not simply the contribution decision. Specically, the symmetric design of this and other VCM experiments suggests a simple denition of free riding: a subject A experiences free riding when someone else in his group, B, contributes less to the public good but earns more than A does.22 For a specied punishment rule, sequence of periods, and collection of groups, we dene the frequency of free riding as the number of cases of free riding divided by the number of observations, and an observation as a pairing in a group, where one subject in a group has a higher contribution than the other subject of the pair. By the design of a basic VCM without punishment and its payoff Eq. (1), every time someone contributes more than someone else, there is a case of free riding because the higher contributor always has lower earnings than the lower contributors. Thus, in this denition of free riding, the frequency of free riding for the basic VCM is 100% (as shown in the rst bar of Fig. 5). But the frequency of free riding may decrease when sufcient punishment is directed at low contributors. For the rule of unrestricted punishment, overall 20 groups in periods 46 of the 3-Vote Design, there were 205 observations of pairs of unequal contributions by subjects in a group, and 148 cases of free riding, for a frequency of 72% (see the middle bar). In comparison, the frequency of free riding in the rst 3 periods after a group voted for the rule of punishing low-but-not-high contributors was 35% of the 103 observed unequal pairs. This is a striking reduction, considering that the rule of punish low-but-not-high does not prevent a higher-than-average contributor from free riding on a still higher contributor. The difference in free riding between unrestricted punishment and punish low-but-not-high contributors is signicant (po0.0001) in a Fisher exact test.23 Result 4. In comparing VCMs with rules governing punishment, we found the highest frequency of free-riding in groups operating with no punishment, less free-riding in groups with unrestricted punishment, and least free riding in groups allowing punishment of low-but-not-high contributors. A regression analysis of incentives to free ride nds the same ordering as in Result 4. In the regressions below, we follow Fehr and Gachter in dening subject is absolute negative and positive deviations from the average of others contributions as ( ( Absolute Positive jC i C i j jC i C i j if C i oC i Negative and Deviation 0 otherwise 0 Deviation P where C i jai C j =3 is the average of others contributions. if C i oC i

otherwise

22 Under this denition, if everyone in a group contributed the same low amount, there would be no free riding (it is only dened for unequal contributors). 23 We also did a Wilcoxon matched pair test, which is also signicant; see the Working Paper for details.

Average contribution by others

0.230 (0.244) p 0.350

1.090 (0.405) p 0.009

0.228 (0.206) p 0.269

Positive deviation

n.a.

n.a.

n.a.

n.a.

Absolute negative deviation

1.217 (0.148) po0.001 0.91 82

1.039 (0.122) po0.001 0.86 92

1.054 (0.138) po0.001 0.75 241

0.967 (0.095) po0.001 0.78 176

R2 Observations

Notes: Punishment received as a function of deviation from group average in unrestricted and restricted punishment conditions. OLS regressions with period and group xed effects, not shown. Unrestricted punishment, in Column 1, is observed in periods 46, where each observation is for one subject and one period. Columns 25 include one observation per subject under the rule allowing punishment of low-but-not-high contributors. In Columns 2 and 3, only the rst three periods in which a group adopted the rule for the rst time are included, while Columns 4 and 5 include all periods of restricted punishment. Numbers in parentheses are White heteroskedasticity-consistent standard errors. Signicance at the 1% level. Signicance at the 5% level. Signicance at the 10% level.

Using Fehr and Gachters specication (see their Table 5, p. 991), we rst consider behavior in the 3 periods of the exogenously imposed rule of unrestricted punishment (periods 46 of the 3-Vote design, see column (1) of Table 4), and compare this with the rst 3 periods of the endogenously chosen rule allowing punishment of low-but-not-high in both the 3 and 5-Vote designs (columns (2) and (3)).24 Then we consider behavior for the punish low-but-not-high rule over all the periods which it governs in the 3- and 5-Vote designs (columns (4) and (5)). In each regression of Table 4 the dependent variable is each subject is punishment received in each period (3 periods for regressions (1), (2), and (3), and up to 24 and 30 periods in regressions (4) and (5) respectively). The independent variables are the Average Contribution of Others, is Absolute Negative Deviation, is Positive Deviation, and period and group dummies (not shown).25,26 The results in Table 4 are consistent with those of Fehr and Gachter in that Absolute Negative Deviation always obtains a signicant positive coefcient. The coefcient on the Positive Deviation term in column (1), however, suggests that when it is allowed, perverse punishment exacerbates the incentive problem for high contributors.27 Table 5 re-organizes Table 4s

24 We include observations for only the rst 3 periods under a rule in columns (2) and (3) to achieve comparability with the regression for periods 46 (column (1)), in view of the possibility that learning or other factors might change behaviors with more repetitions. 25 In both the unrestricted (Column 1) and restricted (Columns 25) punishment regressions, only the observations of individuals who could potentially be punished are included. The difference is that under unrestricted punishment, anyone can be punished. 26 The regressions were also estimated by the Tobit method, treating 0 punishment observations as potentially left-censored. Resulting coefcients are similar and similarly signicant except in the case corresponding to Column (1), where they are not signicant at conventional levels. 27 In fact there was considerable perverse punishment in periods 46 of the 3-Vote design. Of the 129 events of punishment, 28% were punishments aimed at higher-than-average contributors for the period and group in question, 19% at the highest contributor for the period and group in question and 11% at individuals who contributed their full endowment. These percentages are calculated by counting each event (rather than dollar amount) of someone punishing someone else. They may be atypically high due to the short duration of the unrestricted punishment portion of our experiment. Yet similarly large amounts of perverse punishment are found in some other studies; see for example Anderson and Putterman (2006), Gachter and Herrmann (2005), and for a regression result similar to column (1), in which the absolute positive deviation term also has a positive signicant coefcient, Ones and Putterman (2007), Table 2.

n.a. +0.60 +0.60

n.a. +0.60 +0.60

ndings in a manner that makes this clearer. In column (1) of Table 4 the coefcient for Absolute Negative Deviation is $0.89, the estimated punishment for a $1 reduction in contribution for a less-than-average contributor, under the rule of unrestricted punishment, in the rst 3 periods of the 3-Vote design, and shown as a negative gain of $0.89 in Column (1) of Table 5. In Column (2) of Table 4 the coefcient for Absolute Negative Deviation is $1.22, the estimated punishment for a $1 reduction in contribution for a less than average contributor, under the rule of punish low-but-not-high contributors, in the rst 3 periods of the 3-Vote design, and shown as a negative gain of $1.22 in Column (2) of Table 5, etc. The $+0.60 throughout Table 5 is the $1 gain in the private account from reducing ones contribution by $1, minus the $0.40 loss in the individuals earnings from the group account. In Column (1) of Table 4 the coefcient for Positive Deviation is $0.38, the estimated punishment for each $1 of additional contribution for a higher-than-average contributor, under the rule of unrestricted punishment, in periods 46 of the 3-Vote design. The $+0.38 in Column (1) of Table 5 is the positive gain from contributing $1 less and avoiding $0.38 in perverse punishment, for a higher-than-average contributor. The cases labeled n.a. in Table 5 are for the rule of punish low-but-not-high in Columns (2)(5), in which case punishment of higherthan-average contributors is not allowed. Table 5 shows that for less-than-average contributors the net gain from contributing $1 less is negative for each of the cases in Columns (1)(5). The $0.29 in Column (1) suggests that unrestricted punishment can reverse a subjects incentive to free ride, for a subject contributing less than average, replicating Fehr and Gachters earlier nding for the case of lessthan-average contributors. But the negative gains for less-than-average contributors is even more negative in Columns (2)(5), suggesting that the incentive against free riding is strengthened for less-than-average contributors under the rule of punish low-but-not-high. Table 5 suggests that the incentives to contribute $1 less for higher-than-average contributors is not reversed under unrestricted punishment or the rule of punish low-but-not-high. In Column (1) under unrestricted punishment, a subject with a higher-than-average contribution makes an estimated net gain of $0.98 from contributing $1 less (a gain of $0.38 from reduced perverse punishment added to the $0.60 gain from shifting away from the group account). In Columns (2)(5), under the rule of punish low-but-not-high, a higher-than-average contributor bears no punishment, but still gains the $0.60 from a $1 shift from the public account. While neither rule reverses the incentive for a higher-than-average contributor to contribute less, the incentive toward free riding is less under the rule of punish low-but-not-high than under unrestricted punishment.28

3.4. Do subjects vote according to their type? We conjectured that even though some subjects use opportunities to perversely punish (when punishment is unrestricted) and would likely vote to allow perverse punishment in our experiment, punishment of high contributors might nonetheless be ruled out since few groups would have a majority of members of this type. Results at group level are consistent with this conjecture. Is there also evidence at the level of individuals, however, that subjects tended to vote according to type? Logit regressions provide some afrmative evidence. We estimated regressions in which the dependent variable is 1 if a subject voted to permit punishment specied by a particular rule and 0 otherwise. Explanatory variables included the subjects contributions relative to their group averages28 When subjects make their contribution decision, they do not know what the other subjects contributions will be, and are uncertain of what will be the average and its boundary line of punishment risk. This uncertainty creates an incentive toward higher contributions to be on the safe side of the unknown boundary.

during the periods preceding each vote, measures of how much punishment they had given and received, and period (i.e. vote) and group dummies. The coefcients on the subjects relative contribution were positive in the regressions on voting to allow punishment of low contributors, signicant at the 5% level or better for both the 3- and the 5-Vote designs, and negative in the regressions on voting to allow punishment of high contributors, signicant at the 10% level in the regression for the 3-Vote but not in that for the 5-Vote design. This evidence suggests that the subjects were more (less) likely to vote to allow punishment of less-(greater-)than-average contributors the higher on average was their contribution above their groups average contribution in the 8 (6) previous periods. Details are in the Working Paper (Ertan et al., 2005).

4. Discussion and interpretation We discuss our results and interpretation under the following headings: (a) A rough calculation on the plausibility that no group would ever allow punishment of high contributors in the 160 group votes of the combined 3- and 5-Vote designs; (b) Institutional choice and its evolution with and without information on other groups performance; (c) Distaste for punishment and the role of opportunities to reconsider, (d) Variability of experimental results; (e) Implementation; (f) What the experiment appears to tell us about models of heterogeneous preference types; (g) Heterogeneous preferences in other voting models. (a) On the plausibility of unanimously prohibiting perverse punishment in 160 group votes. Even if only a quarter of subjects are prone to perversely punishing, it might seem implausibly rare that not a single group vote produced a majority for allowing it. As an anonymous referee commented: [t]he fact that no group ever allowed punishment of high contributors will make readers suspicious, since results of such clarity are quite rare. How improbable is the unanimity result? Simple calculations suggest a wide range in the assessment of probability. Consider the following composite hypothesis: (i) about 25% of punishment is targeted on higher-than-average contributors when punishment is unrestricted (see Section 1.1 and footnote 27); (ii) an individual who has a preference toward punishing high contributors is just as likely to punish as an individual who has a preference against such punishment (i.e., the proportion of subjects of given preference is the same as the proportion of corresponding punishment observations); (iii) perverse punishers are likely to vote their preference type to allow punishing high contributors, and similarly normal punishers are likely to vote their preference type to prohibit punishing high contributors (evidence for this from the logit analysis in Section 3.4); and (iv) the preference types are stable and randomly distributed. With these rough assumptions the binomial probability that a group of four subjects chooses to allow perverse punishment by a majority of 3 or 4 votes for the third ballot item is 0.0508 (we are setting aside complications from abstentions), the probability of prohibiting perverse punishment is 0.9492, and the expected number of group votes prohibiting perverse punishment is 152 out of the combined 160 group votes in the 3- and 5-Vote designs. This calculation roughly suggests that the vast majority of group votes will be to prohibit perverse punishment. But the binomial probability of unanimity, the event that 160 out of 160 votes prohibit perverse punishment, is small, 0.0002. However, this calculation depends on the assumption of statistical independence in type from period to period even for the same individual, and this is an unrealistically strong assumption. Consider another simple but unrealistic assumption in the other direction: that preference types and beliefs are so stable that they remain xed from period to period. Then it is as though there were only 40 independent group-level observations in the 10 sessions of the experiment and the same votes and other decisions are repeated many times. Then the expected number of votes prohibiting perverse punishment is 38 out of the combined 40 group votes in the 3- and 5-Vote designs, and the binomial probability of unanimity, that 40 out 40 votes prohibit perverse punishment, is much larger, 0.12. A glance at Fig. 2 shows that this second assumption on statistical dependence is unrealistically strong in the other directioni.e., views change over time. Presumably the probability of unanimity is somewhere between 0.0002 and 0.12, likely pretty far from the two extreme calculations. The calculations serve to remind us of the sensitivity of assumptions on statistical independence, when there are aggregations over many periods, and of the other uncertainties in (i)(iv). (b) On institutional choice, learning and evolution: Our experiment is one of several recent ones in which institutions are chosen by subjects through voting. Despite its stylized character, we think it suggests the considerable potential that the experimental method has for contributing to our understanding of how institutions emerge and evolve. We note again our choice of promoting a more accelerated and informed evolution of institutions by sharing information about outcomes among groups in given sessions, despite some cost to statistical independence. We would argue that when real-world groups decide on rules and practices, they often have access to information about the experience of similar groups, so the information spill-over in the experiment has a real-to-life quality. We want to emphasize, however, that 40 rst votes were taken by the 160 subjects in our core treatments, and 4 more by the 16 subjects in our pilot experiment, and that each of these votes occurred with no information about others choices or outcomes. Apart from the evolution towards more use of punish low-but-not-high with additional votes, our ndingsunanimous rejection of allowing punishment of high contributors in the initial vote, higher contributions and earnings with than without punishment of low contributors, lower earnings with unrestricted punishment, lower frequency of successful free riding under the punish low ruleare all supported by tests using only decisions taken prior to information dissemination, as well as by tests using the full data set.

(c) Distaste for punishment and opportunities to reconsider. In one treatment of a set of 4-person VCM interactions similar to those in this paper, Bochet et al. (2006) let subjects communicate in a chat room before the 1st, 4th, and 7th periods of ten rounds of play. A noticeable nding was that out of 12 groups in the chat room treatment with punishment opportunities, not a single group discussed an explicit strategy of punishing low contributors, and in some groups, members messages expressed the view that the punishment option was a trap set by the experimenters to reduce earnings. A distaste for punishment may help to account for the rejection of all forms of punishment in many of the initial votes in our experiment, for the rejection of punishment by most groups in Sutter et al. (2005) and Botelho et al. (2005), and for the initial preference shown for being in the group without punishment by most subjects in Gurerk et al. (2005, 2006). While eschewing the punishment idea in their deliberations, however, many of Bochet et al.s subjects engaged in costly punishment when group members defected from their verbal agreements to contribute. And subjects in the present experiment seem to warm to the idea of allowing punishment of low contributors as they experience the sense of resentment of or anger at free riders and as they learn that groups permitting punishment tend to have higher earnings. The institutional choices made in our paper and in those of Sutter et al. (2005), Botelho et al. (2005), and Gurerk et al. (2005, 2006) might seem at rst glance to be at odds, since our subjects and Gurerk et al.s subjects seem to show a greater overall preference for punishment than do those of Sutter et al. and Botelho et al. However, all share a common reluctance to adopt punishment rules at the outset, and much of the difference in overall outcomes may be attributed to the fact that our subjects and Gurerk et al.s subjects have many opportunities to change rules or groups, while Sutter et al. and Botelho et al. subjects have only one opportunity to vote on rules. Also, our subjects might have voted more like Botelho et al.s had they been required to choose between no punishment and unrestricted punishment only, since the results of periods 46 of our 3-Vote design are consistent with Botelho et al.s point that subjects may be worse, not better, off with (unrestricted) punishment. (d) On variability: At the same time, even a brief review of the literature of punishment in social dilemmas shows a large variability in experimental results. Experimentalists are well-aware that small changes in experimental design and wording of instructions can affect experimental results, not just for experiments on punishment but quite generally. Still, the literature on punishment in social dilemmas seems to yield an especially large variability in results. Our suspicion is that this variability is partly due to punishment behavior itself being scattershot and variable. Thus there may not be a simple general answer to the question of whether punishment in social dilemmas is a good or bad thing. The effects of punishment may vary so much with the specic conditions that there is no general answer.29 (e) On implementation: In the experiment, once a rule of punishment is chosen by vote, it is easily implemented by the experiments computer software. In the real world, there is no such easy implementation. Nonetheless, in the practical world most organizations are hierarchical or a blend of hierarchy and symmetric volunteer elements, and organizations often nd ways of managing, albeit imperfectly, who gets punished. For example, in hierarchical organizations if managers were more aware of the possibly high frequencies of perverse punishers and high costs in efciency, they might focus more on mitigation. Once aware, managers could work to limit decentralized punishment and attempt to instill norms of cooperation in much the same manner that managers attempt to control bullying behavior and harassment. (f) On heterogeneous preferences: There is a continuing discussion about keeping the standard model which limits the type of preferences to self-regarding (individual prot maximizing) preferences. In favor of this approach is that it is parsimonious and often leads to specic predictions, which in turn are often consistent with experimental results. However, in this experiment, we dont see how we can interpret the results without positing some form of other-regarding preference types (e.g. conditional cooperators, perverse punishers). Other experiments on social dilemmas also suggest the need for modeling heterogeneous preference types, including both self-regarding and other-regarding or reciprocating types. Our experiment adds to the interpretation of heterogeneity, in a particularly striking way. An appeal of modeling only homogeneous self-regarding preferences is that introducing heterogeneous preferences is too mushy, allowing almost any prediction and rationalizing almost any observed result. But our experiment has a strong and consistent pattern to it, suggesting that the existence of heterogeneous preferences need not always lead to indeterminate results. (g) On heterogeneous preferences in other voting models: Our analysis suggests that the presence of multiple preference types may be important to predicting voting outcomes, and this may be true for other instances of public choice as well. Pork barrel politics provides an example. Ordeshooks (1986, pp. 210215) model of pork barrel politics is one of a social dilemma where what is good for an individual legislator is bad for society as a whole. For example, Senator Stevens benets

29 The fact that Gurerk et al.s subjects earn more with than without unrestricted punishment while the comparison goes the opposite way for our subjects and Botelho et al.s illustrates this variability. In personal communication, Simon Gachter reported that he and his collaborators found large differences in the frequency of perverse punishment and, correspondingly, in the benet or lack of benet of introducing a punishment option across subject pools in different countries and settings (a nding documented shortly before our paper went to press in the remarkable study by Herrmann et al., 2008).

by bringing pork to his district (the bridge to nowhere), while other Senators lose because their districts end up paying for the bridge, even when the net benets of the bridge are negative. Why then dont the other Senators outvote Stevens? Ordeshooks answer is that in a pork bill, there can easily be an equilibrium where there are just enough ear-marked pork projects to form a winning coalition, even when each of the projects has negative net benets. Ordeshooks analysis depends heavily on the assumption that each legislator is narrowly self-interested (the selfinterest may be in the form of an increased probability of re-election). In fact, the assumption of a single preference type of self-interest is still common in voting models in the political science literature. Our experiment and others on VCMs, the dictator game, and the centipede game (McKelvey and Palfrey, 1992) suggest that the assumption of homogeneous preference types can be misleading. If one allows for the possibility of heterogeneous preference types in Ordeshooks model, the equilibrium can shift and the predicted outcomes are not always as dire as Ordeshooks original model suggests. For example, some senators may care about doing the right thing, or some voters may choose not to reward a senator who joins a pork coalition, so the situation may be more uid than it appears in Ordeshooks model. But if the situation is this uid, can anything happen? To deal with this possibility we focused on observed behavior under the specic experimental conditions, and then interpreted the specic results in terms of heterogeneous preferences. We believe that this approach can work in experimental studies of other voting models, such as Ordeshooks, even when there are signs of heterogeneity and odd behavior, as there were in our study of voting and perverse punishment. As another example, Meltzer and Richards (1981) model of the level of redistributive taxation uses a median voter solution assuming strictly self-regarding preferences. More accurate explanations of the level of redistribution and its variation over time and place would consider the strength of preferences for greater equality, on the parts of some citizens, and resentment of the undeserving poor, on the parts of others (see, for instance, Benabou and Tirole, 2005). Such an addition of two almost opposite social preference types alongside self-interested types resembles the situation studied in this paper, where self-interested subjects co-exist with both cooperation-preferring and cooperation-resisting types, with the associated demographic leading to predictable voting outcomes.30

Appendix Fig. A1 is the screen design for an individual to enter her contribution to the group account (box a), to learning of others contributions (boxes b, c, and d), to enter her punishment decisions (boxes b0 , c0 , and d0 ), and to observe the computers calculation of net earnings for a period.

g = 10.0 a Put in group account

You a B b C c D d

Earnings from private account

Total in group account e = a+b+c+d f = 0.4e Earnings from group account