Introduction

The Time, Money, and Morality article has been HIBAR-ed on Twitter and the Blogosphere (e.g., by Rolf Zwaan and Greg Francis ) and the discussion seems to revolve around the validity of the inferences based p-values close to 0.05 (e.g., they raise suspicions of p-hacking).

Unethical behaviour is operationalised as taking the opportunity to cheat on a task.
Priming methods vary across experiments, so do the tasks that allow for an opportunity to cheat.
In Experiment 1 the two postulates are tested, Experiments 2-4 concern an assessment of the role of self-reflection on cheating behaviour and is operationalised differently across experiments.

Hold on to your P-curves for a moment… Back to the basics!

In this Post-Publication Peer-Review (3PR) I demonstrate that there is indeed some cause for concern about the way these results are presented and interpreted. Was it p-hacking? … I don't know and maybe I don't even care. To me this is an example of sloppy science, p-hacked or not, these results were allowed to be published by expert peers. It is more relevant to discuss the broken system of quality control that should have picked up on at least some of the following issues:

Important information is missing:

in general (e.g., number of subjects per condition, sample size determination)

No explanation of (conflicting) results across experiments (e.g., variation in amount of cheating)

No explanation for failing of random assignment to design levels (none of the experiments have equal N samples)

The article under scrutiny is by no means exceptional with respect to such issues, moreover, the way frequency / proportion data are analysed in psychological science is generally awkward and most of the time completely wrong.

I will 3PR the data based on the information in the article and comment on the results:

The R code used to generate the results (and this page) is available in this Markdown file, and this post explains how to post to a WordPress blog.

I. Analysis of proportion / frequency data

Some concerns can be raised about the significant differences between various conditions in proportion Cheating reported in the 4 experiments.
First and foremost, no corrections for multiple comparisons are conducted, should one do so, just 2 significant proportion differences remain:Money vs. Time in experiment 1 & 4. In Experiment 3, the sample difference No Mirror: Money - Time was marginally significant in the 2nd significant digit (original: p = 0.015, adjusted \( \alpha \) = 0.013, Bonferroni).

Second, no continuity correction is applied, these proportions are calculated from discrete numbers (participants). If a continuity correction is applied, 2-3 significant differences remain, depending on the \( \alpha \)-level chosen:

(Cheating can be considered a dichotomous response, so logistic regression could also be used, see III. HAPPE-ing)

Note:
Experiment 2 & 3 do not list n per condition, the most likely values for n (1. closest to an integer value; 2. as equal as possible; 3. Add to total N) are assumed:

Experiment 2

Prime

Assessment

Ncond * %Cheat = Ncheat (deviation)

Money

Personality

36 * 0.2778 = 10.0008 (8 × 10-4)

Time

Personality

35 * 0.2857 = 9.9995 (5 × 10-4)

Money

Intelligence

38 * 0.5 = 19 (0)

Time

Intelligence

33 * 0.303 = 9.999 (10 × 10-4)

Experiment 3

Prime

Assessment

Ncond * %Cheat = Ncheat (deviation)

Money

Mirror

31 * 0.387 = 11.997 (0.003)

Time

Mirror

28 * 0.321 = 8.988 (0.012)

Money

No Mirror

30 * 0.667 = 20.01 (0.01)

Time

No Mirror

31 * 0.355 = 11.005 (0.005)

1. log-linear analysis of observed cell frequencies

Log-linear analysis, or poisson regression using the generalised linear model, can be used to test whether relationships exist among the variables in a multi-way contingency table. Here I analyse the number of participants in each cell of the design: The observed frequencies take the role of the dependent variable and the levels of the design factors such as Mediator, Prime and Cheating are considered the levels of independent variables (another option would have been a logistic / probit regression with Cheating as the dependent binary / proportion variable).

Two types of result given for each experiment:

First, a table listing deviance tests for the full (saturated) model. The analysis starts with the NULL model (all frequencies are equal) in the first row. Each subsequent row lists what happens to the deviance (of the model in the previous row) when a factor is added. A significant drop in deviance means adding the factor to the model contributes to predicting the difference between expected and observed frequencies. For hints of corroboration of the hypotheses reported in the paper, significant interactions between a design factor and Cheating are necessary.

Second, a mosaic plot is displayed, this is a graphical representation of the conditional cell frequencies. The mosaic plot also indicates which residual frequencies (observed - expected) are significantly below (red) or above (blue) the expected frequencies (residuals are interpretable as a Z-score). The coloured cells contribute most to a high and possibly significant \( \chi^2 \) value.

Note:
The significance of the change in deviance can depend on the order in which factors are added to the model and is not the same as a significant beta weight in a regression model.

Conclusion log-linear analysis:
This alternative, and in my opinion more appropriate analysis is in agreement with the results after correction for multiple comparisons and continuity:

The mosaic plots show that there may be some unexpected factors driving the “effects” reported in the paper:

In experiment 1 & 4 it is not so much the observed frequency of people that did cheat, but the number of participants that did not cheat that deviate from the expected frequencies based on table margins.

The Money prime caused less people to NOT cheat, whereas the Time prime caused more people to NOT cheat

If there is a difference in amount of Cheating between samples, it is likely a “main effect” between the Time and Money prime (Cheating:Prime interaction), it is found to cause a significant drop in deviance in Experiments 1, 3 and 4.

Experiment 2 stands out, because observed differences in Cheating are unlikely due to chance, but none of the other factors contribute to explain differences between expected and observed frequencies.

The point about the mosaic plots is not just semantics or methodologists' nit-picking. What it tells us is that, e.g. in the mosaic plot Table.1.1, among the observed frequencies of CheatYES, the cell Money does not stand out much from Time and Control from what may be expected by chance, for CheatNO on the other hand, the cell Money does stand out as different.

Effect Size Confidence Intervals:
To get a clearer idea about the significance between cell differences I calculate confidence intervals around the effect size associated with contingency tables. The CIs in Figure 1 below are based on the exact Odds Ratio (using the non-central hypergeomteric distribution) for a 2x2 sub-table of the full design obtained from Fisher's Exact Test, testing against \( H_0: OR = 1 \).

Note:
Here, the Confidence Levels have been adjusted to account for the fact that 3 (EXP1&4) and 4 (EXP2&3) subtables of the full design were compared (1-(0.05 / #tests)). The exact p-value from Fisher's exact test reported in the Figure was multiplied by the number of comparisons in each experiment.

Conclusion Proportion data

If there is an effect, it exists as a “main-effect” difference between the Money and Time primed samples in Experiment 1 and 4.

Experiment 3 No Mirror: Money - Time is a marginal case.

Experiment 2 did not yield any substantial effects.

4-5 out of 7 statistical inferences in the paper that are made based on proportion data should be considered invalid.

II. Analysis of extent of cheating

The extent of Cheating concerns the difference between actual accuracy (which is not provided as a result) and reported accuracy by a participant.
Experiment 1-3 report analyses of extent of Cheating including means and SD's. Sample size assumptions for Experiments 2 and 3 are the same as above.

Compare Cohen's d CIs

I created CIs around the effect sizes based on the means and SD reported for Experiment 1-3 using the R package MBESS.

III. HAPPE-ing (Hypothesising After Post-Publication Evaluation)

Even without re-analysing the published data as I have done here, the conclusions by the authors can be questioned based on a comparison of very elementary results:

Across four experiments, using different primes and a variety of measures and tasks, we consistently
found that shifting people’s attention to time decreases dishonesty. Priming time makes people reflect
on who they are, and this self-reflection reduces their likelihood of behaving dishonestly.

The clue is to compare the results across the 4 experiments and evaluate whether it is valid to infer that the core postulates have been corroborated. The designs and materials are slightly different each time, but if variation in outcomes (e.g., proportion cheating behaviour) varies systematically with one or more of the experimental differences, there may be another variable at work here.

One result that begs explanation is the drop in proportion Cheating in all the samples of Experiment 2 when compared to the other experiments. What is special about the procedure and methods? Regrettably more than 1 potential intervening factor changes with respect to Experiment 1.

A second odd omission in the interpretation of the results is the level of accuracy achieved by participants. In Experiments 1-3, the urge to cheat must have been less when a participant had achieved 90% accuracy. Experiment 4 is somewhat different in that the cheating opportunity concerns one “bottleneck” problem that is difficult to solve, but has to be correct in order to make other more easily solvable problems count in adding to the final reward. Here, accuracy could have an opposite effect in which less accurate participants cheat less. If 0 or only 1 extra item past the “bottleneck” item were solved, a participant might be less inclined to cheat than a participant who solved every problem except for the “bottleneck” item.

What is mediating what?

The figure below shows the interaction between the maximal financial incentive that could be awarded and the proportion cheating for each prime and experimental condition (indicating whether a mediator variable was manipulated in addition to being exposed to a prime). Note that the Intelligence and the No Mirror condition of Experiments 2 and 3 respectively are considered similar to Experiment 1 and 4, that is, they reflect a condition in which Self-reflection was not induced by any other means than priming:

This relationship can be tested in a generalised linear model, of course being fully aware that this is exploratory HAPPE-ing. I assume the samples from each experiment are independent and use the number of cheaters vs. no cheaters as the dependent binomial variable. The model contains only those effects for which data are available (e.g., no interactions with both Prime and Mediator)

In the table above the model Intercept corresponds to the odds of Cheating compared to the Null-model when the predictors have the values: Prime = Time, Mediator = None and Reward = 0. Compared to the overall probability of observing Cheating behaviour, it thus seems that when the Time prime is presented without an induction of Self-reflection and a financial reward incentive, the odds of Cheating drop.

This appears to be a corroboration of the second postulate, but note that in this analysis (just as in the previous analyses), there is no real difference between the Time prime and prime = None. The standard errors around these parameters are quite high. A clearer picture emerges when the Intercept is defined as Prime = None, Mediator = None and Reward = 0 and the Odds Ratios are compared (exponentiation of the parameter estimates):

The odds ratios in the table above are multiplicative changes to the Probability of Cheating = 1 when the predictor increases by 1 unit. So an OR < 1 will decrease the odds of observing Cheating behaviour and an OR > 1 will increase it. The 95% CIs are based on the profile likelihood and show that in most cases the effect covers a range below and above 1. The range for the effect of Self-Reflection is always below 1.

One can interpret the modelled relationship between these variables as follows:

There is a weak positive association between the Maximal Financial Reward and the Probability of Cheating

The association changes with the value of Prime, becoming stronger when Money is primed, weaker when Time is primed

The induction of Self-reflection does not cause the association to change, it changes the intercept, the base-line Probability of Cheating at Reward = 0

A graphical representation of the model predictions more clearly reveals this relationship:

Conclusions, Discussion and further HAPPE-ing

The significant results between Time and Money in Experiments 1 and 4 probably arise due to the increase in Probability of Cheating when there is a financial reward and Money is primed.

It is unlikely there are any other “real” differences in these data except for the induction of Self-reflection: Model predictions show it decreases the Probability of Cheating by the same amount for different primes

Note that there were no actual data points for None + Self-reflection

The missing predictors in the Probability of Cheating analysis are the actual and reported accuracy of the performance (amount of correctly solved problems and money received respectively). These values cannot be inferred from the extent of cheating analyses. It seems reasonable to assume in most experiments there was less incentive to engage in Cheating by participants who were more accurate.

This brings up the question of whether the effects are driven by some sort of Speed-Accuracy instruction: Naturally, Time = Money, but taking the time to solve the problems may lead to higher accuracy and less incentive to cheat, likewise a focus on getting as many answers as possible may introduce errors and promote cheating.

In science there is a moral obligation to do the best one can to be as accurate as possible and usually this means it is wise to be as modest as possible about ones' scientific claims. I am not an expert in this field, but the sheer amount of questions that can be raised about the validity of the inferences made in this paper makes one wonder who the peers were that achieved consensus about the credibility of this research and what their area of expertise was.

I am not saying this is irrelevant, or poor research; the two effects that survive the scrutiny of 3PR are certainly interesting. I am just a little worried this paper says more about the morality of contemporary scientific publishing than the scientific study of moral behaviour.

Some notes about this file:

This file was created using Markdown in RStudio: Unless otherwise indicated in the code blocks (e.g., by require), the basic R packages are used.

The one true gospel on statistical inference does not exist and more than one approach to analyse these data may be defensible.

Therefore: Please be aware these comments and suggestions reflect my own preferences and standards in these matters. If you feel I should change some of my preferences and/or standards please let me know, because I review and adjust them on a regular basis.