Monday, June 21, 2010

Deep Trouble for the Deep Self

Hi Everyone,

My name is David Rose and I am new to this blog and happy to be contributing. Jonathan Livengood, Justin Sytsma, Edouard Machery and I have written a paper discussing Chandra Sripada's Deep Self Model of intentional action. Recently, Sripada used structural equation modeling to show that his Deep Self Model was supported over several other models (e.g. Knobe's and Alicke's). We reanalyzed Sripada's data and found that his data do not support his Deep Self Model. For those interested, the paper is attached here:
Download Deep Trouble for the Deep Self circulated

In our paper, Sara and I argue that 1) normative factors, specifically goodness/badness judgments and moral status judgments, don’t explain the asymmetry in Knobe’s Chairman case; and 2) Deep Self-related factors do. In your critique, you agree that our data supports (1) and disagree that our data supports (2). We appreciate your support of our claim 1, since this is the topic of most of the paper and is the basis of the title of the paper (‘Telling more than we can know about intentional action’). But the arguments advanced in your paper against our claim 2 are problematic both in terms of specifics and in terms of broader theoretical approach. Let me take each in turn. I start with the specific claims in this comment (part 1) and then turn to issues around broader theoretical approach in a second comment (part 2).

Claim 1: Tetrad outputted a better model than the one adopted in Sripada and Konrath (S&K), so the S&K model should be rejected.
There are two problems here. First, Tetrad did not find a better model, it found an equivalent model. As anyone familiar with structural equation models (SEM) knows, equivalent models are invariably available. The difference in fit between the Tetrad-outputted and S&K models is very modest, and the models are non-nested and cannot be formally compared anyhow. Second, you misrepresent our purpose in using SEM. We did not seek to find the best fitting model among *all* possible models. Our approach was to test a small number of candidate hypotheses [several Normative Factor hypotheses and the Deep Self Concordance Account (DSCA)], all of which enjoy substantial antecedent support in the philosophical literature. We wanted to conduct a head-to-head test and found that the DSCA clearly beats normative factor models. We used modification indices only to make sure a good candidate model was not being prematurely rejected due to small issues with model specification. We do not follow your lead in thinking it is appropriate to blindly search for the best fitting model without any antecedent theoretical hypotheses at all, and then claim that alternatives (even alternatives with very strong antecedent support) are thereby undermined. I do think theory-free search can be useful. It can help one find new models which can then be tested in subsequent studies. I hope you will now do this if in fact you genuinely think your alternative Tetrad-delivered model is credible.

Claim 2: The positive sub-model associated with the DSCA does not fit the data.
This too is simply incorrect. A significant chi-square simply cannot be used in the way you propose to reject a model. A chi-square test is unreliable when sample size is large (i.e., >200; see the Kline text you cite, page 200) so established norms in the field are to use other fit indices, such as GFI and NNFI, which both show the positive sub-model is consistent with the data. There are also lots of problems with breaking a holistically fit (maximum likelihood estimated) and tested SEM model into parts (with resulting loss in degrees of freedom) and insisting on good fit for every part, but I will leave that issue alone for now.

Claim 3: The data supports that at most either the Att variable or Gen variable (the two Deep Self variables) influence Intentionality, but not both. Therefore the DSCA is undermined.
Simply stating this claim immediately reveals why it is unreasonable. If a model makes two predictions and only one turns out to be true, then the model is not thereby falsified. That is a much, much too dramatic conclusion. You do not even consider any alternatives to this extreme conclusion. Why?

But there is an even more serious problem here. You say you use two assumptions, Markov and Faithfulness, to infer certain probabilistic relations that need to hold among the variables in the positive sub-model, and then argue these probabilistic relations fail to obtain. However, you failed to mention a third critical assumption – Causal Sufficiency. This is the assumption that all the common causes of pairs of variables in the model are themselves measured. This assumption would fail if a latent (unmeasured) variable underlies Gen and Att. But that such a latent construct exists (call it the ‘Deep Self’ factor) has precisely been my hypothesis all along. It is clearly implied in the S&K manuscript, explicitly stated in other papers (that David has read), and was specifically stated at talks (that Edouard has seen). It is puzzling to me that you would frame the issue in such stark and unwavering terms as the data ‘undermining’ the DSCA when this alternative hypothesis is so readily available.

Of course I could not model this hypothesized latent variable with the current data set because the resulting model would be unidentified. But it is easy enough to do a poor man’s latent analysis with this data: average Gen and Att as an unweighted ‘Deep Self’ composite score, as is widely done in other SEM studies (see Dawes’ famous paper ‘The Robust Beauty of Improper Linear Models’ to see why this works). This yields a simple mediation model with a highly significant Sobel statistic for the path involving the averaged Deep Self variable. The bottom line is that your claim that the DSCA is ‘undermined’ is not at all warranted by the actual content of your much more modest finding, and you don’t even try to consider alternative less extreme conclusions.

Let me now turn to broader issues of your theoretical approach. You do not provide nearly enough context about the evidential status of the Tetrad framework from which you mount your critique. Structural equation modeling is an established method widely used in the behavioral/social sciences (PsychInfo alone showed more than 10,000 articles published over the last couple of decades). Sara and I use a very standard SEM methodology, and we execute it completely ‘by the book’, that is we carefully follow the established norms for proposing and testing an SEM model, and use SEM only to adjudicate between competing a priori hypotheses (rather than blindly searching for the best fitting model). You offer a critique based on Tetrad, a theory-free search procedure that is very much a new kid on the block. Tetrad is not established in the field by any means (I am aware of only a handful of studies in the behavioral sciences that have used it, and fewer in leading journals). Moreover Tetrad is certainly controversial, with many serious thinkers skeptical of it, especially in terms of practical applications. For example, the Causal Sufficiency assumption or other assumptions may not be met in actual empirical data sets, and if the theorist stipulates these assumptions are met, Tetrad will misrepresent actual causal structure, which is exactly what happened in your Tetrad analysis. There are many other issues about Tetrad that I obviously cannot discuss here, and indeed am not at all qualified to discuss. But the relative merits of Tetrad is not at all the issue. The point is that when you mount a critique of an established method like ours using a new, experimental, and controversial method like Tetrad, it should be a basic ground rule that you have to flag this fact front and center. Otherwise, you present a very misleading picture to your audience that two equally well-established methods are being pitted against each other. There may indeed be brilliant insights in the Tetrad framework for causal discovery (in fact I am absolutely sure there are), but as of yet, the approach is simply not established in the field and you need to acknowledge this clearly.

The need to be very clear about ‘norms in the field’ is particularly important in XPhi. Let’s face it, XPhi practitioners are not always going to be experts on every new fangled statistical method. As the field matures, people will gradually incorporate more elaborate methods into their repertoire. We should expect XPhi practitioners who use new methods to be forthright in identifying how established the methods are, since we will often have to take it on faith that the methods are established and trustworthy. You need to be much clearer that Tetrad is not an established method, and the one used by Sara and I very much is.

Here is the other issue about approach. SEM is great for some things. For example, when a model with strong a priori support, like ours, and the other equivalent models all uniformly say that the two normative factors are not driving intentionality judgments, that is clearly a strong result, and one that is very hard to show without using SEM methods, such as using polling methods. Here you and I are in agreement that SEM is very useful to make progress in philosophical debates. But here is where we seem to be disagreeing: I believe that SEM is *not* good for some purposes and am very cognizant of its limits. For example, when there are two equivalent models (e.g., our final model in S&K and your Tetrad-delivered model) that have similar fit and differ only in terms of the direction of a single causal arrow, it is unreasonable to expect a single data set like this to strongly adjudicate between the models, let alone to support your extreme claim that the data ‘undermine’ our model. SEM in not a magic wand and correlational approaches based on linear models have their limits. It is here that the other tools at our disposal are going to be crucial. We have to build comprehensive theories around our respective accounts, and test these accounts head-to-head in new studies using manipulations (preferably coupled with SEM). This is the only reliable way to tease apart what is causing what.

Towards the end of moving forward with future theory-based studies, here are some questions for you that I hope you will answer: Have you been able to find any theoretical justification for your Tetrad-delivered ‘Intentionality First’ account in which intentionality judgments drive attitude attributions, and not the other way around (indeed, if I understand correctly, you think there is no causal arrow at all going from attitude attributions to intentionality judgments)? Why should we believe this account is plausible in light of everything we already know in philosophy and psychology about intentionality judgments? Do you have experiments, or proposals for experiments, that would support your account? How does the Intentionality First account deal with the fact that existing theories of intentional action (e.g., see Malle and Knobe 1997) uniformly say attitude attributions (i.e., attributions of beliefs, desires, foresight), inter alia, play a causal role in intentionality judgments, and not vice versa as your Intentionality First account would seem to predict? How does your account fit with recent experimental findings in the XPhi literature on intentional action, for example the cases and findings I report in my first Deep Self paper? These are all critical questions that need an answer. I look forward to hearing them.

Finally, let me add that though my part of my job here is to offer criticisms and I am obviously partisan to my own view, there is actually much more we agree on. In particular, I think we agree that SEM-related methods might be very useful as a core method for XPhi. Also, let me add that I think the techniques you are using are very much at the cutting-edge and will no doubt move the field forward in the future. Thanks again for your critique and your interest.

Thanks for this useful response. My co-authors will without any doubt reply, but I'd like to correct several mistakes in your response.

1. "Tetrad did not find a better model, it found an equivalent model." Not so. Tetrad *did* find a model that fits best by all fit criteria. As we note in the paper when the models are not hierarchically related fit indices are commonly used to determine what's the best model is. We give citations about this.

2. "you misrepresent our purpose in using SEM. We did not seek to find the best fitting model among *all* possible models." Not so: We do not say this was your goal. Rather, we object to the fact that, instead of searching for the best fitting model, you relied on what we call a guess-and-check method. *I* think that scientists should search instead of testing the models they happen to consider for usually subjective and arbitrary reasons.

3. Your point about using p to test a model is well taken, but you misunderstand the essence of the second argument (partly because the way it was formulated). The real issue is that the fit of your full model derives almost entirely from the fit of its negative parts. So, the good fit of your model provides no evidence whatsoever for your positive causal hypotheses.

4. Your response to the third argument is puzzling: If someone holds a theory asserting P&Q and if P cannot be true, then her theory is surely falsified! Surely you do not want every claim made by a theory to be shown false for a theory to be falsified or there would never be any falsification.

5. " For example, when there are two equivalent models (e.g., our final model in S&K and your Tetrad-delivered model) that have similar fit and differ only in terms of the direction of a single causal arrow, it is unreasonable to expect a single data set like this to strongly adjudicate between the models, let alone to support your extreme claim that the data ‘undermine’ our model." We do not say that the existence of a better fitting model undermines your account, exactly for this reason. The first argument concludes that your model is not supported over another better fitting model. Arguments 2 and 3 show that your model is undermined.

6. Your discussion of Tetrad is very misleading, and it's a bit of a smoke screen, really.

Tetrad is naturally not "theory free," quite the contrary! The mathematical property of the algorithm we used are known (see the citations in the paper) and it is based on extensive work by Scheines, Glymour, Spirtes, and others.

Tetrad is not particularly controversial, contrary to what you suggest. Naturally, assumptions are made when tetrad is used, but this is also the case of SEM - indeed, we have doubts that the assumptions you are making are met (see one of the footnotes at the beginning of the paper).

I also note that you are confusing "established" and "trustworthy". We give you the citation concerning the trustworthyness of the algorithm we used in the paper.

7. Finally, our goal was not to propose a model of intentional action. We do not endorse the model output by Tetrad. In *my* mind, the conclusion to be drawn from our paper is that the measures you used and the data your recorded are not particularly useful to study the psychological process leading to judgments of intentional action. But this is not the point of the paper at all.

(PS: I vaguely remember that Dawes's paper is about the weights in linear regression, really a different issue from latent variables it seems to me, but I might be wrong.)

I just realized that by "theory free" you did not mean that Tetrad is not based on theory; instead, you referred to the contrast between searching and testing models specified a priori. Sorry for my mistake.

But let me highlight a virtue of Tetrad - why would you think that the models you happen to be able to imagine based on your preconceptions are the correct ones?

(1) “Claim 2: The positive sub-model associated with the DSCA does not fit the data.
This too is simply incorrect. A significant chi-square simply cannot be used in the way you propose to reject a model. A chi-square test is unreliable when sample size is large (i.e., >200; see the Kline text you cite, page 200) so established norms in the field are to use other fit indices, such as GFI and NNFI, which both show the positive sub-model is consistent with the data. There are also lots of problems with breaking a holistically fit (maximum likelihood estimated) and tested SEM model into parts (with resulting loss in degrees of freedom) and insisting on good fit for every part, but I will leave that issue alone for now.”

You say that a chi-square is unreliable when the sample size is large. Your sample size was 240, do you consider this to be large enough to render the chi-square tests of the sub-models unreliable? More importantly though when a model has good overall fit it is both useful and commonplace to investigate *why* a model has good overall fit. The reason is because for any model I can dump in all sorts of “noise variables” that are just uncorrelated Gaussians and get a good p-value. So, looking at sub-models can be a useful way of seeing why the overall model fits well. Additionally, there is no “correct” method for analyzing sub-models (chi-square tests are perfectly apt but so are BIC and AIC scores). Again, what’s really important is figuring out why a model fits well, and by looking at the various sub-models we can look at what parts are fitting the data well and what parts aren’t.

(2) (a) “But there is an even more serious problem here. You say you use two assumptions, Markov and Faithfulness, to infer certain probabilistic relations that need to hold among the variables in the positive sub-model, and then argue these probabilistic relations fail to obtain. However, you failed to mention a third critical assumption – Causal Sufficiency.”

(b) “Moreover Tetrad is certainly controversial, with many serious thinkers skeptical of it, especially in terms of practical applications. For example, the Causal Sufficiency assumption or other assumptions may not be met in actual empirical data sets, and if the theorist stipulates these assumptions are met, Tetrad will misrepresent actual causal structure, which is exactly what happened in your Tetrad analysis.”

The suggestion that algorithms in Tetrad do not assume casual sufficiency is surely not true. Causal sufficiency is assumed by GES and PC (but there are some algorithms in Tetrad that do not assume this e.g., BIC, FCI). So, I don’t see how the concern about causal sufficiency threatens the Tetrad project as you seem to suggest in 2b above.
Relevant to 2a though, is that the disagreements between your model and the GES model are not due to causal sufficiency.

(3) “Let me now turn to broader issues of your theoretical approach. You do not provide nearly enough context about the evidential status of the Tetrad framework from which you mount your critique. Structural equation modeling is an established method widely used in the behavioral/social sciences (PsychInfo alone showed more than 10,000 articles published over the last couple of decades). Sara and I use a very standard SEM methodology, and we execute it completely ‘by the book’, that is we carefully follow the established norms for proposing and testing an SEM model, and use SEM only to adjudicate between competing a priori hypotheses (rather than blindly searching for the best fitting model). You offer a critique based on Tetrad, a theory-free search procedure that is very much a new kid on the block. Tetrad is not established in the field by any means (I am aware of only a handful of studies in the behavioral sciences that have used it, and fewer in leading journals).”

The way I understand the objection to Tetrad is that it should be considered “untrustworthy” since it does not benefit from having the long history that SEM use does in the social sciences and furthermore, there are no established rules governing its use. Linear regression has a long history and quite a bit of well-established rules to go along with it. However, linear regression is typically used for causal inference, and it is very unreliable when used for this purpose. This practice is widely accepted, but that doesn’t necessarily mean that such widespread acceptance should confer any statistical legitimacy upon it. Widespread acceptance by a community doesn’t necessarily mean that the practice is reliable.

I would like to emphasize that apriori theorizing certainly does play *some* role (as you have emphasized throughout your criticism) and that Tetrad is not some sort of black box that takes data as input and spits out true models. Tetrad is a tool for understanding “holistic” structural features of the data statistics—and learning these features can be helpful in understanding the space of theoretical possibilities that are consistent with the data. Finally, Tetrad is a far better tool than standard SEM techniques and there are several good papers that show this. If you’re interested (and if any others are interested), I would recommend the following:

Richard Scheines, Peter Spirtes, Clark Glymore, Greg Cooper, David Heckman and many others have written brilliant papers explaining why and how Tetrad (or more precisely, directed graphical search procedures) are superior to standard social science methodology. The burden of proof then is actually a bit misplaced and as I hope we’ve shown the real proof is in the pudding.

I find it bizarre that you don't actually provide the details concerning the content of the "best" model you claim to have identified with Tetrad. Of course, some model will always in principle be best relative to all other *possible* competing models. I didn't need a statistics lesson in causal modelling to know that. But to the extent that you have successfully found the best model, why not present the model itself and explain how it accounts for/explains all of the data thus far from experimental philosophy of action better than all of the existing rival models?

An uncharitable reader might be tempted to conclude that you were silent on this issue because your method of analysis doesn't actually enable you to conclude anything about what the best model actually includes/involves above and beyond the fact that it is the best conceivable model. So, perhaps you could say a few words to dissuade the uncharitable from drawing such a tempting inference?

p.s. I admittedly read your paper quickly. So if I simply missed your discussion of the content of the best model then I apologize in advance.

There is an important difference between two models being statisticallly equivalent and two models being statistically comparable. I agree that the Tetrad models and your models are not statistically comparable -- that is, there is no statistical test that will tell you whether or not the differences between the models are statistically significant. But the models are definitely *not* equivalent. The Tetrad models fit the data better. Whether that difference is statistically significant cannot be answered. That said, comparing non-nested models is typical and important -- even if one isn't doing automated search -- since competing theories often lead to non-nested alternative models. What appears to be the standard line in such cases -- as we note -- is to turn to information criteria like AIC, BIC, and the like and simply take the best scoring model. Moreover, using Raftery's Bayesian analysis, the difference seems more than modest to me. By Raftery's analysis, the Tetrad models are between three and nine times more likely than your models given the data. (This assumes that they were equally likely to begin with and that at least one of them is true. I really have no problems with you saying, "Yes, but I think my models are antecedently more likely." I'm not so much aiming to convince you as to convince people who come in with no or few prior theoretical commitments.)

You say that your interest in fitting your models was to compare the DSCA and prescriptivist accounts head-to-head. But the model doesn't really do that. If you really wanted head-to-head comparison, you would have been better off with a simpler statistical technique along with manipulations of the relevant conditions. Even with the data you have, you would probably be better off looking at a bunch of multiple regressions. At the very least, these should be checked against the SEM. (We have a nice excuse that all we had to work with was a covariance matrix.)

Your point about sample size is troubling, and I'm not quite sure what to say on that score yet. I'm not finding the Kline passage you refer to. From other sources, I trust the N > 200 sample size you report, but I'd like to look at the passage you actually have in mind. Are you sure it's on page 200 of the 1998 edition? Anyway, my first reaction is to follow Kline's recommendation of using X^2/df < 3 as a measure of good fit in this case (based on p. 128 -- in the Fit Indexes section of Chapter 5). The positive part of your model has X^2 = 4.156 with df = 1, which would not pass Kline's test. Still, there doesn't appear to be much agreement on this test.

As to breaking a model into parts, maybe I'm missing something, but I don't see how the maximum likelihood fitting function matters. Is there a statistical point that I'm missing? I agree that it would be unreasonable to insist that every sub-model of a model have good fit, but that is clearly not what we did. In earlier drafts, some of us thought the right way to put the point was that you fit a model with two irrelevant variables -- sort of like adding two noise variables to a model that you really care about. I thought that that framing was unfair: the prescriptive variables in your model matter indirectly, and it is an interesting question how they might be related to the Deep Self variables. But that doesn't mean that the positive sub-model doesn't stand or fall on its own.

I'm not sure why it is obviously unreasonable to say that a model is undermined by the fact that at *best* one of its two claims is wrong, and at worst (which is a live option), both of its claims are wrong. That sounds like bad news to me. The point about Sufficiency is a bit of a red herring, and here's why: so long as the error terms in the model are uncorrelated -- which I think you also assume in order to get identification on the SEM that you fit -- Sufficiency is satisfied if Markov and Faithfulness are. I suppose we should have been clearer here.

As to how to interpret your own account, I suppose we have to defer to you, but why think that (1) the Deep Self Factor is a single thing that gives rise both to judgments about the attitudes the agent has and also to judgments about whether the agent has a reliable disposition to behave the same way in similar contexts, and (2) this Deep Self Factor is something other than judgments about whether the agent acted intentionally, which is one of the Tetrad models?

More to the point with respect to modeling, if you thought that Att and Gen were measuring a single latent, you should have said why you fit a path model, which assumes that each measurement variable is caused by an independent latent variable, rather than a hybrid model with two measurements or indicators of the Deep Self Factor. As you say, the worry is over identification, but then, if you knew that going in, and you knew how to fix the problem, why did you make the modeling choice you made? And why do you think it is unreasonable for us to interpret your modeling choice (especially in the target paper) as committing you to the assumptions of path analysis?

I actually agree with you about being clear about assumptions. What we tried to do -- and evidently need to be clearer about -- was to start from the assumptions that *you* make and then flag any additional assumptions (Markov and Faithfulness). I suppose we could say that there are skeptics of automated methods. But, as far as I can tell, these skeptics fall into one of two groups: (1) people like Freedman who are skeptical of causal modeling across the boards, unless the modeling is backed up by manipulations; as such, I don't think he's any happier with ordinary SEM practice as he is with Tetrad; and (2) a typical attitude among researchers in education and psychology that SEM is not an exploratory technique. The first group is going to be hard to appease with any SEM-related work. The second group typically hasn't looked at the machine learning literature, some of which we mention in connection with explaining the GES procedure.

Here, I'd like to echo Edouard's remarks here ... if you think the GES procedure is broken, then say why. There may be something to say here, and if there is, it probably has to do with the asymptotics of the procedure and the sample size at hand. But if you don't think the GES procedure is broken (or inappropriately applied in this case), then why does it matter whether or not it is widely used? C.S. Peirce used randomized assignment of treatment in the 1880s, which was not widely implemented until the middle of the twentieth century. Does that mean that it was unreliable when Peirce used it? M.S. Tswett used liquid-solid chromatography to identify the components of raw chlorophyll in 1905. The technique was rejected by leading chlorophyll chemists of the period, and despite the fact that Tswett's results were confirmed in 1912, chromatography was not widely used until the 1930s. Does that mean that chromatography was unreliable when Tswett used it? I could go on, but those are my favorite examples.

I think the Tetrad models are plausible, but that probably just betrays my ignorance about the literature you mention. I haven't read the Malle and Knobe article or your earlier work; my concern was with the SEM work. Before seeing any other studies, I find it quite plausible that people make intentionality judgments and then make judgments about attitudes and action-generalizability. But another possibility that neither of us can handle with the methods we've deployed so far is that there is lots of feedback. Anyway, the possibility of feedback makes me worried that cross-sectional data is not going to be adequate, and that is a whole other kettle of fish.

Sorry, but the beginning of your comment got cut off (it was addressed at both Edouard and I). So, I’ll take a stab at responding. First, thanks for looking at the paper and taking the time to comment. To begin, the claim that Tetrad found the best fitting model should not be understood as Tetrad found the best *possible* model. What Tetrad does (at least the GES algorithm) is provides the best fitting model given the data. Of course, there can be lots of relevant data that we just do not have yet (perhaps some crucial variables have yet to be measured) and so Tetrad can say nothing about what it does not have. So, the sense of *possible* here is not to be understood as “all conceivable models”.

As to your concern to why we did not go on to explain the model output by GES and endorse it as a positive account, my view is that this model is non-explanatory. As we note, but don’t elaborate on, the only variable that causally influences intentionality judgments is the case variable. This is largely uninformative as it does not explain what it is about the case that is influencing these judgments. Perhaps we should put this in the body so that it’s clear that we are not endorsing this model as an explanatory model.

Finally, as to the uncharitable reader, they might conclude that our method is poor, but this would be for the reader to entirely miss the point of why we conducted the search. The reason we conducted the search was because it is more reliable than the “guess and check” method that is characteristic of standard SEM practice. This is the real thrust of our search.

I hope this clears up your concerns and again, thanks for looking over the paper and taking the time to respond.

I'm not sure I'm understanding what you mean by the "content" of the models we identified. Are you asking what the models look like qualitatively (that's in the paper) or are you asking what specific predictions the models make (that's not in the paper, since we didn't give parameterized versions of our models)? In the next draft I suppose we should fill in that gap. Qualitatively, one Tetrad model predicts that manipulating intentionality judgments will cause both attitude judgments and generalizability judgments; the other Tetrad model predicts that intentionality judgments will cause attitude judgments, while generalizability judgments will cause intentionality judgments. On the first model, it seems most reasonable to me to say that intentionality judgments have not been explained at all. All we have is the phenomenon -- already known -- that manipulating case (between help and harm) changes intentionality judgments. One plausible reaction to that is to deny that the measurement variables are adequate and thus deny that the SEM tells us anything at all. That leaves open the possibility that the DSCA is correct but that the way the Deep Self Factor(s) has/have been operationalized was poor. Of course, the prescriptivists could say the same thing.

Alternatively, one could maintain that the measurement variables are just fine and endorse the Tetrad models. In that case, you either have no explanation of intentionality judgments, and the message from Tetrad is "Look somewhere else," or you have a weaker version of the DSCA -- how much weaker I don't know -- that says, "Attitude judgments aren't important, only generalizability is important."

If one endorses the Tetrad models (or one of them), then one is committed to the claim that manipulating judgments of intentionality will sometimes bring about changes in attitude judgments and that these changes weakly favor the claim that "more intentional => more anti-environment" (since the unstandardized coefficient on that edge is -0.2 in the Tetrad model). Here is a simple experiment: tell people that the CEO harms/helps the environment intentionally or not and then ask them about the CEO's attitude toward the environment. Prediction: in the harm (help) condition, if the CEO is said to act intentionally, the scores for attitude (1 = anti-environment, 7 = pro-environment) will be significantly lower (higher) than if he is said not to act intentionally; AND the difference for the harm case will be slightly larger than the difference for the help case, to account for the negative coefficient. Similar story for generalizability, though there you have to pick a model. For the model that asserts that intentionality judgments cause generalizability judgments, the coefficient on that edge is 0.21, so the Tetrad model asserts that if you get people to judge that the CEO acted intentionally, then they will be more likely to say that his action is generalizable to other similar contexts.

I don't know enough about the experiments in this neighborhood to know if there are experiments that show this interpretation can't work, but on its face, it seems plausible to me. Are there good reasons to think this interpretation is inadmissible?

In any event, I thought the goal of our paper was to give a short, punchy critique of Chandra and Sara's SEM analysis, not to present an alternative theory on the basis of their data. One reason for me not to want to do the latter is that we weren't working from raw data, just the covariance matrix and sample size they report. So, for example, I can't check how things look if instead of including the variable Case one fits two models -- one for the help condition and one for the harm condition.

There is much that I disagree with in David and Jonathon’s responses at the level of details, including claims about how the BIC should be used to compare non-nested models (this is not at all an accepted practice) and use of the chi-square to reject models at large sample size (published guidelines in the field say this is illegitimate and other fit statistics, such as the NNFI, are far preferred). I do, however, fear that we are perceived as trading rather pedantic points here and that these issues won’t get much traction in this forum. I want to concentrate here on the most crucial objection I have and leave the other stuff aside for now.

If I had to focus on the most glaring issue I have with your critique, it is the issue of rhetorical excess. You claim that our own data undermines the Deep Self account. This is a misrepresentation, and very, very uncharitable. I think people may have trouble following what you are up to so I will try to clarify it here. You carve a part of the S&K model called the positive model. This sub-model embodies the core claim of the Deep Self account that the Case manipulation (presenting people with the Harm versus Help condition) influences Intentionality Judgments by influencing Deep Self variables (values/attitude judgments and generalizability judgments). You note that when this sub-model is considered alone without the normative factor variables in the model, the sub-model does not fit as well as the overall model (though I claim the model still fits the data pretty well). The chief issue that causes the worse fit is that values/attitude attributions and generalizability judgments are correlated, and that is not adequately represented in the model (this can be determined by inspecting the covariation matrix residuals, but we do not need to get into this issue here).

At this point, the reasonable thing to do, one would think, is put a path in the model to capture the correlation. After all, the Deep Self account itself proposes that values/attitude judgments and generalizability judgments both measure people’s tendency to see an action as stemming from a person’s Deep Self, so the account antecedently predicts these two variables will be correlated. But what you do instead is claim that based on the data, our account is falsified. This conclusion is absolutely ridiculous! It is hard for people to see how absurd your conclusion is because they can’t follow the math very clearly.

To make things clearer, I have drawn models that show two ways to capture the correlation between values/attitude judgments and generalizability judgments. You can look at the models here:

Model 1 simply draws a path between values/attitude judgments and generalizability judgments (as you can clearly see, there is a significant correlation between these two variables). In Model 2, values/attitude judgments and generalizability judgments are averaged and the resulting averaged ‘Deep Self’ variable is treated as a mediator. Model 2 is actually preferable because it reflects the underlying hypothesis better that values/attitude judgments and generalizability judgments measure a common underlying latent variable (I called this a ‘poor man’s latent variable analysis’ in my previous post, and forming composite scores like this is done in the behavioral sciences and has good theoretical rational). The Sobel statistic, which a way to test the statistical significance of mediation effects, is highly significant for the Deep Self-related variables in both models (ps for all Sobel tests are < 0.001). Thus these models show that Deep Self-related variables are highly significant mediators of the link between Case and Intentionality Judgments, just as predicted by the Deep Self account. In short, making a small modification in the model, which actually better reflects the underlying Deep Self theory, is all that is needed here to see the data does support the Deep Self account.

I think your lack of charity in dealing with the Deep Self account stems from your lack of any real interest in the existing large body of philosophical theory regarding intentional action, which could be very useful in guiding model construction and revision. Your interest in this issue seems exclusively to be in taking models as they are and churning them through Tetrad, without any concern for how existing bodies of theory are relevant for model selection or model revision. So it is not surprising that you don’t even consider any alternative models and go straight for the dramatic falsificationist conclusion. I hope you consider walking back your overheated claim that you have used our own data to undermine the Deep Self account.

The rhetorical stance we take depends on the pedantics, so it is hard to divorce them. If you agreed with us on the statistical details, then you would not find the rhetoric outlandish. Similarly, if we agreed with you on the statistical details, we would not be writing a criticism of your admittedly very interesting paper.

With that in mind, I have to say three things about the details. I will then try to come back and say two things about the bigger picture in a bit.

First, we really need to lay to rest the objection that BIC is not used the way we say it is or that we are not justified in so using it. Aside from Kline (whose book is perhaps *the* standard textbook reference for applied SEM work) and Raftery (whose work appears in a nice volume edited by Bollen, the master of SEM), here are a few others:

Raykov and Marcoulides (2006) _A first course in sem_. "The two indices, AIC and BIC, are widely used in applied statistics for purposes of model comparison. ... The ECVI, AIC, and BIC have become quite popular in SEM and latent variable modeling applications, particularly for the purpose of examining competing models, i.e., when a researcher is considering several models and wishes to select from them the one with best fit" (47).

Kaplan (2008) Structural equation modeling: foundations and extensions. "A particularly important feature of the AIC is that it can be used for comparing nonnested models." (118) "As we discussed above, the AIC, ECVI, and BIC, can be used to select among competing nonnested models as well as competing nested models." (126)

Schreiber et al. (2006) "Reporting structural equation modeling and confirmatory factor analysis results: a review," The Journal of Education Research 99(6): "When comparing nonnested models, the AIC fit index is a good choice because the difference in the chi-square values among the models cannot be interpreted as a test statistic" (330). -- They cite Kline (2005). In their Table 2, they list AIC, BCC, BIC, CAIC, and ECVI as fit indices to use in comparing non-nested models.

Rust et al. (1995) "Comparing covariance structure models," International Journal of Research in Marketing 12, 279-291. "3.2.2. Akaike's criterion. Unfortunately, situations frequently arise in which the competing models share little in common, and are not nested. In such cases, the likelihood ratio chi-square test may not be applied. In recent years, however, new methods have been proposed for comparing non-nested structural equation models, assuming that the same variables are observed for each model" (282). They then discuss AIC at length. It should be noted that Loehlin (2004) Latent variable models: an introduction to factor, path, and structural equation analysis (also a very popular/influential SEM textbook) cites Rust approvingly on p. 73, writing: "For methods of comparing the fits of models that are not nested, see Rust et al. (1995)."

And here is a very recent example of cognitive neuroscientists using AIC and BIC to compare non-nested models in applied work:

Penke and Deary (In Press) "Some guidelines for structural equation modelling in cognitive neuroscience," Neurobiology of Aging, "The models in the present report and that of Charlton are not nested (i.e. not subsets of each other). Therefore, formally comparing them is only possible based on comparative fit indices, which do not allow for a significance test. Based on one of them, AIC, our alternative (-4.85) and modified alternative model (-6.84) fitted the data better than the full model Chalton proposed based on the theoretical assumptions they made (-3.11)." Penke and Deary go on to use BIC as well and even refer to the Raftery article we cited in comparing the posterior likelihood of their models against Charlton's.

So, please, before asserting again that using BIC to compare non-nested models is not an accepted practice, at least provide a reference or two. Or better, tell me what the standard is for being an "accepted practice" in SEM research, if this use of BIC doesn't count.

Second, having done a bit of quick review, I think your point about sample size being over 200 is not so clear. Here, for example, is Barrett from a special issue of _Personality and Individual Differences_ dedicated to SEM: "The X2 test is the only statistical test for a SEM model fit to the data. A problem occurs when the sample size is "huge", as stated succinctly by Burnham and Anderson (2002). They note that "model goodness-of-fit" based on statistical tests becomes irrelevant when sample size is huge. ... The numbers being used in examples of "huge" datasets by Burnham and Anderson are of the order of 10,000 cases or more. Not the 200 s or so which seems to be the "trigger" threshold at which many will reject the X2 test as being "flawed"!" (820) In the same issue, Bentler picks on Barrett for rejecting alternative fit indices -- though Barrett has reasons for doing so -- but noticeably, Bentler's only complaint about sample size considerations is that he thinks models with N < 200 should be *admissible*, where Barrett thinks they should be rejected out of hand. Markland is the only commenting author (the special issue is really a set of replies -- six or seven -- to Barrett's article) who takes issue with Barrett's remark on sample size, and he puts the "trigger" point somewhere in the 500s (855).

The chi-square isn't the only test that fails the positive sub-model: RMSEA fails it, too, and the NNFI doesn't give resounding support at 0.96628. In any event, hard and fast cut-offs are probably not a good way to look at the issue. The issue is that the good overall fit for your model turns into dramatically less good fit for the positive part, while the negative part shows extremely good fit. So, what looks like strong support for the positive core of your model is just not derived from the positive part of the model. The multiple jobs that you have the SEM doing for you hides the relatively poor fit of the positive part. Now, we probably over-state the case against the DSCA as a psychological account or theory, as opposed to confining ourselves to the model itself. There are ways to salvage the DSCA, some of which I mentioned in my reply to Thomas. But I don't think the DSCA can be maintained along with the modeling assumptions behind your SEM analysis and the data. Something has to give. So, I'm not ready to confess to the charge that we are "very, very uncharitable." Maybe we're a *little bit* uncharitable. Maybe. Perhaps we should not have said that the *DSCA* should be rejected outright (we say that twice) -- in my opinion, "undermines" (which we use more often) is the right term (and not too terribly overheated), since something might be undermined without being destroyed.

Third, adding an edge between Att and Gen might be the right way to go, except that in that case, the model is saturated, meaning that evaluating its fit or comparing it to other models is meaningless. Your second alternative model looks more promising, but we couldn't have constructed such a model even if we had thought it was a reasonable thing for us to do, since we don't have the data you used to build the aggregated variable.

Finally, for my part, I admit to being more interested in the methodological and statistical issues than the theoretical ones in this case. My research is not on intentionality or philosophy of action more broadly, it is on causal reasoning. But that doesn't mean I have no interest in how the theory plays out or would be willing to use the Tetrad models as part of a criticism if I didn't think they were plausible. I asked Thomas what in the literature makes the story I tell about intentionality judgments coming first implausible, and I'll just repeat the question to you. Why do you think the Tetrad models are beyond the pale? What other experiments make that view crazy?

Hi Chandra,
I think what Jonathan has said is both apt and to the point. I would, however, like to mention a few things:

(1) I’m not exactly sure what is meant by “acceptable” and what you mean by “admissible” when referring to your sub-model. If you truly are invested in following the rules of thumb that guide the practice then it seems like you should be willing to claim that these models are sub-par at best, and inadmissible at worst (since a *typical* practice is to reject models that don’t meet the .05 cut-off for a chi-square test and your sub-model only hits .04. Furthermore, there are surely standards surrounding the use of other measures such as BIC and AIC that would also suggest that we should select an alternative model—one that is not the DSCA. Jonathan has provided references that suggest that the standard would be to use AIC and BIC to compare non-nested models. He has also provided references showing that your disavowal of rejecting a sub-model based on a ‘large” sample size is misplaced. As I mentioned before, 240 doesn’t appear very large and the standard, for the references Jonathan gave, seem to suggest that the sample size must be “huge” before we can set aside the chi-square test. I would also like to point out that it is “standard” practice to investigate why a model does have good overall fit by looking at that model’s component parts. Another “standard” is that Markov and Faithfulness be met, and as we’ve shown your model violates these “accepted” conditions. So, if you are invested in following the rules of thumb, then shouldn’t you also accept these rules of thumb? And if you do, then shouldn’t you be willing to admit that your model is incorrect. And if not, what would be your justification for not endorsing something like say, Markov and Faithfulness?). Again, I would like to emphasize that the sub-models are being investigated because we would like to know why the overall model fits so well. It turns out that the reason it does is because the sub-model embodying the DSCA fits *poorly* while the fit of the other variables is *excellent*.

(2) As I take it, your account is undermined. The model predicts A&B and, at best, your model shows only A or B. This is enough for me to think the account false. Now, as Jonathan rightly pointed out, you may at this point claim that your measures were very poor and this is why the predictions of your theory are not born out. But again this move is open to the prescriptivists and the whole study would be rendered useless if both parties go for this. So, I think the right conclusion to make from our overall analysis is that your model is undermined.

(3) “ Your interest in this issue seems exclusively to be in taking models as they are and churning them through Tetrad, without any concern for how existing bodies of theory are relevant for model selection or model revision”. This is surely a misrepresentation of what we are doing. We’re not just “churning models through Tetrad” in order to see what pops out. As a matter of fact, we only “churned out” one model in Tetrad and this was the result of a GES search. But this was not done for the sake of “churning out a model”. Rather, this was to make a methodological point; namely, that GES is superior to your “guess and check method”. As for our other two criticisms, we showed that the models you offer in your paper only have the appearance of supporting the DSCA—when we investigate why your model fits we find that the part of the model that embodies the DSCA fits very *poorly*. Lastly, we showed that your own model violated Markov and Faithfulness, two very crucial conditions for causal modeling (and I take it, conditions that *should* be met). Our paper is largely methodological and these are issues that I think are crucial, especially for experimental philosophers. If causal graphical modeling is being endorsed as a legitimate way to conduct experimental philosophy, then I think people should be aware of the various issues surrounding this practice. And, furthermore, I think that what we’ve done is shown that there are some issues and there are some things that one should be careful about when using these techniques. So, I think that what we’ve done has much value (aside from the philosophical issues surrounding intentional action), especially is experimental philosophers are interested in using these techniques. But, I should also point out that our analysis bears only on a small number of theories of intentional action. The “right” model may well be out there and I think it’s important to continue looking for the correct model. I personally am interested in these theories and have been working on developing my own account (and so naturally, I think the right account hasn’t already been put forward). But again, the methodological issues are just as important and I think that what we’ve shown is both how to be methodologically sound (at least with respect to the limited set of issues we deal with in our paper) and why it’s so important to be methodologically sound (since it bears on the theories we are ultimately interested in).

I want to keep the focus on what I think is the most glaring issue in your critique that I still don't feel is addressed in your responses. Jonathan touches on it, and David does not deal with it at all.
I agree that the positive model does not have as great a fit as the negative model. I conceded that already. We can bicker about how good a fit it has, but this is almost in poor taste because if there is one universally held maxim in SEM, it is that arguing over borderline fit statistics is pointless. Rather, one should go out and do further tests! I am certainly in favor of this (more on future studies later). But even if we go your way and say the fit of the positive model is poor, this too is kind of irrelevant. My suggestion all along has been to modify the positive model in the direction of one of the two models I linked to in my last post (here: http://sitemaker.umich.edu/sripada/files/deepselffigures.pdf ). I said these slightly modified models better reflect the actual Deep Self account, and indeed are actually how I model the data in other studies. This is how Jonathan responds:

‘… adding an edge between Att and Gen might be the right way to go, except that in that case, the model is saturated, meaning that evaluating its fit or comparing it to other models is meaningless.’

This is an inadequate response. You are absolutely right that adding this path makes my model saturated (actaully, both my suggested models are saturated), thus making the models more akin to multiple regression than SEM. But so what? I would have liked the DSCA to have support through an SEM method rather than multiple regression method, but it turns out given the structure of the data, the SEM method has to be put aside here in order to make a small change in the model. Once you make the change I suggested, the resulting models gain support through the statistics that appropriate for non-SEM models, i.e., tests on the regression coefficients linking the variables and the Sobel statistics for mediation, which are all highly significant (ps < 0.001). All your overheated claims about undermining the DSCA are based on failing to make the slight modification to the positive model that I have suggested, and you have barely discussed my suggested modification despite my making this point now in three successive posts. Is it because you think the modification is post-hoc? This isn’t true because the modification actually better reflects the Deep Self account as described in the manuscript, and I have *already* modeled my data in this modified way in my current analyses prior to your critique ever coming out (as David and Edouard already know). So you need to either provide principled reasons why this slight modification is unacceptable, or else walk back your overheated claims.

So here is the final scorecard as I see it: Our rejection of normative factor models gains support through an SEM-level analysis (i.e., the negative model) and you agree with this claim. Our support of the DSCA gains its clearest support through a multiple regression-level analysis (of the modified positive models I linked to in my last post), and you have said not a word to disagree with this claim. The DSCA certainly isn't proven once and forever -- but no one ever claimed such a thing. The DSCA (when modeled in the way I suggest) is merely supported by the data and now we need to do further studies, especially collecting data that would allow an SEM-level analysis of the positive model. What is so earth-shattering in all this that licenses the overheated claim that you have used our own data to undermine the DSCA?

You do a very nice job in your last post of helping me see the BIC is more widely used than I first thought. I do however think that using the BIC in the way you suggest is not ‘established’. What does that mean? I will get into that and the other issues you raised in a comment to follow. Right now, I want to keep the focus on the issue that I raised above in this post. I think this is too important to get lost in a sea of ‘pedantry’ (as admittedly useful as the pedantics may be).

I feel that you are losing track of the dialectic, so let me remind you of two points.

First, the best fitting model with the variables you used in your forthcoming paper (output by GES) states that only Case influences people's intentionality judgment (Int). At the very least (that is, if you do not conclude from the fit superiority of the GES model that it is superior), this means that your data do not support the two causal models you derived from the DSCA - since these data are consistent with an incompatible model. In fact, these data would not support any model derived from the DSCA. (Note that this does not mean that they undermine it - just that you cannot call on them to support your views). I am still waiting for an answer to this point.

Second, you now seem to concede that the two causal models that were meant to represent the DSCA are undermined by the data. Good! this is progress! Note that this was the goal of our paper: You claim that some data support two causal models derived from the DSCA. We argue that they don't.

We made the further claim that this undermined the DSCA. As Jonathan notes, we should modify this claim: A more cautious conclusion is that if the two models are meant to represent the causal claims made by the DSCA, the DSCA is undermined. (I do not remember that the M&L paper nowhere says that the two models are approximate representation of the DSCA.)

Now, you insist that in fact there is a causal model that is a better representation of the claims made by the DSCA (one I should be aware of, it looks like - did you present it at the MPRG, I don't remember). In effect, you reject the antecedent of the conditional above. This might or might not be the case. I need to look in more detail at what you propose.

But, at this point, it is fair to say that we need not deny this. Again our target was the claims you make in the forthcoming paper in M&L. I feel that we now agree about them: These claims are mistaken.

First, I agree (at least in the main) that bickering about borderline-significant model fit is unhelpful. And I agree that more studies are called for -- studies that manipulate the variables of interest.

Second, I didn't know that you had changed your modeling choices as part of a new paper. Hence, I haven't evaluated the new position you take.

Third, the reason I remarked that the new model over the same variables is saturated is that I don't have your data to work with. I only have a covariance matrix. I can do some things with coariance matrices, but I can't fit a multiple regression from one without making some strong assumptions that I don't think actually hold in this case. I can't do anything at all for your second model, because I can't construct the new variable from the covariance matrix on the old variables -- at least, if these things are possible to do, I don't know how to do them (if anyone reading this does, please share!). So, you might very well be right, but I can't check for myself. My comment, then, was meant to flag this fact, not to be a criticism of your models. Sorry if I said that badly the first time around ... or this time around, for that matter!

I'm not sure anything we do is Earth-shattering ... it was only supposed to be DSCA shattering, after all. ;) Seriously, though, the further models you propose may support the DSCA, but our claim is not about those models, which make different assumptions. Our claim is that given the assumptions behind the path analysis you did, and given your reported covariance matrix, the DSCA is not confirmed but disconfirmed. Inference requires both data and assumptions, so if you change the modeling assumptions, you may very well get a different inference. That's fine with me. I'm not opposed to the DSCA as such -- I'm not sure that I find it all that plausible, but I don't have a dog in the intentional-action fight.

Put another way, I take it that this argument has the same structure as many philosophical arguments. You have set out some assumptions and drawn a conclusion. Given those assumptions, I claim a different conclusion follows. Changing the assumptions, you might get your original conclusion back again, but that won't preserve the original argument. If I'm right, the original argument is not persuasive. (That might actually be a good thing for the DSCA ... it might be that the new assumptions are more plausible, and hence, the new argument more persuasive if successful.) Anyway, I don't see any reason to characterize our claims as "overheated."

So, as I see it, the scorecard reads slightly differently: nothing that I can really comment on tells in favor of the DSCA, rather, what I can test looks unfavorable to the DSCA; there might very well be other things going on that end up favoring the DSCA; I haven't said anything about those things, because I *can't* say anything there without access to raw data.

And that brings us back to what I see as the two main disagreements between us. The biggest disagreement is about model-building. We think that model building should be done as much as possible from a data-first perspective; whereas, you think that models should be built from "background theory" (which I put in scare quotes because I find it typically pretty squishy, plastic, or hard to precisify). Second, we disagree about the initial plausibility of the Tetrad models. I really would like to know why you think those models are implausible on their face.