Why do beginner econometricians get worked up about the wrong things?

People make elementary errors when they run a regression for the first time. They inadvertently drop large numbers of observations by including a variable, such as spouse's hours of work, which is missing for over half their sample. They include every single observation in their data set, even when it makes no sense to do so. For example, individuals who are below the legal driving age might be included in a regression that is trying to predict who talks on the cell phone while driving. People create specification bias by failing to control for variables which are almost certainly going to matter in their analysis, like the presence of children or marital status.

But it is rare that I will have someone come to my office hours and ask "have I chosen my sample appropriately?" Instead, year after year, students are obsessed about learning how to use probit or logit models, as if their computer would explode, or the god of econometrics would smite them down, if they were to try to explain a 0-1 dependent variable by running an ordinary least squares regression.

I try to explain "look, it doesn't matter. It doesn't make much difference to your results. It's hard to come up with an intuitive interpretation of what logit and probit coefficients mean, and it's a hassle to calculate the marginal effects. You can run logit or probit if you want, but run a linear probability model as well, so I can tell whether or not anything weird is going on with the regression."

But they just don't believe me.

I am happy to concede to Dave Giles that, all else being equal, it is better to use probit than ordinary least squares, and that Stata's margins command is not that difficult for an undergraduate to use.

But all else is not equal. Using probit will not save a regression that combines men and women together into one sample when estimating the impact of having young children on the probability of being employed, and fails to include a gender*children interaction term. (The problem here is that children are associated with a higher probability of being employed for men, and a lower probability of being employed for women. These two effects cancel out in a sample that includes both men and women.)

Once students know how to appropriately define a sample, deal with missing values, spot an obviously endogenous regressor, and figure out which explanatory variables to include in their model, then it might be worth having a conversation about the relative merits of probit and linear probability models. Until then, I'm telling my students to use the regress command and, if it makes them feel better, stick "robust" at the end of it.

They don't listen.

It all comes down to the way that they have been taught econometrics. Most - not all - econometrics classes emphasize statistical theory. Students might run regressions, but often these are canned, ready-made examples, with the parameters of the analysis clearly defined, or straightforward replication exercises.

Econometrics is taught that way for a simple, practical reason: it's easy. When every student downloads his own data, works on his own unique problem, and specifies a novel and original model, each student will need a lot of individual help and attention. The marking cannot be delegated to a TA, because each research question, and each data set, is different, so it is impossible to write down a simple answer key. But spending hours upon hours reading students' first struggling steps at regression analysis is a huge amount of work. It's so much easier to mark a final exam consisting of calculations, short answer questions, and replication of theorems.

No one in my honours seminar this year is taking that easy route - and it's tough going.

Econometrics is a journey. Logit and probit are just one step on the path towards enlightenment. Once one arrives at probit, and can calculate marginal effects with ease, another challenge awaits - bootstrapped standard errors, perhaps, or correction for sample selection bias. The ultimate goal - identification of causal relationships - may never be achieved - but we journey down the path nonetheless.

Students need to discover that all econometricians reach a point where they say "there's only so much I can do, I'm going to stick with this regression, even though I know it's not perfect".

My favourite advice for would-be researchers comes from Alice through the Looking Glass

Begin at the beginning...and go on till you come to the end: then stop.

The beginning of a good piece of applied econometrics is formulation of a theory; a hypothesis about what matters and why. The next step is the identification of the sample - who the model applies to. Then comes the model specification - figuring out some way of establishing a relationship between the explanatory variables and the thing that is being explained. Only then do considerations like the choice between probit and linear probability model come into play.

Comments

You can follow this conversation by subscribing to the comment feed for this post.

This is a great post. I would add that students and professionals should look carefully at how the data are produced. They should look at the actual question in the survey not just the text description in the dataset. Even the order of the questions can have an impact on the usefulness of the data. As you note, the samples are critical. The challenges with the National Household (Harper) Survey are well known but Statistics Canada has been changing the samples on many key surveys.

It's beside the point but every time I hear econometrics I feel the need to quote the inimitable twitter commentator Economist Hulk: WHEN FACTS CHANGE, HULK SMASH FACTS UNTIL THEY FIT HIS PRE-CONCEIVED THEORY. HULK CALL THIS ‘ECONOMETRICS’

Another way of putting this might be that these problems arise because you teach econometrics/statistics first. If you started with coursework on research design, sampling, representativeness, and so forth -- and then dove into the math -- the grave mathematical errors warned against in an econometrics class might be taken in the context of the much more grave research design issues that can take even a perfectly executed statistical analysis and render it useless. It's not that the econometrics is poorly taught, it's that there's a whole other course that should be taught first!

Hi Frances. This post certainly resonated with me! I teach a 4th year applied seminar course in which students must write an empirical paper. I've taught it for 12 years, so I have confronted the same issues you raise here.

The particular obsession I face with my students is Rsq. They really worry theirs is not big enough. So, I devote a lot of time to that in class.

For binary dependent variables, I have them do probit using Stata's 'dprobit'. I was a bit confused by your section on probit in the post because 'dprobit' is just so easy. Your point about LPM being 'good enough' in most cases is correct though--that is simply not digging where the body is buried. I've gotten the same thing from old-school referees and editors who insist on probit instead of LPM+hetero adjustment.

Oh jeez, the Rsq thing, yes. I have now drilled into my students the following simple diagnostic test: if your Rsq is greater than 0.5, something is wrong with your regression, and you must come and see me in office hours. The emphasis at Carleton is more on macro-econometrics than on micro-econometrics, so people are absolutely shocked the first time they run a x-sectional regression and come up with an R2 of, say, 0.1.

On dprobit - perhaps you can help me out with something. Here's a bit of code one of my students was struggling with, and the associated stata output:

I never used to get this problem in earlier versions of Stata, but now it seems that Stata is trying to push people towards using the margins command by becoming more and more fussy about the use of mfx and dprobit.

Kevin, the other problem I'm having is that the coefficient estimates that people are generating by using probit plus the margins command are an order of magnitude different from the results they get from a LPM, and I don't know why.

Couldn't you interpret your students' obsession with formalism and technique as a just sign they've already absorbed a lot of the culture and mindset of present-day academic economics. Although this obsession may limit their ability to do useful research in the future, it might also help at least some of them to achieve professional success, by making them focus on the things that matter for success.

Maurice, I could indeed - which raises the question "who should be teaching students econometrics?" Should it be someone like Kevin, who is basically an applied micro guy, or should it be an econometric theorist?

It's beside the point but every time I hear econometrics I feel the need to quote the inimitable twitter commentator Economist Hulk: WHEN FACTS CHANGE, HULK SMASH FACTS UNTIL THEY FIT HIS PRE-CONCEIVED THEORY. HULK CALL THIS ‘ECONOMETRICS’

This reminds me of many of Stephen Gordon's published articles in Maclean's and the Globe & Mail.

I don't recall this being a problem in my undergrad. We learned linear probability models, and we discussed their limitations in terms of producing predictions outside the [0,1] range. Logit and probit were mentioned in passing. Hardly anyone used them in undergrad papers.

Based on that, I'm guessing the undergraduate econometrics course you took was this one, or something like it: https://courses.cit.cornell.edu/3200/Syllabus3140.pdf? That course outline that says: "Frequently, we will attempt to compensate for problems with
data quality by using knowledge (or at least assumptions) from economic theory. This
means that to do econometrics well, you need to know your economics and not just your
statistics." That is my ideal of what an undergrad econometrics course should be like.

"...which raises the question "who should be teaching students econometrics?""

I'd go further and ask whether it makes sense to teach anything beyond basic statistical methods to undergraduates...what are students not learning in the many hours they spend pounding the technicalities of multivariate regression analysis into their brains?

In my career I've worked with many people with economics degrees. It has often struck me there is a large disconnect between what these people learned in school and what they ended up doing for a living. I don't mean that they're underemployed relative to their training level (although that does occur). I mean that in many cases they seem to lack what might be called "everyday analytical skills" - the skills necessary to know, for example, what can and cannot be reasonably inferred from examining a set of time-series plots. I don't mean to say all economics grads lacks these skills, but I've seen this kind of thing often enough to make me think it points to a serious deficiency in economics programs. In particular, activities that might help develop "everyday analytical skills" - working through data-heavy case studies, using basic quantitative methods to examine claims made by government officials and the media - don't seem to be a big part of the undergrad curriculum.

I suspect there are two things at work here. The first is that many professors aren't interested in teaching the kind of mundane stuff I'm talking about - quite understandable, given the professors' own high level of training. The second is that economics departments like to see their undergraduate programs as farm-teams for the big leagues of grad school and (eventually) academia, and feel obliged to equip students with the knowledge they'll need to succeed when they go on to further study.

Stata version 11 introduced a new syntax for "factor variables" and replaced -mfx- with -margins-. The older -dprobit- command does not recognize this new syntax (so it would allow i.relationship, but not ib4.relationship).

There are several ways to compute marginal effects with -margins-. Average marginal effects after -probit- should be very close to the LPM coefficients estimated by -regress-, but other marginal effects (e.g., computed at the means) may differ.

Maurice Lechat: In my whole carreer,and I always worked as an economist, I rarely used high-level stats or econometrics. At this moment, I am administering a colleague exam and I wonder If I could pass the exam myself , insert a half-wink here,( a student just left, having capitulated...).
In our "Introduction à l'économétrie" course, we often joked that we could get an A by reading Johnson first chapter and a B by not falling asleep. Some days, it seems that after almost 40 years, all I remember is that the critical value of the D-W test is 2.16. But must it be at,below or above? And is homoscedastic marriage legal in Canada?
On a more serious note,out of approximately forty graduating students in our class, 4 were in the "Économétrie pure" stream. At a later reunion, a couple of them told us that they wished they had taken what they " real economics " course. They were tired of manipulating numbers they didn't understand. Call it the Kotcherlakota syndrome...

Benoit - thanks for that. Do you know if there is any way of getting around the problem with ib3 in dprobit other than either recoding the categorical variable so that the desired base case is the first category, or typing in a long list of dummy variables? E.g. with provincial data, I really don't like having NFLD as the base case, as it's different from the mainland in a lot of ways, plus it's small, plus students don't necessarily know much about it and that makes it hard for them to interpret the results. I would rather have QC, ON or BC as the base case, because these are the big provinces population wise. But it's a hassle to sit there and type in 9 provincial dummies + territorial dummies.

The student was calculating marginal effects at the means, that was the issue. Do you know how can I tweak the margins command so that it gives the same results as a LPM?

I can attest that this issue is present in other programs as well (Epidemiology comes to mind). I think it is partially that professors have internalized the subject-specific knowledge and no longer instantly recall how much work it was to get a good grounding in it at the beginning.

"To which there are clearly two possible solutions
(a) we should add another course to the program or
(b) other people should redesign their courses and teach things differently."

Agreed, but again I'd go further. I think the first couple of years should be about equipping students with core knowledge in quantitative methods and economic theory. In third-year the emphasis should switch toward the sort of applied analysis most economics grads are likely to do in their working lives. (I wouldn't rule out a more academic stream for students who see themselves going on to grad school.)

I should say my first comment above was meant to be somewhat sarcastic - not toward yourself, of course, but toward the state of academic economics. I was thinking of a couple of people I knew when I was in grad school. These guys were hard-core mathematical economics types, prime examples of geek-machismo who took pride in using difficult, obscure math. I used to tell them: "Math is just a set of tools. Why do you guys make such a big deal about it? Why use a gold-plated hammer when an ordinary hammer will do the job just fine?" The answer I got was always something like: "You just don't understand...in economics using advanced math is how you show people how smart you are." To which I'd respond "that's just messed up" (although I may not have said "messed").

Frances: which raises the question "who should be teaching students econometrics?" Should it be someone like Kevin, who is basically an applied micro guy, or should it be an econometric theorist?

I have always maintained that you should never let a micro theorist teach core micro (at undergraduate or graduate level), and you should never let an econometric theorist teach core econometrics (again at any level). Core, by definition, is something that is common to all economics, not just a specialist field; applied economists are more attuned to that reality. In the case of econometrics, I would go further and claim that you should never let a time-series econometrician teach core econometrics, but that might be a bit harder to sell to the econometrics fraternity!

Seamus - "Core, by definition, is something that is common to all economics, not just a specialist field" - interesting observation. I think a lot of theorists would agree with it as applied to Econ 1000, fewer would it this was to be applied to grad micro/macro/econometrics.

No, no, the snark is definitely there...I just didn't read carefully enough to pick up on it...my fault.

I too found Seamus Hogan's comment very interesting. What he said makes a lot of sense to me, but I gather this is not the way things are in economics. I wonder if this reflects some kind of pecking-order effect. My impression is that theorists are considered the royalty of academic economics, a cut above people who do applied work, and so it's usually thought best - or at least appropriate - to let a theorist teach a course in their field, even where an applied person might do a better job giving students what they really need to know. (In saying this I don't mean to offend anyone...just my impression.)

I personally like NFLD, but I understand that it may not be the best choice as a reference case. For Stata commands that do not recognize the new syntax for factor variables, the base category of a variable can be changed through -char-. Suppose for example that variable province is coded 35 for Ontario. This category could be set as the reference as follows:
char province[omit] 35
xi: dprobit yvar i.province

With the new syntax, there is a choice of alternative solutions to achieve the same result:
probit yvar b35.province
probit yvar b(freq).province // picks the most frequent category as the reference

You can also permanently set the reference category with -fvset-:
fvset base 35 province
but this setting will only be taken into account by commands that allow the new syntax.

Marginal effects are the defaults with -margins, dydx(*)-, while marginal effects conditional on mean values have to be explicitly requested with -margins, dydx(*) atmeans-.

"Why do beginner econometricians get worked up about the wrong things?"

Because they don't understand what they are doing. Training people to use software packages creates technicians. You shouldn't be surprised when students think like technicians when they've been trained that way.

Remember, econometrics was develop without computers in much the way that numerical methods were developed without computers. As a comparison, look how applied mathematicians teach numerical methods/analysis. We don't set them loose to use a bunch of packaged numerical integration routines, we teach them why people worried about certain types of problems, how we overcame special cases or increased speed, and then we have the students code there own proof of concept implementation only after understanding the mathematics behind the result (proof-lemma,...). Only then do we set them loose on professional packages. That way, when the package doesn't do what you expect, you know what to think about (and occasionally find a bug in professional software!).

If students can't prove the ADF test, what's the point of having students check for unit roots. It becomes an exercise in memorizing a sequence with black box routines.

@Avon I would submit that learning proofs to things is good practice if you want to do more proofs, or in some circumstances can be helpful in highlighting the assumptions that go into a result, but I'm not sure if time spent proving stuff wouldn't be better spent just learning more rules. I learned the Pythagorean rule in junior high, and didn't learn to prove it until undergrad. I don't think proving it helped me use it any better. Who cares why it holds? Pythagorus figured it out so I don't have to. What's the point of checking his work.

There's a tradeoff between breadth of knowledge and depth of knowledge, and as far as a university education (especially undergrad) I prefer breadth over depth. It's easy for me to go back later and deepen my knowledge about a given topic, it's a lot harder to learn about things I've never heard of (unknown unknowns).

I posted a link to this post to the tch-econ listserv, thinking some of the people there might be interested. Here's what I posted there:
---------------------
Which I have never done (and, now that I'm retired, will never do). Nonetheless, I thought this blog post makes some very interesting points:

@Kailer Mullet It's rather stunning to see an economist write that. Academic disciplines, like economics or physics, etc., are about finding the truth near as humans can find it. You cannot add to that knowledge or truly appreciate it if you won't put in the hard yards. Understanding a proof is not about checking someone's work, it's about understanding truth.

It is the mark of deep ignorance (and I mean that in the technical sense of the word, not as an insult) to suggest that one should largely not care about understanding the proof of the Pythagorean theorem. If you understand what the Pythagorean theorem is all about, it will lead you to ideas like extremal paths, metric spaces, differential geometry, and topology. All of these ideas are important in economics.

"Economists" who don't see the point in understanding what's under the hood become marketers for banks and pet government programs. They end up using software packages completely inappropriately, like using ARMA models to forecast exchange rates (and I have seen this on more than one occasion from MAs). They become salesmen of policy and advocates for an ever growing government.

I love watching students run a regression using "male" as a dummy variable, and then running it again using "female" as a dummy variable, and then trying to stick both male and female in the regression and seeing that stata insists upon dropping one of the variables - and then running all the regressions two or three times to figure out what's going on - and then suddenly, at the end of the day, a light goes on, and the student suddenly *gets* what dummy variables do.

benoit answered your questions I think. I have always used the char [omit] thing to control which dummy is dropped. This is important since prov==10 (ie NL) is not the best one to be dropped, and that is often dropped by default. I wasn't aware of the .ib4 syntax.

For your other question about the magnitude of the marginal effects, I've not seen large differences between OLS and dprobit. The one exception is for dummy variables where the magnitude of the impact is large. Since dummies aren't continuous, the 'marginal' isn't really a 'marginal'. Dprobit calculates norm(XB) when the dummy equals 0 and again one it equals 1 and takes the difference. When that dummy is influential, it can move XB so much that the curvy aspect of the normal distribution bites. This can make dprobit marginals for dummy variables smaller than what you see for LPM. I don't have an explanation for things going in the other direction, if you're finding bigger LPM coefficients!

I read your post about cookbook econometrics. Shocking. Your suggestion that a deep desire to understand before exploring data is somehow gender specific is insulting to both men and women.

If your department admitted curious students, they would find it deeply unsatisfying to apply a computerized analysis without first understanding what's going on under the hood. That incurious students are admitted at all probably points to rent seeking in academia.

Economics is a beautiful discipline and there is much we don't understand. It's sad some professors reduce it to technician work. Technician graduates will end up marginally employable, running useless forecasts that promote some new government program or social engineering piece. Technicians without understanding do not have a future in today's world, expect possibly in government. As a physicist turned quant, I can tell you none of them will be hired by people like me - I can guarantee that much.

Back in 1972 we got a brand new CDC 6400 computer, with a brand new random number generator, touted as "the real thing". For fun, I generated 1000 random numbers, divided them into 20 groups of 50 each, arbitrarily chose one as the dependent variable and the others as independent variables, then ran a stepwise regression. I got an R^2 of 0.5 and several significant coefficients. Startled, I repeated the experiment several hundred times and always got R^2 between 0.4 and 0.6.

Kind of ruined my confidence in regression analysis.

FWIW it's kind of fun to get a book on regression aimed at statistics students (e.g. the one by Thomas Ryan) and an econometrics book and read them side by side. Doesn't look like the same subject at all.

George - but what did it do for your faith in Monte Carlo experiments? ;-)

Avon - on Tuesday I asked my students who were doing their final presentations to tell me what they'd learned in the course. Admittedly they were speaking in front of the class, and could see me grading them on the computer as they talked, so they were obviously going to put a positive spin on things. But some of the things they said actually moved me so much I just about burst into tears. One student said "I learned how to read a research paper on this class - how a paper is structured, and what the regression results mean."

Avon - I think the point about gender in the cookbook post is that one of the reasons "cookbook econometrics" is insulting is that it's a feminine metaphor. "Repair manual econometrics" wouldn't be nearly as insulting because it suggests looking under the hood and fixing things in a manly way. Just like it's more insulting to call a man a c**t than a d**k. That's all.

Frances - There's is nothing wrong with cookbooks as long as you understand what is going on under the hood first. In mathematics and physics, there are tons of cookbooks for difficult calculations: Feynman diagrams in quantum field theory, Dynkin diagrams in Lie algebras, etc. Physicists and mathematicians look hard for simplifying structures that permit a cookbook recipe. They are proud when then have created the cookbook - there is nothing derogatory about it. But the point is, students of physics and mathematics don't use the cookbook without understanding precisely why it works. Curious students would feel "dirty" doing that - it's all technique and no understanding. Once you understand why the cookbook works, you get a flash of insight (and then you realize just how much smarter people like Feynman were than the rest of us). Really, the point is to understand the motivation for the cookbook in the first place and how the cookbook organizes the problem in a sensible way. University should be for the curious, not for people just looking to get a job (or because their parents think that university is a good idea). Curious people will always land on their feet, there's no need to ensure they can click through a GUI interface. But as I said, rent seeking in academia will extend university to the incurious as it expands the student population and faculty, siphoning more money from the public purse. Perhaps that would make a good research question for your more curious students.

Thanks for the Kennedy document, Kathy! These are great rules for statisticians like me! If there is commandment that I would add to that list, it would be "Thou shall check thy assumptions before using any model".

George: I'm not familiar with econometrics. How do statistics textbooks and econometrics textbooks differ in teaching regression?

Frances: Why is logistic regression not taught in econometrics? It's a pretty standard technique in basic statistics, and, aside from how the coefficients are estimated, it's relatively easy to interpret. In fact, I wished that my professors spent more time on it in my statistics education - I need to use logistic regression much more often than linear regression in my work in industry, so concepts like contingency tables, ROC curves, and concordance statistics are very useful and important.

Eric - in a lot of places logit is part of the core (I did it when I took econometrics from Peter Kennedy), but at Carleton the emphasis is more on time series/macro econometric techniques, so students don't come across it in the core statistics/econometrics sequence. It's done in the advanced undergrad econometrics course, which isn't required for students.

Frances: I did not learn about logistic regression until my graduate applied statistics course. However, I speak as someone who took 5 undergraduate statistics classes before pursuing a Master's degree in statistics, so my undergraduate education experience is not typical. Nonetheless, I believe that it's often not taught until 3rd- or 4th-year classes for statistics majors, and that should change. Not only is it good for students in the long term, it's immediately beneficial in the short term for 2nd- or 3rd-year students who pursue co-op work terms. If an employer asks about the basics of logistic regression in a job interview, it would be to their advantage to know it.

I'm not as familiar with economics, but my experience in the "real world" makes me think that it is greatly undervalued in undergraduate statistics education - for any student who needs to use statistics in their field.

On the difference between a statistician and an econometrician, here's a caricature I used to use. An econometrician and a statistician both estimate a regression of the form y = XB + e. The econometrician throws away the e and spends his/her time interpreting the XB. A statistician throws away the XB and spends his/her time worrying about the e.

More seriously, a statistician puts in a lot more effort trying to understand the data before fitting any model to it. The models tend to be tentative, and statisticians tend to have a lot of diagnostic tools, not just to test whether the assumptions of the model are satisfied, but whether something else is going on.

The conclusions drawn by statisticians tend to be more cautious and they highlight uncertainty more. As an example, when was the last time you saw the Bank of Canada publish forecasts/projections with confidence intervals attached? When I was in grad school, a student who didn't give a pretty comprehensive indication of uncertainty in his results would be shot on sight.

A final observation: Econometricians I have worked with tend to be thrown off stride by "dirty" or non-conforming data, and sometimes pretend that they didn't see that. By contrast, a good applied statistician glories in such data.

George - "The econometrician throws away the e and spends his/her time interpreting the XB. A statistician throws away the XB and spends his/her time worrying about the e."

that's a bit harsh. Econometricians like James MacKinnon has for many years being attempting to drill into the heads of students the importance of thinking about the error term. With regards to B of C forecasts. It's extremely difficult to attach standard errors to the results of some kinds of modelling exercises. Now does that mean that those models should be thrown away? I don't think so. Though I suspect sometimes the reasons for not including error terms are political or psychological rather than econometric, just like the way that the weather network never forecasts a 50% probability of precipitation, because the customers don't like it.

Eric - thanks for the comments. I guess the question is what would one take out of the curriculum in order to make room for logit. Dave Gilles has argued that far too much time is spent on multicollinearity, and I'm inclined to agree with him.

Frances -I did say that was a caricature. Some econometricians are also excellent statisticians. And there are many in the younger generation who do take specification error seriously.

Part of the difference is whether one considers regression analysis as an exploratory or a confirmatory tool. If one wants to simply test hypotheses that have been formulated on a priori grounds, that';s one thing (although there's a lot more randomness in the world than models usually allow for, and some confirmations are just luck). But using regression as an exploratory tool is different. You are stripping away some features in the data that are fairly obvious, so that you can get on with the real job of "seeing" what else is in the data.

According to this latter interpretation, you use a logit or a probit rather than a linear regression, not because of some theoretical consideration, but merely because it gives a better "fit" to the data, and lets you strip away more.

Last year I did a regression analysis for my undergraduate thesis and made a complete hash of it. Dealing with missing data, patching together data from different sources, cleaning it up, were all things I had never really thought about before and in hindsight can see where I screwed up. The other big mistake was trying to include way too many variables for the sample size I had, mostly useless dummies. I think I had something like 150 variables and only a touch over 200 observations. I worked on it for about a year and didn't realise how much I mucked it up until nearly the end of the year when it was way too late to go back and start from scratch like I should have. Having done some kind of applied econometric course before hand would have been a huge help in that regard.