Abstract

The omitted variables problem is one of regression analysis’ most serious problems. The standard approach to the omitted variables problem is to find instruments, or proxies, for the omitted variables, but this approach makes strong assumptions that are rarely met in practice. This paper introduces best projection reiterative truncated projected least squares (BP-RTPLS), the third generation of a technique that solves the omitted variables problem without using proxies or instruments. This paper presents a theoretical argument that BP-RTPLS produces unbiased reduced form estimates when there are omitted variables. This paper also provides simulation evidence that shows OLS produces between 250% and 2450% more errors than BP-RTPLS when there are omitted variables and when measurement and round-off error is 1 percent or less. In an example, the government spending multiplier, , is estimated using annual data for the USA between 1929 and 2010.

1. Introduction

One of regression analysis’ most serious problems occurs when omitted variables affect the relationship between the dependent variable and included explanatory variables.1 If researchers estimate without considering that the true slope, , is affected by other variables, then they obtain a slope estimate that is a constant,2 in contrast to the true slope which varies with . In this case the regression coefficients are hopelessly biased and all statistics are inaccurate :
By substituting (1.2) into (1.1) to produce (1.3), we can see that an easy way to model this omitted variables problem is to use an interaction term, , which is what we do for the remainder of this paper. However, it is important to realize that this modeling approach captures a much more general problem—a problem that occurs any time omitted variables affect the true slope.

The standard approach to dealing with the omitted variables problem is to use instrumental variables or proxies. However, to correctly use these approaches, the researcher must know how to correctly model the omitted variable’s influence on the dependent variable and the relationship between the instruments and the omitted variables. These requirements are often impossible to meet as many researchers do not even know what important variables they are omitting, much less how to correctly model their influence on the dependent variables via proxies.3 One implication of Kevin Clarke’s papers [1, 2] is that including additional proxies may increase or decrease the bias of the estimated coefficients. The approach taken in this paper avoids the problems discussed by Clarke by directly using the combined effects of all omitted variables instead of trying to replace individual omitted variables.

Specifically, this paper introduces the third generation of a technique which produces reduced form estimates of , which vary from observation to observation due to the influence of omitted variables, without using instruments and, thus, without having to make the strong assumptions required by instrumental variables. In essence, this technique recognizes that (for all observations associated with a given value for the known independent variable) the vertically highest observations will be associated with values for the omitted variables that increase the most and that the observations on the bottom will be associated with omitted variable values that increase the least.

Section 2 of this paper provides an intuitive explanation of this new technique, named “best projection reiterative truncated projected least squares” (BP-RTPLS), and provides a very brief survey of the literature concerning the predecessors to BP-RTPLS. Section 3 presents a theoretical argument that BP-RTPLS estimates will be unbiased. Section 4 presents simulation results that show that ordinary least squares (OLS) produce error that is between 250% and 2450% of the error of BP-RTPLS when there is 1 percent measurement/round-off error, when sample sizes of 100 or 500 observations are used, and when the omitted variable makes a 10 percent, 100 percent, or 1000 percent difference to the true slope. Section 5 provides an example, and Section 6 concludes.

2. An Intuitive Explanation of BP-RTPLS and Literature Survey

The key to understanding BP-RTPLS is Figure 1. To construct Figure 1, we generated two series of random numbers, and , which ranged from 0 to 100. We then defined
Thus the true value for equals . Since ranges from 0 to 100, the true slope will range from 10 (when ) to 50 (when ). Thus makes a 500 percent difference to the slope. In Figure 1, we identified each point with that observation’s value for . Notice that the upper edge of the data corresponds to relatively large , and 95. The lower edge of the data corresponds to relatively small , and 6. This makes sense since as increases so does , for any given . For example, when , reading the values of from top to bottom produces 91, 84, 76, 49, 33, and 10. Thus the relative vertical position of each observation is directly related to the values of .4

Figure 1: The intuition behind 4D-RTPLS.

An alternative way to view Figure 1 is to realize that, since the true value for equals , the slope, , will be at its greatest value along the upper edge of the data where is largest and the slope will be at its smallest value along the bottom edge of the data where is smallest. This implies that the relative vertical position of each observation, for any given , is directly related to the true slope.

Now imagine that we do not know what is and that we have to omit it from our analysis. In this case, OLS produces the following estimated equation: with an -squared of 0.6065 and a standard error of the slope of 2.452. On the surface, this OLS regression looks successful, but it is not. Remember that the true equation is . Since ranges from 0 to 100, the true slope (true derivative) ranges from 10 to 50 and OLS produced a constant slope of 30. OLS did the best it could, given its assumption of a constant slope-OLS produced a slope estimate of approximately . However, OLS is hopelessly biased by its assumption of a constant slope when, in truth, the slope is varying.

Although OLS is hopelessly biased when there are omitted variables that interact with the included variables, Figure 1 provides us with a very important insight—even when we do not know what the omitted variables are, even when we have no clue how to model the omitted variables or measure them, and even when there are no proxies for the omitted variables, Figure 1 shows us that the relative vertical position of each observation contains information about the combined influence of all omitted variables on the true slope. BP-RTPLS exploits this insight. We will first explain 4D-RTPLS (Four Directional RTPLS), UD-RTPLS (Up Down RTPLS), and LR-RTPLS (Left Right RTPLS). BP-RTPLS is the best estimate produced by 4D-RTPLS, UD-RTPLS, and LR-RTPLS.

4D-RTPLS begins with a procedure similar to two stage least squares (2SLS). 2SLS is used to eliminate simultaneous equation bias. In the first stage of 2SLS, all right hand side endogenous variables are regressed by all exogenous variables. The data are plugged into the resulting equations to create instruments for the right hand side endogenous variables. These instruments are then used in the second stage regression. The first stage procedure cuts off and discards all the variation in the right hand side endogenous variables that is not correlated with the exogenous variables.

In a similar fashion, 4D-RTPLS draws a frontier around the top data points in Figure 1. It then projects all the data vertically up to this frontier. By projecting the data to the frontier, all the data would correspond to the largest values for . However, there is a possibility that some of the observations will be projected to an upper right hand side horizontal section of the frontier. For example, the 80 which is closest to the upper right hand corner of Figure 1 would be projected to a horizontal section of the frontier. This horizontal section does not show the true relationship between and , and it needs to be eliminated (truncated) before a second stage regression is run through the projected data. This second stage regression (OLS) finds a truncated projected least squares (TPLS) slope estimate for when is at its most favorable level and this TPLS slope estimate is then appended to the data for the observations that determined the frontier.

The observations that determined the frontier are then eliminated and the procedure repeated. We can visualize this removal as “peeling away” the upper frontier of the data points. As the process is iterated, we peel away the data in successive layers, working downward through the set of data points. The first iteration finds a TPLS slope estimate when the omitted variables cause to be at its highest level, ceteris paribus. The second iteration finds a TPLS slope estimate when the omitted variables cause to be at its second highest level, and so forth. This process is stopped when an additional regression would use fewer than ten observations (the remaining observations will be located at the bottom of the data). It is important to realize that the omitted variable, , in this process will represent the combined influence of all forces that are omitted from the analysis. For example, if there are 1000 forces that are omitted where 600 of them are positively related to and 400 are negatively related to , then the first iteration will capture the effect of the 600 variables being at their largest possible levels and the 400 being at their lowest possible levels.

Just as the entire dataset can be peeled down from the top, the entire dataset also can be peeled up from the bottom. Peeling up from the bottom would involve projecting the original data downward to the lower boundary of the data, truncating off any lower left hand side horizontal region, running an OLS regression through the truncated projected data to find a TPLS estimate for the observations that determined the lower boundary of the data, eliminating those observations that determined the lower boundary, and then reiterating this process until there are fewer than 10 observations left at the top of the data. By peeling the data from both the top to the bottom and from the bottom to the top, the observations at both the top and the bottom of the data will have an influence on the results. Of course, some of the observations in the middle of the data will have two TPLS estimated slopes associated with them—one from peeling the data downward and the other from peeling the data upward.

Above, we discussed projecting the data upward and downward; however, an alternative procedure would project the data to the left and to the right. 4D-RTPLS projects the data 4 different ways, upwards when peeling the data from the top, downward when peeling the data from the bottom, leftward when peeling the data from the left, and rightward when peeling the data from the right. When peeling the data from the right or left, any vertical sections of the frontier are truncated off for the same reasons that horizontal regions were truncated off when peeling the data downward and upward.

Once the entire dataset has been peeled from the top, bottom, left, and right, all the resulting TPLS estimates (with their associated data) are put into a final dataset. These TPLS estimates are then made the dependent variable in a final regression in which and are the explanatory variables. The data are plugged back into this final regression to produce a separate 4D-RTPLS estimate for each observation. To understand the role of the final regression, consider Figure 1 again. If all the observations on the upper frontier had been associated with exactly the same omitted variable values (perhaps 98), then the resulting TPLS estimate would perfectly fit all of the observations it was associated with. However, Figure 1 shows that the observations on the upper frontier were associated with omitted variable values of 92, 98, 98, and 95. The resulting TPLS slope estimate would perfectly fit a value of approximately5 96 (the mean of 92, 98, 98, and 95). When a TPLS estimate for a of 96 is associated with qs of 92, 98, 98, and 95, some random variation (both positive and negative variation) remains. By combining the results from all iterations when peeling down, up, right, and left and then conducting this final regression, this random variation is eliminated.

Realize that is codetermined by and . Thus the combination of and should contain information about . This final regression exploits this insight in order to better capture the influence of . The exact form of this final regression is justified by the following derivation.

In (2.2), the part usually omitted could be of many different functional forms (“” and “” could be any real number, positive, or negative):
If , then the right hand side of (2.3) perfectly matches the left hand side of (2.5) implying that just and should be in (2.6). However, if , including either or and might produce better estimates.6

The mathematical equations used to calculate the frontier for each iteration of 4D-RTPLS are as follows: denote the dependent variable of observation “” by , , and the known independent variable of that observation by , . Consider the following variable returns to scale, output-oriented DEA problem, which is used when peeling the data downward:
The ratio of maximally expanded dependent variable to the actual dependent variable (Φ) provides a measure of the influence of unfavorable omitted variables on each observation. This problem is solved times, once for each observation in the sample. For observation “” under evaluation, the problem seeks the maximum expansion of the dependent variable consistent with best practice observed in the sample, that is, subject to the constraints in the problem. In order to project each observation upward to the frontier, its value is multiplied by (for (2.7), will be greater than or equal to 1). Peeling the data from the right is accomplished by using (2.7) after switching the positions of and (in other words, every in (2.7) would refer to the dependent variable and every in (2.7) would refer to the independent variable when peeling from the right side).

The variable returns to scale, input-oriented DEA problem used when peeling the data from the left is
To project the data to the frontier when peeling from the left, the value for each observation should be multiplied by (for (2.8), will be less than or equal to 1). Observations on the frontier will have a for both (2.7) and (2.8). Finally, to peel the data upward from the bottom, (2.8) will be used after switching the positions of and .

4D-RTPLS projected the data up, down, left, and right. However, if a plot of the data shows a tall and thin column, then it might be best to just project up and down. For example, if has a relatively large effect on the true slope, then the data will appear as a tall column with more efficient observations at the top of this column than at the sides. By projecting the data up and down, the data will be projected to where the efficient points are more concentrated. The more concentrated the efficient points are, the more likely they are to have similar values and thus the resulting TPLS estimates will be more accurate. In this case, UD-RTPLS (Up Down RTPLS which only projects up and down) will produce better estimates than 4D-RTPLS, ceteris paribus.

For similar reasons, when has a relatively small effect on the true slope, the data will appear flat and fat, the efficient points will tend to be concentrated on the sides of the data, and LR-RTPLS (Left Right RTPLS) is likely to produce better estimates than 4D-RTPLS. Any round-off and measurement error that adds vertically to the value of would decrease the accuracy of UD-RTPLS more than it decreased the accuracy of LR-RTPLS (because LR-RTPLS would not be going the same direction as the error was added). BP-RTPLS (best projection RTPLS) merely picks the direction of projection (UD, LR, or 4D) that produces the best estimates.

BP-RTPLS generates reduced form estimates that include all the ways that and are correlated. Thus, even when many variables interact via a system of equations, a researcher using BP-RTPLS does not have to discover and justify that system of equations. In contrast, traditional regression analysis theoretically must include all relevant variables in the estimation and the resulting slope estimate for is for the effects of just -holding all other variables constant. BP-RTPLS′ reduced form estimates are not substitutes for traditional regression analysis’ partial derivative estimates. Instead BP-RTPLS and traditional regression estimates are compliments which capture different types of information. BP-RTPLS has the disadvantage of not being able to tell the researcher the mechanism by which affects . On the other hand, BP-RTPLS has the advantage of not having to model and find data for all the forces that can affect in order to estimate . Both BP-RTPLS and traditional regression techniques find “correlations.” It is impossible for either one of them to prove “causation.”

A brief survey of the literature leading up to BP-RTPLS is now provided.7 Branson and Lovell [3] introduce the idea that by drawing a line around the top of a dataset and projecting the data to this line, one can eliminate variations in that are due to variations in omitted variables. Branson and Lovell projected the data to the left, they did not truncate off any vertical section of the frontier, nor did they use a reiterative process. Leightner [4] projected the data upward, discovered that truncating off any horizontal section of the frontier improved the results, and instituted a reiterative process. He named the resulting procedure “Reiterative Truncated Projected Least Squares” (RTPLS).

Leightner and Inoue [5] ran simulation tests which show that RTPLS produces (on average) less than half the error of OLS when there are omitted variables that interact with the included variables under a wide range of conditions. Leightner and Inoue [5] also explain how situations where is negatively related to can be handled, how omitted variables that can change the sign of the slope can be handled, and how the influence of additional right hand variables can be eliminated before conducting RTPLS. Leightner [6] introduces bidirectional reiterative truncated least squares (BD-RTPLS) which peeled the data from both the top and the bottom. Leightner [7] shows how the central limit theorem can be used to generate confidence intervals for groups of BD-RTPLS estimates. Published studies that used either RTPLS or BD-RTPLS in applications include Leightner [4, 6–12] and Leightner and Inoue [5, 13–15].

3. A Theoretical Argument That BP-RTPLS Is Unbiased

We will begin this section by explaining the conditions under which BP-RTPLS produces estimates that perfectly equal the true value of the slope. We will then argue that relaxing those conditions does not introduce bias into BP-RTPLS estimates. Therefore we will conclude that BP-RTPLS produces unbiased estimates. Figure 2 will be used to illustrate our argument.

Figure 2: When TPLS works perfectly.

If there is no measurement and round-off error and if the smallest value and largest values for the known independent variable are associated with every possible value for the omitted variable, , then UD-RTPLS, LR-RTPLS, 4D-RTPLS, and BP-RTPLS will all produce the same estimates which perfectly match the true slope. Figure 2 was generated by making a member of the set , associating the smallest , which had the value of 1, with each of those and then associating the largest , which had the value of 98, with each of those . The remaining observations were created by randomly generating between 1 and 98 and randomly associating one of the with each observation.

In Figure 2, the first iteration when peeling the data downward would produce the true slope for all of the observations that determined the frontier in that iteration. For both Figures 1 and 2, ; thus for the first iteration. The second iteration will also find the true slope for the observations on its frontier—a slope of . This will be true for all iterations. Furthermore, the exact same perfect slope will be found when the data are projected to the left when peeling from the left. Moreover, when peeling the data upwards and from the right, all iterations will continue to produce a perfect slope. The reason that each iteration works perfectly is that the two ends of each frontier contain identical omitted variable values which correspond to the largest (when peeling down or from the left) or smallest (when peeling up or from the right) omitted variable values remaining in the dataset; thus a frontier between the smallest and largest will be a straight line with a slope that perfectly matches the true of every observation on the frontier. In this case, there is no need to run the final regression of BP-RTPLS because each TPLS estimate is perfect. However if that final regression is run any way, it will produce a -squared of 1.0 and plugging the data back into the resulting equation will regenerate the TPLS estimate from each iteration.

Now that we have established under what conditions BP-RTPLS produces estimates that perfectly match the true slope, we will discuss what happens when those conditions are not met. Changes in these conditions can be grouped into three categories: (1) changes for which the TPLS estimates continue to perfectly match the true slope, (2) changes that will produce TPLS estimates that are greater than the true slope for observations with relatively small and that are less than the true slope for observations with relatively large , and (3) changes for which all the TPLS estimates of a given iteration are greater than (or less than) the true slope. We will provide reasons why each of these types of changes will not introduce systematic bias into the final BP-RTPLS estimates.

Omitting an observation from the middle of the frontier will not affect the TPLS slope estimates (to see this, eliminate any, or all, of the middle of the frontier observations that correspond to a of 90 in Figure 2). Likewise, if the observation corresponding to the upper right hand 90 in Figure 2 is eliminated, then the first iteration when peeling the data downward would continue to generate the true slope because eliminating that observation would just create a small horizontal region in the first iteration which would be truncated off.

However, if the three observations for in the upper right part of Figure 2 were all eliminated, then the observation identified by an 80 in the upper right would define the upper right side of the first frontier. In this case, the resulting TPLS estimate of the slope for the first iteration would be slightly too small for the observations identified with 90 s and too big for the upper most observation identified by an 80.8 The same phenomenon happens when we are peeling upward (or from the right), if the observation identified by a of 10 on the right hand side was eliminated. In this case the observation identified by a 20 on the far right side would define the right side of the first frontier; as a consequence, the first iteration when peeling upward (or from the right) would generate a slope that was slightly too large for the observations with a of 10 but too small for the observation with a of 20. In both of these cases, the TPLS estimated slope of the observations with relatively small are too large and the TPLS estimated slope of the observations with relatively large are too small. It is important to note that, since the TPLS slope estimate for this iteration is found using OLS, the relative weight of the slopes overestimated in this iteration should approximately equal the relative weight of the slopes underestimated. The relative weight of the overestimation would cancel out with the relative weight of the underestimation when the final regression of the BP-RTPLS process forces the results to go through the origin, thus eliminating any possible bias from this phenomenon.9

The third type of changes in Figure 2 would cause all of the TPLS estimates for a given iteration to be larger than (or smaller than) the true slope. For example, when the dataset is peeled downward (or from the left) if all the observations corresponding to were eliminated, then the lower left hand observation identified by a 10 would define the lower left edge of the first frontier. In this case TPLS would generate a slope estimate that was slightly too large for the observations identified by 90 s and much too large for the one observation identified by the 10. Likewise when peeling the data upwards (or from the right) if all of the observations identified with an were eliminated and the next two observations identified by a 10 (in the lower left part of Figure 2) were eliminated, then the observation identified by a 30 in the lower left side of Figure 2 would define the left hand edge of the first frontier. In this case the TPLS slope estimate would be slightly too small for all the observations identified by a 10 on the frontier and much too small for the observation identified by the 30. The incidence and weight of TPLS estimates that are greater than the true slope should be approximately equal to the incidence and weight of TPLS estimates that are less than the true slope when the final BP-RTPLS estimate is made. Thus these inaccuracies in the TPLS estimates should also be eliminated when the final BP-RTPLS estimate is made.

None of the three categories of changes discussed above would add a systematic bias to BP-RTPLS estimates. Additional types of changes are possible, like eliminating observations on both ends of the frontier for a given iteration; however, these types of changes would cause effects that are some combination of the effects discussed above. Finally there is no reason why “random” error would add systematic bias either.

4. Simulation Results

Our first set of simulations are based on computer generated values of and which are uniform random numbers ~, where 0 is the lower bound of the distribution and 10 is the upper bound. Measurement and round-off error, , is generated as a normal random number whose standard deviation is adjusted to be 0%, 1%, or 10% of variable ’s standard deviation. We consider 18 cases—all the combinations where (1) the omitted variable makes a 10%, 100%, or a 1000% difference in , (2) where measurement and round-off error is 0%, 1%, or 10% of , and (3) either 100 observations or 500 observations are used. Equations (4.1), (4.2), and (4.3) are used to model when the omitted variable makes a 10%, 100%, and 1000% difference in , respectively.

Consider
for (4.2) would be 1 + 0.1; since ranges from 0 to 10, the true slope will range from 1 (when ) to 2 (when ). Thus, for (4.2), the omitted variable, , makes a 100% difference to the true slope. For similar reasons makes a 10% difference to the real slope in (4.1) and approximately a 1000% difference in (4.3). Total error for the th observation would equal the error from the omitted variable plus the added measurement and round-off error.

Tables 1 and 2 present the mean of the absolute value of the error and the standard deviation of the error for 18 sets of 5000 simulations each where the errors from OLS and from RTPLs are defined by (4.4) and (4.5), respectively. In these equations, “OLS” refers to the OLS estimate of when is omitted and “True” refers to the true slope as calculated by plugging each observation’s data into the derivatives of (4.1)–(4.3) above. “RTPLS” is the RTPLS estimate of , where “BD,” “UD,” “LR,” or “4D” could be substituted for “.”
The mean absolute value of the percent OLS error (Table 1, row 1) was calculated from (4.6), where “” is the number of observations in a simulation and “” is the number of simulations:
Equation (4.7) was used to calculate the standard deviation of OLS error (Table 2, row 1), where = the mean of .

Table 1: The mean of the absolute value of the error.

Table 2: The standard deviation of the error.

Consider
The absolute value of the mean error (Table 1) and the standard deviation (Table 2) of RTPLS error (Row 2) were calculated with (4.5)–(4.7), respectively, where “” was substituted for “.”

The results when 100 observations are used in each simulation are shown in Panel A, and the results when 500 observations are used are shown in Panel B. Columns 1–3, 4–6, and 7–9 correspond to when the omitted variable makes a 10%, 100%, and 1000% difference in , respectively. No measurement and round-off error was added for columns 1, 4, and 7; 1% measurement and round-off error was added for columns 2, 5, and 8; and 10% measurement and round-off error was added for columns 3, 6, and 9. Row one of Tables 1 and 2 presents the OLS results when was omitted. Row 2a presents the results of using BD-RTPLS, the second generation of this technique.10 Rows 2b, 2c, and 2d present the results of using UD-RTPLS, LR-RTPLS, and 4D-RTPLS, respectively. When running the simulations for rows 2b, 2c, and 2d, three different sets of possible explanatory variables for the final regression were considered: , , and . The set of final regression explanatory variables that produced the largest OLS/RTPLS ratio for rows 2b, 2c, and 2d of a given column is what is reported in that column for Tables 1 and 2. This set of final regression explanatory variables was , , , and for column 3 and just and for all other columns. Row 2e and 3e for BP-RTPLS (Best Projection-RTPLS) just repeats the result in the three lines above it that corresponds to the largest OLS/RTPLS ratio.

When comparing the relative absolute value of the mean error (Table 1) and standard deviation (Table 2) of OLS error to RTPLS error by observation, “” was substituted for in (4.6) and for in (4.7) and then the antilog of the result was found (row 3 of Tables 1 and 2, resp.).11 The natural log of the ratio of OLS to RTPLS error had to be used in order to center this ratio symmetrically around the number 1. Consider a two observation example where the ratio is 5/1 for one observation and 1/5 for the other observation. In this example, the mean OLS/RTPLS ratio is 2.6 making OLS appear to have 2.6 times as much error as RTPLS, when (in this example) OLS and RTPLS are performing the same on average. Taking the natural log solves this problem. and and their average would be zero and the antilog of zero is 1, correctly showing that OLS and RTPLS are performing equally well in this example.

In our tables, we present the mean of the absolute value of the error for OLS and for RTPLS so that the reader can understand the size of the error involved. However, our primary focus is on the OLS/RTPLS ratio because this ratio gives the greatest possible emphasis on the accuracy of estimates for individual observations. It is important to realize that dividing the mean absolute value of the error for OLS by the mean absolute value of the error for RTPLS will not duplicate the OLS/RTPLS error ratio.

Table 1 shows that the mean of the absolute value of the error from OLS is 2.4% to 2.5% when makes a 10% difference to the true slope (Panel A, line 1, columns 1–3); in contrast, when makes a 1000% difference to the true slope, the mean error from OLS is 71.4% (Panel A, line 1, columns 7-8). In contrast, the mean of the absolute value of the error from BD-RTPLS is only 8.93% when makes a 1000% difference and (Panel A, line 2b, column 9). Moving from 71.4% error to 8.9% error is a huge improvement.

Notice also that the mean of the absolute value of error for OLS does not noticeably change with the amount of measurement and round-off error added, but the mean of RTPLS error does increase as measurement and round-off error increases (Table 1, lines 1 and 2). Furthermore, as the sample size increases from 100 observations (Panel A) to 500 observations (Panel B), the mean of the absolute value of OLS error does not noticeably fall; however, sometimes the mean RTPLS error falls and sometimes it rises as the sample size increases from 100 to 500 observations. We have no convincing explanation for why the mean RTPLS error sometimes rises as the sample size increases.

OLS produces greater mean error than RTPLS except for when and for both sample sizes (lines 1 and 2, column 3) and when , , and when , when 500 observations are used (lines 1-2, columns 2 and 6, Panel B). When we focus on the OLS/RTPLS mean error ratio, RTPLS outperforms OLS for all cases (the OLS/RTPLS ratio is greater than 1) except for when only makes a 10% difference and . It makes sense that when and are the same size, then RTPLS is not able to use the relative vertical position of observations to capture the influence of (because this vertical position contains an equal amount of contamination).

When 100 observations and the best projection direction is used (line ), the OLS/RTPLS ratio shows (ignoring the case where both and ) that OLS produces between 2.58 times to 18.92 times (258% to 1892%) more error than RTPLS. When 500 observations and the best projection direction are used, (ignoring the case where both and ), OLS produces between 1.67 times to 39.79 times (167% to 3979%) more error than RTPLS.

Table 1 (line 3) reveals a very interesting pattern. The optimal projection direction is left and right (LR-RTPLS) when makes a 10% difference and ; is left, right, up, and down (4D-RTPLS) when makes a 100% difference and or 1%; is again left and right when and ; and is always up and down (UD-RTPLS) when makes a 1000% difference. This pattern is the same for 100 observations and 500 observations and is the exact same pattern that is obtained by looking at the maximum OLS/RTPLS ratios for the standard deviation of the error (Table 2, line 3). Furthermore, this pattern reappears in Tables 3 and 4 (Panel B) when a single set of data is extensively analyzed. This is a persistent pattern.

Table 3: One set of data, .

Table 4: One set of data, additional simulations.

As discussed in Section 2 of this paper, an increase in the importance of should stretch the data upwards, leading to the efficient observations being more concentrated at the top of the frontier than they are along the sides of the frontier, which would cause a projection upward and downward (UD-RTPLS) to be more accurate than a projection left or right—concentrated efficient observations must have more similar values for than nonconcentrated efficient observations. The opposite happens when makes a relatively small percent change in the true slope. In this case the dataset is flatter, causing the efficient observations to be more concentrated on the left and right and less concentrated on the top and bottom. When this happens (columns 1–3 of Tables 1 and 2), then LR-RTPLS is more accurate than its alternatives. In between the extremes of LR-RTPLS and UD-RTPLS is 4D-RTPLS which projects in all four directions and explains columns 4 and 5 of Tables 1 and 2. The presence of measurement and round-off error makes it harder for RTPLS to correctly capture the influence of the omitted variables. Error also vertically shifts the frontier upwards. Thus, when gets larger, its influence is diminished by projecting left and right (LR-RTPLS). This explains line 3c of column 6 of Tables 1 and 2 as it compares to line 3d, columns 4 and 5.

Table 2 (comparing line 2 of Panels A and B) also shows that as the sample size increases from 100 observations to 500 observations, the standard deviation of RTPLS error fell when is 0% (columns 1, 4, and 7) and when makes a 1000% difference and (column 8). In all other cases, increasing the sample size caused the standard deviation of RTPLS error to increase. In contrast, changing the sample size (or changing the amount of measurement and round off error) did not noticeably change the standard deviation of the error for OLS (Table 2, line 1). However, increasing the importance of does increase the standard deviation of the error for OLS. Furthermore OLS has a smaller standard deviation of the error than RTPLS when and or 10% and when and for both sample sizes (Table 2, line 2, columns 2, 3, and 6). In all other cases, RTPLS has a smaller standard deviation of the error than OLS. When the ratio between OLS and RTPLS of the standard deviation of the error is found for each observation and then the mean is found (using the log procedure described above), OLS has a greater standard deviation of the error than RTPLS for all cases; the OLS/RTPLS ratio ranges from 1.07 to 1.55.

The patterns found in Tables 1 and 2 for the best projection direction are repeated in Panel B of Tables 3 and 4. Tables 3–5 use the same set of 100 values for , , and . Leightner and Inoue [5] generated the values for , , and as random numbers between 0 and 10 and imposed no distributional assumptions (they also list the and data in their Table 1 and the data in footnote 5 of Table 5). The dependent variable for Table 3 (both panels) was generated by plugging in the values for , , and into where the numerical value for the given in Table 3, column 2, is 1000 times and represents measurement and round-off error . Since both and are series of numbers that range from 0 to 10, multiplying by 0.4 makes equal to 40% of .12 The given in column 3 of Tables 3 and 4 is “ as a percent of ” and was calculated as the maximum value for divided by (the maximum value of minus the maximum value for ). for Table 4, Panels A and B, was calculated as . Thus for these two panels, is 20% of . Likewise the of Table 4, Panels C and D, were calculated as ; thus of .

Table 5: Other specifications.

Each successive row of a given panel in Tables 3 and 4 represents an increase in the importance of as shown in column 2. The mean error and the OLS/RTPLS ratios in Tables 3–5 were calculated in the same way as they were in Table 1, sans the taking of the mean value of 5000 simulations. Just as was done for Table 1, all the combinations of UD-RTPLS, LR-RTPLS, and 4D-RTPLS with three different sets of possible explanatory variables for the final regression were considered: , , and . For Table 3, Panel A, and for Table 4, Panels A and C, the best set of explanatory variables for the final regression was always , , , and (and only those results are presented). Likewise, for Table 3, Panel B, and for Table 4, Panels B and D, the best set of explanatory variables for the final regression was always and (and only those results are presented). These patterns mirror the patterns found in Table 1 where , , , and were the best explanatory variables in column 3 and and were the best explanatory variables in all other columns. Notice that Panels B and D are extensions of Panels A and C, respectively, with several rows of overlap presented (see the % given in column 2).

In Table 3 (where of ), LR-RTPLS, 4D-RTPLS, and BD-RTPLS produced the largest OLS/RTPLS ratio when affected the true slope by 300% to 380%, 390% to 440%, and more than 440%, respectively. This progression from LR-RTPLS to 4D-RTPLS to BD-RTPLS as increases in importance reflects the progression shown in Table 1. Furthermore, it is reflected in Table 4, Panel B. In Table 4 (where of ), LR-RTPLS, 4D-RTPLS, and BD-RTPLS produced the largest OLS/RTPLS ratio when affected the true slope by 120% to 140%, 150% to 170%, and more than 170%, respectively. Thus a smaller amount of (Table 4, Panel B) leads to narrower ranges for LR-RTPLS and 4D-RTPLS at much smaller values for the importance of than did the case with a larger amount of in Table 3, Panel B. In Table 4, Panels C and D, as a percent of falls even more (to 10%) and the results show no region (given our increasing the importance of by 10% for each row), where LR-RTPLS and 4D-RTPLS are best.

Finally, notice that the mean of the absolute value of OLS’s error always increases as the importance of increases (column 4 of Tables 3 and 4); in contrast the mean of the absolute value of BP-RTPLS’s error always falls (columns 5–7) when estimates using just and are optimal (Panels B and D). In all the cases shown in Tables 3 and 4, if is less than 5% of , then UD-RTPLS using and in the final regression is the BP-RTPLS method.

Table 5 replicates the results of Table 5 of Leightner and Inoue [5] for applying the first generation of this technique (RTPLS) to different types of equations and compares those results to BP-RTPLS. Column 1 gives the equation estimated. Column 2 gives the true equation into which the data from Tables 1 and 5 of Leightner and Inoue [5] was inserted. Table 5, column 3, presents the mean of the absolute value of the error for OLS (calculated using (4.6), sans the taking of the mean of 5000 simulations). Column 5 gives the mean of the absolute value of the error for BP-RTPLS, column 7 gives the OLS/BP-RTPLS ratios, and column 8 tells what specific form BP-RTPLS took—UD, LR, 4D correspond to UD-RTPLS, LR-RTPLS, and 4D-RTPLS, respectively; no + signs, one + sign, and two + signs after UD, LR, and 4D indicate , , and as the explanatory variables in the final regression, respectively. “1D” in column 8 denotes RTPLS.

The number not in parenthesis in columns 4 and 6 duplicates the numbers given in Table 5 of Leightner and Inoue [5] for the first generation of this technique (RTPLS) for the mean of the absolute value of the error for RTPLS and for the OLS/RTPLS ratio. The numbers in parenthesis in columns 4 and 6 show how RTPLS would have performed if a constant had not been included in the final regression.13 A comparison of the numbers not in parenthesis to those in parenthesis dramatically illustrates how important it is to not include a constant in the final regression—not including a constant increased the OLS/RTPLS ratio for all but two of the cases (lines 1d and 3b) and the average OLS/RTPLS ratio increased 3.82-fold.14

If might be negative (Line 1, Table 5), then a preliminary OLS regression should be run between and . If this preliminary regression generates a positive (as it did for lines 1d, 1g, 1h, 1i, and 1j), then normal BP-RTPLS can be used (note: true was negative for 4, 43, 26, 20, and 16 percent of the observations in lines 1(d), 1(g), 1(h), 1(i), and 1(j), resp.). However, the preliminary regression found a negative for the cases given in lines 1(a), 1(b), 1(c), 1(e), and 1(f). In these cases, all were multiplied by negative one and then a constant (equal to 101, which was sufficiently big to make all positive) was added to all . The normal BP-RTPLS process was then conducted using the adjusted , but the resulting were remultiplied by minus one. Multiplying either or by negative one and then adding a constant to make them all positive is necessary because (2.7) and (2.8) only work for positive relationships.

This entire paper deals with misspecification error in that the influence of omitted variables is ignored when using OLS for all of this paper’s cases. However, Table 5, line 2(a) takes misspecification error to even the relationship between and : should be squared (column 2), but it is not (column 1). In this case BP-RTPLS produced 24 percent mean error (column 5) and a third of the error of OLS (column 7). Line 3 shows the results of using RTPLS when omitted variables affect an exponent. Line 4 of Table 5 demonstrates that the relationship between the omitted variable and the known independent variable does not have to be modeled for BP-RTPLS to work well; BP-RTPLS noticeably out performs OLS when the interaction term is (Line 4(a)), (Line 4(b)), and (Line 4(c)).

Line 5 of Table 5 shows how BP-RTPLS can be used when there is more than one known independent variable, where only one of them interacts with omitted variables. Leightner and Inoue [5] argue that OLS produces consistent estimates for the known independent variables that do not interact with omitted variables. Therefore to apply BP-RTPLS to the equation in Line 5 of Table 5, an OLS estimate can be made of . can then be calculated as . Finally RTPLS can be used normally to find the relationship between and (note: in Table 5, Line 5 the error from OLS is from estimating ). In all the cases shown in Table 5, BP-RTPLS noticeably out performs OLS. Comparing column 7 to column 6 of Table 5 and line 3(a) to 3(b) of Table 1 clearly shows that BP-RTPLS produces a major improvement over the first two generations of this technique.

5. Example

When the government buys goods and services (G), it causes gross domestic product (GDP) to increase by a multiple of the spending. The pathways linking G and GDP are numerous, interacting, and complex. For example, the increased government spending will cause producer and consumer incomes to rise, interest rates to rise, and put upward or downward pressure on the exchange rate, affecting exports and imports which in turn affect GDP. Many economists have spent their careers trying to model all the important interconnections in order to better advise the government. To complement the efforts of these economists, BP-RTPLS can be used to produce reduced form estimates of without having to model all the “omitted variables.”

Annual data for the USA between 1929 and 2010 were downloaded from the Bureau of Economic Analysis Website (http://www.bea.gov/). The data were in billions of 2005 dollars and corrected for inflation using a chain-linked index method. The top line of Figure 3 shows the results of using LR-RTPLS and the bottom line of using UD-RTPLS to estimate . If 4D-RTPLS had been depicted, it would lie between the top and bottom lines. Although LR-RTPLS and UD-RTPLS produced different estimates, the two lines are close to each other and they are approximately parallel.

The UD-RTPLS (LR-RTPLS) estimate for 2010 of 6.01 (6.26) implies that a one dollar increase in real government spending would cause real GDP to increase by 6.01 (6.26) dollars. The big dip down in coincides with WWII—the UD-RTPLS (LR-RTPLS) estimate of in 1940 was 5.44 (5.67) and it fell to 1.65 (1.73) in 1945. It makes sense that the government purchasing bullets, tanks, and submarines (many of which were destroyed in WWII) would have a smaller multiplier effect than the government building roads and schools during nonwar times. The UD-RTPLS (LR-RTPLS) estimates climbed from 3.12 (3.26) in 1953 to 6.33 (6.60) in 2007. The crisis that started in the USA in 2008 caused the government multiplier to fall by five percent. An OLS estimate of is 5.22 for all years.

6. Conclusion

This paper has developed and extensively tested a third generation of a technique that uses the relative vertical position of observations to account for the influence of omitted variables that interact with the included variables without having to make the strong assumptions of proxies or instruments. The contributions of this paper include the following.

First, Leightner and Inoue [5] showed that RTPLS has less bias than OLS when there are omitted variables that interact with the included variables. However, this paper shows that both RTPLS and BD-RTPLS (the first two generations of this technique) still contained some bias (see footnote 9) because it included a constant in the final regression. Section 3 of this paper shows that the third generation of this technique (BP-RTPLS) is not biased. Second, this paper shows that when RTPLS does not include a constant, it produced OLS/RTPLS ratios that were 586 percent higher on average than RTPLS when it does include a constant in Table 1 (ignoring column 3) and 382 percent higher in Table 5. Deleting this constant constitutes a major improvement.

Second, this is the first paper to test how the direction of data projection and the variables included in the final regression affect the results. Very strong and persistent patterns were found that include (1) that , , , and should be used as the explanatory variables in the final regression when has an extremely small effect on the true slope and that only and should be used when has a normal or relatively larger effect on the true slope15, (2) as the importance of the omitted variable increases, and as the size of measurement and round off error decreases, there is usually a range where LR-RTPLS produces the best estimates followed by a range where 4D-RTPLS is best, followed by UD-RTPLS being best. However, UD-RTPLS using just and in the final regression will be (by far) the best procedure for the widest range of possible values for the importance of , for the size of , and for the type of specification. We recommend that researchers wanting to use BP-RTPLS use UD-RTPLS but test the robustness of their results by comparing them to (at the very least) LR-RTPLS estimates and then focus their analysis on conclusions that can be drawn from both the UD-RTPLS and LR-RTPLS estimates.

Acknowledgments

The authors appreciate the comments and suggestions made on earlier generations of BP-RTPLS by Knox Lovell, Ron Smith, Lawrence Marsh, and Charles Horioka.

Endnotes

If the true relationship is and if is omitted from the analysis, but has no relationship with and thus does not affect the true slope——then acts as a source of additional random variation which (in large samples) does not change the numerical value of the estimated slope ; however, it will affect the estimated level of statistical significance. One indicator of the importance of “omitted variable bias” is that a Google Scholar search conducted in September 2011 generated 276,000 hits for that phrase. Those hits included [16–30]. These papers include applications ranging from criminal arrest rates, school achievement, hospital costs, psychological distress, housing values, employment, health care expenditures, the cost of equity capital, effects of unemployment insurance, productivity, and financial aid for higher education.

The estimate for will be approximately , where is the expected, or mean, value for .

Instrumental variables must also be ignorable, or not add any explanatory value independent of their correlation with the omitted variable. Furthermore, they must be so highly correlated with the omitted variable that they capture the entire effect of the omitted variable on the dependent variable [1]. Other methods for addressing omitted variable bias (e.g., see [20, 22, 28, 30]) also require questionable assumptions that are not made by BP-RTPLS.

If, instead of adding in (1.1), we had subtracted , then the smallest would be on the top and the largest on the bottom of Figure 1. Either way, the vertical position of observations captures the influence of the omitted variable .

We say “approximately” because how the data are projected to the frontier will affect the resulting TPLS estimate. If the data are projected upwards, then the top of the frontier is weighted heavier. If the data are projected to the left, then the bottom of the frontier is weighted heavier. Notice that projecting to the upper frontier in Figure 1 eliminated approximately 92 percent of the variation due to omitted variables (the was changed from a range from 1 to 98 to a range from 92 to 98). The final regression eliminates any remaining variation due to omitted variables.

is more likely than to be correlated to ; thus we consider adding either or and , but not just . Notice that (2.6) should be estimated without using a constant.

All of the existing RTPLS and BD-RTPLS literature truncated off any horizontal and vertical regions of the frontier, truncated off 3% of the other side of the frontier, and used a constant in the final regression. BP-RTPLS does not truncate off 3% of the other side of the frontier nor does it add a constant to the final regression.

Think of a straight regression line that would pass through the observations on the frontier. The slope of that regression line would be flatter than the slope through just the 90 s and steeper than the slope going through all the 80 s in Figure 2. Also notice that in this case the second iteration will return to producing a perfect slope estimate for the remaining observations associated with a of 80, after truncating off a small horizontal region of the frontier.

If we plotted the true value of the slope versus the BP-RTPLS estimate of the slope, then BP-RTPLS works perfectly if its estimates lie on the 45 degree line. The effect discussed in this paragraph implies that if a constant was added to the final BP-RTPLS estimate (which would be incorrect), then the BP-RTPLS line would cross the 45 degree line in the middle of the data and the triangle formed by the 45 degree line and the BP-RTPLS line below this crossing would be identical to the triangle formed above this crossing. This is exactly what we find if we add a constant to the final BP-RTPLS estimate. However, when a constant is not included, the two triangles being of equal size off set each other and the BP-RTPLS estimates lie along the 45 degree line indicating the absence of bias. This implies that adding a constant to the final regression, as was done in the first two generations of this technique, resulted in biased estimates; however, this is not a problem in the third generation.

This BD-RTPLS is not exactly the same as the second generation of this technique. It is like the second generation in that it peels the data both down and up and that it used a constant, and in the final regression. It is unlike the second generation because it did not truncate off the smallest 3% of the in each iteration when peeling down and the largest 3% of the when peeling up. However, by making the difference between BD-RTPLS in line 2a and UD-RTPLS in line 2b solely the presence of a constant in the final regression for BD-RTPLS, we dramatically illustrate why a constant should not be included in the final regression. To see this, compare line 3(a) and 3(b).

Leightner and Inoue [5] mistakenly substituted “” for in their counterpart for (4.6). This resulted in the absolute value being taken twice, which should not have been done. This affected their results the most when the size of measurement error was 10%.

Thus this is always positive. This always positive can be thought of as the combined effects of an omitted variable that shifts the relationship between and upwards (without changing its slope) with measurement and round-off error that would sometimes increase and sometimes decrease . This was the easiest way to construct error that can be calibrated to .

The old RTPLS not only used a constant in the final regression, it also truncated off the first 3% of the frontier which occurred on the side of the frontier opposite any potentially horizontal or vertical region and it did not make estimates for the observations that corresponded to the 3% of the observations with the smallest values for . The numbers given in parentheses in columns 4 and 6 of Table 5 do none of these things. However, the numbers in parenthesis do use the best set of explanatory variables for the final regression: , , or as indicated in column 8.

For approximately half the cases, BP-RTPLS estimates (column 7) were less than the RTPLS estimates when a constant is not used (numbers in parentheses in column 6). This implies that the TPLS estimates from peeling the data downward were more accurate than the TPLS estimates from peeling the data upwards for this dataset.

Table 5 shows that this rule may not hold for other specifications. Much more work needs to be done to determine the optimal set of explanatory variables for the final regression under different specifications.