This forum supports discussion on topics of specific interest to the Model Analysis Special Interest Group of the System Dynamics Society (SIG-MA). It is currently unmoderated, and anybody who is signed in may post to the forum.

m1 is a the regression coefficient for the effect of one point increase in (a) on (y), etc....

And I want to include greenspace and dog poop density in a model as separate policy variables that can be changed exogenously.
- might it be an okay interpretation to use a lookup table for greenspace with the equation y(a) = c + m1(a)
- and a second lookup table with the equation y(b) = c + m2(b)
- such that the inputs to the lookup would be greenspace (a) and dog poop density (b), respectively
- and the output from the lookup tables would be new 'indicated fraction happy people in neighborhood', used to feed a goal-gap structure with the other input being 'actual fraction happy people in neighborhood'....

1. Does this use of a regression coefficient make sense?

And, if so, the second question is (and goes back to a recent discussion on this forum)...
2. should we multiply or should we add y(a) and y(b) to create a combined 'indicated fraction happy people in neighborhood' to feed the goal-gap structure?
My guess is that in this case we should add them, as that's the relationship specified by the regression equation.

I can post a sample model tomorrow if it helps clarify the question.

Thoughts? It doesn't need to be a perfect interpretation, we're just trying to put these spatial regression coefficients to good use by integrating them into an SD model....

What you can do with a cross sectional (or in this case spatial) formulation inside of an aggregate model depends strongly on the exact formulation being used. If the formulation is linear than you can just move the spatial integration outside of the equation and what is true for each neighborhood is true in aggregate for the entire community. That is is

average happiness in sector = a + b * average green space per person in sector

then

average happiness in community = a + b * average green space per person in community

So that exact same equation applies.

Once you leave linearity this does not work. That is, if y = f(x) then the expected value of y is not equal to f(expected value of x) unless f is linear. It is, however, possible to make assumptions about the distribution of x and come to some meaningful approximations. For example, in your case, you might assume that the green space exogenous effect was proportional to all locations so that doubling green space would double it everywhere. In this case you could derive computationally the relationship between average green space and average happiness. If there are 2 effects you might do this as a two dimensional lookup surface.

A clarification, please. I'm not familiar with spatial integration, and I don't know the specifics of the two independent variables you use as examples, except from an empirical perspective. When using regression results as the basis for a lookup, the units of the betas (coefficients of the independent variables) have dimensions. A linear regression relating income to education, such as " wage = alpha + (beta*education )+ u, beta carries an implied dimension of "wages/education". Yet using dimensioned variables as input to a table goes against our considered good practice.

If the regression betas are going to be used in look up tables, I expect that it would be advantageous to have the dependent datum normalized against a standard so that the input to the table are dimensionless.

What do you think?

Of course, if the model or its users gain faith in the model's dynamic implications by referencing established literature that includes regression parameters, that's a plus to consider.

Thank you for the responses. I will try to clarify, as I now have a bit more clarity myself on what's happening.

First, the regression (not my work) is a geographically-weighted regression. It looks at 200+ neighbourhoods and finds some significant variables correlated(?) with a specific public health issue. I will ask my colleague if we can try and publish this at some time in the next few months so I don't need to be so vague...sorry...

Now, the regression did not use time-series data...I believe because of small samples sizes in the neighbourhoods, it aggregated 7 years of data into one equation:

y(x) is a fraction of the population with a disease. The variables contribute to the disease.

Well, in the end we have not used lookup tables. Instead, we simply use the regression equation. So, we start the model in equilibrium where the actual y(x) or fraction of the population with the disease is equal to the regression-indicated y(x). Then, we change a variable...and therefore there is a gap between the actual and the indicated y(x)....so, the stocks adjust in a first-order adjustment process, classic goal-gap behaviour.

I am not sure that the part of Bob's response related to aggregation is relevant to this particular problem, because we have simply subscripted the same structure by the 200+ neighbourhoods and there is no interaction between the neighbourhoods. Although I'm sure that his feedback will be useful in the future (and hopefully I am not missing something important by discounting it in this case?) Again, I will try and supply an example model soon.

Also, I am not sure I understand: if y = f(x) then the expected value of y is not equal to f(expected value of x) unless f is linear...I need to think this over, though I do believe it relevant to the model we are working on.

In an attempt to respond to Eliot's comment...hmmm. By " using dimensioned variables as input to a table goes against our considered good practice" do you mean that using a dimensioned variable as an input to a lookup table sneaks around our good practice of unit consistency? Incredibly, I've never considered this!

Eliot, in the end, I believe that all of the variable1, variable2, etc. are indexed values, e.g. between 0 and 100 where 0 = no green space and 100 = maximum green space imaginable (although there's not necessarily a neighbourhood with a value of 100 for green space).

But all the independent variables are not indexed...for example, one refers to average level education in the neighbourhood...so using the regression equation in the way I described is not really unit-consistent, is it?

The purpose of the SD model is somewhat secondary to the regression...the geographically-weighted regression is the core of the research, and we are trying to implement the regression coefficients as exogenous inputs into a simple SD model of disease progression because of the power of even a simple stock and flow model and the desire to add behaviour-over-time.

I hope this helps clarify. I welcome your comments although I probably need more time to digest them and really understand the implications of what you've offered. Again, I apologize for not posting a model. I will do so at my earliest convenience.

1. Using the regression equation directly is probably the most transparent and convenient thing to do. However, you might need to modify the relationships to handle extreme conditions. There could easily be combinations of inputs that don't occur in the original data that produce physically impossible outputs. A typical approach would be to take a linear relationship (which might yield impossibly-negative happiness at some point) and transform it with a logistic, so that it preserves the central slope, while constraining the extremes to a fixed range.

2. Units aren't really a problem. In y = a*x, the regression coefficient (a) has units (y per x). However, it's often helpful to put things in a normalized form, like y/y0 = a*(x/x0).

3. Whether you can use the regression directly seems like it boils down to the same question as whether the regression is right. In both cases, a per-cell, aggregate data approach assumes that the time constant of poop->happiness relationships is short with respect to the measurement intervals and that cells don't interact. The latter means that spatial decay of effects is rapid with respect to cell size. If those are approximately true, then everything's OK. Once should be able to detect whether this is true in the regression by looking at temporal and spatial correlations of residuals. Also, you could repeat the regression, including values from adjacent cells as explanatory vars for the cell under consideration.

4. I think Bob's point pertains to the relationship between aggregate and disaggregate data in general, which might be important in a number of circumstances. This might crop up, for example, if your grid cells are large, but there are important sub-grid-scale processes going on. That might give rise to nonlinearities or tipping point dynamics that invalidate the assumptions of the regression.

5. An alternative to plugging the regression into the cell structure would be to build the spatial model, including extreme conditions constraints, nonlinearities, and feedback among cells, and calibrate it to the data directly. This would be better than the regression, because you wouldn't have to make so many restrictive assumptions, though it might also be computationally burdensome. Even if you couldn't get the spatial-dynamic model to run fast enough to calibrate, you could at least use it to generate synthetic data, and use the data to test the regression approach to see if the answers make sense.

With a linear model using the regression equation directly should work fine, but!

The but relates to the fact that the underlying regression equations used data from different times, and this means it assumes the relationship is static. Since the purpose of the SD component is to demonstrate the implications of having a dynamic component, using this directly seems inappropriate.

It is probably better in this case to use the conceptual formulation underlying the regression (this is typically different from the regression equation itself since these involve accommodation to available data) along with a parametrization that is in the ballpark of the regression equations. Then just show the implications of time based behavior and see if you can infer from that which way the regression coefficients might be biased, and any implications this would have for policies based on the regression equation. The purpose here is to add a dimension to the static analysis that is effectively qualitative even though it is a numerical simulation informing a statistical model.