Do you have any case in which fitting a multivariate regression (so having multiple output nodes) outperforms the fitting of a single output one at a time in terms of accuracy?

I ask this because there is not, as I see it, a sharing of information between output nodes, as could be the case in a linear multivariate regression where
residuals of different responses can be correlated:

$Y_{i1} = X_i \beta_1 + \epsilon_{i1}$

$Y_{i2} = X_i \beta_2 + \epsilon_{i2}$

$E[\epsilon_{i1}]= E[\epsilon_{i2}] = 0$,

$Cov(\epsilon_{i1},\epsilon_{j2})=0$ for $i \neq j$,

$Cov(\epsilon_{i1},\epsilon_{j2})= \sigma_{ij}$ for $i = j$ .

In this simple (bivariate) case if $|\sigma_{ij}| \approx 1$ then I may expect that a very high value of $Y_1$ (relatively to its expected value given by $X \beta_1$) will result in a corresponding high value of $Y_2$ (relatively to its expected value $X \beta_2$ ). This is something that I don't find in a neural network context, since nodes are no random variables, rather deterministic.

$\begingroup$What matters in multivariate regression is errors, not variables, that are correlated.$\endgroup$
– Richard HardyMar 22 '17 at 16:42

$\begingroup$You're definitely right, I corrected the question. I meant the residuals of course$\endgroup$
– Tommaso GuerriniMar 22 '17 at 16:50

$\begingroup$You edit still does not make sense to me...$\endgroup$
– Richard HardyMar 22 '17 at 17:12

$\begingroup$$Y_{ij} = X_{i} \mathbf{\beta_j} + \epsilon_{i}^{(j)} $, $i=1,\dots,n$ ,$j=1,\dots,p$ $\epsilon_{i}^{j} \sim N(0,\sigma_{(j)}^2)$, but $Cov(\epsilon^{(j)},\epsilon^{(k)}) = \sigma_{ik} * I$ (so that residuals of different components of the response are correlated when considered at the same time, but independent for different times and the same holds for residuals over the same component at different times)$\endgroup$
– Tommaso GuerriniMar 22 '17 at 17:41

1

$\begingroup$The part there is a covariance matrix over residuals and responses are correlated still does not make sense to me. There are two parts: (1) there is a covariance matrix (obvious); (2) responses are correlated (irrelevant). What would be relevant is that the errors from different individual equations are correlated.$\endgroup$
– Richard HardyMar 23 '17 at 8:15

1 Answer
1

Here's a cartoon representation of the models. Model A uses a single network to predict both outputs. Model B uses separate networks to predict both outputs. Because the input to both networks is the same, we can re-frame this as the equivalent model C. In this case, the input is passed to to hidden/output layers that are simply concatenated copies of those in model B. The hidden layer weights would have a block diagonal structure, such that weights between units in the left and right halves are zero.

Presumably the output units are linear (because this is a regression problem) and the hidden units are nonlinear (otherwise why bother with a neural net). We can think of a network as mapping the input nonlinearly into a feature space. The images of the inputs in feature space are given by the activations of the last hidden layer. The output layer then performs linear regression in feature space. Training the network amounts to jointly learning the regression weights and feature space mapping.

Training model A would mean learning a single feature space mapping. Another way to say this is that we want to find a single representation of the input that's good for predicting both outputs. With model B, we'd find a separate representation for each output. By the equivalence to model C, this can also be seen as finding a single, higher dimensional representation.

Models B/C are much bigger networks than model A and have many more parameters. Consequently, they should be much more flexible. This could be good or bad depending on the situation. A bigger network can learn more complicated functions, given enough data. Given insufficient data, it can be more prone to overfit. Consider what would happen if we scaled this to 100 outputs instead of 2.

$\begingroup$THank you, super answer! The thing is that model $C$ to me is the same as model $B$, the only difference being the fact that you are minimizing a joint loss. The connections are the exact same..$\endgroup$
– Tommaso GuerriniMar 24 '17 at 8:05

$\begingroup$can I ask you where did you get the images?$\endgroup$
– Tommaso GuerriniMar 24 '17 at 8:41

$\begingroup$@TommasoGuerrini That's correct. I mentioned in the post that models B and C are equivalent, but perhaps the wording wasn't clear. They're just different ways of looking at the same thing. The point was to compare B/C to A. I drew the images in OpenOffice.$\endgroup$
– user20160Mar 24 '17 at 8:43

1

$\begingroup$For B, you could train the two networks on separate machines (with a different GPU each), so I think that might be more efficient. If using a single machine, I don't know. In principle, the structure is the same, but not sure how implementation details would affect things. With C, you'd need to make sure the sparse/block diagonal weight matrix is handled correctly.$\endgroup$
– user20160Mar 24 '17 at 8:59