Generalized Coefficient of Correlation for Non-Linear Relationships

What is the best correlation coefficient R(X, Y) to measure non-linear dependencies between two variables X and Y? Let's say that you want to assess weather there is a linear or quadratic relationship between X and Y. One way to do it is to perform a polynomial regression such as Y = a + bX + cX^2, and then measure the standard coefficient of correlation between the predicted and observed values. How good is this approach?

Note that the proposed correlation coefficient R(X, Y) is not symmetric. One way to get a symmetric version, is to use the maximum between | R(X, Y) | and | R(Y, X) |. It will be equal to 1 if and only if there is an exact polynomial or inverse polynomial relationship between X and Y.

Note: If one checks the model Y = a + bX + cX^2, the "inverse polynomial" model would be X = a' + b'Y + c'Y^2. So, R(X, Y) is computed on the first regression, while R(Y, X) is computed on the second (reversed, also called dual) regression.

Discussion

An issue with my approach is the risk of over-fitting. If you have n observations and n coefficients in the regression, my correlation will always be 1.

There are various ways to avoid this problem, for instance:

Use a polynomial of degree 2 maximum, regardless of the number of observations.

Use much smoother functions than polnomials, for instance functions that have one extremum (maximum or minimum) at most, and growing not faster than a linear function. Even in that case, use a small number of coefficients in the regression, maybe log(log(n))) where n is the number of observations.

The correlation coefficient in question can also be used for model selection: The best model would provide the correlation closest to 1.

In general, I would recommend Mutual Information as an approach to measuring the strength of a relationship between two variables with an unknown linear, nonlinear and even non-functional relationship between them.

There are some challenges though, most notably you need to estimate the joint density between the variables (discussed in the paper supplement and the attached slides). And with the number of data points illustrated above, that would not be possible. But then again, you can't really fit a high-order polynomial through number of points above either (i.e., it's obviously overfit as illustrated). So the issue of sample size is really a generic problem. Nevertheless, MI is probably a bit more data hungry than a constrained analytical model (if you know what the model should be in advance).