I'm really confused about the difference in meaning regarding the context of linear regression of the following terms:

F statistic

R squared

Residual standard error

I found this webstie which gave me great insight in the different terms involved in linear regression, however the terms mentioned above look a like quite a lot (as far as I understand). I will cite what I read and what confused me:

Residual Standard Error is measure of the quality of a linear
regression fit.......The Residual Standard Error is the average amount that the response (dist) will deviate from the true regression line.

1. This is thus actually the average distance of the observed values from the lm line?

The R-squared statistic provides a measure of how well the
model is fitting the actual data.

2. Now I'm getting confused because if RSE tells us how far our observed points deviate from the regression line a low RSE is actually telling us "your model is fitting well based on the observed data points" --> thus how good our models fits, so what is the difference between R squared and RSE?

F-statistic is a good indicator of whether there is a relationship
between our predictor and the response variables.

3. Is it true that we can have a F value indicating a strong relationship that is NON LINEAR so that our RSE is high and our R squared is low

3 Answers
3

The best way to understand these terms is to do a regression calculation by hand. I wrote two closely related answers (here and here), however they may not fully help you understanding your particular case. But read through them nonetheless. Maybe they will also help you conceptualizing these terms better.

In a regression (or ANOVA), we build a model based on a sample dataset which enables us to predict outcomes from a population of interest. To do so, the following three components are calculated in a simple linear regression from which the other components can be calculated, e.g. the mean squares, the F-value, the $R^2$ (also the adjusted $R^2$), and the residual standard error ($RSE$):

total sums of squares ($SS_{total}$)

residual sums of squares ($SS_{residual}$)

model sums of squares ($SS_{model}$)

Each of them are assessing how well the model describes the data and are the sum of the squared distances from the data points to fitted model (illustrated as red lines in the plot below).

The $SS_{total}$ assess how well the mean fits the data. Why the mean? Because the mean is the simplest model we can fit and hence serves as the model to which the least-squares regression line is compared to. This plot using the cars dataset illustrates that:

The $SS_{residual}$ assess how well the regression line fits the data.

The $SS_{model}$ compares how much better the regression line is compared to the mean (i.e. the difference between the $SS_{total}$ and the $SS_{residual}$).

To answer your questions, let's first calculate those terms which you want to understand starting with model and output as a reference:

If you remember that the $SS_{residual}$ were the squared distances of the observed data points and the model (regression line in the second plot above), and $MS_{residual}$ was just the averaged $SS_{residual}$, the answer to your first question is, yes: The $RSE$ represents the average distance of the observed data from the model. Intuitively, this also makes perfect sense because if the distance is smaller, your model fit is also better.

Q2:

Now I'm getting confused because if RSE tells us how far our observed points deviate from the regression line a low RSE is actually telling us "your model is fitting well based on the observed data points" --> thus how good our models fits, so what is the difference between R squared and RSE?

Now the $R^2$ is the ratio of the $SS_{model}$ and the $SS_{total}$:

# R squared
r.sq <- ss.model/ss.total
r.sq

The $R^2$ expresses how much of the total variation in the data can be explained by the model (the regression line). Remember that the total variation was the variation in the data when we fitted the simplest model to the data, i.e. the mean. Compare the $SS_{total}$ plot with the $SS_{model}$ plot.

So to answer your second question, the difference between the $RSE$ and the $R^2$ is that the $RSE$ tells you something about the inaccuracy of the model (in this case the regression line) given the observed data.

The $R^2$ on the other hand tells you how much variation is explained by the model (i.e. the regression line) relative the variation that was explained by the mean alone (i.e. the simplest model).

Q3:

Is it true that we can have a F value indicating a strong relationship that is NON LINEAR so that our RSE is high and our R squared is low

So the $F$-value on the other is calculated as the model mean square $MS_{model}$ (or the signal) divided by the $MS_{residual}$ (noise):

(2) You are understanding it correctly, you are just having a hard time with the concept.

The $R^2$ value represents how well the model accounts for all of the data. It can only take on values between 0 and 1. It is the percentage of the deviation of the points in the dataset that the model can explain.

The RSE is more of a descriptor of what the deviation from the model the original data represents. So, the $R^2$ says, "the model does this well at explaining the presented data." The RSE says, "when mapped, we expected the data to be here, but here is where it actually was." They are very similar but are used to validate in different ways.

The F-statistic is the division of the model mean square and the residual mean square. Software like Stata, after fitting a regression model, also provide the p-value associated with the F-statistic. This allows you to test the null hypothesis that your model's coefficients are zero. You could think of it as the "statistical significance of the model as a whole."