One of the questions in the final drew my attention to the fact that I had somehow failed to get one topic straight from the lectures. This is the issue of noise. Perhaps a little discussion would help?

Firstly are we on the same page if I understand that noise in this context always means discrepancies between a learned hypothesis and the function that it attempts to approximate? The latter, a true signal, and the former some (usually somewhat inaccurate) approximation to it.

Firstly, the notion of deterministic noise is very clear from the presentation in the books and lectures as the difference between the mean hypothesis and the target function (with the assumptions of some probability distribution on the set of possible samples and the some fixed machine that converts samples to hypotheses, which can then be compared pointwise to the target). But given that the mean hypothesis can be very different to any of the hypotheses in the hypothesis set, I am not sure it helps to think of it as the "best" hypothesis in the set. Especially since sometimes when it is in the set, it is not the best hypothesis! I believe the fact that where it is in the hypothesis set, it is often close to the best approximation may be something to do with certain derived probability distributions typically being not very asymmetric. [Everything's approximately Gaussian, right? ]

Anyhow, to me, the concept of deterministic noise seems to be essentially the same as the bias term in the bias-variance decomposition. Would you agree with this statement?

In hindsight, the next point is where I am sure I was guilty of muddled understanding. The selection of a particular sample is presumably a random thing, resulting from some probability distribution on the set of all possible samples, and is thus stochastic in nature. But only now I have come to the conclusion that this does not form part of what is called stochastic noise.

It would help here if a separate special term was used (wouldn't "sampling noise" be clearer than "variance", a very general term more prone to being misinterpreted than "bias", in my opinion). Unfortunately, as it is a source of noise resulting from the random selection of a particular sample, it is stochastic in nature, which is how I explain my wrong interpretation of the term.

So, am I right in thinking that stochastic noise is limited to noise in the target which is completely independent of the information contained in ? This is certainly a key practical concept, with no distinction between where it is purely random in nature or in terms of missing information in the inputs (whether this is still true in the case of quantum entanglement is a diversion from machine learning ...). For example if you have a deterministic function of 3 variables and are given only 2 of them to learn from, the dependence on the 3rd variable may look exactly like stochastic noise, right?

Thanks, Yaser, for getting us to think about an important issue. Hopefully I am in a better position now to use it in practice. Please do point out anything that I still have wrong or incomplete.

At a high level, noise is the descrapancy between the best you could do within (not the learned hypothesis) and . You cannot model stochastic noise. In a similar way, cannot model the deterministic noise.

At a high level, to a first approximation, the deterministic noise 'level' is quantified by the bias term. However effect of the noise does not end there. When there is noise, it is also harder to find the best fit. This shows up as the indirect impact of the noise, which is in the var term.

Yes, the bias is determined by the mean hypothesis. For most standard models, this is close to the best fit, but not necessarily so as you point out. With respect to thinking about deterministic noise, it is better to think about the actual best fit, and the part of that is 'orthogonal' to this best fit acts like noise and cannot be modeled.

Yes you are correct. The fact that the data set is finite and random is not related to the stochastic noise. It is not the randomness of the data set per se that is bad, but the finiteness of it. So you are right, it may be a good idea to emphasize that the randomness in the data set is not related to the stochastic noise. In fact we had at some point toyed with introducing the term 'finite sample noise' to highlight this point, but decided against it.

However, this randomness of the finite data set is very crucial because that is actually what leads to the var term. If the data set were large, tending to infinity, then the var term would tend to zero (typically at a rate of 1/N).

So what is actually going on is as follows. There is stochastic noise and deterministic noise. These have direct impact on the error through the and bias terms. Now there is also the var term, which is the indirect impact of the stochastic and deterministic noise. This indirect impact is due to the 'not being able to find the best fit' from a finite data set. It is hard to find the best fit from the finite data set, and even harder when there is noise (stochastic or deterministic).

Quote:

Originally Posted by Elroch

One of the questions in the final drew my attention to the fact that I had somehow failed to get one topic straight from the lectures. This is the issue of noise. Perhaps a little discussion would help?

Firstly are we on the same page if I understand that noise in this context always means discrepancies between a learned hypothesis and the function that it attempts to approximate? The latter, a true signal, and the former some (usually somewhat inaccurate) approximation to it.

Firstly, the notion of deterministic noise is very clear from the presentation in the books and lectures as the difference between the mean hypothesis and the target function (with the assumptions of some probability distribution on the set of possible samples and the some fixed machine that converts samples to hypotheses, which can then be compared pointwise to the target). But given that the mean hypothesis can be very different to any of the hypotheses in the hypothesis set, I am not sure it helps to think of it as the "best" hypothesis in the set. Especially since sometimes when it is in the set, it is not the best hypothesis! I believe the fact that where it is in the hypothesis set, it is often close to the best approximation may be something to do with certain derived probability distributions typically being not very asymmetric. [Everything's approximately Gaussian, right? ]

Anyhow, to me, the concept of deterministic noise seems to be essentially the same as the bias term in the bias-variance decomposition. Would you agree with this statement?

In hindsight, the next point is where I am sure I was guilty of muddled understanding. The selection of a particular sample is presumably a random thing, resulting from some probability distribution on the set of all possible samples, and is thus stochastic in nature. But only now I have come to the conclusion that this does not form part of what is called stochastic noise.

It would help here if a separate special term was used (wouldn't "sampling noise" be clearer than "variance", a very general term more prone to being misinterpreted than "bias", in my opinion). Unfortunately, as it is a source of noise resulting from the random selection of a particular sample, it is stochastic in nature, which is how I explain my wrong interpretation of the term.

So, am I right in thinking that stochastic noise is limited to noise in the target which is completely independent of the information contained in ? This is certainly a key practical concept, with no distinction between where it is purely random in nature or in terms of missing information in the inputs (whether this is still true in the case of quantum entanglement is a diversion from machine learning ...). For example if you have a deterministic function of 3 variables and are given only 2 of them to learn from, the dependence on the 3rd variable may look exactly like stochastic noise, right?

Thanks, Yaser, for getting us to think about an important issue. Hopefully I am in a better position now to use it in practice. Please do point out anything that I still have wrong or incomplete.

The contents of this forum are to be used ONLY by readers of the Learning From Data book by Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin, and participants in the Learning From Data MOOC by Yaser S. Abu-Mostafa. No part of these contents is to be communicated or made accessible to ANY other person or entity.