Data Science Stack Exchange is a question and answer site for Data science professionals, Machine Learning specialists, and those interested in learning more about the field. It only takes a minute to sign up.

Suppose I have a train and test data set and both of them contain missing values. Can I join both of them and then do fillna.(alldata.mean()) or will that cause data leakage. What would be a better way because doing them separately would have different mean for same column.

2 Answers
2

This is a controversial subject with no clear best answer and myriad options, some of which are model-specific. You can drop them, replace them with extreme values, interpolate, replace with median, impute with nearest neighbors, etc. In all cases besides dropping them altogether, you make assumptions and create data where no data exists. I can't tell you the best approach, but I can advise you to try multiple approaches and hope the end results are robust to choice.