Data Science Stack Exchange is a question and answer site for Data science professionals, Machine Learning specialists, and those interested in learning more about the field. It only takes a minute to sign up.

I have a data set with 3,000 features and continuous dependent variables of time with 18,000 instances. The histogram of the dependent variables show that the they have a bimodal distribution. I am building linear regression models that forecast the time, but none of the models are able to make predictions; the $R^2$ values of all of the models are $0$. I plotted the residuals of the models and verified that they are normally distributed

I have used LassoCV/Lasso, ElasticNetCV/ElasticNet, XGBoost, and Kernel Ridge Regression in Scikit-Learn to make the models. I used the weights of the Lasso coefficients to filter my data frame so that I use the features that have the most magnitude in my predictions, but that did not work. I used the feature_importance_ method for XGBoost to do something similar, but that also did not help.

I have been looking into non-parametric methods like Kernel Density Estimate and Gaussian Mixture Models, but am uncertain in how to use them for my regression task. I have also been looking into Modal Regression and fitting of multi-functions, but am not familiar in how to implement them with Python.

I have also considered finding out what features are contributing to the modes and splitting the data set as two distributions and running the models on the two separate data sets, but am not sure on how to paint the picture for the entire data set.

In general, how should I approach building a regression model with a multimodal dependent variable?