Identify "important variables" for partition trees regression

Jun 14, 2019 12:02 PM(1209 views)

I am developing an analysis of partition trees regression, with JMP 13, I have 83 variables, I am identifying the variables that do not have colinearity with VIF, with the aim of being able to identify the most "representative" variables to develop the analysis of partition trees regression(The example I use is Bank RevenuesMultiple Linear Regression with Transformation / From Building Better Models with JMP Pro, Chapter 4, SAS Press (2015).Grayson, Gardner and Stephens.).My query, could this be an appropriate way?or there is a more specific protocol in JMP to identify these "important" variables to work with regression partition trees.

Re: Identify "important variables" for partition trees regression

I'm no expert, but I've gone through that book as well (as an aside, I liked the examples and practice work). There are some good examples in later chapters that might be applicable to your particular situation, too.

The VIF is certainly a way to determine which variables are over emphasized in the model. Examing the column contributions of the decision tree results can be quite helpful, too. Another approach could be to use principal component analysis (PCA), which can help to group the most important variables together and use those as PCs in modeling the response. The bootstrap forest platform is supposed to be robust against overfitting because it's randomized with both responses and factors. Both the bootstrap forest and boosted tree should "kick" out non-useful factors as it won't split along those factors and their contribution will drop to very low values.

If you have enough data to create a validation column with training, validation, and test data, this will help to simplify your model a lot. The validation portion helps to choose the most parsimonious model from the group, while the test data is used to see how well the model works with "blind" data. One note with this, you might want to consider if this should be stratified, randomized, or somehow otherwise chosen. This can be partially (or largely) based on the response your modeling. Is it continuous, or binary, etc?

Lastly, comparing several different models (model comparison platform), such as NN, partition, boostrap forest, boosted tree, or generalized regression will also help to choose which one works best for your data. Be sure to compare your models on the "test" data only, or on any that were manually held back by hiding and excluding (i.e. you don't want to test on the training data).

I'm not sure if there is a "standardized" method to determine this, but certainly approaching the problem from several different angles will help to see where the most overlap is, and help you in your decision in determing the "representative" variables.

Re: Identify "important variables" for partition trees regression

There will be some limitations in model generation when you don't have access to JMP Pro. On the other hand, you can still do ARIMA modeling (if appropriate) and a simplified NN and partition tree. Another platform that might help is the predictor screening, under Analyze -> Screening. This could help to also identify which of the many factors are likely to be important.

Another thing you can try is in the Fit Model platform, to choose "stepwise" under "personality". This will also use different criteria to select which factors (and mixed terms) are needed and which aren't. Here, it helps a lot to have a pretty good idea of what kind of model to use. If you have lots of data, you could select all your factors and then choose response surface under the "macros" drop-down button. You just want to make sure your degrees of freedom are not reduced too far. If you know that there is some physical process governing the response, this can be a start in how to generate a model equation. One last thing you might want to look into is transforming your data before modeling it, especially if it's not normally distributed. Transforming the data can help the modeling process achieve a robust fit that has better predictive capabilities than it would otherwise.

As far the validation column goes, I know standard JMP doesn't come with the GUI to generate it, but here's some JSL that will generate N (I have it set to 5) validation columns that are stratified. Some models are sensitive to the validation column, even when stratifying on a certain response, so testing out different validation columns can be helpful in evaluating the robustness of the model.

In the above script, i runs from 1 to 5, so it generates 5 validation columns that are split 60% training, 20% validation and 20% test (you can change the values to what suits your data best), and it will be stratified random on whichever column(s) you choose in the :Group popup window. If you don't want to stratify, then you can replace that command with a random category formula.

JSL is a pretty easy language to code in, so if you have time, I highly suggest testing out some scripts to become familiar with it. I find it quite helpful when performing the same tasks on different data sets -- it helps to automate things. You can also check out the Scripting Index under the Help menu.

Re: Identify "important variables" for partition trees regression

I'm no expert, but I've gone through that book as well (as an aside, I liked the examples and practice work). There are some good examples in later chapters that might be applicable to your particular situation, too.

The VIF is certainly a way to determine which variables are over emphasized in the model. Examing the column contributions of the decision tree results can be quite helpful, too. Another approach could be to use principal component analysis (PCA), which can help to group the most important variables together and use those as PCs in modeling the response. The bootstrap forest platform is supposed to be robust against overfitting because it's randomized with both responses and factors. Both the bootstrap forest and boosted tree should "kick" out non-useful factors as it won't split along those factors and their contribution will drop to very low values.

If you have enough data to create a validation column with training, validation, and test data, this will help to simplify your model a lot. The validation portion helps to choose the most parsimonious model from the group, while the test data is used to see how well the model works with "blind" data. One note with this, you might want to consider if this should be stratified, randomized, or somehow otherwise chosen. This can be partially (or largely) based on the response your modeling. Is it continuous, or binary, etc?

Lastly, comparing several different models (model comparison platform), such as NN, partition, boostrap forest, boosted tree, or generalized regression will also help to choose which one works best for your data. Be sure to compare your models on the "test" data only, or on any that were manually held back by hiding and excluding (i.e. you don't want to test on the training data).

I'm not sure if there is a "standardized" method to determine this, but certainly approaching the problem from several different angles will help to see where the most overlap is, and help you in your decision in determing the "representative" variables.

Re: Identify "important variables" for partition trees regression

Thanks for the suggestions, they clarified many issues to me, and I will work in that context.Although one of my limitations to be able to use and develop, for example some comparisons between models, is that I do not work in a JMP PRO version.However, I was able to identify the option: JMP to R Add-In Builder, which I understand could facilitate me the use of other algorithms for the development of analysis as it suggests;this last one that I indicate, it is because I am a person non-coder >.<

Re: Identify "important variables" for partition trees regression

There will be some limitations in model generation when you don't have access to JMP Pro. On the other hand, you can still do ARIMA modeling (if appropriate) and a simplified NN and partition tree. Another platform that might help is the predictor screening, under Analyze -> Screening. This could help to also identify which of the many factors are likely to be important.

Another thing you can try is in the Fit Model platform, to choose "stepwise" under "personality". This will also use different criteria to select which factors (and mixed terms) are needed and which aren't. Here, it helps a lot to have a pretty good idea of what kind of model to use. If you have lots of data, you could select all your factors and then choose response surface under the "macros" drop-down button. You just want to make sure your degrees of freedom are not reduced too far. If you know that there is some physical process governing the response, this can be a start in how to generate a model equation. One last thing you might want to look into is transforming your data before modeling it, especially if it's not normally distributed. Transforming the data can help the modeling process achieve a robust fit that has better predictive capabilities than it would otherwise.

As far the validation column goes, I know standard JMP doesn't come with the GUI to generate it, but here's some JSL that will generate N (I have it set to 5) validation columns that are stratified. Some models are sensitive to the validation column, even when stratifying on a certain response, so testing out different validation columns can be helpful in evaluating the robustness of the model.

In the above script, i runs from 1 to 5, so it generates 5 validation columns that are split 60% training, 20% validation and 20% test (you can change the values to what suits your data best), and it will be stratified random on whichever column(s) you choose in the :Group popup window. If you don't want to stratify, then you can replace that command with a random category formula.

JSL is a pretty easy language to code in, so if you have time, I highly suggest testing out some scripts to become familiar with it. I find it quite helpful when performing the same tasks on different data sets -- it helps to automate things. You can also check out the Scripting Index under the Help menu.