Modeling and predicting pathology from multivariate clinical data

New Member

Hello, I have a clinical data set that consists of 5 clinical measurements on thousands of tissue samples. Furthermore, each sample has a pathology diagnosis that is 1 of 5 possible diagnoses (all different types of tumors). I am interested in predicting which pathologic class future samples will belong to based on the 5 clinical measurements. I recognize the predictor can be built using machine learning, and I will apply both decision trees and a deep learning method to the data soon. However, I first wanted to explore more simple analyses that could be used to compare the machine learning findings to. An example of the structure of the data are below, averaged across all samples. The numbers are fake.

Is there a statistical approach to take that might say which clinical tests are "important" for which diagnoses? For example, to determine that Tumor Type 5 is best classified by Blood Test 2 > 15, Biopsy Test 1 < 13, and Imaging Test 1 > 2?

Do any other analysis methods jump out at you besides machine learning that I should consider?

Not a robit

As you know you have a classification problem here, so what works for that...many machine learning algorithms. Question, are the continuous lab tests bounded within 0-1.00? Are continuous lab tests correlated?

Yeah, my first inkling was also logistic reg. You wouldn't need tumor free patients if you did regroupings: 1 vs 2-5; 2 vs 3-5,...,4 vs. 1-3, 5 vs. 1-4. But that is a lot of testing when think about correcting for false discovery. But if it is just for fun it would give you a glimpse into relations. Another crude option would be just to run linear reg and treat 1-5 outcomes as a continuous variable. Another option is doing multinomial logistic regression, but you would need to set your reference group accordingly.

Lastly you could look at correlations, you could probably get away with Spearman Rank correlation.

P.S., Would be curious how support vector or random forest may do. I have always wondered if you could run a bunch of short trees say just for each variable (if independent) and pull the split points for each and plot a histogram of the splits to make a decision from. Though running full trees would also help to distinguish Variable Importance.