Developments over last few years may change (statistically significantly!) the way we analyze our data. These include wide availability of powerful computers (especially with graphics processing units or GPUs that allow large scale, parallelized computations), open source programming languages, for example, R (https://cran.r-project.org) and Python (https://www.python.org) as well as machine learning (ML) software (e.g., Scikit-Learn, Theano, TensorFlow, Caffe, Weka, and Apache Spark). As a result, ML is increasingly being used for data analysis in medicine.[1],[2],[3]

ML algorithms can be supervised or unsupervised depending on whether a class or outcome variable is available. In addition to commonly used linear and logistic regression, many generalized linear models are available such as Ridge regression, Lasso, Elastic net, Least Angle Regression, Bayesian regression, Perceptron, Random sample consensus, Theil–Sen estimator, and Huber regression. Last 3 have the advantage of being robust to outliers. Currently, only linear and logistic regression analyses are being used widely in medical studies.

Supervised learning also includes many nonlinear techniques such as Linear and Quadratic Discriminant Analysis, Kernel ridge regression, Support vector machines, Stochastic gradient descent, Nearest Neighbor Gaussian processes, Cross decomposition, Naive Bayes (e.g., Bernoulli, Gaussian, and Multinomial), Decision trees (Decision tree, Extra tree), Ensemble methods (including Bagging, Random Forest, Ada Boost, Gradient Tree Boost, and Voting classifier), and supervised neural network models (e.g., multilayer perceptron). Cross decomposition techniques including the partial least squares and the canonical correlation analysis can find relationships between 2 matrices and hence can be used when the group or outcome variable is also multivariate like predictor variables. Most of techniques mentioned above can be used for regression (as an alternative to linear regression) and for classification (as alternative to logistic regression).

Many of these algorithms are lengthy and were traditionally time-consuming but can now be easily performed on modern fast computers. All these techniques have different assumptions, advantages, disadvantages, and situations where they are most useful. These algorithms may produce models and results different from linear/logistic regression, but they may actually be closer to truth. The average values of coefficients obtained from multiple algorithms are also likely to be indicative of true relationships. It will only be prudent to make use of such variety of available techniques for medical research data analysis.

Several feature selection methods are also available to select out features responsible for high variance while rejecting features with low variance. These include sequential feature selection, minimum redundancy maximum relevance, correlation feature selection, regularized trees, relief, information gain-based feature selection, among others. These are broadly categorized into filter, wrapper, and embedded methods (that incorporate both feature selection and learning). Individual feature importance can also be determined by many methods, such as Gradient Boost, Ada Boost, Extra Trees, Decision Tree, and Random Forest. In addition, many techniques for model selection and evaluation are also available. These include cross-validation, model persistence, and model curves. Comparison of different algorithms using cross-validation is especially popular.

Unconventional approaches used by ML techniques may result in unexpected benefits. For example, Simjanoska et al. were able to accurately determine blood pressure from raw electrocardiographic data![4] Moreover, multiple techniques can now easily be applied to medical data.[5] For example, [Figure 1] shows results of regression analysis of factors associated with low birth weight from publically available “birthwt” dataset using 20 different regression algorithms. Similarly, on running 13 classification algorithms for feature selection on publically available South African Heart Disease (”sahd”) dataset, it was found that age, tobacco, low-density lipoprotein, family history, and Type A personality were selected by 92%, 85%, 77%, 69%, and 62% algorithms, respectively. Obesity, adiposity, systolic blood pressure, and alcohol intake were selected by only 23%, 15%, 8%, and 8% algorithms, respectively. A meta-analysis of this kind involving results from different algorithms applied to same data is likely to produce answers as correct as meta-analysis of data from different studies using the same single algorithm.

Figure 1: Comparison of coefficients for different variables obtained from 20 different algorithms. It is clear that uterine irritability (”ui”), history of hypertension (”ht”), and smoking status (”smoke”) have most consistent association.