6 6 avnnet.default data subset na.action contrasts object newdata type Data frame from which variables specified in formula are preferentially to be taken. An index vector specifying the cases to be used in the training sample. (NOTE: If given, this argument must be named.) A function to specify the action to be taken if NAs are found. The default action is for the procedure to fail. An alternative is na.omit, which leads to rejection of cases with missing values on any required variable. (NOTE: If given, this argument must be named.) a list of contrasts to be used for some or all of the factors appearing as variables in the model formula. an object of class avnnet as returned by avnnet. matrix or data frame of test examples. A vector is considered to be a row vector comprising a single case. Type of output, either: raw for the raw outputs, code for the predicted class or prob for the class probabilities.... arguments passed to nnet Details Following Ripley (1996), the same neural network model is fit using different random number seeds. All the resulting models are used for prediction. For regression, the output from each network are averaged. For classification, the model scores are first averaged, then translated to predicted classes. Bagging can also be used to create the models. If a parallel backend is registered, the foreach package is used to train the networks in parallel. Value For avnnet, an object of "avnnet" or "avnnet.formula". Items of interest in the output are: model repeats names a list of the models generated from nnet an echo of the model input if any predictors had only one distinct value, this is a character string of the remaining columns. Otherwise a value of NULL Author(s) These are heavily based on the nnet code from Brian Ripley. References Ripley, B. D. (1996) Pattern Recognition and Neural Networks. Cambridge. See Also nnet, preprocess

8 8 bag.default B bagcontrol the number of bootstrap samples to train over. a list of options.... arguments to pass to the model function fit predict aggregate downsample oob allowparallel vars Details Value object newdata a function that has arguments x, y and... and produces a model object that can later be used for prediction. Example functions are found in ldabag, plsbag, nbbag, svmbag and nnetbag. a function that generates predictions for each sub-model. The function should have arguments object and x. The output of the function can be any type of object (see the example below where posterior probabilities are generated. Example functions are found in ldabag, plsbag, nbbag, svmbag and nnetbag.) a function with arguments x and type. The function that takes the output of the predict function and reduces the bagged predictions to a single prediction per sample. the type argument can be used to switch between predicting classes or class probabilities for classification models. Example functions are found in ldabag, plsbag, nbbag, svmbag and nnetbag. a logical: for classification, should the data set be randomly sampled so that each class has the same number of samples as the smallest class? a logical: should out-of-bag statistics be computed and the predictions retained? if a parallel backend is loaded and available, should the function use it? an integer. If this argument is not NULL, a random sample of size vars is taken of the predictors in each bagging iteration. If NULL, all predictors are used. an object of class bag. a matrix or data frame of samples for prediction. Note that this argument must have a non-null value The function is basically a framework where users can plug in any model in to assess the effect of bagging. Examples functions can be found in ldabag, plsbag, nbbag, svmbag and nnetbag. Each has elements fit, pred and aggregate. One note: when vars is not NULL, the sub-setting occurs prior to the fit and predict functions are called. In this way, the user probably does not need to account for the change in predictors in their functions. When using bag with train, classification models should use type = "prob" inside of the predict function so that predict.train(object, newdata, type = "prob") will work. If a parallel backend is registered, the foreach package is used to train the models in parallel. bag produces an object of class bag with elements fits control a list with two sub-objects: the fit object has the actual model fit for that bagged samples and the vars object is either NULL or a vector of integers corresponding to which predictors were sampled for that model a mirror of the arguments passed into bagcontrol

10 10 bagearth Usage ## S3 method for class formula bagearth(formula, data = NULL, B = 50, summary = mean, keepx = TRUE,..., subset, weights, na.action = na.omit) ## Default S3 method: bagearth(x, y, weights = NULL, B = 50, summary = mean, keepx = TRUE,...) Arguments formula A formula of the form y ~ x1 + x x y matrix or data frame of x values for examples. matrix or data frame of numeric values outcomes. weights (case) weights for each example - if missing defaults to 1. data subset na.action B Details summary keepx Data frame from which variables specified in formula are preferentially to be taken. An index vector specifying the cases to be used in the training sample. (NOTE: If given, this argument must be named.) A function to specify the action to be taken if NA s are found. The default action is for the procedure to fail. An alternative is na.omit, which leads to rejection of cases with missing values on any required variable. (NOTE: If given, this argument must be named.) the number of bootstrap samples a function with a single argument specifying how the bagged predictions should be summarized a logical: should the original training data be kept?... arguments passed to the earth function The function computes a Earth model for each bootstap sample. Value A list with elements fit B call x oob a list of B Earth fits the number of bootstrap samples the function call either NULL or the value of x, depending on the value of keepx a matrix of performance estimates for each bootstrap sample Author(s) Max Kuhn (bagearth.formula is based on Ripley s nnet.formula)

12 12 bagfda na.action B Details Value keepx A function to specify the action to be taken if NA s are found. The default action is for the procedure to fail. An alternative is na.omit, which leads to rejection of cases with missing values on any required variable. (NOTE: If given, this argument must be named.) the number of bootstrap samples a logical: should the original training data be kept?... arguments passed to the mars function The function computes a FDA model for each bootstap sample. A list with elements fit B call x oob a list of B FDA fits the number of bootstrap samples the function call either NULL or the value of x, depending on the value of keepx a matrix of performance estimates for each bootstrap sample Author(s) Max Kuhn (bagfda.formula is based on Ripley s nnet.formula) References J. Friedman, Multivariate Adaptive Regression Splines (with discussion) (1991). Annals of Statistics, 19/1, See Also fda, predict.bagfda Examples library(mlbench) library(earth) data(glass) set.seed(36) intrain <- sample(1:dim(glass)[1], 150) traindata <- Glass[ intrain, ] testdata <- Glass[-inTrain, ] baggedfit <- bagfda(type ~., traindata)

13 BloodBrain 13 confusionmatrix(predict(baggedfit, testdata[, -10]), testdata[, 10]) BloodBrain Blood Brain Barrier Data Description Mente and Lombardo (2005) develop models to predict the log of the ratio of the concentration of a compound in the brain and the concentration in blood. For each compound, they computed three sets of molecular descriptors: MOE 2D, rule-of-five and Charge Polar Surface Area (CPSA). In all, 134 descriptors were calculated. Included in this package are 208 non-proprietary literature compounds. The vector logbbb contains the concentration ratio and the data fame bbbdescr contains the descriptor values. Usage data(bloodbrain) Value bbbdescr logbbb data frame of chemical descriptors vector of assay results Source Mente, S.R. and Lombardo, F. (2005). A recursive-partitioning model for blood-brain barrier permeation, Journal of Computer-Aided Molecular Design, Vol. 19, pg BoxCoxTrans.default Box-Cox and Exponential Transformations Description These classes can be used to estimate transformations and apply them to existing and future data

14 14 BoxCoxTrans.default Usage BoxCoxTrans(y,...) expotrans(y,...) ## Default S3 method: BoxCoxTrans(y, x = rep(1, length(y)), fudge = 0.2, numunique = 3, na.rm = FALSE,...) ## Default S3 method: expotrans(y, na.rm = TRUE, init = 0, lim = c(-4, 4), method = "Brent", numunique = 3,...) ## S3 method for class BoxCoxTrans predict(object, newdata,...) ## S3 method for class expotrans predict(object, newdata,...) Arguments y x fudge numunique na.rm a numeric vector of data to be transformed. For BoxCoxTrans, the data must be strictly positive. an optional dependent variable to be used in a linear model. a tolerance value: lambda values within +/-fudge will be coerced to 0 and within 1+/-fudge will be coerced to 1. how many unique values should y have to estimate the transformation? a logical value indicating whether NA values should be stripped from y and x before the computation proceeds. init, lim, method initial values, limits and optimization method for optim.... for BoxCoxTrans: options to pass to boxcox. plotit should not be passed through. For predict.boxcoxtrans, additional arguments are ignored. object newdata an object of class BoxCoxTrans or expotrans. a numeric vector of values to transform. Details BoxCoxTrans function is basically a wrapper for the boxcox function in the MASS library. It can be used to estimate the transformation and apply it to new data. expotrans estimates the exponential transformation of Manly (1976) but assumes a common mean for the data. The transformation parameter is estimated by directly maximizing the likelihood. If any(y <= 0) or if length(unique(y)) < numunique, lambda is not estimated and no transformation is applied.

16 16 calibration calibration Probability Calibration Plot Description Usage For classification models, this function creates a calibration plot that describes how consistent model probabilities are with observed event rates. calibration(x,...) ## S3 method for class formula calibration(x, data = NULL, class = NULL, cuts = 11, subset = TRUE, lattice.options = NULL,...) ## S3 method for class calibration xyplot(x, data,...) panel.calibration(...) Arguments x data class cuts a lattice formula (see xyplot for syntax) where the left-hand side of the formula is a factor class variable of the observed outcome and the right-hand side specifies one or model columns corresponding to a numeric ranking variable for a model (e.g. class probabilities). The classification variable should have two levels. For calibration.formula, a data frame (or more precisely, anything that is a valid envir argument in eval, e.g., a list or an environment) containing values for any variables in the formula, as well as groups and subset if applicable. If not found in data, or if data is unspecified, the variables are looked for in the environment of the formula. This argument is not used for xyplot.calibration. a character string for the class of interest If a single number this indicates the number of splits of the data are used to create the plot. By default, it uses as many cuts as there are rows in data. If a vector, these are the actual cuts that will be used. subset An expression that evaluates to a logical or integer indexing vector. It is evaluated in data. Only the resulting rows of data are used for the plot. lattice.options A list that could be supplied to lattice.options... options to pass through to xyplot or the panel function (not used in calibration.formula).

17 calibration 17 Details Value calibration.formula is used to process the data and xyplot.calibration is used to create the plot. To construct the calibration plot, the following steps are used for each model: 1. The data are split into cuts - 1 roughly equal groups by their class probabilities 2. the number of samples with true results equal to class are determined 3. the event rate is determined for each bin xyplot.calibration produces a plot of the observed event rate by the mid-point of the bins. This implementation uses the lattice function xyplot, so plot elements can be changed via panel functions, trellis.par.set or other means. calibration uses the panel function panel.calibration by default, but it can be changed by passing that argument into xyplot.calibration. The following elements are set by default in the plot but can be changed by passing new values into xyplot.calibration: xlab = "Bin Midpoint", ylab = "Observed Event Percentage", type = "o", ylim = extendrange(c(0, 100)),xlim = extendrange(c(0, 100)) and panel = panel.calibration calibration.formula returns a list with elements: data cuts class probnames the data used for plotting the number of cuts the event class the names of the model probabilities xyplot.calibration returns a lattice object Author(s) See Also Max Kuhn, some lattice code and documentation by Deepayan Sarkar xyplot, trellis.par.set Examples ## Not run: data(mdrr) mdrrdescr <- mdrrdescr[, -nearzerovar(mdrrdescr)] mdrrdescr <- mdrrdescr[, -findcorrelation(cor(mdrrdescr),.5)] intrain <- createdatapartition(mdrrclass) trainx <- mdrrdescr[intrain[[1]], ] trainy <- mdrrclass[intrain[[1]]]

19 caretfuncs 19 maximize tol y size a logical; should the metric be maximized? a scalar to denote the acceptable difference in optimal performance (see Details below) a list of data frames with variables Overall and var an integer for the number of variables to retain Details This page describes the functions that are used in backwards selection (aka recursive feature elimination). The functions described here are passed to the algorithm via the functions argument of rfecontrol. See rfecontrol for details on how these functions should be defined. The pick functions are used to find the appropriate subset size for different situations. pickbest will find the position associated with the numerically best value (see the maximize argument to help define this). picksizetolerance picks the lowest position (i.e. the smallest subset size) that has no more of an X percent loss in performances. When maximizing, it calculates (O-X)/O*100, where X is the set of performance values and O is max(x). This is the percent loss. When X is to be minimized, it uses (X-O)/O*100 (so that values greater than X have a positive "loss"). The function finds the smallest subset size that has a percent loss less than tol. Both of the pick functions assume that the data are sorted from smallest subset size to largest. Author(s) See Also Max Kuhn rfecontrol, rfe Examples ## For picking subset sizes: ## Minimize the RMSE example <- data.frame(rmse = c(1.2, 1.1, 1.05, 1.01, 1.01, 1.03, 1.00), Variables = 1:7) ## Percent Loss in performance (positive) example$pctloss <- (example$rmse - min(example$rmse))/min(example$rmse)*100 xyplot(rmse ~ Variables, data= example) xyplot(pctloss ~ Variables, data= example) absolutebest <- picksizebest(example, metric = "RMSE", maximize = FALSE) within5pct <- picksizetolerance(example, metric = "RMSE", maximize = FALSE) cat("numerically optimal:", example$rmse[absolutebest], "RMSE in position", absolutebest, "\n")

21 cars 21 Details More details on these functions can be found at html#filter. This page documents the functions that are used in selection by filtering (SBF). The functions described here are passed to the algorithm via the functions argument of sbfcontrol. See sbfcontrol for details on how these functions should be defined. anovascores and gamscores are two examples of univariate filtering functions. anovascores fits a simple linear model between a single feature and the outcome, then the p-value for the whole model F-test is returned. gamscores fits a generalized additive model between a single predictor and the outcome using a smoothing spline basis function. A p-value is generated using the whole model test from summary.gam and is returned. If a particular model fails for lm or gam, a p-value of 1 is returned. Author(s) See Also Max Kuhn sbfcontrol, sbf, summary.gam cars Kelly Blue Book resale data for 2005 model year GM cars Description Kuiper (2008) collected data on Kelly Blue Book resale data for 804 GM cars (2005 model year). Usage data(cars) Value cars data frame of the suggested retail price (column Price) and various characteristics of each car (columns Mileage, Cylinder, Doors, Cruise, Sound, Leather, Buick, Cadillac, Chevy, Pontiac, Saab, Saturn, convertible, coupe, hatchback, sedan and wagon) Source Kuiper, S. (2008). Introduction to Multiple Regression: How Much Is Your Car Worth?, Journal of Statistics Education, Vol. 16, html

A Short Introduction to the caret Package Max Kuhn max.kuhn@pfizer.com August 6, 2015 The caret package (short for classification and regression training) contains functions to streamline the model training

Vol. 2/3, December 02 18 Classification and Regression by randomforest Andy Liaw and Matthew Wiener Introduction Recently there has been a lot of interest in ensemble learning methods that generate many

Evaluation & Validation: Credibility: Evaluating what has been learned How predictive is a learned model? How can we evaluate a model Test the model Statistical tests Considerations in evaluating a Model

Chapter 2 A Short Tour of the Predictive Modeling Process Before diving in to the formal components of model building, we present a simple example that illustrates the broad concepts of model building.

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller Getting to know the data An important first step before performing any kind of statistical analysis is to familiarize

Chapter 2 Getting started with qplot 2.1 Introduction In this chapter, you will learn to make a wide variety of plots with your first ggplot2 function, qplot(), short for quick plot. qplot makes it easy

Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?

Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

Version 1.3-8 Package polynom June 24, 2015 Title A Collection of Functions to Implement a Class for Univariate Polynomial Manipulations A collection of functions to implement a class for univariate polynomial

8. Excel Charts and Analysis ToolPak Charts, also known as graphs, have been an integral part of spreadsheets since the early days of Lotus 1-2-3. Charting features have improved significantly over the

Chapter 311 Introduction Often, theory and experience give only general direction as to which of a pool of candidate variables (including transformed variables) should be included in the regression model.

Data Exploration Data Visualization What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include Helping to select

Information Builders enables agile information solutions with business intelligence (BI) and integration technologies. WebFOCUS the most widely utilized business intelligence platform connects to any enterprise

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts

STATISTICAL CONTRIBUTION TO THE VIRTUAL MULTICRITERIA OPTIMISATION OF COMBINATORIAL MOLECULES LIBRARIES AND TO THE VALIDATION AND APPLICATION OF QSAR MODELS CÉLINE LE BAILLY DE TILLEGHEM Institut de statistique

http://wwwcscolostateedu/~cs535 W6B W6B2 CS535 BIG DAA FAQs Please prepare for the last minute rush Store your output files safely Partial score will be given for the output from less than 50GB input Computer

Chapter 420 Introduction (FA) is an exploratory technique applied to a set of observed variables that seeks to find underlying factors (subsets of variables) from which the observed variables were generated.

BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an

BNG 202 Biomechanics Lab Descriptive statistics and probability distributions I Overview The overall goal of this short course in statistics is to provide an introduction to descriptive and inferential

IBM SPSS Neural Networks 22 Note Before using this information and the product it supports, read the information in Notices on page 21. Product Information This edition applies to version 22, release 0,

1 Multivariate Logistic Regression As in univariate logistic regression, let π(x) represent the probability of an event that depends on p covariates or independent variables. Then, using an inv.logit formulation

Chapter 552 Gamma Distribution Fitting Introduction This module fits the gamma probability distributions to a complete or censored set of individual or grouped data values. It outputs various statistics

Computational Assignment 4: Discriminant Analysis -Written by James Wilson -Edited by Andrew Nobel In this assignment, we will investigate running Fisher s Discriminant analysis in R. This is a powerful

How To Run Statistical Tests in Excel Microsoft Excel is your best tool for storing and manipulating data, calculating basic descriptive statistics such as means and standard deviations, and conducting

Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

Individuals: The objects that are described by a set of data. They may be people, animals, things, etc. (Also referred to as Cases or Records) Variables: The characteristics recorded about each individual.

IBM SPSS Missing Values 22 Note Before using this information and the product it supports, read the information in Notices on page 23. Product Information This edition applies to version 22, release 0,