NOTA: Simple cross-validation was proposed by Kurtz (1948) as a remedy for the Rorschach test, a form of personality test which was criticized by psychometricians for its lack of common psychometric attributes such as data normality. Based on Kurtz's simple cross-validation, Mosier (1951) developed double cross-validation, which was later extended to multicross-validation by Krus and Fuller (1982).

El propósito es averiguar si los resultados son replicables o solo una cuesti?n de fluctuaciones aleatorias, es decir, verificar la replicabilidad de los resultados. La muestra se divide aleatoriamente en 2 o más subconjuntos y los resultados de la prueba sn validados comparando las sub-muestras. Varias categor?as: simple, doble o m?ltiple: *Simple cross-validation. Take regression as an example. In the process of implementing a simple cross-validation, the first sub-sampleis usually used for deriving the regression equation while another sub-sampleis used for generating predicted scores from the first regression equation. Next, the cross-validity coefficient is computed by correlating the predicted scores and the observed scores on the outcome variable. *Double cross-validation. Double cross-validation is a step further than its simple counterpart. Take regression as an example again. In double cross-validation regression equations are generated in both sub-samples, and then both equations are used to generate predicted scores and cross-validity coefficients. *Multicross-validation. Multicross-validation is an extension of double cross-validation. In this form of cross-validation,double cross-validation procedures are repeated many times by randomly selecting sub-samples from the data set. In the context of regression analysis,betaweights computed in each sub-sample are used to predict the outcome variable in the corresponding sub-sample. Next, the observed and predicted scores of the outcome variable in each sub-sample are used to compute the cross validated coefficient. Problemas: d?bil ante tama?os de muestras peque?os. Ang (1998) argued that cross-validation is problematic because splitting an already small sample increases the risk that the betaweights are just artifacts of the sub-sample. Thus, Ang endorsed use of Jackknife,which will be discussed next.

Ejemplo: an?lisis de regersi?n ? Leaving out the ith pair of points(Xi, Yi), use least squares methods to find parameter estimates ?a-i and ?b-i. ? Hence,find the best predictor of the left out Yi: ?Y-i = ?a-i + ?b-i*Xi ? Repeat the above for each point and calculate the cross validation score, or estimate predictive error (1/n)*sum(Yi-?Y-i)^2 ? If we also calculate the cross validation score for the log(X) model we can again use predictive error as a basis to choose between the models.

Algoritmo: Given a data set of n observations we create n newdatasets, each of size n - 1,by leaving out one point at a time. ? Suppose we have a sample of observations X = {X1, . . . Xn} that we will use to estimate a parameter Theta. ? Suppose that T = T(X)is an estimator of Theta. ? Draw n jackknife samples X-1, X-2, ...., X-n by leaving out the first, second . . . nth observation respectively. ? Calculate T-i=T(X-i)for each i. ? Calculate T* =(1/n)*sum(T-i) ? The jackknife estimate of variance is((n-1)/n)*sum(T-i-T*)^2 ? And the jackknife estimate of bias is(n - 1)(T*- T)which leads to a bias corrected estimate of nT - (n - 1)T* when we subtract this from T.

Algoritmo: ? Suppose we have a sample of observations X = {X1, . . . Xn} that we want to use to estimate a parameter Theta. ? Suppose also that we have derived T(X), an estimator of Theta, and want to evaluate itssampling variance. ? Draw a bootstrapsample X1 of size n bysamplingwith replacement from X. ? Calculate T1=T(X1) ? Repeat the resampling b times to get T1, T2, ..., Tb ? Calculate the mean of the bootstrapped statistics meanT=(1/b)*sum(Ti) ? Finally, calculate the bootstrap estimate of variance (1/(b-1))*sum(Ti-meanT)^2

EJEMPLOS

##################################################################################### Resample data################################################################################### Basado en http://www.zoology.ubc.ca/~schluter/zoo502stats/Rtips.resample.html#Let's assume that the data are a sample of measurements for a single variable stored in a vector "x". The data may be numeric or categorical. #A single bootstrap replicate is obtained as follows. The replace option is used to indicate that sampling is carried out with replacement. xboot = sample(x,replace=TRUE)#Calculate the statistic of interest (for example, the mean) on the resampled data in "xboot" and store the result in a vector created for this purpose. z = vector()# initialize z (do only once) z[1] = mean(xboot)# first bootstrap replicate estimate#Repeat steps (1) and (2) many times. The result will be a large collection of bootstrap replicate estimates for subsequent analysis.

#In other cases, two or more variables are measured on individuals (e.g., stem height, leaf area, petal diameter, etc). Assume that each row of a data frame "mydata" is a different individual, and each column a different variable. #To resample individuals (i.e., rows), iboot = sample(1:nrow(mydata),replace=TRUE)) bootdata = mydata[iboot,]#The data frame "bootdata" will contain a single bootstrap replicate including all the variables.#Calculate the statistic of interest on the resampled data and store the result in vector created for this purpose. For example, to calculate the correlation between two variables x and y in "bootdata", z = vector()# initialize z (do only once) z[1] = cor(bootdata$x, bootdata$y)#Repeat steps (1) and (2) many times. The result will be a large collection of bootstrap replicate estimates for subsequent analysis.

# Basado en http://www.zoology.ubc.ca/~schluter/zoo502stats/Rtips.resample.html### Randomization test#A randomization test uses resampling and the computer to generate a null distribution for a test statistic. The test statistic is a measure of association between two variables or difference between groups. Each randomization step involves randomly resampling without replacement the values of one of the two variables in the data. The two variables may be categorical (character or factor), numeric, or one of each.

## Both variables categorical#R has a built-in randomization procedure for a contingency test of association when both of two variables are categorical (call them A1 and A2). To apply it, execute the usual command for the ?2 contingency test, but set the "simulate.p.value" option to TRUE. The number of replicates in the randomization is set by the option "B" (default is 2000). Each randomization rearranges the values in the contingency table while keeping all the row and column totals fixed to their observed values.chisq.test(A1, A2, simulate.p.value = TRUE, B = 5000)

## Data are numeric#If one or both of the variables is numeric, then you will need to create a short loop to carry out the resampling necessary for the randomization test. Choose one of the two variables to resample (call it "x"). It doesn't matter which of the two variables you choose. Keep the other variable (call it "y") unchanged (there is no benefit to resampling both variables). #Resample "x" without replacement to create a new vector (call it x1). x1 = sample(x,replace = FALSE)#Calculate the test statistic to measure association between y and the randomized variable x1. Store the result in a vector created for this purpose. For example, to calculate the correlation between the two variables, z = vector()# initialize z (do only once) z[1] = cor(x1, y)# first randomization result#Repeat steps (1) and (2) many times. The result will be a large collection of replicates representing the null distribution of your test statistic. #Use this null distribution to calculate an approximate P-value for your test. This involves calculating the proportion of values in the null distribution that are as extreme or more extreme than the observed value of the test statistic (calculated from the real data). Whether this calculation must take account of just one or both tails of the null distribution depends on your test statistic.

## Basado en http://www.burns-stat.com/pages/Tutor/bootstrap_resampling.html# The idea: Permutation tests are restricted to the case where the null hypothesis really is null -- that is, that there is no effect. If changing the order of the data destroys the effect (whatever it is), then a random permutation test can be done. The test checks if the statistic with the actual data is unusual relative to the distribution of the statistic for permuted data. # Our example permutation test is to test volatility clustering of the S&P returns. Below is an R function that computes the statistic for Engle's ARCH test. engle.arch.test = function(x,order=10){ xsq = x^2 nobs = length(x) inds = outer(0:(nobs - order - 1),order:1,"+") xmat = xsq[inds]dim(xmat) = dim(inds)xreg = lm(xsq[-1:-order] ~ xmat)summary(xreg)$r.squared * (nobs - order)}# All you need to know is that the function returns the test statistic and that a big value means there is volatility clustering, but here is an explanation of it if you are interested. The test does a regression with the squared returns as the response and some number of lags (most recent previous data) of the squared returns as explanatory variables. (An estimate of the mean of the returns is generally removed first, but this has little impact in practice.) If the last few squared returns have power to predict tomorrow's squared return, then there must be volatility clustering. The tricky part of the function is the line that creates inds. This object is a matrix of the desired indices of the squared returns for the matrix of explanatory variables. Once the explanatory matrix is created, the regression is performed and the desired statistic is returned. The default number of lags to use is 10. # A random permutation test compares the value of the test statistic for the original data to the distribution of test statistics when the data are permuted. We do this below for our example. spx.arch.perm = numeric(1000)for(i in1:1000){ spx.arch.perm[i] = engle.arch.test(sample(spxret))}hist(spx.arch.perm,col='yellow')abline(v=engle.arch.test(spxret),col='blue')# The simplest way to get a random permutation in R is to put your data vector as the only argument to the sample function. The call: sample(spxret)#is equivalent to: sample(spxret, size=length(spxret),replace=FALSE)

# A simple calculation for the p-value of the permutation test is to count the number of statistics from permuted data that exceed the statistic for the original data and then divide by the number of permutations performed. In R a succinct form of this calculation is: mean(spx.arch.perm >= engle.arch.test(spxret))# In my case I get 0.016. That is, 16 of the 1000 permutations produced a test statistic larger than the statistic from the real data. A test using 100,000 permutations gave a value of 0.0111.

# to do a leave-on-out jackknife estimate for the mean of the# data ?jackknife gives an example# Having a look at the jackknife function we see that it demands# two parameters: x and theta. x is supposed to contain the data# and theta the function that is applied to the data

#Bootstrap#Now I want to program the estimation for one coefficient of a linear regression model. DF = as.data.frame(matrix(rnorm(250),ncol=5))# my data model.lm = formula(V1 ~ V2 + V3 + V4)# my model # Now I need to specify the theta function. Here x is not the# data itself but is used as the row index vector to select# a subset from the data frame (xdata). Also the coefficient# to be returned is specified. theta = function(x, xdata, coefficient){coef(lm(model.lm,data=xdata[x,]))[coefficient]}# So now at each leave-on-out run the model is calculated with# a subset defined by the vector x (here one is left out) and one# coefficient is returned: results = jackknife(1:50, theta, xdata=DF, coefficient="(Intercept)")## To expand this code onto the estimation of all the regression coefficients is only a small step now. As the theta function is supposed to return a scalar and not a list of estimates for each coefficient, the following workaround is used: The sapply function calls the jackknife function four times prompting a different parameter estimate at each run. The prompted coeffient is passed on to the jackknife function by the three point (?) argument .# The following function calculates all then coefficients jackknife.apply = function(x, xdata, coefs){sapply(coefs,function(coefficient) jackknife(x, theta, xdata=xdata, coefficient=coefficient), simplify=F)}# now jackknife.apply() can be called results = jackknife.apply(1:50, DF,c("(Intercept)","V2","V3","V3"))

# Basado en http://www.burns-stat.com/pages/Tutor/bootstrap_resampling.html#The idea: Models should be tested with data that were not used to fit the model. If you have enough data, it is best to hold back a random portion of the data to use for testing. Cross validation is a trick to get out-of-sample tests but still use all the data. The sleight of hand is to do a number of fits, each time leaving out a different portion of the data. #Cross validation is perhaps most often used in the context of prediction. Everyone wants to predict the stock market. So let's do it. #Below we predict tomorrow's IBM return with a quadratic function of today's IBM and S&P returns. predictors = cbind(spxret, ibmret, spxret^2, ibmret^2, spxret * ibmret)[-251,] predicted = ibmret[-1] predicted.lm = lm(predicted ~ predictors)#The p-value for this regression is 0.048, so it is good enough to publish in a journal. However, you might want to hold off putting money on it until we have tested it a bit more. #In cross validation we divide the data into some number of groups. Then for each group we fit the model with all of the data that are not in the group, and test that fit with the data that are in the group. Below we divide the data into 5 groups. group = rep(1:5,length=250)group = sample(group) mse.group = numeric(5)for(i in1:5){ group.lm = lm(predicted[group != i] ~ predictors[group != i,]) mse.group[i] = mean((predicted[group == i] - cbind(1, predictors[group == i,]) %*% coef(group.lm))^2)}#The first command above repeats the numbers 1 through 5 to get a vector of length 250. We do not want to use this as our grouping because we may be capturing systematic effects. In our case a group would be on one particular day of the week until a holiday interrupted the pattern. Hence the next line permutes the vector so we have random assignment, but still an equal number of observations in each group. The for loop estimates each of the five models and computes the out-of-sample mean squared error. #The mean squared error of the predicted vector taking only its mean into account is 0.794. The in-sample mean squared error from the regression is 0.759. This is a modest improvement, but in the context of market prediction might be of use if the improvement were real. However, the cross validation mean squared error is 0.800 -- even higher than from the constant model. This is evidence of overfitting (which this regression model surely is doing).

There are a number of R packages that are either confined to or touch upon bootstrapping or its relatives. These include:boot: This package incorporates quite a wide variety of bootstrapping tricks.bootstrap: A package of relatively simple functions for bootstrapping and related techniques.coin: A package for permutation tests (which are discussed below).Design: This package includes bootcov for bootstrapping the covariance of estimated regression parameters and validate for testing the quality of fitted models by cross validation or bootstrapping.MChtest: This package isfor Monte Carlo hypothesis tests, that is, tests using some form of resampling. This includes code forsampling rules where the number of samples taken depend on how certain the result is.meboot: Provides a method of bootstrapping a time series.permtest: A package containing a functionfor permutation tests of microarray data.resper: A package for doing restricted permutations.scaleboot: This package produces approximately unbiased hypothesis tests via bootstrapping.simpleboot: A package of a few functions that perform (or present) bootstraps in simple situations, such as one and two samples, and linear regression.There are a large number of R packages that include bootstrapping. Examples include multtest that has the boot.resample function, and Matchingwhich has a functionfor a bootstrapped Kolmogorov-Smirnov test (the equality of two probability distributions).

################################################################################################ Basado en http://www.zoology.ubc.ca/~schluter/zoo502stats/Rtips.resample.htmllibrary(boot)### Single variable#To use the boot package you will need to write a function to calculate the statistic of interest. The format is illustrated below for the sample mean, but any univariate function would be handled similarly. We'll call our function "boot.mean". When you have finished writing the script for a function you will need to cut and paste it into the command window so that R can access it (you'll need to do this just once in an R session). Here, "x" refers to the vector of data. "i" refers to a vector of indices, which must be included as an argument in the function and employed as shown. boot.mean = function(x,i){boot.mean = mean(x[i])}#The command "boot" will automatically carry out all the resampling and computations required. For this example, "x" is the vector of original data and "boot.mean" is the name of the function we created above to calculate the statistic of interest. R specifies the number of bootstrap replicate estimates desired. z = boot(x, boot.mean, R=2000)#The resulting object (which here named "z") is a boot object containing all the results. Use the following additional commands to pull out the results. print(z)# Bootstrap calculation of bias and SEsd(z$t)# Another way to get the standard errorhist(z$t)# Histogram of boostrap replicate estimatesqqnorm(z$t)# Normal quantiles of replicate estimatesboot.ci(z,type="bca")# 95% confidence interval using BCaboot.ci(z,type="perc")# Same using percentile method

### Two or more variables#Here's how you would use the boot command If the statistic of interest must be calculated from two (or more) variables (for example, a correlation, regression slope, or odds ratio). Assume that there are two variables of interest, "x" and "y". If not done already, put the two variables into a data frame (here called "mydata"), mydata = cbind.data.frame(x, y, stringsAsFactors=FALSE)# Then create a function to calculate the statistic of interest on the variables in "mydata". For example, to create a function that calculates the correlation coefficient between the two variables "x" and "y", use boot.cor = function(mydata,i){ x = mydata$x[i] y = mydata$y[i] boot.cor = cor(x,y)}#Here, "i" refers to a vector of indices, which must be included as an argument in the function and employed as shown. #Finally, pass your data frame and function to the boot command, z = boot(mydata, boot.cor, R=2000)#See the previous section for a list of commands to pull results from the boot object (here named "z").

## Basado en http://www.ats.ucla.edu/stat/R/library/bootstrap.htm#calculating the standard error of the median#creating the data set by taking 100 observations #from a normal distribution with mean 5 and stdev 3#we have rounded each observation to nearest integerdata = round(rnorm(100,5,3))data[1:10][1]6334382232#obtaining 20 bootstrap samples #display the first of the bootstrap samples resamples = lapply(1:20,function(i)sample(data,replace = T)) resamples[1][[1]]:[1]5176522695466354107818052[25]8309323105854047356363297[49]249660759306852333432933[73]238283965243371359434290[97]3697#calculating the median for each bootstrap sample r.median = sapply(resamples,median) r.median[1]4.04.54.05.04.05.05.05.05.04.05.05.05.05.05.04.05.05.0[19]6.05.0#calculating the standard deviation of the distribution of medianssqrt(var(r.median))[1]0.5250313#displaying the histogram of the distribution of the medians hist(r.median)

#We can put all these steps into a single function where all we would need to specify is which data set to use and how many times we want to resample in order to obtain the adjusted standard error of the median. For more information on how to construct functions please consult the R library pages on introduction to functions and advanced functions.#function which will bootstrap the standard error of the median b.median = function(data, num){ resamples = lapply(1:num,function(i)sample(data,replace=T)) r.median = sapply(resamples,median) std.err = sqrt(var(r.median))list(std.err=std.err, resamples=resamples, medians=r.median)}#generating the data to be used (same as in the above example) data1 = round(rnorm(100,5,3))#saving the results of the function b.median in the object b1 b1 = b.median(data1,30)#displaying the first of the 30 bootstrap samples b1$resamples[1][[1]]:[1]666093562753325304373163[25]6359648431324523176371072[49]33510962345135081427822106[73]48635210507654960663481076[97]3323#displaying the standard error b1$std.err[1]0.5155477#displaying the histogram of the distribution of medianshist(b1$medians)#we can input the data directly into the function and display #the standard error in one line of code b.median(rnorm(100,5,2),50)$std.err[1]0.5104178#displaying the histogram of the distribution of medianshist(b.median(rnorm(100,5,2),50)$medians)

### Basado en http://thebiobucket.blogspot.com/2011/04/bootstrap-with-strata.html# Here's a worked example for comparing group averages with bootstrap confidence intervals and allowing for different subsample sizes by calling the strata argument within the bootstrap function.# The data is set up analogous to an before-after impact experiment conducted on plots across four different successional stages. Similarity was calculated for plot's composition before and after an impact. One hypothesis was that certain stages would show higher / smaller average similarities, that is, a higher / lower impact on composition. As plots within stages were situated within different subsites and the nr. of replicates was unbalanced, this had to be allowed for by use of the "strata" argument in the boot.ci call:library(boot)library(Hmisc)