2017年11月17日 星期五

A ConvNet arranges its neurons in three dimensions (width, height, depth), as visualized in one of the layers. Every layer of a ConvNet transforms the 3D input volume to a 3D output volume of neuron activations. In this example, the red input layer holds the image, so its width and height would be the dimensions of the image, and the depth would be 3 (Red, Green, Blue channels).An example input volume in red (e.g. a 32x32x3 CIFAR-10 image), and an example volume of neurons in the first Convolutional layer. Each neuron in the convolutional layer is connected only to a local region in the input volume spatially, but to the full depth (i.e. all color channels). Note, there are multiple neurons (5 in this example) along the depth, all looking at the same region in the input - see discussion of depth columns in text below

he neurons from the Neural Network chapter remain unchanged: They still compute a dot product of their weights with the input followed by a non-linearity, but their connectivity is now restricted to be local spatially.

Pooling layer downsamples the volume spatially, independently in each depth slice of the input volume. Left: In this example, the input volume of size [224x224x64] is pooled with filter size 2, stride 2 into output volume of size [112x112x64]. Notice that the volume depth is preserved. Right: The most common downsampling operation is max, giving rise to max pooling, here shown with a stride of 2. That is, each max is taken over 4 numbers (little 2x2 square).

2017年11月11日 星期六

In PCA, we are interested to find the directions (components) that maximize the variance in our dataset. In DataSet, 某些Feature對於每一筆X有比較大的差異, 對Y分類影響會比較明顯。若Feature 對於所有X都差不多, 則此Feature並不是作為分群(Partition)的最好選擇。

PCA(n_components=2), 會自動找出前2個主要Feature。下圖為2維, 畫出其分類(0,1,2)的狀態

==============

Residual variance ?

Then you fit a regression model. You use the regression equation to calculate a predicted score for each person.

Then you find the difference between the predicted scores and the actual scores. You calculate the variance of the set of scores. It's the residual variance. The residual variance will be less than the total variance (or if your predictors are completely useless, they will be equal).

How much variance did you explain?Explained variance = (total variance - residual variance)

The proportion of variance explained is therefore:

explained variance / total variance

If your predicted scores exactly match the outcome scores, you've perfectly predicted the scores, and you've explained all of the variance. The residuals are all zero.

(Note: The calculations are done with sums of squares, variances will give a very slightly different answer as they are usually calculated as SS/(N-1). But if the sample is large, this difference is trivial).

However, by partitioning the available data into three sets, we drastically reduce the number of samples which can be used for learning the model, and the results can depend on a particular random choice for the pair of (train, validation) sets.

A solution to this problem is a procedure called cross-validation (CV for short). A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV. In the basic approach, called k-fold CV, the training set is split into k smaller sets (other approaches are described below, but generally follow the same principles). The following procedure is followed for each of the k “folds”:

A model is trained using of the folds as training data;

the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).

The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop. This approach can be computationally expensive, but does not waste too much data (as it is the case when fixing an arbitrary test set), which is a major advantage in problem such as inverse inference where the number of samples is very small.

Size of minibatches for stochastic optimizers. If the solver is ‘lbfgs’, the classifier will not use minibatch. When set to “auto”, batch_size=min(200, n_samples)

max_iter : int, optional, default 200

Maximum number of iterations. The solver iterates until convergence (determined by ‘tol’) or this number of iterations. For stochastic solvers (‘sgd’, ‘adam’), note that this determines the number of epochs (how many times each data point will be used), not the number of gradient steps.

that is , total parameter update= (# of batch_size) * (# of epochs)

tol: float, optional, default 1e-4

Tolerance for the optimization. When the loss or score is not improving by at least tol for two consecutive iterations, unless learning_rate is set to ‘adaptive’, convergence is considered to be reached and training stops.

momentum : float, default 0.9

Momentum for gradient descent update. Should be between 0 and 1. Only used when solver=’sgd’.

early_stopping : bool, default False

Whether to use early stopping to terminate training when validation score is not improving. If set to true, it will automatically set aside 10% of training data as validation and terminate training when validation score is not improving by at least tol for two consecutive epochs. Only effective when solver=’sgd’ or ‘adam’

validation_fraction : float, optional, default 0.1

The proportion of training data to set aside as validation set for early stopping. Must be between 0 and 1. Only used if early_stopping is True

predict_proba(X) Probability estimates. The returned estimates for all classes are ordered by the label of classes.

Y=predict_proba(X) is a method of a (soft) classifier outputting the probability of the instance being in each of the classes.
Y 是一個list, 列出X 會落在Y分類中的機率值
若是二元分類, 則Y為落在0 或落在1的機率
若是多元分類, 則Y為落在各類別的機率值 (加起來的機率總合為1)

“Ordinal” is easy to remember because is sounds like “order” and that’s the key to remember with “ordinal scales”–it is the order that matters, but that’s all you really get from these.

Interval ==> +, 1

Like the others, you can remember the key points of an “interval scale” pretty easily. “Interval” itself means “space in between,” which is the important thing to remember–interval scales not only tell us about order, but also about the value between each item.

有次序關係: 非常滿意、滿意、普通、不滿意 、非常不滿意

Ratio ==> +, - x /

一般Rational Number 有理數 (即整數與可化為"分數"的數都是有理數), 有理數可以做加減乘除

Ratio scales provide a wealth of possibilities when it comes to statistical analysis.

Summary

In summary, nominal variables are used to “name,” or label a series of values. Ordinal scales provide good information about the order of choices, such as in a customer satisfaction survey. Interval scales give us the order of values + the ability to quantify the difference between each one. Finally, Ratio scales give us the ultimate–order, interval values, plus the ability to calculate ratios since a “true zero” can be defined.

Multiclass option can be either ‘ovr’ or ‘multinomial’. If the option chosen is ‘ovr’, then a binary problem is fit for each label. Else the loss minimised is the multinomial loss fit across the entire probability distribution. Does not work for liblinear solver.