Now, if we inspect the shape of X_reduced, it is very clear how many features were selected. So the question is now, which ones?

The coef_ attribute of the LinearSVC is very important and is suggested to iterate over it and the features which coef_ are different from zero are the ones selected. Well, this is wrong, but you can get very close to the real result.

After checking X_reduced, I noticed I got 310 selected features, and that is for sure, I mean, I am checking the resultant matrix, now, if I do the coef_ thing, 414 features were selected, from a total of 2000, so it is close to the real thing.

According to scikit LinearSVC docs there is a mean(X) involved for Threshold=None but I am stuck, no idea what to do now.

UPDATE: Here is a link with data&code that reproduce the error, its just a few KB

Best How To :

I think LinearSVC() does returns features with non-zero coefficients. Could you please upload the sample data file and code script (for example, via dropbox sharelink) that can reproduce the inconsistency you saw?

Perhaps you are confusing the concept of optimising a statistical model from a set of data points and fitting a curve through a set of data points. Some of the scikit-learn code that is cited above, is trying to optimise a statistical model from a set of data points. In...

You can look at RandomForest which is a well known classifier and quite efficient. In scikit-learn, you have some class that can be used over several core like RandomForestClassifier. It has a constructor parameter that can be used to define the number of core or a value that will use...

It looks like sklearn requires the data shape of (row number, column number). If your data shape is (row number, ) like (999, ), it does not work. By using numpy.reshape, you should change to (999, 1). Ex. data.reshape((999,1)) In my case, it worked with that.

You have to use the Support Vector Machine (LibSVM) Operator. In contrast to the classic SVM which only supports two class problems, the LibSVM implementation (http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf) supports multi-class classification as well as regression.

I get more than one digit in my results, are you sure it is not due to your dataset ? (for example using a very small dataset would yield to simple decision trees and so to 'simple' probabilities). Otherwise it may only be the display that shows one digit,...

Here is my guess about what is happening in your two types of results: .days does not convert your index into a form that repeats itself between your train and test samples. So it becomes a unique value for every date in your dataset. As a consequence your models either...

First of all, why to use a Ball tree? Maybe your metric does imply this to you, but if that's not the case, you could use a kd-tree too. I will approach your question from a theoretic point of view. The radius parameter is set 1.0 by default. This might...

The Pipeline documentation slightly overstates things. It has all the estimator methods of its last estimator. These include things like predict(), fit_predict(), fit_transform(), transform(), decision_function(), predict_proba().... It cannot use any other functions, because it wouldn't know what to do with all the other steps in the pipeline. For most situations,...

Try this: goodwords = ((countmatrix > 1).mean(axis=0) <= 0.8).nonzero()[0] It first computes a Boolean matrix which is True if countmatrix > 1 and computes the column-wise mean of it. If the mean is less than 0.8 (80%), the corresponding column index is returned by nonzero(). So, goodwords will contain all...

Optimal size of images is that you can easily classify object by yourself. Yes, classifiers works better after normalization, there are options. Most popular ways is center dataset (subtract mean) and normalize range of values say in [-1:1] range. Other popular way of normalization is similar to previous but...

RandomForests are built on Trees, which are very well documented. Check how Trees use the sample weighting: User guide on decision trees - tells exactly what algorithm is used Decision tree API - explains how sample_weight is used by trees (which for random forests, as you have determined, is the...

I am not using Python, but I did something you need in C++ & opencv. Hope you succeed in converting it to whatever language. // choose how many eigenvectors you want: int nEigensOfInterest = 0; float sum = 0.0; for (int i = 0; i < mEiVal.rows; ++i) { sum...

I suspect image1.jpg is a color image, so im is 3D, with shape (num_rows, num_cols, num_color_channels). One option is to tell imread to flatten the image into a 2D array by giving it the argument flatten=True: im = misc.imread('image1.jpg', flatten=True) Or you could apply canny to just one of the...

You most likely have an older version of scikit-learn. You can check the current version using: python -c "import sklearn as sk; print sk.__version__" If you're using 0.16.1, you should be able to import LSHForest. ...

Classification report must be straightforward - a report of P/R/F-Measure for each element in your test data. In Multiclass problems, it is not a good idea to read Precision/Recall and F-Measure over the whole data any imbalance would make you feel you've reached better results. That's where such reports help....

For what you're describing, you just need to use train_test_split with a following split on its results. Adapting the tutorial there, start with something like this: import numpy as np from sklearn import cross_validation from sklearn import datasets from sklearn import svm iris = datasets.load_iris() iris.data.shape, iris.target.shape ((150, 4), (150,))...

It is advisable to prepend regular expressions with r, this should work: vectorizer2 = TfidfVectorizer(token_pattern=r'(?u)\b\w\w+\b', ngram_range=(1, 2), max_df=1.0, min_df=1) train_set_tfidf = vectorizer2.fit_transform(train_set) This is a known bug in the documentation, but if you look at the source code they do use raw literals....

I think that 0.695652 is the same thing with 0.70. In the scikit-learn f1_score documentation explains that in default mode : F1 score gives the positive class in binary classification. Also you can easily reach the score of 0.86 with the formulation of F1 score. The formulation of F1 score...

Problem was with scipy/numpy install. I'd been using the (normally excellent!) unofficial installers from http://www.lfd.uci.edu/~gohlke/pythonlibs/. Uninstall/re-install from there made no difference, but installing with the official installers (linked from http://www.scipy.org/install.html) did the trick.

Unfortunately, this is currently not as nice as it could be. You need to use FeatureUnion to concatenate to kinds of features, and the transformer in each needs to select the features and transform them. One way to do that is to make a pipeline of a transformer that selects...

The code should output the warning: "Warning: Escape sequence '\U' is not valid. See 'help sprintf' for valid escape sequences. " You need to escape the \ when using sprintf. With yor code path is C:. For examples how proper escaping is done, please check the documentation for sprintf. Instead...

Yes, you will need to convert the strings to numerical values The naive Bayes classifier can not handle strings as there is not a way an string can enter in a mathematical equation. If your strings have some "scalar value" for example "large, medium, small" you might want to classify...

Firstly, it's better to leave the import at the top of your code instead of within your class: from sklearn.feature_extraction.text import TfidfVectorizer class changeToMatrix(object): def __init__(self,ngram_range=(1,1),tokenizer=StemTokenizer()): ... Next StemTokenizer don't seem to be a canonical class. Possibly you've got it from http://sahandsaba.com/visualizing-philosophers-and-scientists-by-the-words-they-used-with-d3js-and-python.html or maybe somewhere else so we'll assume it...

You need to do a GridSearchCrossValidation instead of just CV. CV is used for performance evaluation and itself doesn't fit the estimator actually. from sklearn.datasets import make_classification from sklearn.svm import SVC from sklearn.grid_search import GridSearchCV # unbalanced classification X, y = make_classification(n_samples=1000, weights=[0.1, 0.9]) # use grid search for tuning...

First off, it might not be good to just go by recall alone. You can simply achieve a recall of 100% by classifying everything as the positive class. I usually suggest using AUC for selecting parameters, and then finding a threshold for the operating point (say a given precision level)...

Yes, the default kernel is RBF with gamma equal to 1/k. See other defaults in javadocs here or here. NB: Weka contains its own implementation - SMO, but it also provides wrapper for libsvm, and "LibSVM runs faster than SMO" (note that it requires installed libsvm, see docs)....

Unless you have some implementation bug (test your code with synthetic, well separated data), the problem might lay in the class imbalance. This can be solved by adjusting the missclassification cost (See this discussion in CV). I'd use the cost parameter of fitcsvm to increase the missclassification cost of the...

You can access the individual decision trees in the estimators_ attribute of a fitted random forest instance. You can even re-sample that attribute (it's just a Python list of decision tree objects) to add or remove trees and see the impact on the quality of the prediction of the resulting...

Generally, the combination of a fairly low number of n_samples, a high probability of randomly flipping the label flip_y and a large number of n_classes should get you where you want. You can try the following: from sklearn.cross_validation import cross_val_score from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression lr =...

Django is a python framework meaning that you need to have python installed to use it. Once you have python, you can use whatever python package you want (compatible with the version of python you are using).

It looks like you are looking for OneHotEncoder. For an explanation take a look at the Encoding categorical features section of the docs. The idea is that you will make a column for each city with 0/1 values if the sample belongs to the current city. You might also be...

The pipeline calls transform on the preprocessing and feature selection steps if you call pl.predict. That means that the features selected in training will be selected from the test data (the only thing that makes sense here). It is unclear what you mean by "apply" here. Nothing new will be...

Image classification can be quite general. In order to define good features, first you need to be clear what kind of output you want. For example, images can be categorized according to the scenes in them into nature view, city view, indoor view etc. Different kind of classifications may required...

The GaussianNB() implemented in scikit-learn does not allow you to set class prior. If you read the online documentation, you see .class_prior_ is an attribute rather than parameters. Once you fit the GaussianNB(), you can get access to class_prior_ attribute. It is calculated by simply counting the number of different...

SVD is a dimensionality reduction tool, which means it reduces the order (number) of your features to a more representative set. From the source code on github: def fit_transform(self, X, y=None): """Fit LSI model to X and perform dimensionality reduction on X. Parameters ---------- X : {array-like, sparse matrix}, shape...

Threading by itself does not speed up python processes whose bottleneck is CPU and not the IO(read/write) because of the global interpreter lock (GIL). To actually get the speedup sklearn uses multiprocessing for parallelization. This is different from threading in that the objects are copied into a separate process, and...