DirtyCommentsPreprocessor (registered as dirty_comments_preprocessor) preprocesses samples converting samples to lowercase, paraphrasing English combinations with apostrophe ', transforming more than three the same symbols to two symbols.

StrLower (registered as str_lower) converts samples to lowercase.

Already implemented universal preprocessors of another type of features:

OneHotter (registered as one_hotter) performs one-hotting operation for the batch of samples where each sample is an integer label or a list of integer labels (can be combined in one batch). If multi_label parameter is set to True, returns one one-dimensional vector per sample with several elements equal to 1.

Vectorizer is a component that converts batch of text samples to batch of vectors.

SklearnComponent (registered as sklearn_component) is a DeepPavlov wrapper for most of sklearn estimators, vectorizers etc. For example, to get TFIDF-vecotrizer one should assign in config model_class to sklearn.feature_extraction.text:TfidfVectorizer, infer_method to transform, pass load_path, save_path and other sklearn model parameters.

HashingTfIdfVectorizer (registered as hashing_tfidf_vectorizer) implements hashing version of usual TFIDF-vecotrizer. It creates a TFIDF matrix from collection of documents of size [n_documentsXn_features(hash_size)].