BERT (Bidirectional Encoder Representations from Transformers) is a Transformer pre-trained on masked language model
and next sentence prediction tasks. This approach showed state-of-the-art results on a wide range of NLP tasks in
English.

RuBERT was trained on the Russian part of Wikipedia and news data. We used this training data to build vocabulary of Russian subtokens and took
multilingual version of BERT-base as initialization for RuBERT.

Here, in DeepPavlov, we made it easy to use pre-trained BERT for downstream tasks like classification, tagging, question answering and
ranking. We provide pre-trained models and examples on how to use BERT with DeepPavlov.

deeppavlov.models.bert.BertClassifierModel (see here) provides easy to use
solution for classification problem using pre-trained BERT.
One can use several pre-trained English, multi-lingual and Russian BERT models that are listed above.

Two main components of BERT classifier pipeline in DeepPavlov are
deeppavlov.models.preprocessors.BertPreprocessor (see here)
and deeppavlov.models.bert.BertClassifierModel (see here).
Non-processed texts should be given to bert_preprocessor for tokenization on subtokens,
encoding subtokens with their indices and creating tokens and segment masks.
If one processed classes to one-hot labels in pipeline, one_hot_labels should be set to true.

bert_classifier has a dense layer of number of classes size upon pooled outputs of Transformer encoder,
it is followed by softmax activation (sigmoid if multilabel parameter is set to true in config).

Pre-trained BERT model can be used for sequence tagging. Examples of usage of BERT for sequence tagging can be
found here. The module used for tagging is BertNerModel.
To tag each word representations of the first sub-word elements are extracted. So for each word there is only one vector produced.
These representations are passed to a dense layer or Bi-RNN layer to produce distribution over tags. There is
also an optional CRF layer on the top.

Context Question Answering on SQuAD dataset is a task
of looking for an answer on a question in a given context. This task could be formalized as predicting answer start
and end position in a given context. BertSQuADModel uses two linear
transformations to predict probability that currents subtoken is start/end position of an answer. For details check
Context Question Answering documentation page.

There are two main approaches in text ranking. The first one is interaction-based which is relatively accurate but
works slow and the second one is representation-based which is less accurate but faster 1.
The interaction-based ranking based on BERT is represented in the DeepPavlov with two main components
BertRankerPreprocessor
and BertRankerModel
and the representation-based ranking with components
BertSepRankerPreprocessor
and BertSepRankerModel.
Additional components
BertSepRankerPredictorPreprocessor
and BertSepRankerPredictor are for usage in the interact mode
where the task for ranking is to retrieve the best possible response from some provided response base with the help of
the trained model. Working examples with the trained models are given here.
Statistics are available here.