Multi-Task Learning in Language Model for Text Classification

Multi-Task Learning in Language Model for Text ClassificationUniversal Language Model Fine-tuning for Text ClassificationEdward MaBlockedUnblockFollowFollowingMar 27Howard and Ruder propose a new method to enable robust transfer learning for any NLP task by using pre-training embedding, LM fine-tuning and classification fine-tuning.

The sample 3-layer of LSTM architecture with same hyperparameters except different dropout demonstrate a outperform and robust model for 6 downstream NLPS tasks.

They named it as Universal Language Model Fine-tuning (ULMFiT).

This story will discuss about Universal Language Model Fine-tuning for Text Classification (Howard and Ruder, 2018) and the following are will be covered:ArchitectureExperimentImplementationArchitectureAs mentioned before ULMFiT has 3 stages.

The first stage is using general domain data to build a LM pre-training model.

Second stage is fine-tuning LM for target data set while third stage is fine-tuning classification for target data set.

It does not use transformer but using a regular LSTM with various tuned dropout hyperparameters.

General-domain LM pretrainingIt is domain-free problem so we can leverage any data to train the model.

In other word, data volume can be very very large.

For example, it can use whole content of wikipedia or reddit content.

The purpose of that is capturing general features to handle different kind of downstream problems.

Significant improvement from transfer learning in NLP is demonstrated by lots of previous experiment.

Target task LM fine-tuningHaving a general purpose vectors, it may not perform well directly on specific problem as it is too generic.

Therefore, fine-tuning is a must action.

First of all, model will be fine-tuned by Language Model (LM) problem.

Theoretically, it will convergence much faster than general-domain LM training because it only needs to learn the characterise of single source of target tasks.

Some tricks are used to boost up the performance of ULMFiT.

They are discriminative fine-tuning and slanted triangular learning rates.

DDiscriminative fine-tuning is proposed to use different learning rate for different layers.

From the experiment, Howard and Ruder found that choosing the learning rate for last layer only.

The learning rate of last 2 layer is last layer / 2.

6 and using same formula to setup the learning rate for lower layers.

Stochastic Gradient Descent (SGD).

η is learning rate while ∇θJ(θ) is the gradient of objective function.

(Howard and Ruder, 2018)SGD with discriminate fine-tuning.

η l is learning rate of l-th layer.

(Howard and Ruder, 2018)Slanted triangular learning rates (STLR) is another approach of using dynamic learning rate is increasing linearly at the beginning and decaying it linearly such that it formed a triangle.

STLR Formula.

T is number of training iteration.

cut_frac is the fraction of increasing learning rate.

cut is the iteration switching from increasing to decreasing.

p is the fraction of the number of iterations which increase or decreased.