NLP Research → Production | Art | Philosophy
Authors: | Duo Chai | Wei Wu | Qinghong Han | Wu Fei | Jiwei Li |
Paper Link: here
Note: This is a documentation of my understanding of the paper
The standard methodology of text classification happens in two stages - text feature extraction and classification. In this formulation categories are simply represented as indexes in the label vocabulary leaving the model with no explicit instructions on what to classify.
This paper suggests an alternative strategy wherein each category label is associated with a category description. These descriptions are generated using either hand-crafted examples or by using abstractive/extractive models using reinforcement learning. The methodology proposes a text entailment style approach where the text and description are concatenated and fed to a classifier to decide if the current label is to be assigned to the text.
This strategy forces the model to attend to the most salient (important) features of the text by baking in the label description (think of it like a hard version of attention) making it more information rich.
Let’s first have a glimpse of BERT’s attention mechanism. You can use ExBERT in order to visualize this or refer to the figures in this paper.
As you can see, the self-attention mechanism pays attention to only a handful of words in each of it’s layers. This means that the actual class indicators in the text can be just a few keywords and could be deeply buried in the text making it hard to differentiate grain from chaff.
If you are dealing with a multi-label text classification problem, then signals from different classes might get entangled in the text as well. All this makes it difficult for the model as it needs to first learn to associate relevant text with target aspect and then decide the sentiment.
Past work on query generation:
This paper takes inspiration from 1. and 4. above and marries the two.
You concatenate the query, qy with the text, x and feed it to a transformer and get the h[CLS] which encodes the entire query + text. (The [CLS] token bakes in all the features from the input text and is what is sent forward to the softmax/sigmoid layer for classification)
{[CLS]; qy; [SEP]; x} → transformers in BERT → contextual representation, h[CLS]
You then pass it through the sigmoid layer,
p(y|x) = sigmoid(W2ReLU(W1h[CLS] + b1) + b2) → value between 0 and 1
For single-label classification, you just take the argmax of the sigmoid (which is nothing but the softmax)
y˜ = arg maxy({p(y|x), ∀y ∈ Y})
For multi-label classification,
y˜ = {y | p(y|x) > 0.5, ∀y ∈ Y}
These are binary classifiers and you can have N-binary classifiers like so.
You may also formulate your multi-class problem as a single N-class classifier by concatenating all descriptions with the input x,
{[CLS1]; q1; [CLS2]; q2; …; [CLSN]; qN ; [SEP]; x} → fed to transformer → h[CLS1], h[CLS2], …, h[CLSN]
Do note though that the N-class classifier strategy cannot handle the multi-label classification case.
The probability of assigning class n to instance x is obtained by first mapping h[CLSn] to scalars and then applying a softmax on it
an = h^T.h[CLSn]
p(y=n|x) = softmax(an)
There are primarily three strategies:
Important thing to note here is that the goal is to have the ability for the model to generate the most appropriate descriptions of the different classes conditioned on the current text to classify, and the appropriateness of the generated descriptions should directly correlate with the final classification performance.
The Template Strategy is usually very labour intensive and also human generated templates might be sub-optimal. So let’s have a deeper look at the Extractive and Abstractive strategies.
If you have worked on any text summarisation problem, then you can build your intuition around the Extractive Strategy by looking at SQuAD style text extraction from the body of the text and you can build your intuition around the Abstractive Strategy by looking at GPT-2/GPT-3 style generative text generation.
For each input, x = {x1, · · · , xT} the extractive model generates a description qyx for each class label y; where qyx is a substring of x. For different inputs x, descriptions for the same class can be different. This is quite intuitive as when x changes, the base text on which extractions are done changes and thus the descriptions also change.
It’s important to note that -
To deal with this problem, append N dummy tokens to x such that if the extractive model picks a dummy token it will fall back to using hand-crafted templates for different categories as descriptions. Also, note that in this case, the hand-crafted examples are crafted by using some regex heuristics rather than completely manual.
Now you must have been wondering, when does the reinforcement learning bit come in and how does it actually help?
Reinforcement learning is used to back-propagate the signal indicating which span contributes how much to the classification performance. Let’s also introduce some typical reinforcement learning components here -
For each class label y, action is to pick a text-span {xis , · · · , xie} from x to represent qyx → need start and end index of span, ais,ie
For each class label y, the policy π defines the probability of selecting the start index, is and end index, ie.
Each token xk within x is mapped to a representation hk using BERT.
Pstart(y, k) = exp(Wyshk)/(Σ1…T exp(Wysht))
Pend(y, k) = exp(Wyehk)/(Σ1…T exp(Wyeht))
Each class y has a class-specific Wys and Wye
Probability of a text span with the starting index is and ending index ie being the description for class y, Pspan(y, ais,ie) is -
Pspan(y, ais,ie) = Pstart(y, is) × Pend(y, ie)
Given x and the description qyx, classification model will assign probability of assigning correct label to x which will be used as reward to update both the classification model and the extractive model.
For multi-class classification, reward is given by -
R(x, qyx for all y) = p(y = n|x) where, n is gold label for x
To find optimal policy, use REINFORCE algorithm which maximizes the expected reward Eπ[R(x, qy)]
For each generated description qyx and the corresponding x,
L = −Eπ[R(qyx, x)]
REINFORCE approximates above equation with sampled distributions from the policy(π) distribution and the gradient to update parameters is given by -
∇L ≈ -Σi=1…B∇logπ(ais,ie|x, y)[R(qy)-b]
where, b denotes the baseline value, which is set to the average of all previous rewards.
So, the Extractive Policy is initialized to generate sub-text as descriptions. Then the extractive model and the classification model are jointly trained based on the reward.
For each class label y, the action is to generate the description qyx = {q1, · · · , qL} defined by pθ
And the policy, PSEQ2SEQ is given by -
PSEQ2SEQ(qy|x) = Πi=1..Lpθ(qi|q<i, x, y)
where q<i denotes all the already generated tokens
PSEQ2SEQ(qy|x) for different class y share the structures and parameters, with the only difference being that a class specific embedding hy is appended to each source and target token.
When dealing with abstractive summarisation, since we are dealing with the paradigm of generating text, we know that the possible generated tokens might be a really big space, let’s call it action space. Now, a widely recognized challenge for training language models using RL → high variance due to action space being huge.
So, circumvention around this can’t happen with the basic REINFORCE algorithm. To deal with this, we will use REGS(Reward for Every Generation Step) - also called Monte Carlo Tree Search (further reading)
Unlike standard REINFORCE training, in which the same reward is used to update the probability of all tokens within the description, REGS trains a discriminator that is able to assign rewards to partially decoded sequences.
∇L ≈ -Σi=1…L∇logπ(qi|q<i, hy)[R(q<i) - b(q<i)]
here, R(q<i) denotes the reward given the partially decoded sequence q<i as the description, and b(q<i) denotes the baseline (average of all previous rewards).
The generative policy PSEQ2SEQ is initialized using a pretrained encoder-decoder with input being x and output being template descriptions. Then the description generation model and the classification model are jointly trained based on the reward.