Description Based Text Classification with Reinforcement Learning

Authors:

Duo Chai

Wei Wu

Qinghong Han

Wu Fei

Jiwei Li

Paper Link: here

Note: This is a documentation of my understanding of the paper

TL; DR / Summary

The Problem:

The standard methodology of text classification happens in two stages - text feature extraction and classification. In this formulation categories are simply represented as indexes in the label vocabulary leaving the model with no explicit instructions on what to classify.

The Proposed Solution:

This paper suggests an alternative strategy wherein each category label is associated with a category description. These descriptions are generated using either hand-crafted examples or by using abstractive/extractive models using reinforcement learning. The methodology proposes a text entailment style approach where the text and description are concatenated and fed to a classifier to decide if the current label is to be assigned to the text.

Why this methodology?

This strategy forces the model to attend to the most salient (important) features of the text by baking in the label description (think of it like a hard version of attention) making it more information rich.

Deeper analyses of the problem with the standard formulation

Let’s first have a glimpse of BERT’s attention mechanism. You can use ExBERT in order to visualize this or refer to the figures in this paper.

As you can see, the self-attention mechanism pays attention to only a handful of words in each of it’s layers. This means that the actual class indicators in the text can be just a few keywords and could be deeply buried in the text making it hard to differentiate grain from chaff.

If you are dealing with a multi-label text classification problem, then signals from different classes might get entangled in the text as well. All this makes it difficult for the model as it needs to first learn to associate relevant text with target aspect and then decide the sentiment.

Let’s dive right in!

Past work on query generation:

Combine supervised learning + reinforcement learning to generate natural language descriptions
Generative model to generate queries based on unlabeled texts to train QA models
Framed the task of description generation as a seq2seq task, where descriptions are generated conditioning on the texts
Proposed a generator-evaluator framework that directly optimizes objectives
- Generator - is a seq2seq model that incorporates the structure and semantics of the question being generated
- Evaluator - evaluates and assigns a reward to each predicted question based on its conformity to the structure of ground-truth questions

This paper takes inspiration from 1. and 4. above and marries the two.

How is the description-based text classification formulated?

You concatenate the query, q_y with the text, x and feed it to a transformer and get the h_[CLS] which encodes the entire query + text. (The [CLS] token bakes in all the features from the input text and is what is sent forward to the softmax/sigmoid layer for classification)

{[CLS]; q_y; [SEP]; x} → transformers in BERT → contextual representation, h_[CLS]

You then pass it through the sigmoid layer,

p(y|x) = sigmoid(W₂ReLU(W₁h_[CLS] + b₁) + b₂) → value between 0 and 1

For single-label classification, you just take the argmax of the sigmoid (which is nothing but the softmax)

y˜ = arg max_y({p(y|x), ∀y ∈ Y})

For multi-label classification,

y˜ = {y | p(y|x) > 0.5, ∀y ∈ Y}

These are binary classifiers and you can have N-binary classifiers like so.

You may also formulate your multi-class problem as a single N-class classifier by concatenating all descriptions with the input x,

{[CLS₁]; q₁; [CLS₂]; q₂; …; [CLS_N]; q_N ; [SEP]; x} → fed to transformer → h_[CLS₁], h_[CLS₂], …, h_{[CLS_N]}

Do note though that the N-class classifier strategy cannot handle the multi-label classification case.

The probability of assigning class n to instance x is obtained by first mapping h_{[CLS_n]} to scalars and then applying a softmax on it

a_n = h^{^}^T.h_{[CLS_n]}

p(y=n|x) = softmax(a_n)

What are the ways to construct description?

There are primarily three strategies:

Template strategy
Abstractive Strategy
Extractive Strategy

Important thing to note here is that the goal is to have the ability for the model to generate the most appropriate descriptions of the different classes conditioned on the current text to classify, and the appropriateness of the generated descriptions should directly correlate with the final classification performance.

The Template Strategy is usually very labour intensive and also human generated templates might be sub-optimal. So let’s have a deeper look at the Extractive and Abstractive strategies.

If you have worked on any text summarisation problem, then you can build your intuition around the Extractive Strategy by looking at SQuAD style text extraction from the body of the text and you can build your intuition around the Abstractive Strategy by looking at GPT-2/GPT-3 style generative text generation.

Description Construction : Extractive Strategy

For each input, x = {x₁, · · · , x_T} the extractive model generates a description q_yx for each class label y; where q_yx is a substring of x. For different inputs x, descriptions for the same class can be different. This is quite intuitive as when x changes, the base text on which extractions are done changes and thus the descriptions also change.

It’s important to note that -

For the golden class, y that should be assigned to x, there should always be a substring of x relevant to y.
But for classes that should not be assigned, there might not be corresponding substrings in x that can be used as descriptions.

To deal with this problem, append N dummy tokens to x such that if the extractive model picks a dummy token it will fall back to using hand-crafted templates for different categories as descriptions. Also, note that in this case, the hand-crafted examples are crafted by using some regex heuristics rather than completely manual.

Now you must have been wondering, when does the reinforcement learning bit come in and how does it actually help?

Reinforcement learning is used to back-propagate the signal indicating which span contributes how much to the classification performance. Let’s also introduce some typical reinforcement learning components here -

action, a
policy, π
reward, r

Action and Policy

For each class label y, action is to pick a text-span {x_{i_s} , · · · , x_{i_e}} from x to represent q_yx → need start and end index of span, a_{i_s,i_e}

For each class label y, the policy π defines the probability of selecting the start index, i_s and end index, i_e.

Each token x_k within x is mapped to a representation h_k using BERT.

P_start(y, k) = exp(W_ysh_k)/(Σ_1…T exp(W_ysh_t))

P_end(y, k) = exp(W_yeh_k)/(Σ_1…T exp(W_yeh_t))

Each class y has a class-specific W_ys and W_ye

Probability of a text span with the starting index i_s and ending index i_e being the description for class y, P_span(y, a_is,ie) is -

P_span(y, a_is,ie) = P_start(y, i_s) × P_end(y, i_e)

Reward

Given x and the description q_yx, classification model will assign probability of assigning correct label to x which will be used as reward to update both the classification model and the extractive model.

For multi-class classification, reward is given by -

R(x, q_yx for all y) = p(y = n|x) where, n is gold label for x

REINFORCE

To find optimal policy, use REINFORCE algorithm which maximizes the expected reward E_π[R(x, q_y)]

For each generated description q_yx and the corresponding x,

L = −E_π[R(q_yx, x)]

REINFORCE approximates above equation with sampled distributions from the policy(π) distribution and the gradient to update parameters is given by -

∇L ≈ -Σ_i=1…B∇logπ(a_{i_s,i_e}|x, y)[R(q_y)-b]

where, b denotes the baseline value, which is set to the average of all previous rewards.

So, the Extractive Policy is initialized to generate sub-text as descriptions. Then the extractive model and the classification model are jointly trained based on the reward.

Description construction : Abstractive Strategy

Action and Policy

For each class label y, the action is to generate the description q_yx = {q₁, · · · , q_L} defined by p_θ

And the policy, P_SEQ2SEQ is given by -

P_SEQ2SEQ(q_y|x) = Π_i=1..Lp_θ(q_i|q_<i, x, y)

where q_<i denotes all the already generated tokens

P_SEQ2SEQ(q_y|x) for different class y share the structures and parameters, with the only difference being that a class specific embedding hy is appended to each source and target token.

Reward

When dealing with abstractive summarisation, since we are dealing with the paradigm of generating text, we know that the possible generated tokens might be a really big space, let’s call it action space. Now, a widely recognized challenge for training language models using RL → high variance due to action space being huge.

So, circumvention around this can’t happen with the basic REINFORCE algorithm. To deal with this, we will use REGS(Reward for Every Generation Step) - also called Monte Carlo Tree Search (further reading)

Unlike standard REINFORCE training, in which the same reward is used to update the probability of all tokens within the description, REGS trains a discriminator that is able to assign rewards to partially decoded sequences.

∇L ≈ -Σ_i=1…L∇logπ(q_i|q_<i, h_y)[R(q_<i) - b(q_<i)]

here, R(q_<i) denotes the reward given the partially decoded sequence q_<i as the description, and b(q_<i) denotes the baseline (average of all previous rewards).

The generative policy P_SEQ2SEQ is initialized using a pretrained encoder-decoder with input being x and output being template descriptions. Then the description generation model and the classification model are jointly trained based on the reward.