Named Entity Recognition with Pytorch Transformers

What if I told you that you can develop a state-of-the-art Natural Language Processing (NLP) system to do Language Generation, Question Answering or Named Entity Recognition with only few line of code? Sounds too good to be true?

Thanks to the folks at HuggingFace, this is now a reality and top-performing language representation models have never been that easy to use for virtually any NLP downstream task. The HuggingFace’s Transformers python library let you use any pre-trained model such as BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet, CTRL and fine-tune it to your task. This is truly the golden age of NLP!

In this post, I will show how to use the Transformer library for the Named Entity Recognition task. You can access the code for this post in the dedicated Github repository.

Transformers, a new NLP era!

Following the progress in general deep learning research, Natural Language Processing (NLP) has taken enormous leaps the last 2 years. OpenAI‚Äôs GPT, Google’s BERT, Google and Carnegie Mellon University’s XL-Net, Facebook’s RoBERTa or SpanBERT. Those models belong to a new generation of NLP models using Transformer architectures that can learn long-term dependencies in texts. As explained in the excellent post at MLexplained: “a transformer is an encoder-decoder architecture model which uses attention mechanisms to forward a more complete picture of the whole sequence to the decoder at once rather than sequentially”.

Moreover, they support transfer learning, giving access to pre-trained models on huge text corpus that the rest of the community can easily fine-tune on specific tasks. What this means is that anyone can benefit from the latest research on language representation model architecture and avoid the expensive task (e.g. four days on 4 to 16 Cloud TPUs for BERT pre-training) of pre-training it on corpus like wikipedia. This is the new NLP supply chain as illustrated by Han Xiao in the figure below!

The new NLP supply chain – Han Xiao, source:

In a previous post, we showed the potential of BERT model in the sentence classification task. Let see how difficult it is to perform Named Entity Recognition (NER) using several top-performing models.

Disease and Chemical Extraction

In the biomedical field, disease, gene and chemical are among the most search entities. That’s why the BioCreative challenge – a challenge for evaluating text mining and information extraction systems applied to the biological domain – has proposed a task for disease and chemical extraction in 2015. We will reuse the annotated data they published, called “BC5CDR“, to train and evaluate our transformer models.

The figure below, taken from the website paperswithcode, shows the supremacy of these models on the BioCreative benchmark dataset we are about to experiment with.

Similarly to recent papers, we are using the pre-processed version of the dataset made available along with the paper entitled “A Neural Network Multi-Task Learning Approach to Biomedical Named Entity Recognition” by Crichton et al. In particular we are using the BC5CDR-IOB data pre-formatted in the IOB2 tagging scheme (short for inside, outside, beginning). Each word token is tagged with one of the following 5 labels: O, B-Disease, I-Disease, B-Chemical or I-Chemical. An “O” tag indicates that a token belongs to no chunk. The B- prefix before a tag indicates that the tag is the beginning of a chunk, and an I- prefix before a tag indicates that the tag is inside a chunk.

The dataset is composed of 9.141 training sentences and 4.797 testing sentences. Here is a sentence example from the training data:

Torsade B-Disease
de I-Disease
pointes I-Disease
ventricular B-Disease
tachycardia I-Disease
during O
low O
dose O
intermittent O
dobutamine B-Chemical
treatment O
in O
a O
patient O
with O
dilated B-Disease
cardiomyopathy I-Disease
and O
congestive B-Disease
heart I-Disease
failure I-Disease
. O

In this example, 4 diseases are tagged: Torsade de pointes, ventricular tachycardia, dilated cardiomyopathy and congestive heart failure; 1 chemical annotation is present: dobutamine.

The NER task is a multi-class classification problem where the model provide a probability that any of the 5 classes (“O”, “B-Disease”, “I-Disease”, “B-Chemical”, “I-Chemical”) is true. To make BERT model suited for the NER task, we add a token classification head layer on top of BERT model consisting of a softmax layer. We use the Multi-Class Cross Entropy Loss as an objective function.

We are using the seqeval script to perform our evaluation (this is the same evaluation as used for the CoNLL NER benchmark dataset).

Let’s experiment with several flavour of the BERT model, namely Google’s BERT, AllenAI’s SciBERT and Facebook’s SpanBERT. Both BERT and SpanBERT are trained on English Wikipedia articles whereas SciBERT is trained on scientific article.

You can find in the README file, the instruction to set your environment and download the various pre-trained models. The main code is in and takes care of loading the pre-trained model and its architecture; using the model’s vocabulary to efficiently tokenize the word token we have in our train and test sets; and add the head softmax layer for the NER task. Providing you have a TokenClassification class for other models, we can then run the same code on another model just by changing its name in the script parameter!

We fine-tune the model for 3 iterations. Too few iterations and the model will not specialized enough for our task, too many and we hit the “catastrophic forgetting” problem where we loose the benefit of the language representation learned by the pre-trained model.

Looking at the sentence length in our data (figure below), we can see that our sentences are made of maximum 226 word tokens. Therefore we set the max_sequence_length parameter to 256.

The results of our experiments are presented in table below.

Model Parameters Global F1-Score Disease F1-Score Chemical F1-Score
BERT base-cased, max_sequence_length=256 0.871 0.816 0.917
SciBERT large-cased, max_sequence_length=256 0.904 0.863 0.938
SpanBERT large-cased, max_sequence_length=256 0.858 0.809 0.890

All three models are performing very well (comparable performance as mentioned in paperswithcode dashboard). Still pre-training the model on scientific articles (closer to our application data) leads to better results with an F1 score of 0.904.

Let’s look at some errors the best-performing – SciBERT model – has made. We circle the ground truth annotations in green. Model’s predictions have a orange background for disease annotations and green for chemical annotations.

An example where our model has failed is:

Myotonia congenita(MC) is caused by a defect in the skeletal muscle chloride channel function, which may cause sustained membrane depolarisation.
Although the abbreviation “MC” may be difficult to desambiguate for the model, the last disease tag seemed easier to spot as it is preceded by “cause”. In another example below, it is not obvious that the prediction is incorrect (does not appear in the ground truth). I would have may be tagged “multiple congenital anomalies” as indeed being a disease. This tells us the importance and difficulty to construct a high quality, consensual gold set.
We report a newborn infant with multiple congenital anomalies (anotia and Taussig-Bing malformation) due to exposure to isotretinoin within the first trimester.