AI or not AI? Classifying ArXiv articles with BERT

Things are evolving faster and faster in the NLP world. We can’t go 6 months without someone releasing a new language representation model that breaks records on major downstream benchmarks. Just take as an example the Named Entity Recognition task illustrated in the picture below. Behind the recent breakthroughs (since 2017) are new paradigms such as transfer learning (and the availability of pre-trained models on very large datasets) or the Transformers architecture.

Performance on Named Entity Recognition (NER) on CoNLL-2003 (English) over time

In this post we will use the State of the Art language model BERT, proposed in October 2018 by Jacob Devlin et al., to classify arXiv articles as AI or not AI. To do so, we will use a pre-trained BERT model to compute sentence and paragraph embeddings for each paper’s title and abstract. We will then learn a binary classifier on a subset of articles with their domain labels and test the classifier performance on a set of articles for which we mask their associated domains.

Understanding BERT

While previous NLP models like word2vec were mapping each word to a fixed vector, recent models such as ULMFiT, ELMo, BERT or XL-Net considers a word in its context. They map a vector to each word based on the entire surrounding words context and are then able to distinguish between the various meaning of a word. In effect, the word “lie” will have a different vector representation in the sentences “I promised my friends that I would never lie to them” and “The town lies south of the river”. We will take advantage of this finer-grained semantic understanding of sentences to determine the application domain of an article.

For those of you who wants details on BERT and to understand its architecture in simple words, I suggest this great post from mlexplained.

The ArXiv Dataset

Like in our previous post using node2vec, I will use the Arxiv dataset published on Kaggle. The dataset contains metadata about 41,000 papers published between 1992 to 2018. All together the articles have been written by almost 60,000 distinct authors. The authors submit their article to one or more categories offered by ArXiv. The top 11 tags used in this dataset we find:

Tag Tag Label Counts Percentage of articles with this tag
cs.CV Computer Vision and Pattern Recognition 13,902 33.91%
cs.LG Machine Learning 13,735 33.50%
cs.AI Artificial Intelligence 10,481 25.56%
stat.ML Machine Learning 10,326 25.19%
cs.CL Computation and Language 6,417 15.65%
cs.NE Neural and Evolutionary Computing 3,819 9.31%
cs.IR Information Retrieval 1,443 3.52%
math.OC Optimization and Control 1,020 2.49%
cs.RO Robotics 973 2.37%
cs.LO Logic in Computer Science 643 1.57%
cs.SI Social and Information Networks 639 1.56%

There are a couple of things to note here beyond the expected long tail distribution of the tags:

  • Although this may be obvious, an article may have one or more tags. In our dataset the mean number of tags per article is 1.97
  • The ArXiv’s tag taxonomy is only two level deep (e.g. “Computer Science” / “Artificial Intelligence” or “Computer Science” / “Computer Vision and Pattern Recognition”). Although papers in “Computer Vision and Pattern Recognition” are to a large extent as well part of “Artificial Intelligence”, the co-occurrence of these two tags happens only in 993 articles (7.14%). Full overlap for the “Artificial Intelligence” tag is illustrated in the diagram below (the “AI” tag is on purpose out of proportion to visualize all the intersections). Authors tend to tag their papers only with more specific tags and not co-tag them with more generic ones. This will have an impact on our classifier’s performance as we will see.
  • The tags are assigned by the authors and are not manually curated by a staff. This adds extra noise to the labels used for the classification task.

Extracting sentence and paragraph embeddings

While in an earlier post we looked at vector representation of graph nodes and edges, we here focus our attention on vector representation of text. Specifically we are using BERT-as-a-service to extract:

  • A sentence embedding of the paper’s title,
  • A paragraph embedding of the paper’s abstract. To obtain the paragraph embedding, we split the abstract in sentences, get the embeddings of each sentence, then average them to get a single vector representation of the abstract.

The advantage of using BERT-as-a-service is its built-in multi-thread server implementation, and the possibility to load any pre-trained or fine-tuned BERT model. By default, it extracts each token embedding from the second-to-last hidden layer and compute its mean.

Google provides multiple pre-trained BERT models that can be reused straightaway or after fine-tuning. In our case we are using the uncased 12 layers base model: uncased_L-12_H-768_A-12 with maximum length of 64 input tokens (The title and the sentences in the abstract in our dataset don’t exceed 64 tokens).

Training and evaluating a binary classifier

For each paper, we encode its title in an embedding of shape (1, 768). Similarly from its abstract, we encode each sentence in a 768 dimensional vector, then compute their mean, thus obtaining a single vector representation of shape (1, 768).

The two vectors are concatenated to a single vector of shape (1, 1536). That is then fed to a 1 layer neural network with sigmoid activation corresponding to the classification of AI / not AI. A figure of the workflow is presented below.

Our dataset has imbalanced classes: for every 3 “not AI” papers there is only 1 “AI” paper. To handle the imbalance, we use the train_test_split function stratified by class labels present in scikit learn. We then weight the classifier loss function in Keras using the ratio 1:3 to give more weight to under-represented classes “AI”.

We run the following 4 experiments:

  • Random baseline (weighted): a random baseline with same class distribution as in train set
  • Title: classifier using only the title embedding
  • Abstract: classifier using only the abstract embedding
  • Title+abstract: classifier using both title and abstract embeddings

Our classifier is trained for 10 epochs using Adam optimizer, the binary cross-entropy loss and a sigmoid activation function for the output layer. The results are presented in the table below:

AI/not AI? Precision Recall F1 MCC ROC-AUC
Random baseline (weighted) 0.26 0.24 0.25 0.01 0.50
Title 0.48 0.75 0.59 0.42 0.74
Abstract 0.61 0.71 0.66 0.54 0.78
Title+Abstract 0.61 0.71 0.66 0.53 0.78

Note that the abstract’s embedding provides better results than the embedding of titles. There is not much improvement by using the title and abstract embeddings together. The overall performance is better than random but not that high. Let’s have a look at the confusion matrix and 3 examples of false negatives and false positives, to see what’s happening:

Examples of False Negatives:
> “fast non-parametric tests of relative dependency and similarity” has been classified as non AI while according to the arXiv labels it should be AI
> “automatic differentiation variational inference” has been classified as non AI and should be AI
> “report: dynamic eye movement matching and visualization tool in neuro gesture” has been classified as non AI and should be AI

Examples of False Positives:
> “non-confluent nlc graph grammar inference by compressing disjoint subgraphs” has been classified as AI while according to the arXiv labels it should be non AI
> “the cyborg astrobiologist: testing a novelty-detection algorithm on two mobile exploration systems at rivas vaciamadrid in spain and at the mars desert research station in utah” has been classified as AI and should be non AI
> “complex-network analysis of combinatorial spaces: the nk landscape case” has been classified as AI and should be non AI

I would argue that some of the false-positives should in fact be labeled AI. This could be explained by the fact that not all “AI” papers are tagged as such by authors. Many authors prefer a more fine-grained tag such as “Computer Vision” and do not add the more generic tag “AI”. The shallow taxonomy of ArXiv is not helping this situation.

If this assumption is correct, we should get better results for a more focused domain. Let’s run the same analysis but for the binary classification of papers as “Computer Vision and Pattern Recognition” or not. We change the imbalance ratio to 1:2 and get the following results:

CV/not CV? Precision Recall F1 MCC ROC-AUC
Random baseline (weighted) 0.35 0.34 0.35 0.02 0.51
Title 0.79 0.80 0.79 0.69 0.85
Abstract 0.75 0.94 0.83 0.74 0.89
Title+Abstract 0.84 0.89 0.86 0.79 0.90

These results are much better than for the broader “Artificial Intelligence” category. Also note that combining the title and abstract embeddings leads to better results.

BERT pre-trained embeddings can lead to good results for classification of science papers. There are several alternatives we could try from here. We could try pre-training BERT on scientific articles, or fine-tuning BERT on the AI/not-AI task. We could even combine the text embeddings with graph node embedding… But to make real progress on this problem we need to have a more consistently labeled data set. We used the author-supplied arXiv labels because they were easy to get, not because they were the best possible labels for the task.