Querying text annotations at scale with SPARK

Whether you are analyzing textual data or building features from text, you will likely use text annotations. While many software libraries and material exist to annotate documents, querying these annotations at scale remains non-trivial. Consider the text in the illustration above about BRCA1 tumor suppressor gene normally expressed in the cells of breast tissue. You may want to know:

  • Q1: Which diseases are mentioned with BRCA1 or BRCA2 genes in the document title?
  • Q2: What cell lines occurs within 30 characters before or after the genes BRCA1 or BRCA2 genes?
  • Q3: What are the genes co-occurring with BRCA1 gene in the same sentence?

Answering these questions requires the following:

  • annotations about biomedical entities like gene, disease or cell line as well as document annotations such as title or sentence.
  • a mechanism to query these annotations using logical functions such as contains, before, or filters by annotation type or annotation properties.

In this post, we will see how to use AnnotationQuery library to easily query annotations of PubMed articles using SPARK. We will leverage existing biomedical article annotations to focus on the task of complex annotation query.

The code for this post is available on Github.

About AnnotationQuery

AnnotationQuery is a library (available in Python and Scala) developed by Elsevier Labs to rapidly query annotation sets at scale with SPARK using composable logical functions.

Let’s have a glimpse at some function compositions using our example queries listed above.

Query example 1 Which diseases are mentioned with BRCA1 or BRCA2 genes in the document title?
Annotation sets The query uses both biomedical annotations (disease, genes) that we name aqPub in our code and document annotations (title) named aqOM for Original Markup.
AnnotationQuery functions ContainedIn(FilterType(aqPUB,”disease”),Contains(FilterType(aqOM,”title”), FilterProperty(aqPUB,”identifier”,valueArr=Array(“675″,”672”)) ))

The query can be interpreted as follows: What are the diseases that are contained in a title that itself contains annotations of property identifier with value “675 or 672”? This function composition expresses a co-occurrence. While disease and title are both annotation types, BRCA1 and BRCA2 are values of the annotation type gene. We use their identifiers (675 and 672) stored as property to filter these annotations.

Query example 2 What cell lines occurs within 30 characters before or after the genes BRCA1 or BRCA2 genes?
Annotation sets The query uses only biomedical annotations (cell lines and genes) named aqPub in our code.
AnnotationQuery functions Or(
Before(FilterType(aqPUB,”cellline”),FilterProperty(aqPUB,”identifier”,valueArr=Array(“675″,”672”)), 30),
After(FilterType(aqPUB,”cellline”),FilterProperty(aqPUB,”identifier”,valueArr=Array(“675″,”672”)), 30)

The query can be interpreted as follows: What are the cell lines occurring within 30 characters before BRCA1 or BRCA2 genes, or occurring within 30 characters after BRCA1 or BRCA2 genes?

Query example 3 What are the genes co-occurring with BRCA1 gene in the same sentence?
Annotation sets The query uses both biomedical annotations (genes) named aqPub in our code and the sentence annotations from Stanford Core NLP, named aqSCNLP.
AnnotationQuery functions ContainedIn(FilterType(aqPUB,”gene”),Contains(FilterType(aqSCNLP,”sentence”), FilterProperty(aqPUB,”identifier”,”672″)) )

The query can be interpreted as follows: What are the genes contained in a sentence that itself contains the gene identified by the property identifier “672”?

Simple right? There is no need for any indexing or setting up a search engine! And best of all, the code is natively made for scalability!

More functions are described in the documentation of the library. It is worth noting that the library is working solely with the annotations offset, type and properties information and not the text itself. Nonetheless there are convenient functions like Hydrate or Concordancer to get the resulting and surrounding text. We will use such functions later in this post.

Data preparation

Before being able to run our queries, we first need to get the data in the right shape. This process is described in the illustration below:

We are using PubTator APIs to get access to PubMed articles (title and abstract) along with their biomedical annotations. PubTator is using state-of-the-art named entity recognition tools in the domain to detect entities of type Gene, Chemical, Cell lines, Disease, Species and Mutation. This constitutes our first annotation set – pubtator. The article comes with metadata like the year of publication and some formatting (title, abstract passages). This formatting will be the basis for a second annotation set – orginal markup.

Our last annotation set – scnlp – will be the identification of sentences in the text. Those annotations will be generated using Stanford Core NLP.

Each of the blue boxes in the illustration above corresponds to a scala App object that is detailed below. You can alternatively follow along with the code files.


The process starts with a list of PubMed article IDs we are interested in processing (stored in ./data/keys). For each article, we query Pubtator and store the XML response in the ./data/xml folder.

val results_rawXML =  sc.textFile(keyFile).repartition(numParts)
    .mapPartitions(keyIter => { => {
        try {
          // for each key, make a GET request to pubtator
          val url = "" + key
          val rawXML =
          // store the file in the raw mount
          FileUtils.writeStringToFile(new File(rawXMLMnt + key), rawXML, "UTF-8")
        } catch {
          case e: Exception => {

Let’s take a look at the XML we get for the first article (for conciseness, we omit the abstract passage which is structured the same way as the title passage).

<?xml version='1.0' encoding='UTF-8'?><!DOCTYPE collection SYSTEM 'BioC.dtd'>
      <infon key="journal">Int. J. Cancer; 2019 Aug 301 doi:10.1002/ijc.32655</infon>
      <infon key="year">2019</infon>
      <infon key="type">title</infon>
      <infon key="authors">Prajzendanc K, Domagała P, Hybiak J, Ryś J, Huzarski T, Szwiec M, Tomiczek-Szwiec J, Redelbach W, Sejda A, Gronwald J, Kluz T, Wiśniowski R, Cybulski C, Łukomska A, Białkowska K, Sukiennicki G, Kulczycka K, Narod SA, Wojdacz TK, Lubiński J, Jakubowska A, </infon>
      <text>BRCA1 promoter methylation in peripheral blood is associated with the risk of triple-negative breast cancer.</text>
      <annotation id="2">
        <infon key="identifier">672</infon>
        <infon key="type">Gene</infon>
        <infon key="NCBI Homologene">5276</infon>
        <location length="5" offset="0"/>
      <annotation id="3">
        <infon key="Identifier">MESH:D001943</infon>
        <infon key="type">Disease</infon>
        <location length="13" offset="94"/>
        <text>breast cancer</text>

The document is identified by an id and composed of two passages: one of type title including some document-level metadata like year of publication and one of type abstract. Each passage has a text, an offset based on the document text and a set of annotations with their offset, length, type, identifier and original text. We now have to separate the text from the annotations to be able to manipulate them individually.

For each XML file, we use XQuery to extract the text string, the original document markup (whole document, title and abstract) and the PubTator annotations:

  • ./data/str contains the string content of the document stripped from any annotation (all annotation offsets reference this text)
  • ./data/pubtator contains the pubtator annotations including Gene, Disease, Chemical, Mutation, Species and CellLine
  • ./data/om contains the original markup of the document including Document, Title and Abstract.

Below is the XQuery that creates the document text (stripped from any annotation or structure) from the original XML file. spark-xml-utils is used to run the XML transformation on SPARK. Note that pubtator considers that there is an extra character between the title and the abstract.

val xquery_str = """
declare default element namespace "";

let $titlePassage := /collection/document/passage[infon[@key="type"] = "title"]
let $abstractPassage := /collection/document/passage[infon[@key="type"] = "abstract"]

  concat(string($titlePassage/text), " ",  string($abstractPassage/text))

AnnotationQuery expects the annotations to come with the following structure:

docId: String, // Document Id
annotSet: String, // Annotation set (such as scnlp, pubtator)
annotType: String, // Annotation type (such as sentence, gene)
startOffset: Long, // Starting offset for the annotation
endOffset: Long, // Ending offset for the annotation
annotId: Long, // Annotation Id (needs to be unique)
other: Option[String] = None) // Contains any attributes (name-value pairs ampersand delimited)

To prepare for that formatting, we once again leverage XQuery to create annotations as a caret-separated file with the exact same fields. The execution code is then pretty straightforward:

val results =  sc.textFile(keyFile).repartition(numParts)
     .mapPartitions(keyIter => {
       val proc_pubtator = XQueryProcessor.getInstance(xquery_pubtator)
       val proc_om = XQueryProcessor.getInstance(xquery_om)
       val proc_str = XQueryProcessor.getInstance(xquery_str)
      => {
         try {
           var rawXML = FileUtils.readFileToString(new File(rawXMLMnt + key), "UTF-8")
           // Remove DocType declaration
           val cleanXML = rawXML.replaceAll("<!DOCTYPE(.)*><collection","<collection")
           val annot_pubtator = proc_pubtator.evaluateString(cleanXML)
           FileUtils.writeStringToFile(new File(pubtatorAnnotMnt + key), annot_pubtator, "UTF-8") 
           val annot_om = proc_om.evaluateString(cleanXML)
           FileUtils.writeStringToFile(new File(omAnnotMnt + key), annot_om, "UTF-8")  
           val annot_str = proc_str.evaluateString(cleanXML)
           FileUtils.writeStringToFile(new File(strMnt + key), annot_str, "UTF-8")  
         } catch {
          case e: Exception => {

Let’s take a look at the 5 first PubTator annotations for the first PubMed article:



This app is using Stanford Core NLP to annotate the sentences contained in each article text. The annotations are then stored in ./data/scnlp.

val results =  sc.textFile(keyFile).repartition(numParts)
    .mapPartitions(keyIter => {
      // Create SCNLP pipeline to be used by all workers
      val props: Properties = new Properties()
      props.put("annotators", "tokenize, ssplit")
      val pipeline: StanfordCoreNLP = new StanfordCoreNLP(props) => {
        try {
          var rawStr = FileUtils.readFileToString(new File(strMnt + key), "UTF-8")

          // get the sentences contained in the raw string (title and abstract from pubmed)
          val scnlp_annotation: Annotation = pipeline.process(rawStr)
          val sentences = scnlp_annotation.get(classOf[SentencesAnnotation]).asScala.toList

          // write annotations in the caret format, e.g. 1^scnlp^sentence^0^1635^origAnnotID=1"
          val annotations = (for {sentence: CoreMap <- sentences} yield (Array(

          FileUtils.writeStringToFile(new File(scnlpAnnotMnt + key), annotations, "UTF-8")

        } catch {
          case e: Exception => {

The resulting annotation set is here again formatted in a caret-separated file. Here are the first 5 annotations for the first PubMed article:



This app stores each annotation set (om, pubtator and scnlp) in a parquet file. We use the class CATAnnotation provided by AnnotationQuery library to make sure our parquet file complies with the expected formatting explained above.

for( annotSet <- annotSets ) {
    val annotMnt = annotMntFolder + annotSet + "/"
    val parquetMnt = parquetMntFolder + annotSet

    // Get the  annotation for each key and return (key,annotations)
    val annots = sc.textFile(keyFile).repartition(numParts).map(key => {
      (key,FileUtils.readFileToString(new File(annotMnt + key), "UTF-8"))

    // Remove empty records, aborted records, ignored records
    val filteredAnnots = annots.filter(rec => rec._2.length > 0)
      .filter(rec => rec._2.startsWith("***") != true)

    // FlatMap to to get all the annotations for each record
    val catAnnotations = filteredAnnots.flatMap(rec  => {
      var arr = rec._2.split("\n")
      val res = new ListBuffer[CATAnnotation]()
      for (i <- arr) {
        val parts = i.split("\\^")
        val docId = rec._1
        val annotSet = parts(1)
        val annotType = parts(2)
        val startOffset = parts(3).toLong
        val endOffset = parts(4).toLong
        val annotId = parts(0).toLong
        var other : String = null
        if (parts.size == 6) {
          other = parts(5)
        res += CATAnnotation(docId,
          if (other != null) Some(other) else None)
    import sqlContext.implicits._
    // Write the parquet file

AnnotationQuery results

Now that our data is in the right format, the moment finally arrived to query our annotations. The Query app runs several scenarios querying the annotations with logical relations. We will detail here the ones corresponding to the 2 original questions stated at the beginning of this post:

Which diseases are mentioned with BRCA1 or BRCA2 genes in the document title?

This first question makes use of pubtator and om annotation sets. The corresponding code is listed below:

val q1_annot = ContainedIn(FilterType(aqPUB,"disease"),Contains(FilterType(aqOM,"title"), FilterProperty(aqPUB,"identifier",valueArr=Array("675","672")) ))

val q1 ="orig", "identifier").map(x => $"properties".getItem(x).alias(x)): _*)
          collect_set("orig") as "labels",

Top 5 results are:

This is in line with our expectations as mutations of BRCA1 and BRCA2 genes are associated with both breast and ovaries cancers.

What cell lines occurs within 30 characters before or after the genes BRCA1 or BRCA2 genes?

The second question only uses the pubtator annotation set. The corresponding code goes like this:

val q2_annot = Or(
                  Before(FilterType(aqPUB,"cellline"),FilterProperty(aqPUB,"identifier",valueArr=Array("675","672")), 30),
                  After(FilterType(aqPUB,"cellline"),FilterProperty(aqPUB,"identifier",valueArr=Array("675","672")), 30)
val q2 ="orig", "identifier").map(x => $"properties".getItem(x).alias(x)): _*)
          collect_set("orig") as "labels",

The result is more surprising:

In the top5 cell line mentioned within 30 characters from BRCA1 and BRCA2 genes are expected breast cancer cells but as well liver cancer cells (HEPG2) and pancreatic cancer cells (Capan-1).

One can get more insight in some of the results by “hydrating” the annotations with their contextual text.


Using Concordancer function on the five first results of the second query (with 60 contextual characters), we obtain:

Document ID Annot. set Annot. type Hydrated text
10954590 pubtator cellline 5′ regulatory region. In contrast, the non-BRCA1 expressing UACC3199 cells were completely methylated at all 75 CpGs. Chromatin
10954590 pubtator cellline ted BRCA1 expressing cells. The chromatin of the methylated UACC3199 BRCA1 promoter was inaccessible to DNA-protein interactions
11126365 pubtator cellline and break repair. The human BRCA2-deficient human cell line Capan-1, whilst being sensitive to ionizing radiation, is also sens
11126365 pubtator cellline iophage T4 DNA ligase or human DNA ligase III. BRCA2-mutant Capan-1 cells may possess reduced DNA ligase activity during BER.
16322213 pubtator cellline ch was reversible in the heavily BRCA1-methylated cell line UACC3199 following treatment with 5-aza-2′-deoxycytidine and trichos

What are the genes co-occurring with BRCA1 gene in the same sentence?

This last question makes use of pubtator and scnlp annotation sets. The corresponding code goes like this:

  val cooc_brca1_annot = ContainedIn(FilterType(aqPUB,"gene"),Contains(FilterType(aqSCNLP,"sentence"), FilterProperty(aqPUB,"identifier","672")) )

  val cooc_brca1 ="orig", "identifier").map(x => $"properties".getItem(x).alias(x)): _*)
      collect_set("orig") as "labels",

Top 5 results are:

The top 5 co-occurring genes are usual suspects in DNA damage pathways. This query could be a good way for cancer researchers to look at recent gene or protein associations described in the literature.


In this post, we experimented with AnnotationQuery library to query annotation sets at scale. It is a convenient and fast way to run complex query (with composable functions) without the need for an indexer and search engine. I particularly like the fact that AnnotationQuery results are typed spark.sql.Dataset. This facilitates its integration into a bigger SPARK pipeline.

Knowing that annotations are often mapped to a taxonomy, it could be interesting to extend the library to take advantage of taxonomy semantics and being able to run queries like: “what are diseases of the respiratory system co-occurring with a given gene?”. Here the respiratory system would correspond to an upper concept in the taxonomy and linked to several diseases. Similarly for geolocation, exploiting cardinalities to ask for entities located in/near another one may be of interest to data scientist in the domain.

Finally I want to thank Darin McBeath, the author of both AnnotationQuery and spark-xml-utils libraries, for his help on setting up part of the code.