Linked Data: combining data from Wikidata and Eurostat datasets

Screen Shot 2019-05-29 at 16.54.33

With the rise of open data, it has never been so simple to access data from different sources and use them jointly. In this post, we will consume data expressed with semantic web technology using python code and create visualisations about countries using information from Wikidata and Eurostat. Since the two datasets use a common semantics, in this case the NUTS (Nomenclature of Territorial Units for Statistics) identifiers, we can merge the information easily. This is called Linked Data. The visualisation will try to answer the following questions:

  • What is the correlation between the GDP per capita and life expectancy?
  • What is the correlation between GDP per capita and enrolment of young kids in education system?

In our code, we will be using several python libraries, well-known to data scientist, like pandas, numpy or matplolib but as well rdflib and SPARQLWrapper used to interact with semantic Web data technology RDF.

%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm
import rdflib
from SPARQLWrapper import SPARQLWrapper, JSON

def get_sparql_dataframe(service, query):
    Helper function to convert SPARQL results into a Pandas data frame.
    sparql = SPARQLWrapper(service)
    result = sparql.query()

    processed_results = json.load(result.response)
    cols = processed_results['head']['vars']

    out = []
    for row in processed_results['results']['bindings']:
        item = []
        for c in cols:
            item.append(row.get(c, {}).get('value'))

    return pd.DataFrame(out, columns=cols)

Wikidata is a collaboratively edited knowledge base making use of Semantic Web principles and used to feed some data blocks in Wikipedia. It offers a SPARQL endpoint to run our graph queries. The following query looks at all sovereign countries with their English label, life expectancy, GDP per capita, population, code and nuts code value.

The following image shows a graph view of the data related to Bulgaria.

endpoint = ""
query = """
select * {
    ?country wdt:P31 wd:Q3624078 .
    ?country rdfs:label ?name .
    filter(lang(?name) = 'en')
    ?country wdt:P2250 ?life_exp .
    ?country wdt:P2132 ?GDP_per_capita .
    ?country wdt:P1082 ?population .
    ?country wdt:P297 ?code .
    Optional{?country wdt:P605 ?nuts_code . }

countries = get_sparql_dataframe(endpoint, query)
countries = countries.loc[~(countries['nuts_code'].str.len() > 2)]
countries['population'] = countries['population'].astype('float64')
countries['life_exp'] = countries['life_exp'].astype('float64')
countries['GDP_per_capita'] = countries['GDP_per_capita'].astype('float64')

Using Wikidata only, we can plot the countries’ GDP per capita on one axis and life expectancy on the other. The size of the bubbles is proportional to the countries’ population. Hopefully it should resemble the famous chart presented by Hans Rosling.
# Similar to Hans Rosling presentation
countries['scaled population'] = ((countries['population'] - countries['population'].min())/200000+1).astype(int)
ax = countries.plot.scatter(x='GDP_per_capita', y='life_exp', s=countries['scaled population'], 
                            figsize=(20, 10), cmap=cm.get_cmap('viridis'),c='life_exp', colorbar=False,
                           alpha=0.6, edgecolor=(0,0,0,.2));
countries[['GDP_per_capita','life_exp','code']].apply(lambda x: ax.text(*x),axis=1);

The second dataset we are using comes from Eurostat, the European institute responsible to gather the EU countries national statistical data. The file we are querying contains datacubes with the education indicator: “participation rates of 4-years-olds in education at regional level”. The file includes regional information and country level ones. We will only use the country level information.

# Get Eurostat Linked Data through their portal:
g.parse('', format='application/rdf+xml')
qres = g.query(
PREFIX sdmx-measure: <>
PREFIX dcterms: <>
PREFIX eus: <>
SELECT ?geo ?value
	?obs dcterms:date "2012".
	?obs eus:geo ?geo.
    ?obs eus:indic_ed <> .
	?obs sdmx-measure:obsValue ?value .
# Participation rates of 4-years-olds in education at regional level
eurostat = pd.DataFrame(columns = ['eurostat_country','education_value'], data=[[str(geo), str(value)] for geo, value in qres])
eurostat['nuts_code'] = eurostat['eurostat_country'].str.split('#').str[1]
eurostat['education_value'] = eurostat['education_value'].astype(float)

Finally we will merge the two datasets using the NUTS (Nomenclature of Territorial Units for Statistics) identifiers for the countries present in both Wikidata and Eurostat.

linked_data = eurostat.merge(countries, how='inner', on='nuts_code')
linked_data['scaled population'] = linked_data['scaled population']*10
ax2 = linked_data.plot.scatter(x='GDP_per_capita', y='education_value', s=linked_data['scaled population'], 
                            figsize=(20, 10), cmap=cm.get_cmap('viridis'),c='education_value', colorbar=False,
                           alpha=0.6, edgecolor=(0,0,0,.2));
linked_data[['GDP_per_capita','education_value','nuts_code']].apply(lambda x: ax2.text(*x),axis=1);

Jointly using information from Wikidata and Eurostat, we can now plot the GDP per capita on one axis and the education indicator on the other. This is limited to European countries as Eurostat contains only information about European countries.

By using the same vocabularies, identifiers or establishing equivalences, linked datasets makes data integration easier than ever and allow data scientist for very fast prototyping and data analysis. Once the data is gathered and merged, it constitutes a knowledge graph, ready to be processed for many graph mining activities like graph feature extraction or link prediction.