Linked Data – Past, present (2019) and future
About two weeks ago, I had the pleasure to give a talk at the ESSnet (European Statistical System network) final meeting held in Sofia, Bulgaria. My presentation was about the Semantic Web and its development into Linked Data (here is a link to my presentation). The Semantic Web is turning 20 this year and although much has been accomplished, the reality is not quite what was imagined first.
With the rise of open data, it has never been so simple to access data from different sources and use them jointly. For instance, it requires only few lines of code to combine macro-economic data about European countries to study the correlation between education level and GDP. We can do so by using Wikidata and Eurostat. Since the two linked datasets use a common semantics for representing countries and regions, namely the NUTS (Nomenclature of Territorial Units for Statistics) identifiers, we can merge the information easily. This is called Linked Data.
If this notion is now more commonly known from Web practitioners, its foundation goes back to the creation of the Web in 1989. In the original proposal by Tim Berners Lee, was the idea of a “graph” view with semantic relations between documents and entities.
By 1994, the first technologies for the Web have been developed. Namely the HTTP protocol, the HTML language, URI identifiers and the Hyperlink. With the URI mechanism, we have a mechanism to identify uniquely an entity (a real person, a document , etc.) and a mean to lookup for it on the Web. The hyperlink establishes a relationship between two entities. This was effectively only used to navigate from one document to another and could be interpreted as a relation “is related to”. What is missing from the original vision is the possibility to define other type of relations such as “is friend with” or “has address” that a computer can interpret. This limitation was the main message from Tim Berners Lee during his presentation at the first World Wide Web plenary conference held in Geneva in 1994. The Web should be thought for human and machines where the meaning of things and their relations are explicit. To support that vision and coordinating Web development efforts, the W3C organisation is created.
Around the W3C’s semantic Web activity, a community of researchers in Artificial Intelligence, knowledge representation and Web experts defines the next bricks to the Web stack. Among them RDF to represent facts in form of triples (subject, predicate, object); RDF Schema adding semantics to the triples allowing for the representation of instances, properties, etc.; OWL, to define ontologies with Description Logic that supports reasoning or SPARQL, a querying language.
The use of Description Logic, has been admittedly the black point of Semantic Web adoption. The AI / ontology implication fails to address most of real world data on the Web that is uncertain, incomplete, inconsistent and includes errors. The adoption of Semantic Web turned out to be different from its original vision, focusing on a more pragmatic use for data representation and integration, data linking. The Linked Data is born! Here the emphasis is on sharing information using a graph structure and following the linked data principles. This includes the use of URI to refer to things, the availability of the resource at the given URI through HTTP for human and machine to access and the interlinking of data. The full formal system defined in OWL is not necessary and people started to reuse common vocabularies with light formal commitment. Constraints can be described by intention using language like SHACL or ShEx in a more intuitive way. This corresponds to the actual need of most data users and producers on the Web. This shift started in 2007 and was clearly apparent in 2012 with the community and industry reactions:
- The major Web search engines companies launch the schema.org initiative to embed semantics in Web pages. This activity alone surpasses any original hope for semantic use with more than 40% of Web pages semantically annotated.
- Projects like DBpedia or wikidata spans most of data domain and bridges data silos on the Web.
- The W3C broadens the activity from Semantic Web to the Web of Data in 2013.
This trend is also visible in people’s interest through organic searches:
Still, publishing and maintaining linked data comes at a cost. Organisations, public departments find it difficult to justify the cost associated to linked data. This is due to the fact that some publishing mechanisms can be costly (e.g. SPARQL endpoint with costly queries) and with low visibility on data usage. Initiatives in that direction (e.g. SPARQL monitoring, Linked Data Fragments) start to emerge.
The biggest challenge going forward is not at the technical level as demonstrated by the continuous publication of Linked data in various domains. It is in the agreement of a common semantics. Librarians have worked on the BIBFRAME data model using RDF, the museum and art community has worked on the Getty vocabularies for the description of their assets. European Statistical System network is currently working on the harmonisation of the statistical dimensions description in RDF among all European countries. Such endeavour is not trivial. An example that stroke me during the final meeting was the “simple” notion of population estimate. Although you might think this basic notion is consensual among European countries, in reality, it is not. In Ireland, the Population estimates, as depicted in the table below, is calculated based on:
- The Census de-facto Population (includes visitors) or with
- The Census Usually resident and present population (excludes residents who are temporarily absent).
While in France, the Population estimates is calculated using:
- The Usual residence criteria (which excludes visitors and includes the temporarily absent).
Aligning these notions is a hard work and this is what makes Linked data difficult in practice and requires a tight cooperation within domain communities.
Semantics agreement will play a major role in the years to come with the rise of sensors and IoT: the Sentient Web. Sentient Web can be defined as “ecosystems of services with awareness of the world through sensors and reasoning based upon graph data and rules together with graph algorithms and machine learning” (David Raggett) and was first coined by Michael N. Huns. For the many sensors to interoperate efficiently, a clear semantics will be essential.
Another area of future interest is Knowledge Discovery. Nowadays it is hard to avoid the term “knowledge graph” when talking about Linked Data. These notions have in common the use of graph as a mean of knowledge representation. A knowledge graph (KG) is a set of triples with semantic relations and associated node and/or edge properties. Although it does not require the use of URIs, most of them are borrowing this mechanism from the semantic Web. Well-known example of KG are Google Knowledge Vault, Microsoft Academic KG or Facebook Graph Search. Large scale industrial KG has pushed the research towards new methods of pattern mining – graph mining – such as Link prediction or graph feature based prediction. Following the success of manual literature based discovery by Swanson in 1980, it is only natural that using knowledge graph mining, we can surface indirect links forming a new information. An example of this is the recent use of graph data to identify fraudulent organisations in the Panama Papers investigation.
Although the original vision of the Semantic Web has not been realised yet, many of its concepts are now used by main actors on the Web and lead the way to new methods facilitating data integration and data comprehension.