Tuesday, 1 December 2009

How Linked Data can improve web search

What is Linked Data?
  • Whilst at the recent Jadu Experience Day, the keynote speech included an emphasis on the concept of what is known as Linked Data. This is a method of making data visible, allowing it to be shared, and connecting to it using the web.

    The idea of Linked Data was first described by Sir Tim Berners Lee in his description of the Semantic Web. There are four key principles which need to be met for Linked Data to happen:

  • Assign all resources on the web with a unique identifier (called a URI – Unique Resource Identifier) to identify the resource
  • Use web based (HTTP - Hyper Text Transport Protocol) URIs so that these resources can be referred to and looked up on the Internet
  • Provide useful information (a structured description or metadata) about the resource so that people finding the resource know what it is
  • Include links to other, related URIs to improve discovery of other related information on the Web

So in essence, all resources on the web should be uniquely identified (via a URI) using the web as the protocol for looking them up (HTTP). The resources should contain descriptive information about themselves so that people can tell what the resource is, and what it can be used for (metadata). Finally, it should contain links to other related resources that may also be of use. So all resources can help you find links to further, related resources. This is similar to the ‘Related Information’ or ‘See Also’ links that you frequently find in web pages.

Extracted from Linked Data
“Linked Data is about using the Web to connect related data that wasn’t previously linked, or using the Web to lower the barriers to linking data using other methods. More specifically, Wikipedia defined Linked Data as a term used to describe a recommended practice for exposing, sharing and connecting pieces of data, information, knowledge on the Semantic Web using URIs and RDF (Resource Description Framework).”

For any of this to work, the web needs to have as much data exposed to it as possible. Every single organisation, business, research laboratory, government, school, hospital, university and so on has data. With the exception of personal data, all data should be available to the web. Once it is on the web, then it can become Linked Data.

To repeat the mantra first uttered by Sir Tim Berners Lee – we want raw data now!

As a scientist working on a cure for cancer, the more data you have at your disposal is important. You can cross check results from other studies, and link the data together to form a more complete picture. As each resource provides links to other related resources, then all manner of discoveries are possible, including those which were not originally thought of.

Intelligent searching
It is not difficult to see how Linked Data is critical to search discovery, and will allow search engines to become much more intelligent. If all data is available and exposed on the web, if all resources are identified and described, and if all resources provide links to other, related resources, then suddenly your search engine becomes much more powerful.

Some types of question are just not possible with current search engines. Where the question is simple and easily framed, then current search engines can return relevant results. But where the question is more complex, or not so easy to frame, then current search engines may return irrelevant results.

Search engines use complex algorithms to return results, but one key mechanism they all use is trying to match the specified search terms to the content (which will also include metadata). Where the combination of search terms has yielded a significant result, then a match is returned. This is simple word matching. I am sure we have all tried in vain to find an answer to a question that was very difficult to find a useful result for.

For example, trying to find meaningful results to the question "what were the crime rate fluctuations in the UK between 2000 to present", and current search engines would struggle to return anything useful, as the question, while easily stated in English, is not easily framed for a search engine to work with.

If government data on crime rates are available to the search engine, and this data is linked to other related data, such as crime patterns and statistics, as well as the raw data itself, then the question becomes more easily answered, and search engines more able to return meaningful results.

All kinds of questions become possible to answer with Linked Data. It empowers people. Search engines become much more intelligent, and capable of answering even the most complex of questions.

We want raw data now!

