linked data: relevant content on this site

The business of unique identification

Talat Chaudhri — Thu, 23 Feb 2012 23:58:26 +0000

What need is there for unique identifiers?

Put in relatively non-technical language, there is an increasing concern in information science in general to uniquely identify different things, organisations or people that could otherwise be confused, whether on the Internet or in the physical world. In technical terms, these are all referred to as resources (even if people might find it vaguely demeaning in normal language to be considered as such). This need, whether real or perceived in any particular context, has grown as the complexity of information available on the Web has grown almost exponentially, increasing the potential for confusing similar resources.

Why aren't names good enough?

1. People

It is not necessarily enough to have a name, since even a relatively unusual combination of names might easily not be entirely unique from a worldwide or even universal perspective: at the basic level, John Steven Smith might be unique in a place called Barton but even if you cross-reference these references, two people with the same name could easily be confused, for example if there are several possible places called Barton.

My own name, Talat Zafar Chaudhri, might appear to be more unique until you realise that these are all fairly common names in the Indian subcontinent and thus in the Indo-Pakistani diaspora, so it is reasonably possible or even fairly likely that another named individual exists with this particular choice of spelling (of which others may exist). I am also Talat Chaudhri, T. Chaudhri, T Chaudhri, T.Z. Chaudhri, TZ Chaudhri and similar variations (with or without spaces and punctuation) that might make it harder to decide which individuals to reconcile as a single individual, especially by machine processing. At least I do not vary the spelling of my surname, but some people may, especially in cases such as my own where other transliterations could be possible: for example, my father previously used the spelling Chaudhry and many others such as Chaudry, Chowdhary and Chowdhuri are equally possible. I understand when companies misspell it, but a computer might not be sure if these were definitely the same person, even if it went to the lengths of calculating a probability for this.

Moreover, people change personal titles (e.g. I have been both a Mr and a Dr and I am occasionally still referred to as the former by companies that do not allow for the latter option); they have multiple, changing work roles and work places, and may be known in multiple contexts, e.g. work, social, voluntary roles and similar. At work, one may have additional roles in various professional bodies, so it may not be apparent who is who. Two people might have the same name in a large professional group, e.g. physicists, and may even produce outputs related to the same subject. Who owns which ones? This is a particular issue for electronically available outputs on the Internet, e.g. publications, educational resources, audio, visual or audiovisual resources and so on.

2. Organisations

The same issue arises for organisations. Can we be sure that a Board of Licencing Control is unique? No. Perhaps it is merely another spelling for the Board of Licensing Control but using a different spelling? What if one, but not all, of these were re-named as Burundian Licencing Control? What if the Board of Licencing Control merged with the Department for Regulatory Affairs under either of these names, a combination, or an entirely new name, yet continued their association with the assets of the originals. De-mergers are likewise possible, and may present issues of uncertain ownership of resources.

Perhaps there are organisations with this name in several countries but serving utterly different purposes, and perhaps one is merely one possible translation of a term into English but used natively in another language. Historical names have been used in multiple contexts that may still be valid, e.g. the Irish Volunteers, and these might need to be kept clearly separate from each other. Conversely, there are also organisations that have multiple names or forms of names, whether in one language or in multiple languages or during their history, e.g. Óglaigh na hÉireann is Irish for both the terrorist Irish Republican Army (IRA) and most of its subsequent splinter groups but is also, however, an acceptable name, for historical reasons, for the Defence Forces of the Republic of Ireland, and previously just the Irish Army (an tArm) that now forms a part of it. These are clearly not the same and must be distinguished. It must be also noted that typographical constraints and character encodings will lead to yet more duplicate forms.

Isn't this bigger than the question of unique identification?

Yes, the need for complex metadata to express these things can go far beyond merely identifying resources in a unique manner. However, before one can even start thinking about complex descriptive and relational metadata, one first has to be clear which resource is mentioned: hence the first step must be unique identification of what it is we are talking about. Only once we have done that can we feel reasonably confident about talking about how resources relate to one another and how they may have changed over time.

Overall, there is an ever increasing need to make clear what is meant, as more and more things and agents have on-line identities that need to be distinguished, whether this is as an owner of resources or as a referrant within a resource, e.g. the subject of the resource in a particular context, and even of the role played and the relationship to other resources or agents, perhaps in a specific time period. Information models can quickly become extremely complex, and this is certainly true where identity is concerned.

What is an identifier?

In concept, an identifier is similar in its basic concept to a name. At its most basic, an identifier in the context of an information system is a token (usually a number or a string of characters) used to refer to an entity (anything which can be referred to). Identifiers are fundamental to most, if not all, information systems. As the global network of information systems evolves, identifiers take on a greater significance. And as the Web becomes more 'machine readable', it becomes vital for all organisations who publish Internet resources to adopt well-managed strategies for creating, maintaining and consistently using identifiers to refer to those assets it cares about.

What are unique identifiers?

The simple answer is that this is the only way to avoid misidentification confidently, and therefore prevent any errors about ownership or rights over resources that might arise, as well as making sure that large bodies of resources contain reliable information generally.

The fundamental question is whether the identifier or token that has been chosen is unique and how best to ensure this. Some identifiers are so complex that mathematical probability makes them effectively unique in the universe, notably UUIDs. In essence, a UUID is no more than a complex numerical token: it is only additional complexity (and thus uniqueness) that it offers compared to, for example, a running number. Others like names can only be distinguished unambiguously by making a series of statements about which names are considered equivalent, which contexts (e.g. a person's work or town) are valid, and so on, where a number of relationships have to be attached to a particular identifier and checked in order to reach an acceptable level of uniqueness and to eliminate any mistaken connections with resources that might be similar in name or perhaps also in other respects by chance.

The problem with UUIDs is that, while the chances of them failing to be unique are, to all practical purposes, non-existent, it is not very clear from a UUID alone what the nature of that resource is. It may be machine-readable but it says nothing about who generated that identifier and when, or which other identifiers might exist for the same resource in different systems that also generated an identifier for the same resource. Consequently, the need to associate other metadata with any complex number or other similar token remains (including but not limited to UUIDs). Simply, no single token can be sufficient for any complex purpose and, at the very least, an electronic or physical resource must be referenced for the token to have any useful meaning at all.

This is effectively that a URL is: another type of token. While I will not go into the whole discussion about URLs and URNs as sub-types of URIs, it is worth noting that, in many quarters, the term URL is no longer preferred despite it being the most commonly used in practice. In strict terms, there is a clear difference: while a URI is usually resolvable to an electronic resource, which may be either a description of a physical or electronic resource or may be an electronic resource itself, there is technically no requirement that a URI should be resolvable, i.e. that all it needs to be is a token that doesn't necessarily have to represent an address that actually delivers a resource. However, it is usual to use the HTTP scheme, which is designed for delivering such a resource, so it would be somewhat eccentric and misleading if one were deliberately to choose an ostensibly resolvable syntax that does not in fact resolve. In effect, virtually all such URIs are also URLs (unless a resource has become unavailable and link rot has set in), since the latter must locate the resource or representation of it: this is inherently useful. Any URI that resolves, i.e. URL, will be effectively unique within the standard Domain Name System (DNS). As a result, there is no absolute need for UUIDs in many contexts, since a sufficiently unique and practical token already exists in the URI. Any unique but arbitrary token serves the core purpose here.

Aren't identifiers really just names?

Yes and no. Names are intrinsically arbitrary too when they are first given. However, they are identifiable on a number of levels from a human perspective. In addition to a combination of names belonging to one or more particular linguistic and/or ethnic origins and usually identifying gender, they quickly become associated with a particular person, so their use in uniquely identifying that person within a given context become central to maintaining the person's reputation in whatever they do. This is, for example, particularly important to academics in Higher Education. In modern times, this name resolution needs to be done globally wherever the Internet is the context, whereas previously it would have been possible to use fewer additional pieces of information in more restricted contexts (e.g. a village, a country etc), depending on the purpose. These different contexts still co-exist but it is now necessary to provide as many as possible, since one cannot control or predict why the information is being requested in each instance on a global system such as the Internet.

How does this affect Higher and Further Education?

Increasing numbers of professionals and the bodies that they work for and represent need to describe their resources on the Internet, whether those are in themselves electronic resources, whether they are descriptions of electronic or physical resources (metadata), or whether they are other representations of physical resources, perhaps in addition to themselves being electronic resources (e.g. photographs). This is a particularly pressing issue in Higher Education and, to an increasing extent, in Further Education. Academic outputs may include publications, educational resources, visual, audio and audiovisual resources and so on. Perhaps the best known is the issue of scholarly publications, partly through the rise of the Open Access movement to make such resources freely available.

There are already a range of identifiers for academics and related professional university staff. One of the problems is that these are created for specific purposes that only cover whichever subset of staff is relevant to those purposes. For example, HESA keeps records that contain a HESA number for academic staff, which means that at least those who have published academic outputs will have such a number. Another number called the HUSID number is maintained for students, since tracking academic careers from student to staff is one important concern for HESA. Many academics in relevant fields may have ISNI numbers, which are used widely in the media content industries. Many academics will have one or more professional staff pages, including within repositories and Current Research Information Systems (CRIS), each with a URI, not to mention OpenIDs and URIs associated with Web services which they use professionally and/or privately, e.g. LinkedIn, Academic.edu, Facebook, Twitter and so on.

Here are some examples belonging to Brian Kelly of UKOLN:

The problem is that the coverage of these numbers is not universal within the HE sector, and there is no single recognised authority or other agreement to prevent and resolve conflicts where information is not consistent between two or more information sources.

At present, the JISC are trying to solve this through the Unique Identifiers Task and Finish Group, which also includes representatives of HESA, HEFCE, the various Research Councils in the UK and UKOLN. The preferred solution is currently the ORCID academic identifier, which is being developed internationally with publishers, with a great deal of input from the United States in particular.

In order to succeed, any such identifier will need international penetration of the higher education sector, since academics will not use it unless it delivers the sorts of interoperability benefits that make their work easier and become integrated into the recognised systems required of them by funders and publishers in the course of their work. Since students and academics change roles and institutions, this needs to be recognised and outputs properly allocated to institutions and departments, which may themselves change identities, merge and de-merge over time.

While institutions will need to reduce the workload on academics by bulk loading information about staff, since the main incentive to use the system is that every academic has a record, there is also an issue about control. Should academics have the ability to alter their records at will? Are assertions automatically trusted or does a particular record for an academic's time at an institution need to be verified by that trusted body? Who should maintain a list of trusted bodies who can back up assertions? How will this effort be funded sustainably? It becomes clear that some of these points are central structural concerns whereas others may cover only fringe issues such as avoiding deliberate falsification, which may be rare.

Proprietary academic identifiers

There are also a number of proprietary identifiers associated with different commercial services related to electronic publishing and related academic service industries. Thomson Reuters and Elsevier provide identities for individuals and organisations as part of their bibliographic and academic services; similarly, search services such as Google Scholar (see the study in this blog post) and Microsoft Academic Search have also started to offer identifiers (see this blog post). There may be privacy issues, for example in Google and Microsoft publicly surfacing information about researchers without explicit consent: while this information might have been suitable for the limited purpose of publication, academics may not have intended for it to be synthesised into a single, public description of their personal details available to all.

Some of these services introduce new problems, since their primary purpose is commercial and it is often less of a priority to deal with the internal issues facing academic institutions unless that impacts significantly on the ability to make commercial profit. These may be resolved over time or be reintroduced as services change and compete: the academic has little or no control over the effects of commercial decisions upon their work. For example, Microsoft Academic Search often misrepresents outputs as belonging to similarly named individuals (thus is currently failing at unique identification) and, by default, requires the manual input of researchers to edit out errors and take a proactive approach towards managing the information about themselves. This brings the overall quality of data into question: for large-scale statistical purposes, this could be tolerable, depending on the degree of error; however, for academic citations and reporting purposes such as the Research Excellence Framework (REF), it would not be acceptable to use this data without further refinement, which would most likely remain a long, manual process.

Software and services

Any software application layer, whether operated by commercial companies, higher educational institutions, funders or governmental bodies, needs to be maintained. If information is harvested or processed automatically, it needs to be clear who corrects information where errors are found and what the resources are for academics to contact individuals with the time and effort available to improve the data as part of their work. In the case of commercial organisations, this is usually unclear and may change. There is no guarantee that the commercial reason for providing services will continue over time, unlike in most cases in the public sector within Higher Education. Coverage of such commercial services is often geared towards institutions rather than individuals: for example, Google Scholar requires registration using a valid university email address that it recognises, which would exclude private scholars and perhaps some retired staff who produce research.

The Web of Things

It has already been mentioned that electronic descriptions or other representations of physical objects may be found on the internet, including written descriptions, pictures, geographical locations, dimensions and so on. It is even possible to describe physical objects that were extant but are now historical, or which have moved or whose location is now unknown, referencing comparable objects and linking these descriptions with other resources that are related. In each case, the nature of the relationship, relevant agents who may have been responsible for it, and when it was valid can be described in metadata.

This opens the way for the Web of Things, a term used to describe that part of the Semantic Web that covers physical resources as opposed to, or as well as, purely electronic ones. Some authorities use the term to mean physical objects with miniaturised electronic devices to enable them to be located, whereas others merely mean any physical object that is described in a record on the Web. It may be argued that all electronic resources have relationships to physical ones, even if that is only with regard to authorship and subject. The Resource Description Framework (RDF) provides a means to describe these relationships and transmit information about them in ways readable to humans and machines. Although these are usually expressed as triples, where two things are described with a relationship between them, metadata structures such as the Common European Research Information Framework (CERIF) can add link tables that give far more detailed information about the relationships themselves. All of this can be made available as Linked Data and surfaced in many software applications on the Web.

The Semantic Web is often seen as a utopian view of a future where no electronic resources will be published without complex information being provided or automatically generated about its origins. The reality is that manual entry of information is generally very limited unless it serves the purposes of the person entering it, and this cannot be relied upon as an approach to ensuring large-scale, consistent metadata on a sufficient scale for the Semantic Web to work. Technology has in some cases improved to the extent that geographical and technical information is now automatically produced, for example in digital cameras and in mobile phones able to record GPS coordinates.

However, the effort and cost required to catalogue the entire physical world and the extent to which this is even possible is highly doubtful. Where the Semantic Web could be useful is within particular large bodies of data, for example experimental scientific data, publications and so on. In the case of the Web of Things, this could include art collections, photography, archaelogical information, the locations of public institutions and many more. For all of these purposes, it will be necessary to provide unique identifiers for increasingly large numbers of resources, including things and agents, in order to provide complex metadata about them.

Education in the wider world

It has perhaps not been sufficiently investigated how unique identifiers for researchers and other staff in Higher Education will fit into the wider question of unique identification on the Web. Relevant purposes might be:

(1) commercial, for example the identification of companies and individuals owning the rights to photos, music, video or publications, particularly legacy resources of ongoing commercial value in terms of royalties and performance licencing.

(2) governmental, for example biometric information about people, used in border controls, crime prevention and citizenship contexts; or about public or private organisations such as charities, political groups of interest to law enforcement etc. Information about individuals, in particular, may be subject to privacy laws, which will vary between jurisdications.

It is clear that there are interfaces between the various agents and outputs of academic institutions and many other purposes, notably those commercial and governmental activities already described. For example, a foreign student or member of staff seeking a work permit will require institutions and governmental bodies to use personal and citizenship information co-operatively, which will be linked to their academic identity in the course of their work at the institution. Some of this information will be private and some public, so there is an issue about who can see which parts of a particular corpus of Linked Data, requiring authentication protocols and systems.

The extent to which consistency of approach between HE institutions and other sectors and contexts can ever be ensured is moot, since there is of course no single international authority and because any single metadata solution that tried to cover so many diverse purposes would be fatally unwieldy. How different, flexible approaches can be understood by machine processing is perhaps the technological key to how well the Semantic Web will answer these questions in future, both within Higher Education and beyond.

JiscEXPO Emerging Themes

Adrian Stevenson — Thu, 13 Jan 2011 17:15:00 +0000

In the previous post I gave an overview of the JiscEXPO project outputs available so far, and hinted at ones coming soon. In this post I focus more on the themes and issues that are starting to appear. It can be quite difficult to distill these out of the information available, but I have been able to see a few patterns emerging, even though it is still relatively early days.

Archives Hub Record for Sir Ernest Shackleton

Given that linked data is of course, about data, a number of issues have been appearing around this subject. Linked data will generally require some data modeling, and as the Locah project report, this may mean having to change your data model mindset:

“it took me quite a while to get away from the idea of modelling the EAD record, rather than the actual data.”

“I found actually getting a ’starting point’ a bit difficult. I think this is because everything can be a starting point”

There can also be inherent complexities in the existing data that can make the modeling difficult:

“perhaps one of the thorniest [questions] is that arising from one of the fundamental characteristics of the nature of archival description [which is] typically based on a “hierarchical”, “multi-level” approach”

“One consequence of the multi-level approach in archival description practice is a strong sense of the importance of “context” … the descriptions of the “lower level” units should be read and interpreted in the context of the higher levels of description”

“So, there is arguably a (perhaps unavoidable) element of tension between the strongly “contextual” emphasis of EAD and ISAD(G) and the “bounded descriptions” of “Linked Data”.”

The process of transforming and exposing linked data can also highlight ‘dirty’ data, and issues around disambiguation. The MusicNet project mentions problems arising from different naming conventions and input error when looking for records that represent the same musical composer in multiple data sets. They’ve been experimenting with a data alignment tool they developed to help solve these issues, and have put together this YouTube video demo:

Locah have also been finding numerous examples of inconsistencies, such as where the ‘creator’ is ‘Joe Bloggs and others’ rather than just a name for example.

Linkbrainz have noted some scalability challenges that can arise with some linked data:

“[the] problem becomes acute for classical composers like Bach who are credited with tens of thousands recordings … the complete RDF resource description for Bach would be immense. This would cause an unacceptable load on the database server and long wait times for dereferenced URIs.”

They suggest a solution that uses the pagination of the HTML pages for the RDF or RDFa, but note that this is not ideal from a modeling point of view. They also mention that including RDFa in the MusicBrainz HTML pages can increase the page by somewhere between 5% and 30%.

Linkbrainz have also had to contend with some licensing issues:

“… some content in the MusicBrainz database is licensed as by-nc-sa… JISC considers this license incompatible with completely open data. Therefore, this small subset of the MusicBrainz database will likely be omitted from our translation moving forward.”

However, the JiscOpenBib project appears to have sorted out data licensing without too many problems, having recently announced that the British Library is providing bibliographic data under CC0 Public Domain Dedication Licence.

MusicNet have drawn attention to the question of how to sustain the data from the JiscEXPO projects, and the HE sector in general in the longer term, suggesting that we need provision for UK academic data to be hosted on the JANET network under a suitable .ac.uk domain. A hosted data.ac.uk is proposed, possibly JISC funded, to lower the technical and financial barrier to entry to publish RDF. One suggestion is that this could be possible via the data.gov.uk education datastore.

Locah believe there is a significant skills and training gap in the linked data area, noting a lack of domain specific examples, and a lack of helpful information about how to create a data model. They suggest that at the moment, a certain level of expertise is needed to model data and output RDF, and that efforts to address this and make it easier would help the take up of linked data. They do however note that adopting the linked data approach is already paying dividends by making development more user focused:

“the very big plus with this different kind of thinking is that by definition it puts what the user is interested in at the forefront of your thinking”

So we can see the projects are meeting a range of challenges in exposing their linked data. It’s worth noting that many of the difficulties do not uniquely arise from outputting linked data, and in fact, the projects are in many cases simply ‘exposing’ existing problems that have thus far remained hidden behind data silos. It’s good to hear about the positive effects the linked data approach can have on helping to steer development in a more user focused direction. It will be interesting to see how the projects get on and what further themes arise when the demonstration prototypes start to appear this year.

JiscEXPO Quarterly Executive Newsletter

Adrian Stevenson — Wed, 01 Dec 2010 11:51:00 +0000

A key aim of this JISC programme newsletter is to highlight some of the outputs and emerging themes from across the ten projects that make up the ‘JiscEXPO‘ programme funded by the 2/10 grant funding call, and, if possible, use these to identify any themes that cut across other ongoing JISC programmes. This work is invaluable to JISC, as the information is fed directly back into the overall evaluation of the programme which informs the Digital Infrastructure Directors, and which in turn is synthesised for JISC’s Senior Management Team.

The broad aims of the ‘exposing content for education and research’ call (‘JiscEXPO’ for short) were to make a collection of resources available as structured linked data by adopting Tim Berners-Lee’s ‘four rules of linked data’, and to produce a prototype that meets a ‘compelling end user case’. Projects were also invited to report on opportunities and barriers in making the linked data available so other UK HE and FE institutions could learn from their experiences. The information in this newsletter is mainly taken from the projects’ blogs, and most of the links below point to these. A single blog post from a project can end up being discussed at the top of JISC, which can and often does affect new policy decisions. So projects, please keep your blog posts coming!

We’re about a third of the way through JiscEXPO, and we’re now starting to see some emergent themes appear. I’ve attempted to summarise where JiscEXPO is up to so far, and give an overview of the themes arising.

A notable early JiscEXPO output was data.open.ac.uk from the LUCERO project which exposes the data available in the Open University’s various institutional repositories and makes it openly available for re-use. Already live are Open Research Online, the OU Podcasts, and some of the courses from the Study at the OU website.The LinkBrainz project will be publishing the metadatabase from the popular MusicBrainz service as Linked Data, along with a number of tutorials for users. BBC Music is one of a number of sites that pulls in data from MusicBrainz, so we can look forward to some exciting developments following the release of their linked data. In September they announced that their RDFa test server was available, so we can now see what the data they’re embedding in MusicBrainz will look like. I had a quick look at the RDFa for John Coltrane’s ‘A Love Supreme’ using the Sindice service.

The fishDelish project have also just provided access to their FishBase species linked data, and the JISC OpenBibliography project expects to have data available soon as referred to in a recent progress report.

The LOCAH project will be making data from the Mimas based Archives Hub and Copac services available as Linked Data. I happen to know (as I’m managing the project) that we have linked data available via SPARQL interfaces on a number of test servers. We’re still working on refining our data models, as well as cleaning up and enhancing the data before we make these available publicly, but we have made details of our Hub and Copac modeling work available on the blog. Pete Johnston has also posted about our approach to URI patterns, and our blog post on the challenges of exposing linked data has been well received.

The aim of the JISC Open Citations project is to publish life science bibliographic citation data as Open Linked Data. They have recently made available their first four ontologies of SPAR, the Semantic Publishing and Referencing Ontologies, an integrated ecosystem of generic ontologies.

So, that gives an idea of where we’re up to with outputs so far. I’ve not covered every project in this post, so as not to make it too long, but I will be sure to highlight all the significant outputs in forthcoming posts (so please don’t be offended if you’re one of the JiscEXPO projects not mentioned yet ). In the next post I’ll be looking at the themes that have been emerging.

JiscEXPO Programme Synthesis

Adrian Stevenson — Thu, 30 Sep 2010 10:46:00 +0000

Earlier this year, JISC issued the 2/1o Grant Funding Call for ‘Deposit of research outputs and Exposing digital content for education and research’, JiscDEPO and JiscEXPO for short. In addition to managing the LOCAH Project that was successful in being funded as part of JiscEXPO, I am also now undertaking the ‘Synthesis Liaison’ role for the JiscEXPO programme (tag = #jiscexpo) working with programme manager David Flanders.

I’ve essentially only just started this role, so I’m still getting a sense of the other projects. I was already familiar with the JiscEXPO programme website as one of the participating projects, but I’ll now be getting more familiar with all the other projects too. The programme synthesis activities are described in more detail on the JiscEXPO site, but it’s basically about identifying ‘emergent patterns’ from across the projects, and ‘synthesising’ these. I’ll then be writing posts on this blog to let you all (and JISC) know what I find. I’ll also be commenting on each of the project blogs, and will be attempting to aid discussion and cross-pollination of ideas across the projects through comments and cross-links. This diagram from the JiscEXPO site aims to describe the process pictorially:

JiscEXPO Programme Synthesis Activity Overview

In many respects this is a new way of doing programme synthesis for JISC, so it’s going to be interesting to see how it goes.

Consuming and producing linked data in a content management system

Thom Bunting — Wed, 15 Sep 2010 12:46:11 +0000

At this summer's Institutional Web Management Workshop in Sheffield (IWMW 2010), I demonstrated how it is becoming feasible for a content management system both to consume and to produce linked data resources. In a parallel session, I presented an overview of the current state of play in 'Semantic content management: consuming and producing RDF in Drupal'. In a video-recorded plenary session (specifically in a nine-minute segment of the recording, from 34 through 42 minutes), I briefly reviewed how a modern CMS can enrich local datasets with remote linked datasets-- and, by engaging with the web of data, produce new insights. Here I explain the scope of what I demonstrated at this event, outline some practical implementation procedures, and evaluate initial results.

Scope: interim check on long sojourn towards promised land

The scope of my demonstration was limited: quickly testing current feasibility of consuming and producing linked data sets in a real-world context. Using recent developments in content management technology, work on this demo was designed to check:

how easily local datasets can be combined and enriched with remote datasets
how effectively a content management system can engage with linked data to provide new insights
how close are we to that promised land where linked data technology can provide real benefits to a broad range of websites

Choosing a context immediately relevant to participants in this year's Institutional Web Management Workshop, I decided to build a 'proof of concept' website providing a synoptic view of institutions and speakers participating over many years in IWMW events. From IWMW organisers I understood that quite a lot of data related to these events was already available, in discrete forms. Event information going back more than 10 years was accessible via many separate IWMW RSS feeds and web pages. Other datasets of interest, however, languished in office spreadsheets (until now, buried within the classic 'information silo'). With so many sources of data available, the challenge was to find a way of presenting disparate sets of information in a unified, manageable, understandable way. Having read with interest the explanations and arguments in "Exploiting Linked Data For Building Web Applications" by Michael Hausenblas (2009), my objective was to check how linked data technologies can have practical applications within a real-world website:

Semantic Web technologies are around now for a while, already. However, in the development of real-world Web applications these technologies have ... little impact to date. With linked data this situation has changed dramatically in the past couple of months. This article shows how linked datasets can be exploited to build rich Web applications with little effort.

Hausenblas observes that "in contrast to the full-fledged Semantic Web vision ... linked data is mainly about publishing structured data in RDF using URIs rather than focusing on the ontological level or inferencing":

This simplification -- just as the Web simplified the established academic approaches of Hypertext systems -- lowers the entry barrier for data provider, hence fosters a wide-spread adoption.

This 'simplicity wins' argument rings true with regard to many technology development patterns, and with human nature. Personally, it strongly reminds me of what I noticed during early days of the web. During the mid-1990s I could well understand how SGML adherents disliked the relatively gross simplifications of HTML and its growing preoccupation with presentation and formatting rather than semantic structure. Nevertheless, it did seem clear then, as now, that simplification and widespread adoption were highly correlated. Is it really becoming simpler, as Hausenblas and others recently claim, so that "linked datasets can be exploited to build rich Web applications with little effort", thanks to advances in content management systems? This summer, as I worked on a prototype website for my IWMW presentation, I remembered how long the journey has been to the long-anticipated 'semantic web' promised land. In Tim Berners-Lee's first recorded proposal for the World Wide Web, as drafted in March 1989 and then revised in 1990, there are remarkable indications of 'semantic web' notation — aligned with much later development of RDF (as noted by Dan Brickley in 'Semantic Web History: Nodes and Arcs 1989-1999'). Looking back again this summer, I noticed how 20 years elapsed since this first proposal did seem — in comparison to normally fast-paced 'internet time' — very much like 40 years of wandering in the desert. I wondered if the 'semantic web' promised land, flowing with linked data, was at last in sight? With its core integration of a robust RDF API and its much-heralded functionality to produce and consume linked data, forthcoming Drupal 7 promised, after two years of active planning and development, to bring linked data technologies into a widely used content management system:

While it is worthwhile to mention that the first of these [content management] systems appeared at around the same time as Semantic Web technologies emerged, with RDF being standardized in 1999, the development of CMSs and Semantic Web technologies have gone largely separate paths. Semantic Web technologies have matured to the point where they are increasingly being deployed on the Web. But the HTML Web still dwarfs this emerging Web of Data and — boosted by technologies such as CMSs — is still growing at much faster pace than the Semantic Web.... Approaching site administrators of widely used CMSs with easy-to-use tools to enhance their site with Linked Data will not only be to their benefit, but also significantly boost the Web of Data. (Corlosquet, Delbru, Clark, Polleres, Decker, 2009)

In designing a prototype website for the IWMW event, I specifically wanted to evaluate:

Beyond the handful of apps and websites described by Hausenblas as exemplary integrations of linked data (Faviki, DBpedia Mobile, BBC Music, Musicbrainz), how easy would exploiting linked data resources be for a broad range of websites managed by an open source content management system?
How faithfully can a modern CMS implement best-practice guidelines for exploiting linked data, such as those explained by Hausenblas?
Where some guidelines cannot yet be implemented, are practical benefits achievable?

Implementation procedures: from hypothetical to actual

Hausenblas describes how key linked data principles could be applied in building a hypothetical website:

imagine a historical ... website http://example.org/cw/ that deals with the topic 'Cold War' ... [and] assume the site is powered by a popular software such as Wordpress or Drupal. (Hausenblas, 2009)

Whereas Hausenblas bases explanations on hypotheticals, I wanted to evaluate more closely what can actually be achieved in building a website that exploits linked data resources, using a specific, currently available CMS. Given the buzz of anticipation for the forthcoming release of version 7 with core RDF integration, I chose Drupal as best choice for a feasibility test. Hausenblas explains, at high level, two "steps needed for exploiting linked datasets in an exemplary Web application":

In order to exploit linked dataset[s] properly, basically two steps are required: (i) prepare your own data, and (ii) select appropriate target datasets.

Preparing local data

As explained in my post on the prototype website entitled 'Consuming and producing RDF: current arrangements', my first stage of work concentrated on local datasets:

extracting available data from IWMW registration details kept in office spreadsheets
compiling event information (session abstracts and speaker bios) from RSS feeds on IWMW website
cross-checking IWMW web pages for detailed information about sessions and speaker affiliations

During this first stage of local data extraction and compilation, I used perl scripts to create relevant datasets. Overall, this first stage of work required more time and effort than the next stage. Because it needed ad hoc data-munging scripts, this work on local data ultimately proved more tedious than the more routine retrieval of linked data resources in stage two.

Selecting linked data resources

Once these local data sets were extracted and compiled for IWMW speakers and their affiliations, it became clear how DBpedia could supply quite a lot of useful linked data. During this second stage of work on the prototype, I used a combination of perl scripts to retrieve and process RDF triples (including textual descriptions, statistics, geolocation coordinates, etc) from DBpedia and then Drupal utility modules ('Feeds' and 'Taxonomy CSV') to batch-load this data into relevant segments of the prototype 'IWMW synoptic' website.

Note: Forthcoming modules such as 'SPARQL views', as explained by Lin Clark in a project proposal and video, are designed to enable "average users to integrate SPARQL into their website workflow" without need for external scripts. As I worked this summer on retrieving and integrating linked data into a demo website, however, this facility was missing in both Drupal 6 or Drupal 7 alphas.

Beyond these datasets from DBpedia, a range of further resources could be integrated given more time and scope to engage with the Web of Data:

filtered datasets retrieved from SPARQL queries on DBpedia (as illustrated by Martin Poulter in his follow-up blog post 'Getting information about UK HE from Wikipedia')
tags coordinated with Open Calais, via Faviki (correlated with DBpedia), or (more recently) via managed-thesaurus-tag-recommendation service such as PoolParty using 'SKOS thesauri enriched with Linked Data'

Initial results, trends, and directions of travel

Even with the limited scope and time available for working on the 'IWMW synoptic' demo website, it was possible to produce quite a lot of initial results. Here are some links to views of local datasets enriched with linked data:

sortable table of participating organisations, compared with distance from event (exportable in .doc and .csv formats)
filterable and sortable table of participating organisations, compared with student numbers (exportable in .doc and .csv formats)
interactive map of organisations contributing speakers to IWMW (clickable pop-ups to display enriched data sets)
SPARQL endpoint producing RDF in wide range of formats (XML, JSON, Turtle etc)
SPARQL query form, available for queries on local and remote endpoints
selective summary of speakers bios, abstracts (note: complete dataset not yet loaded into prototype website)
overview of IWMW speaker affiliations (screenshot below)

How easy?

Short answer: CMS arrangements do make it remarkably easy to present local data enriched with linked data, accessible in both human-usable and machine-readable views. In the currently transitional state of Drupal development (as explained in 'Semantic content management: consuming and producing RDF in Drupal'), however, this requires quite a bit of ad hoc preparation. This summer, I needed to write custom scripts both for preparing local data and for retrieving linked data. This latter process of retrieving linked data should become easier when utility modules such as 'SPARQL views' and others become available following official release of Drupal 7. Only after a full complement of RDF modules becomes available following an official release of Drupal 7 can the optimistic vision of CMS advocates be justified:

Again, the [website] operator is in a comfortable position: for his system plug-ins exist allowing to expose data with just a few configuration changes. (Hausenblas, 2009)

My experience this summer proves that it takes more than just 'a few configuration changes' before a CMS manager can start consuming and producing linked data robustly. Such a 'comfortable position' is not yet quite a reality.

How faithful?

Hausenblas discusses three best-practice guidelines for making a content management system "Web-of-Data compliant":

re-using relevant ontologies and vocabularies (such as FOAF)
exposing linked data as RDF/XML, RDFa, or in SPARQL endpoints
minting URIs along the lines used by DBpedia (where machine-readable (RDF) and human-usable (HTML) versions are distinguished within URI spaces /resource and /html paths, ideally accessible via automated content negotiation)

Guideline 1: Re-using common vocabularies

Regarding the first guideline, I found that Drupal 6 RDF modules available this summer do facilitate re-use of commonly used vocabularies such as FOAF (and many others). In fact, just a few configuration changes were required for the demo site to output RDF such as this (abridged) excerpt:

Guideline 2: Exposing linked data in various formats

With regard to this second guideline for exposing linked data as RDF/XML, RDFa, or as query output from a SPARQL endpoint, I found that:

Drupal 6 RDF modules can easily export a range of linked data in RDF/XML format. (Upon official release of Drupal 7, there will be 'out of the box' support for RDFa output.)
It was easy to set up a SPARQL endpoint with just a few configuration changes, so that it could respond (in a very wide range of formats) to queries on triples compiled automatically (via cron runs) from website content.

As a result of the transitional state of module development pending final release of Drupal 7, however, I found that RDF/XML output included eccentric ('site') vocabulary tags. In effect this produced redundant noise in RDF which, albeit distracting to the human eye, could be safely ignored by machine-read processes keyed to a standard vocabulary such as FOAF.

Guideline 3: Mint machine-readable and human-usable URIs

Regarding this third guideline, I found that current state of development in Drupal RDF modules could not support an ideal arrangement for automated content negotiation as implemented by DBpedia. Drupal 6 RDF modules do, however, support parallel RDF and HTML output using URI schema such as:

http://iwmw-rdf.ukoln.info/node/48/rdf (person profile in RDF format)
http://iwmw-rdf.ukoln.info/node/48/ (person profile in HTML format)

Not ideal yet reasonably practical.

The future?

If Hausenblas, Scorloquet and others are correct about prospects for CMS developments boosting the adoption of linked data technologies, this can dramatically broaden the numbers and types of websites engaged with the Web of Data. Probably more than 7 million websites were using Drupal in July 2010 (including many large, high-traffic and high-profile websites in commercial, governmental, and academic contexts). As more websites transition into using new Drupal 7, this can sharply increase the numbers of websites consuming and producing linked data. Is this the future as illustrated the DrupalCon Boston 2008 keynote presentation 'Video from the future'? That keynote, which announced the start of work on integrating RDF into Drupal core, illustrated some interesting RDF 'web of data' mashups. The current focus is on increasing take-up. As illustrated by Google Trends, levels of interest in 'semantic web' technologies (as reflected in search volumes) decline steadily from 2004 to 2010.

By contrast, Google Trends indicate that search volume levels for 'linked data' are gradually rising. At this point, is active interest in 'linked data' overtaking long-established interest in the 'semantic web'? If Drupal's integration of RDF into its core functionality can help dramatically expand the number of websites engaging with linked data, this is good news for tribes on a long sojourn towards a promised land.

References

Michael Hausenblas, "Exploiting Linked Data to Build Web Applications," IEEE Internet Computing, vol. 13, no. 4, pp. 68-73, July/Aug. 2009, doi:10.1109/MIC.2009.79. Stéphane Corlosquet, Renaud Delbru, Tim Clark, Axel Polleres, Stefan Decker, "Produce and Consume Linked Data with Drupal!", Proceedings of the 8th International Semantic Web Conference (ISWC 2009), Springer, 2009, doi: 10.1007/978-3-642-04930-9_48.

Aggregation and the Resource Discovery Taskforce vision

Paul Walk — Thu, 19 Aug 2010 12:43:35 +0000

On Tuesday of this week, UKOLN convened a group of invited experts to discuss aggregation in the context of the Resource Discovery Taskforce's vision. The Resource Discovery Taskforce (RDTF), a joint JISC / RLUK venture, has summed up its vision:

UK researchers and students will have easy, flexible and ongoing access to content and services through a collaborative, aggregated and integrated resource discovery and delivery framework which is comprehensive, open and sustainable

Given the limitations of time and resources, and with a firm intention to make a real contribution, the RDTF has decided to focus on aggregation of metadata as a means to progressing the vision. There was some debate at the meeting about the extent to which aggregation is something worth focussing on, and a general concern that this not become an end in itself, rather than a means to an end. We agreed to use the phrase 'aggregation as a tactic' as a way of characterising the proper relationship of this approach to the vision, and steered the remainder of the meeting to address aggregation from a mainly technical perspective. To get the ball rolling, I introduced a slide wherein I attempt to list possible reasons for aggregating data:

to address systems/network latency - a cache
for ‘Web Scale concentration’
- ‘gaming’ Google - raising ‘visibility’ of content
- network effects if user facing services also developed
to showcase (e.g. scale & nature of OER in UK)
to create middleman business opportunities
as infrastructure to support locally developed services
as an approach to preservation

This was discussed at some length, and we agreed that some other reasons could be added to this list:

for economic reasons - e.g. to achieve economies of scale through storing & managing metadata in one place, implying that the aggregation becomes the sole source of a given metadata record
to add value to the data through processes, especially around data quality, which are impractical or even impossible to contemplate when the metadata is distributed
to simplify licensing from the point of view of the consumer of the aggregated data

We noted that while the RDTF vision seems to concentrate on metadata describing resources and their provision, other types of metadata, such as user-generated annotations and user attention or activity data, which is also of great potential interest and value might be aggregated advantageously.

The importance of registries to help in the identification and discovery of relevant data was raised.

For the second part of the day we broke the meeting up into three smaller groups, each concentrating on an aspect of the preceding general discussions. Each of these groups, when they summarised their discussions for the whole meeting later, identified issues and made recommendations. Where these are generally applicable (which they mostly are), rather than outline them in the following descriptions of the breakout groups I have treated them together in two sections at the end of this post.

Breakout 1: APIs

This group looked at the role which Application Programming Interfaces (APIs) have to play in an environment of aggregated metadata and related services. It used a spectrum of technological interventions ranging from specific service development to meet a particular need, through to generic infrastructure provision to provide opportunities for others to develop services, and attempted to place classes of APIs on this spectrum:

It was agreed that it was important to understand this distinction, and to be equipped to judge where to 'draw the line' between meeting specific requirements and investing in capacity for future innovation. There is clearly a tension between agility - which is a feature which becomes more desirable as one moves along the spectrum towards those servicing users' requirements, and stability which is necessary for infrastructure to be trusted. Part of the purpose of APIs is to help to manage this tension.

APIs are for developers, and so APIs on aggregations must be highly usable from the point of view of a developer. Focussing on the need for aggregations to expose APIs so that services can build upon them, this group made some recommendations (included in the general recommendations at the end of this post) about the sorts of general features an API should exhibit. In general, it was agreed that an API on an aggregation must be more convenient, from the point of view of a developer, than going directly to the individual sources. Leaving aside simple issues of network latency, in a possible Linked Data future where data is commonly openly available, the aggregation and its API must not become a barrier to building services and adding value to data.

This group also discussed the issue of federation of aggregations - where one aggregation feeds another. There are serious engineering issues with this kind of federation which require better understanding.

Breakout 2: Aggregation as tactic

This group decided to start by looking for "prior art" - examples of successful uses of aggregation as an tactic to improving resource discovery. With this approach, it was suggested, it would be possible to identify stakeholder groups which are already 'bought into' the idea of using aggregation as a tactic in this way, which ought to be easier than convincing people from scratch. The trick would seem be to be to identify a shared service which could be developed upon an aggregation of metadata, and which they could recognise would be beneficial to them. Examples of successful aggregations were identified and included:

Copac (aggregated records from National, Academic, and Specialist Library Catalogues)
SUNCAT (a national serials union catalogue)
Worldcat (a global, aggregated library catalogue)

Echoing an earlier point, the group suggested that the value in aggregation as a tactic comes from the ability to normalise metadata into some sort of canonical form. This aspect of the aggregation adding value to the data it aggregates is crucial if the source record holders are to be persuaded to participate.

The group suggested that JORUM's role in supporting the national (and global) Open Educational Resources (OER) movement was very much in line with this thinking: that JORUM enhances discoverability of OERs created in UK institutions, while simultaneously offering the potential for long term archiving (preservation). Again, the importance of the registry becomes apparent this group suggested, with JORUM likely to become important as a service providing identification and 'provenance' services.

The group discussed the idea of concentrating on one particular domain, such as geography, on the grounds that this could then be built out to an extent that other domains would become interested once they had seen what has been achieved. The counter to this argument was a suggestion that it might be better to consider a range of resource types including scholarly communications (bibliographic data), learning materials, repositories, spatial/geographical data and multi-media.

It was also noted that the 'aggregation as a tactic' argument might apply to self-archiving and Open Access - which has similar arguments as for JORUM and OERs.

It was suggested that this was leading to a set of tactics which would help content providers get over a 'fear' of aggregation, and of encouraging them to open up from a position of 'data ownership'. It was also recognised that once this is achieved, aggregation as a tactic creates opportunities for 'middle-men' to add value through new services building on top of the aggregation.

Interestingly, this group suggested that aggregation as a tactic might be a short-or-medium-term tactic, that the 'end game' would be to dis-aggregate content back to source. At this point, the remaining infrastructure would be of the 'registry' type, helping to locate data at source.

Breakout 3: Build better websites!

The emphasis of this session was about advising & enabling those who hold source metadata to make it available in an appropriate form. The group identified a number of 'steps' that a content provider might take. These steps are ordered in a system of progressive desirability in a model influenced by Tim Berners-Lee's Linked Data Note:

make data available in an open form (even using the much-maligned CSV format if necessary)
assign and expose HTTP URIs for everything, and expose useful content at those URIs
publish as XML
expose semantics

It was noted that these steps do not demand that a provider should work their way through them sequentially - it is perfectly acceptable and even desirable to jump in at step 4 - however this might represent a significant barrier to some, so steps 1-3 are there to give content providers a chance to engage comfortably.

Barriers specific to this model being adopted successfully include the issue of securing vendor 'buy-in'. For content providers to support this model, their software platforms need to enable it. This may not be the case at present in most cases. Also, specific skills in Linked Data are not so widespread in these sectors (yet), and an appreciation of and support for Linked Data is not common among senior managers. It was recommended that JISC create some political momentum around this, perhaps devising a convincing argument for senior management. It was also suggested in this breakout group that RDTF should provide a central resource (guidance & possibly infrastructure) for hosting data, especially for smaller organisations.

This approach was summed up as a description of a potential glam.ac.uk where glam is galleries, libraries, archives and museums.

General Issues

Lack of technical expertise in libraries, museums and archives. This applies most strongly in respect of the 'build better websites' model, but is also true more generally, especially when the long-tail of glams is considered.
Business case, or possible lack thereof. The content providers need to see a clear benefit before committing to the cost involved in supporting the aggregation of their data.
Content providers often show a reluctance to make data openly available on the grounds that they may expose poor quality which reflects badly on them

Recommendations

The various discussions during the meeting gave rise to a number of suggested recommendations. It should be noted that these are based on a few short hours of discussion - however the experience of the group which made them is considerable, so I hope they might be considered seriously.

The 4 step model for advising/supporting content providers in opening up their metadata
The RDTF should fund aggregation projects that demonstrate value in these steps
- e.g. "Tell me how my content is being used"
Providers should provide a semantic sitemap leading to a data aggregation. This could be RDF or XML
Providers should expose the schemas they use (whether their own schemas or links to established schemas)
Aggregation services should provide guidance to content providers about schemas to be used (a registry of recommended schemas would be a useful component)
Aggregators should not reject data on basis of schema used by the content provider - aggregators should be prepared to accept anything
The RDTF should (in partnership with others) seek to engage with vendors of collections/content management systems in the various domains.
Aggregations should have supported APIs which are attractive to and convenient for developers, offering developer-friendly output formats such as XML or JSON
Aggregation should be considered, perhaps, as a temporary approach to aiding discoverability. More extremely, a 'just in time' approach to aggregation might be considered.
A 'cookbook' of design patterns involving aggregation as a technical approach to resource discovery might be a useful thing to consider funding.
A '2 tier' model of metadata might be worth considering, where one tier is for common, basic description and identification, and the other tier is for more targeted uses.

Many thanks to those who attended and made the meeting a success:

Peter Burnhill (Edina)
Hugh Glaser (Seme4)
David Kay (Sero)
Andrew Kitchen (Becta)
Ross MacIntyre (Mimas)
Andy McGregor (JISC)
Paul Miller (Cloud of Data)
Andy Powell (Eduserv)
Owen Stephens (independent)
Adrian Stevenson (UKOLN)
Paul Walk (UKOLN)
Jo Walsh (Edina)

And thanks to Adrian also for organising the meeting.