xml: relevant content on this site

Consuming and producing linked data in a content management system

Thom Bunting — Wed, 15 Sep 2010 12:46:11 +0000

At this summer's Institutional Web Management Workshop in Sheffield (IWMW 2010), I demonstrated how it is becoming feasible for a content management system both to consume and to produce linked data resources. In a parallel session, I presented an overview of the current state of play in 'Semantic content management: consuming and producing RDF in Drupal'. In a video-recorded plenary session (specifically in a nine-minute segment of the recording, from 34 through 42 minutes), I briefly reviewed how a modern CMS can enrich local datasets with remote linked datasets-- and, by engaging with the web of data, produce new insights. Here I explain the scope of what I demonstrated at this event, outline some practical implementation procedures, and evaluate initial results.

Scope: interim check on long sojourn towards promised land

The scope of my demonstration was limited: quickly testing current feasibility of consuming and producing linked data sets in a real-world context. Using recent developments in content management technology, work on this demo was designed to check:

how easily local datasets can be combined and enriched with remote datasets
how effectively a content management system can engage with linked data to provide new insights
how close are we to that promised land where linked data technology can provide real benefits to a broad range of websites

Choosing a context immediately relevant to participants in this year's Institutional Web Management Workshop, I decided to build a 'proof of concept' website providing a synoptic view of institutions and speakers participating over many years in IWMW events. From IWMW organisers I understood that quite a lot of data related to these events was already available, in discrete forms. Event information going back more than 10 years was accessible via many separate IWMW RSS feeds and web pages. Other datasets of interest, however, languished in office spreadsheets (until now, buried within the classic 'information silo'). With so many sources of data available, the challenge was to find a way of presenting disparate sets of information in a unified, manageable, understandable way. Having read with interest the explanations and arguments in "Exploiting Linked Data For Building Web Applications" by Michael Hausenblas (2009), my objective was to check how linked data technologies can have practical applications within a real-world website:

Semantic Web technologies are around now for a while, already. However, in the development of real-world Web applications these technologies have ... little impact to date. With linked data this situation has changed dramatically in the past couple of months. This article shows how linked datasets can be exploited to build rich Web applications with little effort.

Hausenblas observes that "in contrast to the full-fledged Semantic Web vision ... linked data is mainly about publishing structured data in RDF using URIs rather than focusing on the ontological level or inferencing":

This simplification -- just as the Web simplified the established academic approaches of Hypertext systems -- lowers the entry barrier for data provider, hence fosters a wide-spread adoption.

This 'simplicity wins' argument rings true with regard to many technology development patterns, and with human nature. Personally, it strongly reminds me of what I noticed during early days of the web. During the mid-1990s I could well understand how SGML adherents disliked the relatively gross simplifications of HTML and its growing preoccupation with presentation and formatting rather than semantic structure. Nevertheless, it did seem clear then, as now, that simplification and widespread adoption were highly correlated. Is it really becoming simpler, as Hausenblas and others recently claim, so that "linked datasets can be exploited to build rich Web applications with little effort", thanks to advances in content management systems? This summer, as I worked on a prototype website for my IWMW presentation, I remembered how long the journey has been to the long-anticipated 'semantic web' promised land. In Tim Berners-Lee's first recorded proposal for the World Wide Web, as drafted in March 1989 and then revised in 1990, there are remarkable indications of 'semantic web' notation — aligned with much later development of RDF (as noted by Dan Brickley in 'Semantic Web History: Nodes and Arcs 1989-1999'). Looking back again this summer, I noticed how 20 years elapsed since this first proposal did seem — in comparison to normally fast-paced 'internet time' — very much like 40 years of wandering in the desert. I wondered if the 'semantic web' promised land, flowing with linked data, was at last in sight? With its core integration of a robust RDF API and its much-heralded functionality to produce and consume linked data, forthcoming Drupal 7 promised, after two years of active planning and development, to bring linked data technologies into a widely used content management system:

While it is worthwhile to mention that the first of these [content management] systems appeared at around the same time as Semantic Web technologies emerged, with RDF being standardized in 1999, the development of CMSs and Semantic Web technologies have gone largely separate paths. Semantic Web technologies have matured to the point where they are increasingly being deployed on the Web. But the HTML Web still dwarfs this emerging Web of Data and — boosted by technologies such as CMSs — is still growing at much faster pace than the Semantic Web.... Approaching site administrators of widely used CMSs with easy-to-use tools to enhance their site with Linked Data will not only be to their benefit, but also significantly boost the Web of Data. (Corlosquet, Delbru, Clark, Polleres, Decker, 2009)

In designing a prototype website for the IWMW event, I specifically wanted to evaluate:

Beyond the handful of apps and websites described by Hausenblas as exemplary integrations of linked data (Faviki, DBpedia Mobile, BBC Music, Musicbrainz), how easy would exploiting linked data resources be for a broad range of websites managed by an open source content management system?
How faithfully can a modern CMS implement best-practice guidelines for exploiting linked data, such as those explained by Hausenblas?
Where some guidelines cannot yet be implemented, are practical benefits achievable?

Implementation procedures: from hypothetical to actual

Hausenblas describes how key linked data principles could be applied in building a hypothetical website:

imagine a historical ... website http://example.org/cw/ that deals with the topic 'Cold War' ... [and] assume the site is powered by a popular software such as Wordpress or Drupal. (Hausenblas, 2009)

Whereas Hausenblas bases explanations on hypotheticals, I wanted to evaluate more closely what can actually be achieved in building a website that exploits linked data resources, using a specific, currently available CMS. Given the buzz of anticipation for the forthcoming release of version 7 with core RDF integration, I chose Drupal as best choice for a feasibility test. Hausenblas explains, at high level, two "steps needed for exploiting linked datasets in an exemplary Web application":

In order to exploit linked dataset[s] properly, basically two steps are required: (i) prepare your own data, and (ii) select appropriate target datasets.

Preparing local data

As explained in my post on the prototype website entitled 'Consuming and producing RDF: current arrangements', my first stage of work concentrated on local datasets:

extracting available data from IWMW registration details kept in office spreadsheets
compiling event information (session abstracts and speaker bios) from RSS feeds on IWMW website
cross-checking IWMW web pages for detailed information about sessions and speaker affiliations

During this first stage of local data extraction and compilation, I used perl scripts to create relevant datasets. Overall, this first stage of work required more time and effort than the next stage. Because it needed ad hoc data-munging scripts, this work on local data ultimately proved more tedious than the more routine retrieval of linked data resources in stage two.

Selecting linked data resources

Once these local data sets were extracted and compiled for IWMW speakers and their affiliations, it became clear how DBpedia could supply quite a lot of useful linked data. During this second stage of work on the prototype, I used a combination of perl scripts to retrieve and process RDF triples (including textual descriptions, statistics, geolocation coordinates, etc) from DBpedia and then Drupal utility modules ('Feeds' and 'Taxonomy CSV') to batch-load this data into relevant segments of the prototype 'IWMW synoptic' website.

Note: Forthcoming modules such as 'SPARQL views', as explained by Lin Clark in a project proposal and video, are designed to enable "average users to integrate SPARQL into their website workflow" without need for external scripts. As I worked this summer on retrieving and integrating linked data into a demo website, however, this facility was missing in both Drupal 6 or Drupal 7 alphas.

Beyond these datasets from DBpedia, a range of further resources could be integrated given more time and scope to engage with the Web of Data:

filtered datasets retrieved from SPARQL queries on DBpedia (as illustrated by Martin Poulter in his follow-up blog post 'Getting information about UK HE from Wikipedia')
tags coordinated with Open Calais, via Faviki (correlated with DBpedia), or (more recently) via managed-thesaurus-tag-recommendation service such as PoolParty using 'SKOS thesauri enriched with Linked Data'

Initial results, trends, and directions of travel

Even with the limited scope and time available for working on the 'IWMW synoptic' demo website, it was possible to produce quite a lot of initial results. Here are some links to views of local datasets enriched with linked data:

sortable table of participating organisations, compared with distance from event (exportable in .doc and .csv formats)
filterable and sortable table of participating organisations, compared with student numbers (exportable in .doc and .csv formats)
interactive map of organisations contributing speakers to IWMW (clickable pop-ups to display enriched data sets)
SPARQL endpoint producing RDF in wide range of formats (XML, JSON, Turtle etc)
SPARQL query form, available for queries on local and remote endpoints
selective summary of speakers bios, abstracts (note: complete dataset not yet loaded into prototype website)
overview of IWMW speaker affiliations (screenshot below)

How easy?

Short answer: CMS arrangements do make it remarkably easy to present local data enriched with linked data, accessible in both human-usable and machine-readable views. In the currently transitional state of Drupal development (as explained in 'Semantic content management: consuming and producing RDF in Drupal'), however, this requires quite a bit of ad hoc preparation. This summer, I needed to write custom scripts both for preparing local data and for retrieving linked data. This latter process of retrieving linked data should become easier when utility modules such as 'SPARQL views' and others become available following official release of Drupal 7. Only after a full complement of RDF modules becomes available following an official release of Drupal 7 can the optimistic vision of CMS advocates be justified:

Again, the [website] operator is in a comfortable position: for his system plug-ins exist allowing to expose data with just a few configuration changes. (Hausenblas, 2009)

My experience this summer proves that it takes more than just 'a few configuration changes' before a CMS manager can start consuming and producing linked data robustly. Such a 'comfortable position' is not yet quite a reality.

How faithful?

Hausenblas discusses three best-practice guidelines for making a content management system "Web-of-Data compliant":

re-using relevant ontologies and vocabularies (such as FOAF)
exposing linked data as RDF/XML, RDFa, or in SPARQL endpoints
minting URIs along the lines used by DBpedia (where machine-readable (RDF) and human-usable (HTML) versions are distinguished within URI spaces /resource and /html paths, ideally accessible via automated content negotiation)

Guideline 1: Re-using common vocabularies

Regarding the first guideline, I found that Drupal 6 RDF modules available this summer do facilitate re-use of commonly used vocabularies such as FOAF (and many others). In fact, just a few configuration changes were required for the demo site to output RDF such as this (abridged) excerpt:

Guideline 2: Exposing linked data in various formats

With regard to this second guideline for exposing linked data as RDF/XML, RDFa, or as query output from a SPARQL endpoint, I found that:

Drupal 6 RDF modules can easily export a range of linked data in RDF/XML format. (Upon official release of Drupal 7, there will be 'out of the box' support for RDFa output.)
It was easy to set up a SPARQL endpoint with just a few configuration changes, so that it could respond (in a very wide range of formats) to queries on triples compiled automatically (via cron runs) from website content.

As a result of the transitional state of module development pending final release of Drupal 7, however, I found that RDF/XML output included eccentric ('site') vocabulary tags. In effect this produced redundant noise in RDF which, albeit distracting to the human eye, could be safely ignored by machine-read processes keyed to a standard vocabulary such as FOAF.

Guideline 3: Mint machine-readable and human-usable URIs

Regarding this third guideline, I found that current state of development in Drupal RDF modules could not support an ideal arrangement for automated content negotiation as implemented by DBpedia. Drupal 6 RDF modules do, however, support parallel RDF and HTML output using URI schema such as:

http://iwmw-rdf.ukoln.info/node/48/rdf (person profile in RDF format)
http://iwmw-rdf.ukoln.info/node/48/ (person profile in HTML format)

Not ideal yet reasonably practical.

The future?

If Hausenblas, Scorloquet and others are correct about prospects for CMS developments boosting the adoption of linked data technologies, this can dramatically broaden the numbers and types of websites engaged with the Web of Data. Probably more than 7 million websites were using Drupal in July 2010 (including many large, high-traffic and high-profile websites in commercial, governmental, and academic contexts). As more websites transition into using new Drupal 7, this can sharply increase the numbers of websites consuming and producing linked data. Is this the future as illustrated the DrupalCon Boston 2008 keynote presentation 'Video from the future'? That keynote, which announced the start of work on integrating RDF into Drupal core, illustrated some interesting RDF 'web of data' mashups. The current focus is on increasing take-up. As illustrated by Google Trends, levels of interest in 'semantic web' technologies (as reflected in search volumes) decline steadily from 2004 to 2010.

By contrast, Google Trends indicate that search volume levels for 'linked data' are gradually rising. At this point, is active interest in 'linked data' overtaking long-established interest in the 'semantic web'? If Drupal's integration of RDF into its core functionality can help dramatically expand the number of websites engaging with linked data, this is good news for tribes on a long sojourn towards a promised land.

References

Michael Hausenblas, "Exploiting Linked Data to Build Web Applications," IEEE Internet Computing, vol. 13, no. 4, pp. 68-73, July/Aug. 2009, doi:10.1109/MIC.2009.79. Stéphane Corlosquet, Renaud Delbru, Tim Clark, Axel Polleres, Stefan Decker, "Produce and Consume Linked Data with Drupal!", Proceedings of the 8th International Semantic Web Conference (ISWC 2009), Springer, 2009, doi: 10.1007/978-3-642-04930-9_48.

Aggregation and the Resource Discovery Taskforce vision

Paul Walk — Thu, 19 Aug 2010 12:43:35 +0000

On Tuesday of this week, UKOLN convened a group of invited experts to discuss aggregation in the context of the Resource Discovery Taskforce's vision. The Resource Discovery Taskforce (RDTF), a joint JISC / RLUK venture, has summed up its vision:

UK researchers and students will have easy, flexible and ongoing access to content and services through a collaborative, aggregated and integrated resource discovery and delivery framework which is comprehensive, open and sustainable

Given the limitations of time and resources, and with a firm intention to make a real contribution, the RDTF has decided to focus on aggregation of metadata as a means to progressing the vision. There was some debate at the meeting about the extent to which aggregation is something worth focussing on, and a general concern that this not become an end in itself, rather than a means to an end. We agreed to use the phrase 'aggregation as a tactic' as a way of characterising the proper relationship of this approach to the vision, and steered the remainder of the meeting to address aggregation from a mainly technical perspective. To get the ball rolling, I introduced a slide wherein I attempt to list possible reasons for aggregating data:

to address systems/network latency - a cache
for ‘Web Scale concentration’
- ‘gaming’ Google - raising ‘visibility’ of content
- network effects if user facing services also developed
to showcase (e.g. scale & nature of OER in UK)
to create middleman business opportunities
as infrastructure to support locally developed services
as an approach to preservation

This was discussed at some length, and we agreed that some other reasons could be added to this list:

for economic reasons - e.g. to achieve economies of scale through storing & managing metadata in one place, implying that the aggregation becomes the sole source of a given metadata record
to add value to the data through processes, especially around data quality, which are impractical or even impossible to contemplate when the metadata is distributed
to simplify licensing from the point of view of the consumer of the aggregated data

We noted that while the RDTF vision seems to concentrate on metadata describing resources and their provision, other types of metadata, such as user-generated annotations and user attention or activity data, which is also of great potential interest and value might be aggregated advantageously.

The importance of registries to help in the identification and discovery of relevant data was raised.

For the second part of the day we broke the meeting up into three smaller groups, each concentrating on an aspect of the preceding general discussions. Each of these groups, when they summarised their discussions for the whole meeting later, identified issues and made recommendations. Where these are generally applicable (which they mostly are), rather than outline them in the following descriptions of the breakout groups I have treated them together in two sections at the end of this post.

Breakout 1: APIs

This group looked at the role which Application Programming Interfaces (APIs) have to play in an environment of aggregated metadata and related services. It used a spectrum of technological interventions ranging from specific service development to meet a particular need, through to generic infrastructure provision to provide opportunities for others to develop services, and attempted to place classes of APIs on this spectrum:

It was agreed that it was important to understand this distinction, and to be equipped to judge where to 'draw the line' between meeting specific requirements and investing in capacity for future innovation. There is clearly a tension between agility - which is a feature which becomes more desirable as one moves along the spectrum towards those servicing users' requirements, and stability which is necessary for infrastructure to be trusted. Part of the purpose of APIs is to help to manage this tension.

APIs are for developers, and so APIs on aggregations must be highly usable from the point of view of a developer. Focussing on the need for aggregations to expose APIs so that services can build upon them, this group made some recommendations (included in the general recommendations at the end of this post) about the sorts of general features an API should exhibit. In general, it was agreed that an API on an aggregation must be more convenient, from the point of view of a developer, than going directly to the individual sources. Leaving aside simple issues of network latency, in a possible Linked Data future where data is commonly openly available, the aggregation and its API must not become a barrier to building services and adding value to data.

This group also discussed the issue of federation of aggregations - where one aggregation feeds another. There are serious engineering issues with this kind of federation which require better understanding.

Breakout 2: Aggregation as tactic

This group decided to start by looking for "prior art" - examples of successful uses of aggregation as an tactic to improving resource discovery. With this approach, it was suggested, it would be possible to identify stakeholder groups which are already 'bought into' the idea of using aggregation as a tactic in this way, which ought to be easier than convincing people from scratch. The trick would seem be to be to identify a shared service which could be developed upon an aggregation of metadata, and which they could recognise would be beneficial to them. Examples of successful aggregations were identified and included:

Copac (aggregated records from National, Academic, and Specialist Library Catalogues)
SUNCAT (a national serials union catalogue)
Worldcat (a global, aggregated library catalogue)

Echoing an earlier point, the group suggested that the value in aggregation as a tactic comes from the ability to normalise metadata into some sort of canonical form. This aspect of the aggregation adding value to the data it aggregates is crucial if the source record holders are to be persuaded to participate.

The group suggested that JORUM's role in supporting the national (and global) Open Educational Resources (OER) movement was very much in line with this thinking: that JORUM enhances discoverability of OERs created in UK institutions, while simultaneously offering the potential for long term archiving (preservation). Again, the importance of the registry becomes apparent this group suggested, with JORUM likely to become important as a service providing identification and 'provenance' services.

The group discussed the idea of concentrating on one particular domain, such as geography, on the grounds that this could then be built out to an extent that other domains would become interested once they had seen what has been achieved. The counter to this argument was a suggestion that it might be better to consider a range of resource types including scholarly communications (bibliographic data), learning materials, repositories, spatial/geographical data and multi-media.

It was also noted that the 'aggregation as a tactic' argument might apply to self-archiving and Open Access - which has similar arguments as for JORUM and OERs.

It was suggested that this was leading to a set of tactics which would help content providers get over a 'fear' of aggregation, and of encouraging them to open up from a position of 'data ownership'. It was also recognised that once this is achieved, aggregation as a tactic creates opportunities for 'middle-men' to add value through new services building on top of the aggregation.

Interestingly, this group suggested that aggregation as a tactic might be a short-or-medium-term tactic, that the 'end game' would be to dis-aggregate content back to source. At this point, the remaining infrastructure would be of the 'registry' type, helping to locate data at source.

Breakout 3: Build better websites!

The emphasis of this session was about advising & enabling those who hold source metadata to make it available in an appropriate form. The group identified a number of 'steps' that a content provider might take. These steps are ordered in a system of progressive desirability in a model influenced by Tim Berners-Lee's Linked Data Note:

make data available in an open form (even using the much-maligned CSV format if necessary)
assign and expose HTTP URIs for everything, and expose useful content at those URIs
publish as XML
expose semantics

It was noted that these steps do not demand that a provider should work their way through them sequentially - it is perfectly acceptable and even desirable to jump in at step 4 - however this might represent a significant barrier to some, so steps 1-3 are there to give content providers a chance to engage comfortably.

Barriers specific to this model being adopted successfully include the issue of securing vendor 'buy-in'. For content providers to support this model, their software platforms need to enable it. This may not be the case at present in most cases. Also, specific skills in Linked Data are not so widespread in these sectors (yet), and an appreciation of and support for Linked Data is not common among senior managers. It was recommended that JISC create some political momentum around this, perhaps devising a convincing argument for senior management. It was also suggested in this breakout group that RDTF should provide a central resource (guidance & possibly infrastructure) for hosting data, especially for smaller organisations.

This approach was summed up as a description of a potential glam.ac.uk where glam is galleries, libraries, archives and museums.

General Issues

Lack of technical expertise in libraries, museums and archives. This applies most strongly in respect of the 'build better websites' model, but is also true more generally, especially when the long-tail of glams is considered.
Business case, or possible lack thereof. The content providers need to see a clear benefit before committing to the cost involved in supporting the aggregation of their data.
Content providers often show a reluctance to make data openly available on the grounds that they may expose poor quality which reflects badly on them

Recommendations

The various discussions during the meeting gave rise to a number of suggested recommendations. It should be noted that these are based on a few short hours of discussion - however the experience of the group which made them is considerable, so I hope they might be considered seriously.

The 4 step model for advising/supporting content providers in opening up their metadata
The RDTF should fund aggregation projects that demonstrate value in these steps
- e.g. "Tell me how my content is being used"
Providers should provide a semantic sitemap leading to a data aggregation. This could be RDF or XML
Providers should expose the schemas they use (whether their own schemas or links to established schemas)
Aggregation services should provide guidance to content providers about schemas to be used (a registry of recommended schemas would be a useful component)
Aggregators should not reject data on basis of schema used by the content provider - aggregators should be prepared to accept anything
The RDTF should (in partnership with others) seek to engage with vendors of collections/content management systems in the various domains.
Aggregations should have supported APIs which are attractive to and convenient for developers, offering developer-friendly output formats such as XML or JSON
Aggregation should be considered, perhaps, as a temporary approach to aiding discoverability. More extremely, a 'just in time' approach to aggregation might be considered.
A 'cookbook' of design patterns involving aggregation as a technical approach to resource discovery might be a useful thing to consider funding.
A '2 tier' model of metadata might be worth considering, where one tier is for common, basic description and identification, and the other tier is for more targeted uses.

Many thanks to those who attended and made the meeting a success:

Peter Burnhill (Edina)
Hugh Glaser (Seme4)
David Kay (Sero)
Andrew Kitchen (Becta)
Ross MacIntyre (Mimas)
Andy McGregor (JISC)
Paul Miller (Cloud of Data)
Andy Powell (Eduserv)
Owen Stephens (independent)
Adrian Stevenson (UKOLN)
Paul Walk (UKOLN)
Jo Walsh (Edina)

And thanks to Adrian also for organising the meeting.