metadata: relevant content on this site

HTML5 Case Study 1: Semantics and Metadata: Machine Understandable Documents

Administrator — Mon, 16 Jul 2012 12:10:39 +0000

Author: Sam Adams

1 About This Case Study

Institutions and researchers need to maintain and grow their reputations: this means increasing the exposure of their research outputs on the web. Embedding machine understandable metadata into their Web sites will do this by making them more visible, easier to discover and increasing their uses.

The benefits of such approaches for institutions are:

Increased exposure of research (and other) outputs, and the effect this will have on assessment metrics, and hence funding.

The benefits for the individual include:

Increased personal exposure and recognition.
Standing out from the crowd in an ever increasingly competitive environment.
Assisting their own research, making it easier and more efficient to find things.
Increasing the usefulness of their own outputs.

This case study reviews the current mainstream approaches to embedding machine-understandable [FN 1] metadata into HTML documents: microformats, RDFa and microdata - and investigates their use for creating 'semantic' scholarly publications.

Note: all references to HTML5 microdata refer to the May 25, 2011 specification [Ref-07] unless otherwise stated. Changes contained in the editor's draft [Ref-08] have not been addressed.

Target Audience

This case study is primarily designed for developers and publishers interested in embedding machine-understandable metadata into their Web pages, those interested in extracting such data, and the wider community interested in the development of a semantic web.

It is also hoped that the communities behind the various technologies and specifications used in the course of this case study will be interested in the feedback regarding their usability and any limitations encountered.

Finally this study highlights areas where further work may help to develop standard approaches.

What Is Covered

This case study reviews the current state of the microformat, RDFa and microdata approaches to embedding semantic mark-up in HTML documents, and reports on their application to the encoding of semantic metadata in scholarly publications.

What Is Not Covered

HTML5 adds a number of new elements for describing the structure of a Web page semantically - e.g. article, header, section. These elements have been used in the course of carrying out this case study, but will not be discussed here.

Further information on the semantic HTML5 elements are available in this series of case studies [Ref-13] and Mark Pilgrim's Dive into HTML5 [Ref-11] .

2 Introduction

Originally the World Wide Web's content was designed solely for humans to read, not for computers to interpret in a meaningful way. Today the technologies to change this exist: by creating HTML with embedded semantics we can publish documents that both humans and machines can 'understand'. The growth in the publication of machine-understandable information is driving the emergence of a Semantic Web - "an extension of the current [web], in which information is given well-defined meaning, better enabling computers and people to work in cooperation" [Ref-2] . This is creating new opportunities, allowing heterogeneous data sources to be integrated and making it possible for software agents to infer new insights. These can be as 'straightforward' as helping users to discover information, or as complex as discovering new relationships between known disease symptoms and potential molecular targets for new drugs [Ref-10].

At the same time, it has become impractical for anyone to manually keep on top of the ever accelerating volume of published text and data. Increasingly the first reading (and filtering) of publications is done by a machine - this is effectively what search engines do. If you're not providing the appropriate machine-understandable metadata - the equivalent of writing a 'paragraph&' for the machine to review - then the humans are unlikely to ever get to see the document! On the other hand, providing rich metadata will make it easier for potential users to discover your content, and increase the likelihood that other services will direct people to your pages.

This report presents some examples showing how search engines currently exploit embedded semantic metadata, and demonstrates how such data can be authored. It then provides a broader review of the state of current technologies, before discussing some issues that remain to be addressed.

3 Case Study: Searching and Rich Snippets

Publishing machine-understandable metadata is not ‘blue skies’ thinking – organisations are doing it right now, and today’s search engines are exploiting it to improve their listings and provide a richer user experience.

Person Profiles: Linked-In

Searches for 'sam adamscambridge' on both Google and Bing return my LinkedIn profile high in their hits. LinkedIn include semantic markup of data in their profiles, and both search engines extract information from this to enrich their search listings.

Google displays my photo, location and current role, in what is termed a 'Rich Snippet';:

Figure 1. Google display of author’s LinkedIn profile.

While Bing highlights my field of work, recommendations and connections:

Figure 2. Bing display of author’s LinkedIn profile.

These additions make the result stand-out from surrounding hits, increasing the likelihood that someone will visit the page.

Google Recipe Search

When one performs a search for 'shepherds pie' on google.com [FN2], the search engine will present the user with rich results listings, and options to filter the results in meaningful ways:

Figure 3. The google.com rich results listings for search term ‘shepherds pie’.

Individual search hits (e.g. red box) can include a picture of the dish and information such as the number of reviews and average score, and the cooking time and number of calories per serving. Similarly the user is given options (green box) to filter the recipes (e.g. selecting those using lamb, rather than beef!), or those that require less than 30 minutes cooking time. All this is achieved by the web sites publishing the recipes embedding appropriate semantic markup in their pages, allowing the search engine to 'understand' the content.

Similar workflows could be applied to searching in the scholarly domain, if appropriate semantically published data is made available. If the cookery business can do this, surely universities can - higher education is falling behind home-economics Web sites!

4 Example Application: Researchers' Homepages

All institutions provide homepages for their academic staff, and many for other staff and researchers too. These can be made to appear as ‘Rich Snippets’ in Google results with addition of semantic markup for a small number of metadata elements:

Name
Address (locality, country)
Job Title
Photograph (optional)

The original markup is given below:

<article>
<h1>Sam Adams</h1>
<img src=”http://www.ukoln.ac.uk/isc/html5-case-studies/adams/html/tn_sam-adams.jpg”>
<h2>Cambridge (UK) based Software Developer & Consultant</h2>
</article>

With semantic mark-up (using HTML5 Microdata / schema.org – see discussion below, for details):

<article itemscope itemtype=”http://schema.org/Person”>
<h1 itemprop=”name”>Sam Adams</h1>
<img itemprop=”image” src=”http://www.ukoln.ac.uk/isc/html5-case-studies/adams/html/tn_sam-adams.jpg”>
<h2>
<span itemprop=”address” itemscope itemtype=”http://schema.org/PostalAddress”>
<span itemprop=”addressLocality”>Cambridge</span>
(<span itemprop=”addressCountry”>UK</span>)

</span>
based <span itemprop=”jobTitle”>Software Developer & Consultant</span>
</h2>
</article>

Figure 4. Resulting Google ‘Rich Snippet’ [FN3]

5 Technical Discussions

The remainder of this report contains more detailed technical discussions. The technologies described above are reviewed in more detail, and some current issues discussed. Four areas are covered:

A review of the different approaches to embedding semantic metadata into HTML5 documents.
A review of the types of data/metadata found in the different scholarly publications under investigation.
An evaluation of the suitability of each of the methods of embedding semantic metadata for supporting the types of data required by this study.
Production of example works with embedded metadata.

Semantic data formats

This section provides an overview of the three major formats for embedding semantics in HTML documents – microformats, RDFa and microdata. For a comprehensive review of their implementation choices and support for different features see [Ref 15].

Microformats

Microformats [FN4] are simple conventions for embedding semantic mark-up about a specific domain into human-readable (X)HTML/XML documents. here are microformat specifications supporting a variety of types of data, a number of which have seen quite widespread up-take – e.g., hCard [FN5] for describing people and organisations, hCalendar [FN6] for describing calendars and events, and rel-tag [FN7] for marking up tags, keywords and categories in pages such as blog posts.

Microformats have been designed to be straightforward for humans to use, with mark-up based around existing, widely used HTML features as shown in Figure 5:

<p class=”vcard”>
<a href=”http://www.seadams.co.uk/”>Sam Adams</a>
is a <span>software developer</span>.
</p>

Figure 5. Example of an hCard describing Sam Adams.

Note in Figure 5 the vcard class on the p element indicates that the child elements form an hCard. The subsequent classes (url, fn, role) indicate the properties their elements describe.

The major criticisms of the microformat specifications are:

Conflicts with formatting information: Microformats make wide use of the class HTML attribute which is more usually employed by selectors for style sheets giving presentation instructions for a page. While the HTML specifications permit the use of the class attribute "for general purpose processing by user agents" [FN8], overloading the attribute in this manner makes it impossible to tell whether a class attribute is being used for styling purposes, or to mark up a data field, and conflicts can arise when microformats are introduced to existing Web sites.

Processing challenges: The ambiguity between data and format specification also makes it impossible to extract marked-up data in a generic manner - a processor can only extract data conforming to microformats that it knows about. In the above example, a processor cannot know that it should associate the value of the a element's href attribute with the url property, and its text content with fn (full name), unless these rules are hard-coded.

Accessibility: a number of microformats use the abbr HTML element to encode text in both human friendly and machine readable formats. e.g., a date-time may be encoded as:

<abbr title=”20110921T14:00:00+0100″>Wednesday 21st at 2 o’clock</abbr>

Unfortunately this usage of the abbr element is not compatible with screen readers used by many blind and partially sighted users which has led some organisations, most notably the BBC [Ref-14] and [Ref-5] to ban the use of microformats which make use of this pattern.

Approval process / Extensibility: in order to prevent conflicts between microformat and property names, new microformats require centralised registration, and approval through a community process [FN9]. This can make it a lengthy and sometimes difficult process to establish a microformat for a new type of data.

RDFa

The RDFa specification provides a mechanism for embedding RDF (the language of the Semantic Web) data models into XHTML documents. RDFa brings the full power of RDF to embedding semantic data into Web documents, and is automatically compatible with the work of the Semantic Web community. In contrast to microformats, RDF/RDFa embraces 'distributed extensibility' - anyone can create a new vocabulary. This is achieved without having to worrying about conflicting with another vocabulary’s names by using a URL the authors control as a namespace for the vocabulary. Technologies such as RDF Schema (RDFS) and Web Ontology Language (OWL) enable the construction of machine-understandable descriptions of the required structure of RDF entities, and the separation between data and formatting mark-up, combined with more strictly specified parsing rules, ensure that problems such as the urlfn ambiguity, discussed above, do not arise.

RDFa has, however been widely criticised for its complexity in a number of areas:

XML basis: RDFa was originally developed for use with XHTML, and, as such, requires that documents be well formed XML. Since up-take of XHTML has been limited, the specification has been ported to support less well formed HTML; however, differences between HTML and XML can cause difficulties when processing RDF in HTML documents [FN10].

Use of prefixes: RDFa relies on XML namespace prefixes, which, it has been argued, “most authors simply do not understand, and which many implementors end up getting wrong” and “lead[s] to flaky copy-and-paste behaviour” [Ref 6]. This is further complicated by the prefixed terms (technically CURIEs, rather than QNames) appearing in attribute values which few (if any?) authoring tools understand, QNames generally being confined to element and attribute names.

Complex formatting rules: depending on the context in which they appear, relationships in RDFa are variously expressed using either a property, rel or rev attribute, and authors can easily be confused about which is the correct one to use for a given situation – using the wrong one can still generate a valid RDF graph, but not with the meaning the author intended.

The RDFa 1.1 specification, currently under development [FN11], aims to address such concerns, by:

Permitting use of full URIs as property names, rather than requiring prefixed CURIEs
Providing a mechanism for specifying a default vocabulary for a given scope within a document, thereby removing the need to prefix property names
Permitting the external definition of standard collections of prefixes, using ‘profile’ documents

While RDFa 1.0 is widely used, there are very few sites or applications currently supporting RDFa 1.1.

Microdata

The Microdata specification has been created during the development of HTML5, with the aim of addressing the common use cases for embedding metadata, while avoiding some of the concerns that are raised around microformats and RDFa. James Graham of Opera [4] has stated that, “Compared to microformats I believe the HTML 5 microdata offers more consistent parsing rules [...] and cleaner separation from the rest of the markup language. Compared to RFDa, microdata offers a considerably simpler authoring experience which I believe to be critical to gaining traction with a large base of users.”

Microdata introduces a set of new attributes for specifying data ‘items’ and their properties. Items can be assigned a type (defined using a URL) which provides a context for prefix-less property names, similar to the role of namespaces in RDF/RDFa. Properties may also be specified using a URL, in which case they can be applied in any context, without requiring a specific item type. Currently there is no mechanism for providing machine-understandable specification of microdata vocabularies, or mapping between URL and ‘simple’ property names; so it is not possible to mix ‘simple’ names from different vocabularies in a single item. This contrasts with RDF/RDFa, where objects (items) can be assigned multiple classes (types), and it is straightforward to mix property names from different vocabularies.

The microdata specification currently includes instructions for mapping microdata to JSON. Some earlier versions of the specification have included instructions for converting HTML Microdata to RDF, but they have been removed from the current draft.

Metadata available in scholarly works

This case study is not looking at adding new metadata to scholarly publications, but semantically encoding metadata that is already being recorded. The focus is on bibliographic and citation data – i.e. metadata about the publication itself, and about other publications that it cites and references.

PLoS Articles

The Public Library of Science (PLoS) [FN12] is an open access publisher. Alongside the conventional HTML and PDF formatted versions of papers they publish, PLoS also makes available raw XML versions (conforming to the U.S. National Library of Medicine Document Type Definition (NLM DTD)). The XML files contain considerable amounts of metadata, including:

Article title
Author names and affiliations
Citation (journal title, year, volume, pages)
Publisher
Publication data
URL
DOI
Reference list – titles, authors, citation (e.g., journal title, year, volume, issue, pages)

CrystalEye Entries

CrystalEye [FN13] is a repository aggregating openly published crystallographic molecular structures from across the Web. CrystalEye entries consist of Crystallographic Information Files and Chemical Markup Language XML files describing the crystallographic structure, as well as, recently, an RDF representation of information about the crystal. There is an HTML splash page for each entry, providing a summary of the crystal structure, and linking to the various resources (files) making up the entry. The full semantic data can already be retrieved as an RDF/XML file, but there are core items of metadata that, if encoded in the HTML splash page, could assist Web crawlers and browsers in respect of:

Title and authors of the crystal structure
Identity of molecular entities in the crystal structure
Citation for the original publication

Evaluation of suitability

Microformats

Microformats such as rel=”license”:

and rel=”tag”:

are likely to be useful for adding semantics to licence statements and content tags, due to their simplicity. However, there are currently no microformat specifications or drafts relating to scholarly works’ more complex requirements. While there are ‘exploratory discussions’ around citations, this process appears to have been on-going for some years, and it is likely to be some time before a specification starts to emerge.

RDFa

RDF is widely used to process data in many communities, including the handling of scholarly metadata. This means there are already a large number of RDF vocabularies available; examples with particular relevance to scholarly publishing include:

Dublin Core
FOAF (Friend of a Friend)
Bibliographic Ontology
PRISM (Publishing Requirements for Industry Standard Metadata)
FRBR (Functional Requirements for Bibliographic Records)

The Dublin Core vocabulary is very widely used for marking up basic metadata (e.g. title, creator(s), description…) and is straightforward to use to mark-up a resource’s title:

<h1 property=”dc:title”>My Really Great Paper</h1>

where the dc prefix is bound to the namespace http://purl.org/dc/elements/1.1/

Author names are also straightforward to encode using Dublin Core in RDFa:

<p>
<span property=”dc:creator”>Sam Adams</span>
<span property=”dc:creator”>John Smith</span>
</p>

And more complex descriptions of an author can be supported:

<p>
<span rel=”dcterms:creator”>
<span property=”foaf:name”>Sam Adams</span>
<span rel=”foaf:url” resource=”http://www.seadams.co.uk/” />
</span>
</p>

where the dcterms prefix is bound to the namespace http://purl.org/dc/terms/

The existence of two versions of the Dublin Core vocabulary – the original 15 elements, and the larger set of DC terms – can cause confusion for authors: strictly following the specifications, a creator should be specified as a simple (‘literal’) string if using the original elements, and as an object with properties if using the DC terms vocabulary. This means that data of the form:

<p>
<span rel=”dcterms:creator”>Sam Adams</span>
</p>

is not strictly permitted, although such constructs are quite commonly observed.

Bibliographic data

There are a number of RDF vocabularies for describing bibliographic data. During the course of this case study we have evaluated the two most widely used: the Bibliographic Ontology (BIBO) [FN14] and Publishing Requirements for Industry Standard Metadata (PRISM) [FN15]. Both vocabularies contain broadly equivalent terms (e.g. title, authors, journal, issue number, volume number…), however in order to conform strictly to their specification they impose quite different structures on the data. Here we have focused on marking up journal article metadata; however, the vocabularies can also be used to mark up bibliographic data about books, reports and other resources.

The PRISM vocabulary imposes a flat structure, consisting of an article, with a list of properties describing the bibliographic data.

Figure 6. The flat data structure imposed by the PRISM vocabulary.

In contrast, BIBO imposes a nested structure, where following the specification, an article is described as part of an issue, which is in turn part of a journal. According to BIBO’s specification, it is not permitted to use the properties in the ‘flat’ style of the PRISM structure. However, these rules are not always observed (e.g., by some of the examples found in the documentation of BIBO’s Web site!).

Figure 7. The nested data structure imposed by the Bibliographic Ontology.

A second difference is in marking up a journal’s name. While both vocabularies use the Dublin Core title property to mark-up an article’s title, the PRISM vocabulary includes an explicit publicationName term, whereas BIBO used Dublin Core title again (this is made possible due to the nested data structure). These differences make BIBO well suited to building databases of bibliographic data, where it may be useful to model issues and journals explicitly. However, PRISM’s simpler data structure makes it better suited than BIBO for encoding bibliographic metadata in documents.

…

Figure 8. Describing an article’s bibliographic information using RDFa / PRISM vocabulary.

Microdata

Since microdata is a relatively recent development, there are not yet many vocabularies available. The first W3C version of the Microdata specification included a number of predefined types and property names for describing common structures. They were removed from subsequent drafts, but some standard vocabularies (vCard, vEvent and Licensing works) are still included in the current WHATWG specification.

Microdata received a major boost in June 2011, when Bing, Google and Yahoo! announced a joint initiative called schema.org [Ref-3] to support a common set of schemas for structured data mark-up on the Web. Schema.org has chosen to use microdata due to it striking a “balance between the extensibility of RDFa and the simplicity of microformats“. The primary benefit of marking up data using the schema.org vocabulary is to improve one’s display in search results. Google, for example, will display Rich Snippets [FN16] in its search listings for pages containing schema.org mark-up of supported data types, such as Events, Organisations and People.

Among its data types, schema.org includes a ScholarlyArticle type, which we can use to describe an article:

Adding a title (name) to this is straightforward:

<article itemtype=”http://schema.org/ScholarlyArticle” itemscope>
<h1 itemprop=”name”>An investigation of FUD</h1>
</article>

Author names are a little more complicated, as you have start a new Person item, and then attach properties to that:

<p>
<span itemprop=”author” itemscope itemtype=”http://schema.org/Person”>
<span itemprop=”name”>Sam Adams</span>
</span>,
<span itemprop=”author” itemscope itemtype=”http://schema.org/Person”>
<span itemprop=”name”>John Smith</span>
</span>
</p>

The schema.org specification does not permit the simpler:

<p>
<span itemprop=”author”>Sam Adams</span>,
<span itemprop=”author”>John Smith</span>
</p>

Although it seems likely that many examples of this approach will appear as use of the schema.org vocabulary grows.

Bibliographic data

The schema.org vocabulary for ScholarlyArticles does not support concepts such as volume, issue number, DOI which are needed to mark up journal papers’ bibliographic and citation data. This leaves three options for representing such data using Microdata:

1. Extend schema.org

The specification for schema.org allows Web masters to introduce new properties for existing schema.org classes; so we could simply introduce ‘volume’, ‘issueNumber’, ‘doi’ etc properties. However, this carries the risk that a property name we introduce could conflict with another extension. It would also be difficult to document these extensions – the natural place for a user to find information about properties of schema.org classes is on the schema.org Web site, but there would be no information about our extensions there.

<p>
<span itemprop=”journalTitle”>J Interest Things</span>
<span itemprop=”volumeNumber”>7</span>
(<span itemprop=”issueNumber”>2</span>)
<span itemprop=”pageStart”>162</span>
-<span itemprop=”pageEnd”>164</span>
</p>

2. Extend schema.org with external vocabularies

While Microdata properties whose names are plain words (e.g. ‘author’) can only be used within the context of item types for which they are defined, if properties are named using URLs, they can be used on items of any type, though this can end up being quite verbose:

<p>
<span itemprop=”http://prismstandard.org/namespaces/basic/2.0/publicationName”>J Interest Things</span>
<span itemprop=”http://prismstandard.org/namespaces/basic/2.0/volume”>7</span>
(<span itemprop=”http://prismstandard.org/namespaces/basic/2.0/number”>2</span>)
<span itemprop=”http://prismstandard.org/namespaces/basic/2.0/startingPage”>162</span>
-<span itemprop=”http://prismstandard.org/namespaces/basic/2.0/endingPage”>164</span>
</p>

3. Use a different vocabulary

We could create a whole new Microdata vocabulary for scholarly works (possibly building on an existing RDF vocabulary). However, this runs the risk of missing out on the ecosystem/support that may develop around schema.org, given the dominance of its backers.

Example works

To explore the options raised above further, tools have been developed to demonstrate the production of scholarly documents containing semantically encoded metadata:

PLoS Articles

As previously discussed, the raw XML is made available for articles published in PLoS journals. In order to generate examples of articles with semantically marked-up metadata, an XSLT stylesheet has been developed that transforms the XML articles into HTML5, with semantic mark-up of embedded metadata.

The stylesheet has been packaged into a Web application that is accessible at: http://html5app.bluefen.co.uk.

The source code for this application, including the XSLT stylesheet are available from http://bitbucket.org/bluefen/html5app.

CrystalEye Entries

CrystalEye is powered by an instance of the Chempound data repository. Chempound generates splash pages for data items using a templating system. The templates used to generate splash pages for CrystalEye entries have been extended to encode core metadata: title and authors of the crystal structure, and citation of the source publication.

The repository is available at: http://crystaleye.ch.cam.ac.uk

6 Conclusions

Embedding semantic metadata into HTML pages is clearly a topic of current interest. Unfortunately there is not yet a clear standard for generating this mark-up, instead there are a number of competing formats. The strongest contenders seem to be RDFa and microdata, both of which have advantages and disadvantages when compared to the other. Given its longer history, RDFa is currently the more widely used of the two. On the other hand, due to its simpler form, and the recent backing of microdata by the Web’s major search engines through the schema.org initiative, it seems likely that large amounts of microdata will start to appear shortly.

Assuming that microdata does take off, conventions for describing scholarly works will be needed. There are a number of options, though they all suffer from potential drawbacks:

Extend schema.org vocabularies; but the extensions could clash with someone else’s.
Mint a whole new microdata vocabulary of scholarly works; but this misses out the ecosystem/support that may develop around schema.org, given its backers
Use schema.org so far as possible, and import elements of other vocabularies, e.g. BIBO/PRISM; but this would rapidly become a bit untidy/unwieldy
Some other option.

There are advantages and disadvantages to each of these options, but the most important factor is consensus.

It is worth bearing in mind that the microdata specification is not yet finalised. At the same time, the current development of the RDFa 1.1 [1] specification appears to be addressing some of the concerns regarding the complexity of producing RDFa.

While it is unlikely that these efforts will merge anytime in the foreseeable future, ideally a mechanism for interoperability will develop.

7 Addendum

There have been a number of developments since this case study was initially written:

Late in September 2011 the W3C launched a Microdata/RDFa Task Force [FN17] to analyse the relationship between the two formats.
Work is ongoing on a ‘Microdata to RDF’ specification [9].
The microdata specification has been changed to allow an item to have multiple item types, so long as the all “are defined to use the same vocabulary” [8].
Schema.org have announced [12] that they are introducing support for RDFa 1.1 lite [16] – “a very minimal subset that will work for 80% of the folks out there doing simple markup” – alongside microdata, in order to “allow publishers to focus more on what they want to say with their data, rather than on the details of its specific encoding as markup“.

It still does not look like the microdata and RDFa efforts are likely to merge, however efforts are clearly being made to improve their interoperability.

There is not yet any consensus as to whether one format will emerge as the de facto standard for data publication on the Web. My personal feeling is that RDFa is likely to be the stronger contender for this, since it offers greatest flexibility and supports complex data models. Moreover, the development of the RDFa 1.1, and especially the RDFa Lite 1.1, specifications has made it much simpler to publish than was previously the case (RDFa Lite 1.1 looks to be as simple to use as microdata). Microdata suffers from the limitation that it cannot support the more complex use cases for data publication, so will never be able to completely replace RDFa.

References

[1] Adida, B., Birbeck, M., McCarron, S., & Herman, I. (2011) RDFa Core 1.1. http://www.w3.org/TR/rdfa-core/

[2] Berners-Lee, T., Hendler, J., & Lassila, O. (2001) The Semantic Web. Scientific American. 17 May 2001. http://www.scientificamerican.com/article.cfm?id=the-semantic-web

[3] Google. (2011). Introducing schema.org: Search engines come together for a richer web. Webmaster Central Blog, 2 June 2011. http://googlewebmastercentral.blogspot.com/2011/06/introducing-schemaorg-search-engines.html

[4] Graham, J. (2009) Does anyone like microdata? Post to public-html@w3.org Fri, 26 Jun 2009. http://lists.w3.org/Archives/Public/public-html/2009Jun/0736.html

[5] Hassell, J. (2008). Why the BBC removed microformat DateTime patterns from bbc.co.uk. 4 July 2008. BBC Internet Blog. http://www.bbc.co.uk/blogs/bbcinternet/2008/07/why_the_bbc_removed_microforma.html

[6] Hickson, I.(2009). Annotating structured data that HTML has no semantics for. Post to [whatwg] list. Sun May 10 03:32:34 PDT 2009 http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-May/019681.html

[7] Hickson, I.(2011). HTML Microdata. W3C Working Draft 25 May 2011. http://www.w3.org/TR/2011/WD-microdata-20110525/

[8] Hickson, I.(2012). HTML Microdata. Editor’s Draft 6 February 2012. http://dev.w3.org/html5/md/

[9] Kellogg, G. (2011) Microdata to RDF. https://dvcs.w3.org/hg/htmldata/raw-file/37500d90742f/ED/microdata-rdf/20111118/index.html

[10] Neumann, E. K., Miller, E., & Wilbanks, J. (2004, November). What the semantic web could do for the life sciences. Drug Discovery Today 6(2) p228-236. http://lambda.csail.mit.edu/~chet/papers/others/n/neumann/neumann04biosi....

[11] Pilgrim, M. (2011). Dive Into HTML5: What Does It All Mean? http://diveintohtml5.info/semantics.html

[12] Schema.org (2011). Using RDFa 1.1 Lite with Schema.org. http://blog.schema.org/2011/11/using-rdfa-11-lite-with-schemaorg.html

[13] Sefton, P. (2012). Conventions and Guidelines for Scholarly HTML5 Documents. HTML5 Case Studies, UKOLN.

[14] Smethurst, M. (2008). Removing Microformats from bbc.co.uk/programmes, 23 June 2008. BBC Radio Labs Blog. http://www.bbc.co.uk/blogs/radiolabs/2008/06/removing_microformats_from_bbc.shtml

[15] Sporny, M. (2011a, June 11). An Uber-comparison of RDFa, Microdata and Microformats. http://manu.sporny.org/2011/uber-comparison-rdfa-md-uf/

[16] Sporny, M. (2011b). RDFa Lite 1.1 – W3C Editor’s Draft 30 October 2011. http://www.w3.org/2010/02/rdfa/drafts/2011/ED-rdfa-lite-20111030/

Footnotes

[1] Much of the information published on the web is machine-readable, but a much smaller proportion is currently machine-understandable. Information is machine-readable if it is published in a form that can be extracted and manipulated using a computer. If information is published in a machine-understandable manner, software agents can interpret it and reason over it. Unlike humans, machines cannot infer relationships and contexts, so in order to be machine-understandable, data must have clearly defined semantics and structure.

Information published using ASCII characters in an HTML page, or in a CSV file or spread sheet (rather than using images and PDFs) is machine-readable. However, without clear structure and semantic annotations giving ‘meaning’ to each component of the information in a manner that a software agent can interpret, it is not machine-understandable.

[2] As of November 19, 2011, this functionality is only available on google.com, not google.co.uk.

[3] Generated using the Rich Snippets Testing Tool: http://www.google.com/webmasters/tools/richsnippets

[4] Microformats http://microformats.org/

[5] hCard http://microformats.org/wiki/hcard

[6] hCalendar http://microformats.org/wiki/hcalendar

[7] rel=”tag” http://microformats.org/wiki/rel-tag

[8] HTML 4.01 Specification. Chapter 7: The global structure of an HTML document. http://www.w3.org/TR/html4/struct/global.html

[9] The microformats process http://microformats.org/wiki/process

[10] RDFa in HTML issues http://rdfa.info/wiki/Rdfa-in-html-issues

[11] RDFa 1.1 Nears Completion http://rdfa.info/2011/03/31/rdfa-1-1-almost-ready/

[12] The Public Library of Science http://www.plos.org/

[13] CrystalEye http://wwmm.ch.cam.ac.uk/crystaleye/

[14] Web site for the Bibliographic Ontology, known as BIBO http://bibliontology.com/

[15] Publishing Requirements for Industry Standard Metadata (PRISM) http://www.prismstandard.org/

[16] Rich snippets: http://www.google.com/support/webmasters/bin/topic.py?topic=21997

[17] HTML Data Task Force: http://www.w3.org/wiki/Html-data-tf

DC-2011

Alex Ball — Fri, 30 Sep 2011 13:13:00 +0000

On the 21-23 September 2011, I attended the Eleventh International Conference on Dublin Core and Metadata Applications, known as DC-2011 to its friends but #dcmi11 to the true elite. The National Library of the Netherlands (KB) in The Hague made a pleasant setting for the event, although it was perhaps too small. That is to say, the public portion of it did not have sufficient rooms for all the parallel sessions, so some had to be held deep in the secure area of the building. This, as you can imagine, caused headaches for delegates and hosts alike and restricted movement between sessions. In spite of this there was a friendly and lively atmosphere.

On the first day there were tutorial sessions introducing the world of Dublin Core to those less familiar with it. I was not able to attend, and I feel I missed out as people kept telling me about meerkats being behind the name for the original 15 Dublin Core elements. Or something like that.

The conference proper kicked off on the second day with Mikael Nilsson explaining that interoperability (system B understanding what system A produced) is insufficient, and what we really need is harmonization. In other words, metadata that conform to multiple specifications, and systems that can understand and integrate multiple metadata schemes. If you're familiar with RDF and application profiles, you can see where this is going.

In the following plenary session, Jae-Eun Baek used a task-based, 5W1H model to compare different archival and preservation metadata schemes. The 5W1H refers to questions that the metadata are supposed to answer about a task: who does it, why they do it, what they do it to, and so on. The model revealed how different metadata schemes concentrate on different lifecycle stages. This was followed by Kai Eckert, who explained how the Dublin Core Abstract Model needs to be extended in order to provide proper support for recording the provenance of metadata. It involves allowing Description Sets to be the subject of further Descriptions (specifically Annotations); if you know about RDF named graphs, you'll recognise the concept.

The next session was all about mapping between different schemes. Gordon Dunsire argued that to get the benefit of working with Semantic Web technologies, we need to avoid translating values into different formats, and instead concentrate on mapping out the relationships between the properties themselves. Ahsan Morshed talked about how concepts in AGROVOC (an agricultural thesaurus) were mapped to other vocabularies; of particular interest was the way multiple languages were used to pin down the concepts in question. Lastly, Nuno Freire reported on efforts to transform subject headings from various schemes into sets of more specific properties (times, places, events), to make them easier for computers to work with.

The afternoon saw proceedings split into project reports and Dublin Core Community and Task Group workshops. I was involved in the Science and Metadata Community workshop. Jian Qin gave an update on the work she and I are doing with DataCite to produce a Dublin Core Application Profile version of the DataCite Metadata Specification. I gave an overview of current scientific metadata schemes with the aid of some diagrams based on the scoping study I conducted a couple of years ago. The other highlight was a presentation from Michael Lauruhn and Véronique Malaisé of Elsevier on their work with linked data, including the Elsevier Merged Medical Taxonomy (EMMeT) and the Data to Semantics research project.

The talk by Emmanuelle Bermès that kicked off the final day will probably best be remembered for its cookery metaphors, especially the 'stone soup'. If you're not aware of the fable that features stone soup, think of it as a benign slippery slope: some people who weren't willing to help make soup were persuaded instead to incrementally improve boiling water (with stones in) until it became soup. If data are the ingredients, and a functional web of linked data is the soup we're after, what are the 'stones' that will catalyse the transformation from one to the other?

The third plenary session presented the experience of people working with linked data. Antoine Isaac recounted how the Europeana digital library has been making a transition from Europeana Semantic Elements to the (linked-data-friendly) Europeana Data Model, the design decisions they had to make and problems they had normalizing their stock of data. Daniel Vila-Suero justified the style guidelines he and his colleagues have been working on for naming and labelling ontologies in the Multilingual Web. These are being trialled with IFLA's implementation of the FRBR model in RDF. Benjamin Zapilko talked about trying to perform statistical analysis directly through SPARQL. One of his conclusions was that it would probably be better to teach statistical packages SPARQL than to teach SPARQL statistics.

The final plenary collected some more examples of metadata usage in practice. Jörg Brunsmann gave the latest from the SHAMAN Project on handling engineering data, although of most interest to me was how he introduced the notion of Metadata Information Packages to OAIS. Mohammed Ourabah Soualah described the challenges of agreeing a common protocol for cataloguing Arabic manuscripts in Dublin Core, for a cross-search application. Finally, we had a screencast recorded by Oksana Zavalina on the different ways in which digital library collections handled collection-level metadata using the DC Collection Application Profile.

The afternoon was again a mixture of project updates and Community/Task Group meetings. The Registry Community meeting was largely taken up with discussions about the proposed requirements for a new system to manage DCMI's namespaces (and any that its Communities might want to set up). The highlight of the projects session was a paper on encoding the relationships between jazz musicians (e.g. influencedBy, mentorOf) in RDF.

The closing plenary consisted of two videos. The first was from the Free Your Metadata project, who provide guidance on using Google Refine to publish Linked Open Data. The second was an extensive and tuneful tourism advertisement for Malaysia, the host country for next year's conference.

That was my first experience of the Dublin Core conference, but with up to six parallel streams each afternoon, I can't claim to have a representative view on it. There was entire unconference component I didn't experience at all. If there is a common theme I can pick out, it is that the technology still hasn't caught up with demands of people working with the thornier issues of metadata. There was palpable impatience for Named Graphs to become an official part of RDF, for instance. I see a lot of potential for great work to come out of the Community meetings that form a major part of the Conference, and although I'm clearly biased, my own Community meeting was the highlight for me.

Aggregation and the Resource Discovery Taskforce vision

Paul Walk — Thu, 19 Aug 2010 12:43:35 +0000

On Tuesday of this week, UKOLN convened a group of invited experts to discuss aggregation in the context of the Resource Discovery Taskforce's vision. The Resource Discovery Taskforce (RDTF), a joint JISC / RLUK venture, has summed up its vision:

UK researchers and students will have easy, flexible and ongoing access to content and services through a collaborative, aggregated and integrated resource discovery and delivery framework which is comprehensive, open and sustainable

Given the limitations of time and resources, and with a firm intention to make a real contribution, the RDTF has decided to focus on aggregation of metadata as a means to progressing the vision. There was some debate at the meeting about the extent to which aggregation is something worth focussing on, and a general concern that this not become an end in itself, rather than a means to an end. We agreed to use the phrase 'aggregation as a tactic' as a way of characterising the proper relationship of this approach to the vision, and steered the remainder of the meeting to address aggregation from a mainly technical perspective. To get the ball rolling, I introduced a slide wherein I attempt to list possible reasons for aggregating data:

to address systems/network latency - a cache
for ‘Web Scale concentration’
- ‘gaming’ Google - raising ‘visibility’ of content
- network effects if user facing services also developed
to showcase (e.g. scale & nature of OER in UK)
to create middleman business opportunities
as infrastructure to support locally developed services
as an approach to preservation

This was discussed at some length, and we agreed that some other reasons could be added to this list:

for economic reasons - e.g. to achieve economies of scale through storing & managing metadata in one place, implying that the aggregation becomes the sole source of a given metadata record
to add value to the data through processes, especially around data quality, which are impractical or even impossible to contemplate when the metadata is distributed
to simplify licensing from the point of view of the consumer of the aggregated data

We noted that while the RDTF vision seems to concentrate on metadata describing resources and their provision, other types of metadata, such as user-generated annotations and user attention or activity data, which is also of great potential interest and value might be aggregated advantageously.

The importance of registries to help in the identification and discovery of relevant data was raised.

For the second part of the day we broke the meeting up into three smaller groups, each concentrating on an aspect of the preceding general discussions. Each of these groups, when they summarised their discussions for the whole meeting later, identified issues and made recommendations. Where these are generally applicable (which they mostly are), rather than outline them in the following descriptions of the breakout groups I have treated them together in two sections at the end of this post.

Breakout 1: APIs

This group looked at the role which Application Programming Interfaces (APIs) have to play in an environment of aggregated metadata and related services. It used a spectrum of technological interventions ranging from specific service development to meet a particular need, through to generic infrastructure provision to provide opportunities for others to develop services, and attempted to place classes of APIs on this spectrum:

It was agreed that it was important to understand this distinction, and to be equipped to judge where to 'draw the line' between meeting specific requirements and investing in capacity for future innovation. There is clearly a tension between agility - which is a feature which becomes more desirable as one moves along the spectrum towards those servicing users' requirements, and stability which is necessary for infrastructure to be trusted. Part of the purpose of APIs is to help to manage this tension.

APIs are for developers, and so APIs on aggregations must be highly usable from the point of view of a developer. Focussing on the need for aggregations to expose APIs so that services can build upon them, this group made some recommendations (included in the general recommendations at the end of this post) about the sorts of general features an API should exhibit. In general, it was agreed that an API on an aggregation must be more convenient, from the point of view of a developer, than going directly to the individual sources. Leaving aside simple issues of network latency, in a possible Linked Data future where data is commonly openly available, the aggregation and its API must not become a barrier to building services and adding value to data.

This group also discussed the issue of federation of aggregations - where one aggregation feeds another. There are serious engineering issues with this kind of federation which require better understanding.

Breakout 2: Aggregation as tactic

This group decided to start by looking for "prior art" - examples of successful uses of aggregation as an tactic to improving resource discovery. With this approach, it was suggested, it would be possible to identify stakeholder groups which are already 'bought into' the idea of using aggregation as a tactic in this way, which ought to be easier than convincing people from scratch. The trick would seem be to be to identify a shared service which could be developed upon an aggregation of metadata, and which they could recognise would be beneficial to them. Examples of successful aggregations were identified and included:

Copac (aggregated records from National, Academic, and Specialist Library Catalogues)
SUNCAT (a national serials union catalogue)
Worldcat (a global, aggregated library catalogue)

Echoing an earlier point, the group suggested that the value in aggregation as a tactic comes from the ability to normalise metadata into some sort of canonical form. This aspect of the aggregation adding value to the data it aggregates is crucial if the source record holders are to be persuaded to participate.

The group suggested that JORUM's role in supporting the national (and global) Open Educational Resources (OER) movement was very much in line with this thinking: that JORUM enhances discoverability of OERs created in UK institutions, while simultaneously offering the potential for long term archiving (preservation). Again, the importance of the registry becomes apparent this group suggested, with JORUM likely to become important as a service providing identification and 'provenance' services.

The group discussed the idea of concentrating on one particular domain, such as geography, on the grounds that this could then be built out to an extent that other domains would become interested once they had seen what has been achieved. The counter to this argument was a suggestion that it might be better to consider a range of resource types including scholarly communications (bibliographic data), learning materials, repositories, spatial/geographical data and multi-media.

It was also noted that the 'aggregation as a tactic' argument might apply to self-archiving and Open Access - which has similar arguments as for JORUM and OERs.

It was suggested that this was leading to a set of tactics which would help content providers get over a 'fear' of aggregation, and of encouraging them to open up from a position of 'data ownership'. It was also recognised that once this is achieved, aggregation as a tactic creates opportunities for 'middle-men' to add value through new services building on top of the aggregation.

Interestingly, this group suggested that aggregation as a tactic might be a short-or-medium-term tactic, that the 'end game' would be to dis-aggregate content back to source. At this point, the remaining infrastructure would be of the 'registry' type, helping to locate data at source.

Breakout 3: Build better websites!

The emphasis of this session was about advising & enabling those who hold source metadata to make it available in an appropriate form. The group identified a number of 'steps' that a content provider might take. These steps are ordered in a system of progressive desirability in a model influenced by Tim Berners-Lee's Linked Data Note:

make data available in an open form (even using the much-maligned CSV format if necessary)
assign and expose HTTP URIs for everything, and expose useful content at those URIs
publish as XML
expose semantics

It was noted that these steps do not demand that a provider should work their way through them sequentially - it is perfectly acceptable and even desirable to jump in at step 4 - however this might represent a significant barrier to some, so steps 1-3 are there to give content providers a chance to engage comfortably.

Barriers specific to this model being adopted successfully include the issue of securing vendor 'buy-in'. For content providers to support this model, their software platforms need to enable it. This may not be the case at present in most cases. Also, specific skills in Linked Data are not so widespread in these sectors (yet), and an appreciation of and support for Linked Data is not common among senior managers. It was recommended that JISC create some political momentum around this, perhaps devising a convincing argument for senior management. It was also suggested in this breakout group that RDTF should provide a central resource (guidance & possibly infrastructure) for hosting data, especially for smaller organisations.

This approach was summed up as a description of a potential glam.ac.uk where glam is galleries, libraries, archives and museums.

General Issues

Lack of technical expertise in libraries, museums and archives. This applies most strongly in respect of the 'build better websites' model, but is also true more generally, especially when the long-tail of glams is considered.
Business case, or possible lack thereof. The content providers need to see a clear benefit before committing to the cost involved in supporting the aggregation of their data.
Content providers often show a reluctance to make data openly available on the grounds that they may expose poor quality which reflects badly on them

Recommendations

The various discussions during the meeting gave rise to a number of suggested recommendations. It should be noted that these are based on a few short hours of discussion - however the experience of the group which made them is considerable, so I hope they might be considered seriously.

The 4 step model for advising/supporting content providers in opening up their metadata
The RDTF should fund aggregation projects that demonstrate value in these steps
- e.g. "Tell me how my content is being used"
Providers should provide a semantic sitemap leading to a data aggregation. This could be RDF or XML
Providers should expose the schemas they use (whether their own schemas or links to established schemas)
Aggregation services should provide guidance to content providers about schemas to be used (a registry of recommended schemas would be a useful component)
Aggregators should not reject data on basis of schema used by the content provider - aggregators should be prepared to accept anything
The RDTF should (in partnership with others) seek to engage with vendors of collections/content management systems in the various domains.
Aggregations should have supported APIs which are attractive to and convenient for developers, offering developer-friendly output formats such as XML or JSON
Aggregation should be considered, perhaps, as a temporary approach to aiding discoverability. More extremely, a 'just in time' approach to aggregation might be considered.
A 'cookbook' of design patterns involving aggregation as a technical approach to resource discovery might be a useful thing to consider funding.
A '2 tier' model of metadata might be worth considering, where one tier is for common, basic description and identification, and the other tier is for more targeted uses.

Many thanks to those who attended and made the meeting a success:

Peter Burnhill (Edina)
Hugh Glaser (Seme4)
David Kay (Sero)
Andrew Kitchen (Becta)
Ross MacIntyre (Mimas)
Andy McGregor (JISC)
Paul Miller (Cloud of Data)
Andy Powell (Eduserv)
Owen Stephens (independent)
Adrian Stevenson (UKOLN)
Paul Walk (UKOLN)
Jo Walsh (Edina)

And thanks to Adrian also for organising the meeting.

metadata: relevant content on this site

HTML5 Case Study 1: Semantics and Metadata: Machine Understandable Documents

1 About This Case Study

Target Audience

What Is Covered

What Is Not Covered

2 Introduction

3 Case Study: Searching and Rich Snippets

Person Profiles: Linked-In

Google Recipe Search

4 Example Application: Researchers' Homepages

5 Technical Discussions

Semantic data formats

Microformats

RDFa

Microdata

Metadata available in scholarly works

PLoS Articles

CrystalEye Entries

Evaluation of suitability

Microformats

RDFa

Bibliographic data

Microdata

Bibliographic data

1. Extend schema.org

2. Extend schema.org with external vocabularies

3. Use a different vocabulary

Example works

PLoS Articles

CrystalEye Entries

6 Conclusions

7 Addendum

References

Footnotes

DC-2011

Aggregation and the Resource Discovery Taskforce vision

Breakout 1: APIs

Breakout 2: ﻿Aggregation as tactic

Breakout 3: ﻿Build better websites!

Recommendations

Breakout 2: Aggregation as tactic

Breakout 3: Build better websites!