Skip to Content

HTML5 Case Study 1: Semantics and Metadata: Machine Understandable Documents

Author: Sam Adams

1 About This Case Study

Institutions and researchers need to maintain and grow their reputations: this means increasing the exposure of their research outputs on the web. Embedding machine understandable metadata into their Web sites will do this by making them more visible, easier to discover and increasing their uses.

The benefits of such approaches for institutions are:

  • Increased exposure of research (and other) outputs, and the effect this will have on assessment metrics, and hence funding.

The benefits for the individual include:

  • Increased personal exposure and recognition.
  • Standing out from the crowd in an ever increasingly competitive environment.
  • Assisting their own research, making it easier and more efficient to find things.
  • Increasing the usefulness of their own outputs.

This case study reviews the current mainstream approaches to embedding machine-understandable [FN 1] metadata into HTML documents: microformats, RDFa and microdata - and investigates their use for creating 'semantic' scholarly publications.

Note: all references to HTML5 microdata refer to the May 25, 2011 specification [Ref-07] unless otherwise stated. Changes contained in the editor's draft [Ref-08] have not been addressed.

Target Audience

This case study is primarily designed for developers and publishers interested in embedding machine-understandable metadata into their Web pages, those interested in extracting such data, and the wider community interested in the development of a semantic web.

It is also hoped that the communities behind the various technologies and specifications used in the course of this case study will be interested in the feedback regarding their usability and any limitations encountered.

Finally this study highlights areas where further work may help to develop standard approaches.

What Is Covered

This case study reviews the current state of the microformat, RDFa and microdata approaches to embedding semantic mark-up in HTML documents, and reports on their application to the encoding of semantic metadata in scholarly publications.

What Is Not Covered

HTML5 adds a number of new elements for describing the structure of a Web page semantically - e.g. article, header, section. These elements have been used in the course of carrying out this case study, but will not be discussed here.

Further information on the semantic HTML5 elements are available in this series of case studies [Ref-13] and Mark Pilgrim's Dive into HTML5 [Ref-11] .

2 Introduction

Originally the World Wide Web's content was designed solely for humans to read, not for computers to interpret in a meaningful way. Today the technologies to change this exist: by creating HTML with embedded semantics we can publish documents that both humans and machines can 'understand'. The growth in the publication of machine-understandable information is driving the emergence of a Semantic Web - "an extension of the current [web], in which information is given well-defined meaning, better enabling computers and people to work in cooperation" [Ref-2] . This is creating new opportunities, allowing heterogeneous data sources to be integrated and making it possible for software agents to infer new insights. These can be as 'straightforward' as helping users to discover information, or as complex as discovering new relationships between known disease symptoms and potential molecular targets for new drugs [Ref-10].

At the same time, it has become impractical for anyone to manually keep on top of the ever accelerating volume of published text and data. Increasingly the first reading (and filtering) of publications is done by a machine - this is effectively what search engines do. If you're not providing the appropriate machine-understandable metadata - the equivalent of writing a 'paragraph&' for the machine to review - then the humans are unlikely to ever get to see the document! On the other hand, providing rich metadata will make it easier for potential users to discover your content, and increase the likelihood that other services will direct people to your pages.

This report presents some examples showing how search engines currently exploit embedded semantic metadata, and demonstrates how such data can be authored. It then provides a broader review of the state of current technologies, before discussing some issues that remain to be addressed.

3 Case Study: Searching and Rich Snippets

Publishing machine-understandable metadata is not ‘blue skies’ thinking – organisations are doing it right now, and today’s search engines are exploiting it to improve their listings and provide a richer user experience.

Person Profiles: Linked-In

Searches for 'sam adamscambridge' on both Google and Bing return my LinkedIn profile high in their hits. LinkedIn include semantic markup of data in their profiles, and both search engines extract information from this to enrich their search listings.

Google displays my photo, location and current role, in what is termed a 'Rich Snippet';:


Figure 1. Google display of author’s LinkedIn profile.

While Bing highlights my field of work, recommendations and connections:


Figure 2. Bing display of author’s LinkedIn profile.

These additions make the result stand-out from surrounding hits, increasing the likelihood that someone will visit the page.

Google Recipe Search

When one performs a search for 'shepherds pie' on google.com [FN2], the search engine will present the user with rich results listings, and options to filter the results in meaningful ways:


Figure 3. The google.com rich results listings for search term ‘shepherds pie’.

Individual search hits (e.g. red box) can include a picture of the dish and information such as the number of reviews and average score, and the cooking time and number of calories per serving. Similarly the user is given options (green box) to filter the recipes (e.g. selecting those using lamb, rather than beef!), or those that require less than 30 minutes cooking time. All this is achieved by the web sites publishing the recipes embedding appropriate semantic markup in their pages, allowing the search engine to 'understand' the content.

Similar workflows could be applied to searching in the scholarly domain, if appropriate semantically published data is made available. If the cookery business can do this, surely universities can - higher education is falling behind home-economics Web sites!

4 Example Application: Researchers' Homepages

All institutions provide homepages for their academic staff, and many for other staff and researchers too. These can be made to appear as ‘Rich Snippets’ in Google results with addition of semantic markup for a small number of metadata elements:

  • Name
  • Address (locality, country)
  • Job Title
  • Photograph (optional)

The original markup is given below:

<article>
<h1>Sam Adams</h1>
<img src=”http://www.ukoln.ac.uk/isc/html5-case-studies/adams/html/tn_sam-adams.jpg”>
<h2>Cambridge (UK) based Software Developer &amp; Consultant</h2>
</article>

With semantic mark-up (using HTML5 Microdata / schema.org – see discussion below, for details):

<article itemscope itemtype=”http://schema.org/Person”>
<h1 itemprop=”name”>Sam Adams</h1>
<img itemprop=”image” src=”http://www.ukoln.ac.uk/isc/html5-case-studies/adams/html/tn_sam-adams.jpg”>
<h2>
<span itemprop=”address” itemscope itemtype=”http://schema.org/PostalAddress”>
<span itemprop=”addressLocality”>Cambridge</span>
(<span itemprop=”addressCountry”>UK</span>)

</span>
based <span itemprop=”jobTitle”>Software Developer &amp; Consultant</span>
</h2>
</article>

Figure 4. Resulting Google ‘Rich Snippet’ [FN3]

5 Technical Discussions

The remainder of this report contains more detailed technical discussions. The technologies described above are reviewed in more detail, and some current issues discussed. Four areas are covered:

  1. A review of the different approaches to embedding semantic metadata into HTML5 documents.
  2. A review of the types of data/metadata found in the different scholarly publications under investigation.
  3. An evaluation of the suitability of each of the methods of embedding semantic metadata for supporting the types of data required by this study.
  4. Production of example works with embedded metadata.

Semantic data formats

This section provides an overview of the three major formats for embedding semantics in HTML documents – microformats, RDFa and microdata. For a comprehensive review of their implementation choices and support for different features see [Ref 15].

Microformats

Microformats [FN4] are simple conventions for embedding semantic mark-up about a specific domain into human-readable (X)HTML/XML documents. here are microformat specifications supporting a variety of types of data, a number of which have seen quite widespread up-take – e.g., hCard [FN5] for describing people and organisations, hCalendar [FN6] for describing calendars and events, and rel-tag [FN7] for marking up tags, keywords and categories in pages such as blog posts.

Microformats have been designed to be straightforward for humans to use, with mark-up based around existing, widely used HTML features as shown in Figure 5:

<p class=”vcard”>
<a href=”http://www.seadams.co.uk/”>Sam Adams</a>
is a <span>software developer</span>.
</p>

Figure 5. Example of an hCard describing Sam Adams.

Note in Figure 5 the vcard class on the p element indicates that the child elements form an hCard. The subsequent classes (url, fn, role) indicate the properties their elements describe.

The major criticisms of the microformat specifications are:

Conflicts with formatting information: Microformats make wide use of the class HTML attribute which is more usually employed by selectors for style sheets giving presentation instructions for a page. While the HTML specifications permit the use of the class attribute "for general purpose processing by user agents" [FN8], overloading the attribute in this manner makes it impossible to tell whether a class attribute is being used for styling purposes, or to mark up a data field, and conflicts can arise when microformats are introduced to existing Web sites.

Processing challenges: The ambiguity between data and format specification also makes it impossible to extract marked-up data in a generic manner - a processor can only extract data conforming to microformats that it knows about. In the above example, a processor cannot know that it should associate the value of the a element's href attribute with the url property, and its text content with fn (full name), unless these rules are hard-coded.

Accessibility: a number of microformats use the abbr HTML element to encode text in both human friendly and machine readable formats. e.g., a date-time may be encoded as:

<abbr title=”20110921T14:00:00+0100″>Wednesday 21st at 2 o’clock</abbr>

Unfortunately this usage of the abbr element is not compatible with screen readers used by many blind and partially sighted users which has led some organisations, most notably the BBC [Ref-14] and [Ref-5] to ban the use of microformats which make use of this pattern.

Approval process / Extensibility: in order to prevent conflicts between microformat and property names, new microformats require centralised registration, and approval through a community process [FN9]. This can make it a lengthy and sometimes difficult process to establish a microformat for a new type of data.

RDFa

The RDFa specification provides a mechanism for embedding RDF (the language of the Semantic Web) data models into XHTML documents. RDFa brings the full power of RDF to embedding semantic data into Web documents, and is automatically compatible with the work of the Semantic Web community. In contrast to microformats, RDF/RDFa embraces 'distributed extensibility' - anyone can create a new vocabulary. This is achieved without having to worrying about conflicting with another vocabulary’s names by using a URL the authors control as a namespace for the vocabulary. Technologies such as RDF Schema (RDFS) and Web Ontology Language (OWL) enable the construction of machine-understandable descriptions of the required structure of RDF entities, and the separation between data and formatting mark-up, combined with more strictly specified parsing rules, ensure that problems such as the urlfn ambiguity, discussed above, do not arise.

RDFa has, however been widely criticised for its complexity in a number of areas:

XML basis: RDFa was originally developed for use with XHTML, and, as such, requires that documents be well formed XML. Since up-take of XHTML has been limited, the specification has been ported to support less well formed HTML; however, differences between HTML and XML can cause difficulties when processing RDF in HTML documents [FN10].

Use of prefixes: RDFa relies on XML namespace prefixes, which, it has been argued, “most authors simply do not understand, and which many implementors end up getting wrong” and “lead[s] to flaky copy-and-paste behaviour” [Ref 6]. This is further complicated by the prefixed terms (technically CURIEs, rather than QNames) appearing in attribute values which few (if any?) authoring tools understand, QNames generally being confined to element and attribute names.

Complex formatting rules: depending on the context in which they appear, relationships in RDFa are variously expressed using either a property, rel or rev attribute, and authors can easily be confused about which is the correct one to use for a given situation – using the wrong one can still generate a valid RDF graph, but not with the meaning the author intended.

The RDFa 1.1 specification, currently under development [FN11], aims to address such concerns, by:

  • Permitting use of full URIs as property names, rather than requiring prefixed CURIEs
  • Providing a mechanism for specifying a default vocabulary for a given scope within a document, thereby removing the need to prefix property names
  • Permitting the external definition of standard collections of prefixes, using ‘profile’ documents

While RDFa 1.0 is widely used, there are very few sites or applications currently supporting RDFa 1.1.

Microdata

The Microdata specification has been created during the development of HTML5, with the aim of addressing the common use cases for embedding metadata, while avoiding some of the concerns that are raised around microformats and RDFa. James Graham of Opera [4] has stated that, “Compared to microformats I believe the HTML 5 microdata offers more consistent parsing rules [...] and cleaner separation from the rest of the markup language. Compared to RFDa, microdata offers a considerably simpler authoring experience which I believe to be critical to gaining traction with a large base of users.

Microdata introduces a set of new attributes for specifying data ‘items’ and their properties. Items can be assigned a type (defined using a URL) which provides a context for prefix-less property names, similar to the role of namespaces in RDF/RDFa. Properties may also be specified using a URL, in which case they can be applied in any context, without requiring a specific item type. Currently there is no mechanism for providing machine-understandable specification of microdata vocabularies, or mapping between URL and ‘simple’ property names; so it is not possible to mix ‘simple’ names from different vocabularies in a single item. This contrasts with RDF/RDFa, where objects (items) can be assigned multiple classes (types), and it is straightforward to mix property names from different vocabularies.

The microdata specification currently includes instructions for mapping microdata to JSON. Some earlier versions of the specification have included instructions for converting HTML Microdata to RDF, but they have been removed from the current draft.

Metadata available in scholarly works

This case study is not looking at adding new metadata to scholarly publications, but semantically encoding metadata that is already being recorded. The focus is on bibliographic and citation data – i.e. metadata about the publication itself, and about other publications that it cites and references.

PLoS Articles

The Public Library of Science (PLoS) [FN12] is an open access publisher. Alongside the conventional HTML and PDF formatted versions of papers they publish, PLoS also makes available raw XML versions (conforming to the U.S. National Library of Medicine Document Type Definition (NLM DTD)). The XML files contain considerable amounts of metadata, including:

  • Article title
  • Author names and affiliations
  • Citation (journal title, year, volume, pages)
  • Publisher
  • Publication data
  • URL
  • DOI
  • Reference list – titles, authors, citation (e.g., journal title, year, volume, issue, pages)

CrystalEye Entries

CrystalEye [FN13] is a repository aggregating openly published crystallographic molecular structures from across the Web. CrystalEye entries consist of Crystallographic Information Files and Chemical Markup Language XML files describing the crystallographic structure, as well as, recently, an RDF representation of information about the crystal. There is an HTML splash page for each entry, providing a summary of the crystal structure, and linking to the various resources (files) making up the entry. The full semantic data can already be retrieved as an RDF/XML file, but there are core items of metadata that, if encoded in the HTML splash page, could assist Web crawlers and browsers in respect of:

  • Title and authors of the crystal structure
  • Identity of molecular entities in the crystal structure
  • Citation for the original publication

Evaluation of suitability

Microformats

Microformats such as rel=”license”:

<a href=”http://creativecommons.org/licenses/by/2.0/” rel=”license”>cc by 2.0</a>

and rel=”tag”:

<a href=”http://example.com/tag/html5″ rel=”tag”>html5</a>

are likely to be useful for adding semantics to licence statements and content tags, due to their simplicity. However, there are currently no microformat specifications or drafts relating to scholarly works’ more complex requirements. While there are ‘exploratory discussions’ around citations, this process appears to have been on-going for some years, and it is likely to be some time before a specification starts to emerge.

RDFa

RDF is widely used to process data in many communities, including the handling of scholarly metadata. This means there are already a large number of RDF vocabularies available; examples with particular relevance to scholarly publishing include:

  • Dublin Core
  • FOAF (Friend of a Friend)
  • Bibliographic Ontology
  • PRISM (Publishing Requirements for Industry Standard Metadata)
  • FRBR (Functional Requirements for Bibliographic Records)

The Dublin Core vocabulary is very widely used for marking up basic metadata (e.g. title, creator(s), description…) and is straightforward to use to mark-up a resource’s title:

<h1 property=”dc:title”>My Really Great Paper</h1>

where the dc prefix is bound to the namespace http://purl.org/dc/elements/1.1/

Author names are also straightforward to encode using Dublin Core in RDFa:

<p>
<span property=”dc:creator”>Sam Adams</span>
<span property=”dc:creator”>John Smith</span>
</p>

And more complex descriptions of an author can be supported:

<p>
<span rel=”dcterms:creator”>
<span property=”foaf:name”>Sam Adams</span>
<span rel=”foaf:url” resource=”http://www.seadams.co.uk/” />
</span>
</p>

where the dcterms prefix is bound to the namespace http://purl.org/dc/terms/

The existence of two versions of the Dublin Core vocabulary – the original 15 elements, and the larger set of DC terms – can cause confusion for authors: strictly following the specifications, a creator should be specified as a simple (‘literal’) string if using the original elements, and as an object with properties if using the DC terms vocabulary. This means that data of the form:

<p>
<span rel=”dcterms:creator”>Sam Adams</span>
</p>

is not strictly permitted, although such constructs are quite commonly observed.

Bibliographic data

There are a number of RDF vocabularies for describing bibliographic data. During the course of this case study we have evaluated the two most widely used: the Bibliographic Ontology (BIBO) [FN14] and Publishing Requirements for Industry Standard Metadata (PRISM) [FN15]. Both vocabularies contain broadly equivalent terms (e.g. title, authors, journal, issue number, volume number…), however in order to conform strictly to their specification they impose quite different structures on the data. Here we have focused on marking up journal article metadata; however, the vocabularies can also be used to mark up bibliographic data about books, reports and other resources.

The PRISM vocabulary imposes a flat structure, consisting of an article, with a list of properties describing the bibliographic data.

Figure 6. The flat data structure imposed by the PRISM vocabulary.

In contrast, BIBO imposes a nested structure, where following the specification, an article is described as part of an issue, which is in turn part of a journal. According to BIBO’s specification, it is not permitted to use the properties in the ‘flat’ style of the PRISM structure. However, these rules are not always observed (e.g., by some of the examples found in the documentation of BIBO’s Web site!).

Figure 7. The nested data structure imposed by the Bibliographic Ontology.

A second difference is in marking up a journal’s name. While both vocabularies use the Dublin Core title property to mark-up an article’s title, the PRISM vocabulary includes an explicit publicationName term, whereas BIBO used Dublin Core title again (this is made possible due to the nested data structure). These differences make BIBO well suited to building databases of bibliographic data, where it may be useful to model issues and journals explicitly. However, PRISM’s simpler data structure makes it better suited than BIBO for encoding bibliographic metadata in documents.

<html xmlns:dc=”http://purl.org/dc/elements/1.1/”
xmlns:prism=”http://prismstandard.org/namespaces/basic/2.0/”>

<article about=”">
<h1 property=”dc:title”>…</h1>
<p>
<span property=”dc:creator”>…</span>
</p>
<p>
<span property=”prism:publicationName”>…</span>
<span property=”prism:volume”>…</span>
(<span property=”prism:number”>…</span>)
<span property=”prism:startingPage”>…</span>-<span property=”prism:endingPage”>…</span>
</p>
<p>DOI: <a rel=”prism:url” href=”http://dx.doi.org/…”>…</a></p>
</article>

Figure 8. Describing an article’s bibliographic information using RDFa / PRISM vocabulary.

Microdata

Since microdata is a relatively recent development, there are not yet many vocabularies available. The first W3C version of the Microdata specification included a number of predefined types and property names for describing common structures. They were removed from subsequent drafts, but some standard vocabularies (vCard, vEvent and Licensing works) are still included in the current WHATWG specification.

Microdata received a major boost in June 2011, when Bing, Google and Yahoo! announced a joint initiative called schema.org [Ref-3] to support a common set of schemas for structured data mark-up on the Web. Schema.org has chosen to use microdata due to it striking a “balance between the extensibility of RDFa and the simplicity of microformats“. The primary benefit of marking up data using the schema.org vocabulary is to improve one’s display in search results. Google, for example, will display Rich Snippets [FN16] in its search listings for pages containing schema.org mark-up of supported data types, such as Events, Organisations and People.

Among its data types, schema.org includes a ScholarlyArticle type, which we can use to describe an article:

<article itemtype=”http://schema.org/ScholarlyArticle” itemscope>

</article>

Adding a title (name) to this is straightforward:

<article itemtype=”http://schema.org/ScholarlyArticle” itemscope>
<h1 itemprop=”name”>An investigation of FUD</h1>
</article>

Author names are a little more complicated, as you have start a new Person item, and then attach properties to that:

<p>
<span itemprop=”author” itemscope itemtype=”http://schema.org/Person”>
<span itemprop=”name”>Sam Adams</span>
</span>,
<span itemprop=”author” itemscope itemtype=”http://schema.org/Person”>
<span itemprop=”name”>John Smith</span>
</span>
</p>

The schema.org specification does not permit the simpler:

<p>
<span itemprop=”author”>Sam Adams</span>,
<span itemprop=”author”>John Smith</span>
</p>

Although it seems likely that many examples of this approach will appear as use of the schema.org vocabulary grows.

Bibliographic data

The schema.org vocabulary for ScholarlyArticles does not support concepts such as volume, issue number, DOI which are needed to mark up journal papers’ bibliographic and citation data. This leaves three options for representing such data using Microdata:

1. Extend schema.org

The specification for schema.org allows Web masters to introduce new properties for existing schema.org classes; so we could simply introduce ‘volume’, ‘issueNumber’, ‘doi’ etc properties. However, this carries the risk that a property name we introduce could conflict with another extension. It would also be difficult to document these extensions – the natural place for a user to find information about properties of schema.org classes is on the schema.org Web site, but there would be no information about our extensions there.

<p>
<span itemprop=”journalTitle”>J Interest Things</span>
<span itemprop=”volumeNumber”>7</span>
(<span itemprop=”issueNumber”>2</span>)
<span itemprop=”pageStart”>162</span>
-<span itemprop=”pageEnd”>164</span>
</p>

2. Extend schema.org with external vocabularies

While Microdata properties whose names are plain words (e.g. ‘author’) can only be used within the context of item types for which they are defined, if properties are named using URLs, they can be used on items of any type, though this can end up being quite verbose:

<p>
<span itemprop=”http://prismstandard.org/namespaces/basic/2.0/publicationName”>J Interest Things</span>
<span itemprop=”http://prismstandard.org/namespaces/basic/2.0/volume”>7</span>
(<span itemprop=”http://prismstandard.org/namespaces/basic/2.0/number”>2</span>)
<span itemprop=”http://prismstandard.org/namespaces/basic/2.0/startingPage”>162</span>
-<span itemprop=”http://prismstandard.org/namespaces/basic/2.0/endingPage”>164</span>
</p>

3. Use a different vocabulary

We could create a whole new Microdata vocabulary for scholarly works (possibly building on an existing RDF vocabulary). However, this runs the risk of missing out on the ecosystem/support that may develop around schema.org, given the dominance of its backers.

Example works

To explore the options raised above further, tools have been developed to demonstrate the production of scholarly documents containing semantically encoded metadata:

PLoS Articles

As previously discussed, the raw XML is made available for articles published in PLoS journals. In order to generate examples of articles with semantically marked-up metadata, an XSLT stylesheet has been developed that transforms the XML articles into HTML5, with semantic mark-up of embedded metadata.

The stylesheet has been packaged into a Web application that is accessible at: http://html5app.bluefen.co.uk.

The source code for this application, including the XSLT stylesheet are available from http://bitbucket.org/bluefen/html5app.

CrystalEye Entries

CrystalEye is powered by an instance of the Chempound data repository. Chempound generates splash pages for data items using a templating system. The templates used to generate splash pages for CrystalEye entries have been extended to encode core metadata: title and authors of the crystal structure, and citation of the source publication.

The repository is available at: http://crystaleye.ch.cam.ac.uk

6 Conclusions

Embedding semantic metadata into HTML pages is clearly a topic of current interest. Unfortunately there is not yet a clear standard for generating this mark-up, instead there are a number of competing formats. The strongest contenders seem to be RDFa and microdata, both of which have advantages and disadvantages when compared to the other. Given its longer history, RDFa is currently the more widely used of the two. On the other hand, due to its simpler form, and the recent backing of microdata by the Web’s major search engines through the schema.org initiative, it seems likely that large amounts of microdata will start to appear shortly.

Assuming that microdata does take off, conventions for describing scholarly works will be needed. There are a number of options, though they all suffer from potential drawbacks:

  • Extend schema.org vocabularies; but the extensions could clash with someone else’s.
  • Mint a whole new microdata vocabulary of scholarly works; but this misses out the ecosystem/support that may develop around schema.org, given its backers
  • Use schema.org so far as possible, and import elements of other vocabularies, e.g. BIBO/PRISM; but this would rapidly become a bit untidy/unwieldy
  • Some other option.

There are advantages and disadvantages to each of these options, but the most important factor is consensus.

It is worth bearing in mind that the microdata specification is not yet finalised. At the same time, the current development of the RDFa 1.1 [1] specification appears to be addressing some of the concerns regarding the complexity of producing RDFa.

While it is unlikely that these efforts will merge anytime in the foreseeable future, ideally a mechanism for interoperability will develop.

7 Addendum

There have been a number of developments since this case study was initially written:

  • Late in September 2011 the W3C launched a Microdata/RDFa Task Force [FN17] to analyse the relationship between the two formats.
  • Work is ongoing on a ‘Microdata to RDF’ specification [9].
  • The microdata specification has been changed to allow an item to have multiple item types, so long as the all “are defined to use the same vocabulary” [8].
  • Schema.org have announced [12] that they are introducing support for RDFa 1.1 lite [16] – “a very minimal subset that will work for 80% of the folks out there doing simple markup” – alongside microdata, in order to “allow publishers to focus more on what they want to say with their data, rather than on the details of its specific encoding as markup“.

It still does not look like the microdata and RDFa efforts are likely to merge, however efforts are clearly being made to improve their interoperability.

There is not yet any consensus as to whether one format will emerge as the de facto standard for data publication on the Web. My personal feeling is that RDFa is likely to be the stronger contender for this, since it offers greatest flexibility and supports complex data models. Moreover, the development of the RDFa 1.1, and especially the RDFa Lite 1.1, specifications has made it much simpler to publish than was previously the case (RDFa Lite 1.1 looks to be as simple to use as microdata). Microdata suffers from the limitation that it cannot support the more complex use cases for data publication, so will never be able to completely replace RDFa.

References

[1] Adida, B., Birbeck, M., McCarron, S., & Herman, I. (2011) RDFa Core 1.1. http://www.w3.org/TR/rdfa-core/

[2] Berners-Lee, T., Hendler, J., & Lassila, O. (2001) The Semantic Web. Scientific American. 17 May 2001. http://www.scientificamerican.com/article.cfm?id=the-semantic-web

[3] Google. (2011). Introducing schema.org: Search engines come together for a richer web. Webmaster Central Blog, 2 June 2011. http://googlewebmastercentral.blogspot.com/2011/06/introducing-schemaorg-search-engines.html

[4] Graham, J. (2009) Does anyone like microdata? Post to public-html@w3.org Fri, 26 Jun 2009. http://lists.w3.org/Archives/Public/public-html/2009Jun/0736.html

[5] Hassell, J. (2008). Why the BBC removed microformat DateTime patterns from bbc.co.uk. 4 July 2008. BBC Internet Blog. http://www.bbc.co.uk/blogs/bbcinternet/2008/07/why_the_bbc_removed_microforma.html

[6] Hickson, I.(2009). Annotating structured data that HTML has no semantics for. Post to [whatwg] list. Sun May 10 03:32:34 PDT 2009 http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-May/019681.html

[7] Hickson, I.(2011). HTML Microdata. W3C Working Draft 25 May 2011. http://www.w3.org/TR/2011/WD-microdata-20110525/

[8] Hickson, I.(2012). HTML Microdata. Editor’s Draft 6 February 2012. http://dev.w3.org/html5/md/

[9] Kellogg, G. (2011) Microdata to RDF. https://dvcs.w3.org/hg/htmldata/raw-file/37500d90742f/ED/microdata-rdf/20111118/index.html

[10] Neumann, E. K., Miller, E., & Wilbanks, J. (2004, November). What the semantic web could do for the life sciences. Drug Discovery Today 6(2) p228-236. http://lambda.csail.mit.edu/~chet/papers/others/n/neumann/neumann04biosi....

[11] Pilgrim, M. (2011). Dive Into HTML5: What Does It All Mean? http://diveintohtml5.info/semantics.html

[12] Schema.org (2011). Using RDFa 1.1 Lite with Schema.org. http://blog.schema.org/2011/11/using-rdfa-11-lite-with-schemaorg.html

[13] Sefton, P. (2012). Conventions and Guidelines for Scholarly HTML5 Documents. HTML5 Case Studies, UKOLN.

[14] Smethurst, M. (2008). Removing Microformats from bbc.co.uk/programmes, 23 June 2008. BBC Radio Labs Blog. http://www.bbc.co.uk/blogs/radiolabs/2008/06/removing_microformats_from_bbc.shtml

[15] Sporny, M. (2011a, June 11). An Uber-comparison of RDFa, Microdata and Microformats. http://manu.sporny.org/2011/uber-comparison-rdfa-md-uf/

[16] Sporny, M. (2011b). RDFa Lite 1.1 – W3C Editor’s Draft 30 October 2011. http://www.w3.org/2010/02/rdfa/drafts/2011/ED-rdfa-lite-20111030/


Footnotes

[1] Much of the information published on the web is machine-readable, but a much smaller proportion is currently machine-understandable. Information is machine-readable if it is published in a form that can be extracted and manipulated using a computer. If information is published in a machine-understandable manner, software agents can interpret it and reason over it. Unlike humans, machines cannot infer relationships and contexts, so in order to be machine-understandable, data must have clearly defined semantics and structure.

Information published using ASCII characters in an HTML page, or in a CSV file or spread sheet (rather than using images and PDFs) is machine-readable. However, without clear structure and semantic annotations giving ‘meaning’ to each component of the information in a manner that a software agent can interpret, it is not machine-understandable.

[2] As of November 19, 2011, this functionality is only available on google.com, not google.co.uk.
[3] Generated using the Rich Snippets Testing Tool: http://www.google.com/webmasters/tools/richsnippets
[4] Microformats http://microformats.org/
[7] rel=”tag” http://microformats.org/wiki/rel-tag
[8] HTML 4.01 Specification. Chapter 7: The global structure of an HTML document. http://www.w3.org/TR/html4/struct/global.html
[9] The microformats process http://microformats.org/wiki/process
[12] The Public Library of Science http://www.plos.org/
[14] Web site for the Bibliographic Ontology, known as BIBO http://bibliontology.com/
[15] Publishing Requirements for Industry Standard Metadata (PRISM) http://www.prismstandard.org/
[17] HTML Data Task Force: http://www.w3.org/wiki/Html-data-tf


Dr. Radut | technical_resources