HTML5 Case Study 8: Conventions and Guidelines for Scholarly HTML5 Documents

16 July 2012 - 2:01pm — Administrator

Author: Peter Sefton

1. About This Case Study

HTML5 case study on “Conventions and Guidelines for Scholarly HTML5 Documents” by Peter Sefton

This case study looks at the fundamentals of using HTML5 for scholarly documents of all kinds, particularly theses and courseware documents (with application to journal articles as well), but with an eye on a much broader spectrum of resources, including those which are the subject of other case studies in this project such as slide presentations. It will aim to establish the basic structural and semantic building blocks for how resources should be marked up for the Web, to increase their utility for people and machines, as well as help to ensure they can be preserved effectively. This case study will build on work already undertaken by in the Scholarly HTML community as well as the other HTML5 case studies [1], [2], [3], [4], [5], [6], [7] and [8].

Target Audience

The audience for this work is tool developers building authoring systems, repositories and publishing infrastructure for academic documents. The outcomes could be used by people hand-coding documents, but that is not a very likely scenario; another related case study will implement the advice from this study into a tool to allow users to create HTML5 from Microsoft Word (using the Word 2000 HTML format which is available on MS Windows from v2000 to v2010) [7] and a third looks at how citations embedded in a document using the guidelines presented here can be re-formatted [6].

What Is Covered

The following aspects are covered in this document:

The basic structural backbone of a scholarly HTML document; how to mark up the scope of what is the content on a page. For example, on a blog, which section is the scholarly work as distinct from the navigation elements, advertisements, etc.?
Best practice for marking up sections within the document (whether to use nested sections or just headings - discussion of issues like putting headings in tables and their implications).
A brief discussion of techniques for embedding rhetorical semantics in documents. That is, the ability to distinguish an introduction or conclusion, or to mark parts of a text such as learning objectives by drawing on XML schemas and ontologies. Some generic advice about how to mark up other kinds of semantic relationships, such as linking to a data file, illustrated with examples from Chemistry.
Work on metadata and semantics by other case studies incorporated into the core guidelines.

The following is still in gestation:

Anchors for commenting and annotation: this requires some attention as simple schemes such as numbering paragraphs are very limited in capability, and current tools such as digress.it require documents to be in a particular dialect of HTML to work. The introduction of standards in this area would allow interoperability between commenting and annotation systems.

This work should have an impact on:

Search engine optimisation (SEO), particularly for services such as Google Scholar.
Reduced friction in moving documents through submission processes to journals, to repositories and to review processes such as peer review, thesis examination, and assessment, via automated metadata extraction.
Improved machine-readability for text and data-mining processes.
Improved accessibility for readers - guidelines will take into account WCAG accessibility guidelines.
Preservation: the guidelines will assist authors and tool makers in constructing documents which do not 'rot' as technology changes.

2. Use Case

The use case here is very broad: it is about the optimal mark-up for any kind of academic-related document on the Web. The Web was conceived as a vehicle for scholarship, but in the two decades since its invention, scholarly communications have taken a back seat to the driver, commerce. The most common form of Web publishing for scholarly publications is articles in PDF format, which readers are expected to download and manage themselves. PDF has the advantage of capturing an absolute layout, preserving the exact look of a document, but it was designed for capturing print - not for delivery to an increasing variety of screen sizes (and to devices with no screen at all).

It has been recognised that the current scholarly publications landscape is not serving the needs of the community. Publishers have built an industry around the creation of paper-like objects which do not allow:

Delivery to any device
Re-use by humans to create new works.
Rich integration of publications with supporting data and visualisations of data.
Machine-processable semantics so that research literature can be mined, indexed and analysed automatically.

For learning resources the situation is a little different, in that there has been some history of Web-based delivery of materials controlled by institutions themselves.

As noted in another JISC project [FN1] HTML is the major format for the next stage of the development of the Web, as well structured Web resouces can not only be delivered and used on the Web, they are the basis for creating e-books. It is clear from the extremely rapid growth of the e-book market for commercial publishing that learning resources and research materials will need to follow. HTML5 is essential both to the open EPUB3 standard and to market leader Amazon's newest format.

In both research and learning materials, there is still a distinct lack of tools for creating Web-native resources, at least in a way accessible to typical academics. The use case of an academic sitting down to create richly structured HTML5 academic objects, with embedded semantics and preservation-quality mark-up, is something of a dream at this stage. The best this case study can hope to achieve is to provide a starting point for a description of what Scholarly HTML should look like and to provide a starting point for a roadmap for tool development to allow the scholarly Web to take its rightful place alongside the Web 'high street';.

3. Solution

The solution has two parts. The first is a guide to marking up Scholarly HTML documents. The second points to software packages and techniques that are useful in the process of marking up documents, both existing tools, and tools developed as part of this project.

How to Mark Up Documents for Scholarly HTML

This section will become a stand-alone guide and be posted on the Scholarly HTML Website as the core guide to structuring HTML documents. Scholarly HTML is a term used by a loose group of people interested in bringing scholarship to the Web, or the Web back to its scholarly roots as a publication and research platform. The group met physically once at a meeting convened by Peter Murray-Rust at Cambridge University in March 2011.

Use HTML5, Microdata and Common Vocabularies

HTML5 is an evolving standard which codifies HTML in the context of the real world. The Wikipedia page for HTML5 [9] is a good starting point for pointers to the specification. This document will assume that the reader is familiar with the HTML5 standard, in particular outline structures and microdata; Mark Pilgrim's Dive Into HTML5 [10] is a good free introduction.

Within the Scholarly HTML group there was a short, vigorous, debate about whether or not Scholarly HTML should be required to be well-formed XML. There were fears on both sides - that HTML resources would be impossible to parse reliably, and on the other hand that making XML mandatory would be too high a bar, reducing the pool of available content considerably. Mark Pilgrim covers very similar arguments in his chapter on the background of HTML5 including the now-abandoned XHTML standard. The good news is that with HTML5 you do not have to choose. The HTML5 standard specifies exactly how HTML5 should be parsed - and once parsed it can be re-serialised as XML. So, for machine-based processing, the advice is use an HTML5 parser, not an XML parser. Then, if you want to use XML tools, serialise the document as XML.

For example, here is some Python to illustrate the process using html5lib - a reference implementation of the parsing rules, and lxml. This is copied and pasted from an Ubuntu Linux environment.

First, install the Python libraries you need:

sudo easy_install lxml html5lib

Then, open a Python shell (type python) and try this:

import html5lib #Handles the ’HTML5’ stuff

from html5lib import treebuilders

from lxml import etree #Handles the XML serialization

parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder(“lxml”))

e = parser.parse(“<p>This is some HTML<p>Which is very far from being <b>XML <p>But which the HTML parser will be OK with”)

print etree.tostring(e.getroot())

The result is well-formed XML (yes, the namespace is probably wrong, but this will serve as the input to downstream processes).

xmlns:html=”http://www.w3.org/1999/xhtml“>

<html:head/>

<html:body>

<html:p>THis is some HTML</html:p>

<html:p>Which is very far from being <html:b>XML</html:b></html:p>

<html:p>

<html:b>But which the HTML parser will be OK with</html:b>

</html:p>

</html:body>

</html:html>

So, the advice for scholary HTML is:

Use HTML5 as per the standard including Microdata.

The Context for Pages

Scholarly works on the Web are unlikely to be stand-alone documents. They will very often be embedded in content management systems, repositories or publisher Web sites. It is out of scope for this case-study to consider the structure of Web pages produced by these Web applications, but for an example of best practice in structuring HTML5 Web sites, including navigation elements and so on, see the Common Framework case study conducted by Bilbie [2].

The key points in that case study involve:

Flexible design that can re-flow to any size of device.
Use of HTML5 attributes in the mark-up to provide cues to screen readers and other assistive software.

Be Declarative

HTML5 has a close relationship to the Javascript scripting language and has theelement on which you can cause things to appear, via scripts. But there are several reasons to avoid making academic works depend on particular scripts or applications, and instead look for ways to express the meaning of a work and the parts it links to as plain HTML, with enough information in it for scripts, etc., to come into play when needed:

Works can be archived and preserved independently of the scripts on which they depend
Other people can reuse the declaratively specified data in new ways
Revising the work when new applications emerge is much easier when there is a clear separation between documents, together with data and media, as opposed to the code that does interesting things with the documents, data and media.

A good example of this approach can be seen in this series of case studies in the work by Adams [1] and MacGillivray [5] on citation formats, where the same declarative format is a meeting point for two different projects. Another project based on a declarative format (though not HTML5) is the work by Gray on embedding 3D motion-capture models in HTML5. The embedding is done using a declarative XML format rather than via a script [3].

Outline Structure

HTML5 has an <article> element which at first glance seems to be the perfect container for scholarly works. It seems obvious that it should be used for the text of a scholarly article and reasonable that it should be used for book chapters, course modules and so on. The problem with this is that content management systems may also be using <article>. For example, the WordPress default theme at time of writing, uses <article> to mark up posts.

So, the advice is:

Conventions for document-level mark-up:

If your scholarly work is going to be part of a stand-alone Web page, or you know that it is appropriate in the context into which it will be published use <article>

If the article is going to be sent off to a publisher, posted to a blog (where for example the theme might change at some point) it is safer to use the <article> element.

In either case, mark up the scholarly work with microdata semantics:

Note that the Schema.org definition for Scholarly Article is at present rather light on detail, defining it as:

A Scholarly Article

This guide is making the assumption that, in spirit it is really the more generic "Scholarly Work". If more delicate terms are added to the Schema.org vocabularies or more appropriate terms identified then this advice will need to change.

Within the section or article element chosen, the question arises how to mark up the structure of a work with headings, sections, etc. In HTML5, documents have an outline, which can be computed using a well-specified algorithm.

This means that the use of internal section elements within resources has no real impact on semantics, so how you format a document depends on what is convenient or necessary:

For authoring in a text editor, or even an HTML editor, the use of sections may be an unnecessary complication; consider using headings which are not wrapped in enclosing sections.
For authoring in a word-processing environment, nested sections are impossible to implement; so use heading styles and choose conversion software that can respect them - for example WordDown, produced as a demonstrator for this project [7].
To add microdata semantics at the section level of the document, it will be necessary to use section mark-up on which to ‘hang’ microdata attributed. This is a significant barrier to editing with lightweight tools such as Markdown.
In published documents, using sections makes it easier for other people to copy and paste or machine-process documents, even though they could determine the structure of the document by computing its outline.
For the most general way of presenting documents for publication, it is possible to use this structure where each section has an <h1> heading, even where they are nested within each other, but this may not be encouraging re-use by others who need to edit the document.

For documents that need to work in legacy browsers, and content management systems where you do not have control over the CSS used current best-practice advice is to use a <header> block with the document title in and </header> element, then use ... <h5> throughout the document, each in a section, to enable the use of microdata semantics, and to aid others in re-using the content. (For example, loading such a document into Microsoft Word would lose the sections, but keep the headings.)

<header>
<!– document title–>
itemprop='name'>...,
...
</header>
<section>
<h1>...</h1>
...
<section>
<h1>...</h1>
...
</section>
</section>

Embedding Metadata and Semantics

Sam Adams has included a discussion of the most prominent methods of embedding semantics in documents in his case study. He considers microformats, RDFa and microdata. As microdata is part of the HTML5 specification, and is receiving mainstream support from major internet companies it is recommended as the default method of adding semantics to Scholarly HTML documents [1]:

Conventions for embedded semantics and metadata:

Use the schema.org vocabularies where possible, and when they are not adequate extend semantics by using well-documented ontologies or vocabularies maintained by groups with an interest in scholarship.

A blog post [FN2] is available as example of some of the design considerations in using microdata and which reports on work done as part of this case study.

Marking up Rhetorical Semantics

It is useful to be able to mark up sections in academic publications that have different roles. A W3C working draft on the "Ontology of Rhetorical Blocks" [11] puts it like this:

Having the rhetorical block structure externalised and attached to the digital publications would enable a richer and more expressive searching and browsing experience. One would be able to quickly spot the METHODS blocks within the publication and possibly resume the reading activity only to those, thus reducing the time usually spent on reading the entire publication. On the other hand, being able to formulate queries for content specific only to such blocks could already improve the quality (and possibly the quantity) of the set of relevant publications (e.g. methods: "autosomal-dominant mutations in APP").

For example, to mark up the major body sections of a document use mark-up like this:

Or one of the other rhetorical elements defined in the above-mentioned W3C ORB draft:

Relating the section back to the containing document still needs consideration, but using this kind of mark-up is a first step to capturing some of the document structure that XML is often used to describe, but in a more flexible way, that can be applied directly to Web documents.

As a simple demonstration of how this is useful, the WordDown converter that generates HTML from Word documents uses this mark-up for a references or bibliography section:

By using a public, well-defined and documented URI we increase the chances that software can interoperate and that others can reuse our scholarly resources. (But note that in the model of citations we are proposing here, detailed information about references might be stored in the document at the point they are cited, or not at all if the author is citing by reference).

Citations

See the case study by Sefton [6] on citation formatting, which contains draft examples.

Linking to Data and Supporting Documents

To link to a data set in a declarative way, use this pattern, with a generic property from the Citation ontology (cito) and a type which is domain-specific from an appropriate vocabulary, in this case a term associated with Chemical Markup Language:

This declarative statement of the relationship between a scholarly work and supporting data could then be turned into something interactive using a JavaScript library that is loaded by a bookmarklet, extension or added by the CMS serving the page. A worked example of using similar declarative mark-up to embed chemical visualisations in Web pages is available in a blog post [3].

Compound Documents/Objects

Many scholarly resources are made up of more than one part: theses and chapters have books; journal issues are made up of multiple articles; reviews etc. Courseware is very often made up of disparate resources brought together in a lesson context, and the research object of the future really must comprise a wide range of components, including documents, data, provenance information and so on.

In the academic environment, the Open Access Initiative has worked on a standard way of describing compound objects, the Object Reuse and Exchange (ORE) standard [FN4]. ORE is complicated to understand. For the purposes of real-world Web practice that is easy to implement, a simplified approach is needed; but as always, drawing on terms from established ontologies and vocabularies.

Take the case of a table of contents on the Web for work that is made up of multiple parts (NOTE: this approach is an early proposal only).

<article itemtype="http://schema.org/Book" itemscope>
<h1 itemprop=”name”>My book!</h1>
<ol >
<li itemtype="http://schema.org/ScholarlyArticle" itemprop="http://www.openarchives.org/ore/terms/aggregates">
<a href="./chapter1.html">
<span itemprop=”name”>Chapter One</span>
</a>
</li>
…
</ol>
</article>

Tools

As part of this work various tools were created by the author, drawing on open source libraries:

WordDown is a Word-to-HTML5 conversion application, covered in another case study document [7] and hosted on the Google Code site for jiscHTML5 [FN5].
ReCite is a citation re-formatter that uses Citation Style Language (CSL) to reformat the references in a page covered in another case study (Sefton, 2012a).
Show5ource is Javacript code, packaged as a bookmarklet for:
- Extracting Microdata from HTML5 documents in JSON format
- Copying and pasting the source of HTML5 documents for simple pasting into content management systems
- Alpha code for EPUB packaging of compound resources.

Also useful are these tools:

The Live Microdata tool [FN6] is useful for debugging microdata and uses the same underlying library as Show5ource.
H5o is an outline bookmarklet which can show the HTML5-compliant outline of a page [FN7].

4. Impact

This work, being very new, has had minimal impact apart from some engagement from a small cohort of scholars interested in Scholarly HTML. The techniques described here need to be tested in the use-case scenarios outlined above. None of this work will have measurable impact until there are services interoperating which use the conventions outlined here, or equivalents, but without someone specifying some base-line conventions to try, then those experiments will be doomed to be point-to-point agreements between disconnected projects.

5. Challenges

The main challenge in this work lies in dealing with a number of incomplete parallel projects. In particular schema.org has been disruptive in that it promises a mainstream vocabularly of URIs that can be used as Microdata types and properties yet it has some inconsistencies and lacks documentation for some parts (e.g., merely defining ScholarlyArticle as ‘A Scholarly Article’).

6. References

[1] Semantics and metadata, Adams, S., HTML5 Case Studies, University of Bath: UKOLN.

[2] The common Web design, Bilbie, A., HTML5 Case Studies, University of Bath: UKOLN.

[3] 3Dactyl: Using WebGL to represent human movement in 3D, Gray, S. HTML5 Case Studies, University of Bath: UKOLN.

[4] Re-Implementation of the Maavis assistive technology using HTML5 technologies, Lee, S. HTML5 Case Studies, University of Bath: UKOLN.

[5] Visualising embedded metadata. MacGillivray, M. HTML5 Case Studies, University of Bath: UKOLN.

[6] Challenging the tyranny of citation formats: Automated citation formatting for HTML5. Sefton, P. HTML5 Case Studies, University of Bath: UKOLN.

[7] WordDown: Word-to-HTML5 conversion tool, Sefton, P. HTML5 Case Studies, University of Bath: UKOLN.

[8] HTML5-based E-lecture Framework, Wang, Q., HTML5 Case Studies, University of Bath: UKOLN.

[9] HTML5 – Wikipedia, The Free Encyclopedia. Wikipedia, http://en.wikipedia.org/w/index.php?title=HTML5&oldid=455493654

[10] Dive into HTML5, Mark Pilgrim. http://diveintohtml5.info/

[11] Ontology of Rhetorical Blocks (ORB) Editor's Draft 5 June 2011, Ciccarese, P. & Groza, T. World Wide Web Consortium. http://www.w3.org/2001/sw/hcls/notes/orb/

Footnotes

[1] The #jiscPUB Project, http://jiscpub.blogs.edina.ac.uk/about/

[2] Scholarly HTML5: experimenting on myself with microdata and Schema.org vocabs http://ptsefton.com/2011/09/12/scholarly-html5-experimenting-on-myself-with-microdata-and-schema-org-vocabs.htm

[3] Scholarly HTML: Fraglets of progress http://ptsefton.com/2011/03/18/scholarly-html-fraglets-of-progress.htm

[4] What the OAI-ORE protocol can do for you http://ptsefton.com/2008/10/14/what-the-oai-ore-protocol-can-do-for-you.htm#id5

[5] http://ptsefton.com/2008/10/14/what-the-oai-ore-protocol-can-do-for-you.... jischtml5: Collection of HTML5 case studies and examples of scholarly resources and tools for processing them http://code.google.com/p/jischtml5/wiki/WordDown

[6] Live Microdata, http://foolip.org/microdatajs/live/

[7] Downloads - h5o – HTML5 outliner (bookmarklet, Chrome extension) – Google Project Hosting http://code.google.com/p/h5o/downloads/list

Key links:

Printer-friendly version