peter sefton: relevant content on this site

HTML5 Case Study 9: WordDown: A Word-to-HTML5 Conversion Tool

Administrator — Mon, 16 Jul 2012 14:09:28 +0000

Author: Peter Sefton

HTML5 Case Study 9: WordDown: A Word-to-HTML5 Conversion Tool

1. About This Case Study

Target Audience

The main audience for this work is tool developers building authoring systems, repositories and publishing infrastructure for academic documents.

It may also be useful to committed academic authors, comfortable with HTML already and some technical skills, who would be able to install a bookmarklet and possibly run Python code on a Windows machine (i.e., not the broad academic community at this stage).

What Is Covered

This case study examines ways that academic authors working with word processors such as Microsoft Word, the OpenOffice.org family and Google Docs would be able to produce compliant Scholarly HTML5. Due to time constraints, the tool developed for this project handles Microsoft Word documents only, but the principles outlined here apply more broadly.

Word processors are used very widely in academia for all sorts of document authoring. Articles, essays, theses, course materials, and so on, are being produced in huge quantities in Microsoft Word and similar applications in the Higher Education sector. Yet they are not easy to convert to good-quality, clean, semantically rich HTML 5 of the type being discussed in the JISC HTML5 Project.

This is a critical piece of work for the overall project of bringing scholarship to the (semantic) web. If the tools still being used to create academic content do not create HTML natively, then who will do the mark-up? What new tools are needed, if any? This case study will consider these questions, as well as produce some demonstrations of what is possible with the example application, WordDown.

What Is Not Covered

The questions posed above about tools for creating HTML5 are even more pertinent if we are targeting XML for scholarly documents. However, XML is in its teens now, yet there have been no widely available tools produced to create XML for DTDs such as DocBook or TEI that have gained any kind of traction with large user groups. This is an interesting question for the Higher Education sector, but it is out of scope for this case study.

2. Use Case

There are no formal studies of which I am aware that show the usage rates of different academic authoring tools by country or discipline; but it is quite clear that Microsoft Word (along with other word processors) is a very widely used document-creation tool in many, many disciplines at our Higher Education institutions and research organisations. For example, on this study the UKOLN team have requested that case studies be submitted in Word format, using a template supplied by UKOLN. While the template gives the case study documents some structure, using 'Save as HTML...' from Word will not produce good-quality HTML - far from the kind of structured documents that are being produced as exemplars in the JISC HTML5 Project, Word's output is focused on a decade-old approach which tries to match paper formatting.

The format is discussed in an article by the author [1] which notes that some simple transformations can render it into XML, which can be processed by standard parsers. A short excerpt from the HTML produced by MS Word for one of the case study documents illustrates this:

and

There are three main issues with the native format.

It contains processing information which is foreign to HTML, embedded using a mixture of comments to hide code from being rendered in browsers, and some non-standard mark-up. An example of this is the way it renders lists as a series of paragraphs rather than as an HTML list structure. This is an example from one of the other case studies. Constructs like <!--[if !supportLists] are not part of the HTML standard, although all browsers have evolved to deal with this idiom.
It is oriented towards looking right, rather than working as a Web page so contains a lot of information about indents and so on that are page-oriented.
It is very verbose - with span elements and other spurious mark-up appearing as an artefact of the editing process.

There is now in Word a 'Filtered' output which removes much of the worst mark-up but it still does not create documents which are clean, reusable HTML or HTML5.

The use case here is using Microsoft Word in the production of any academically oriented document that is destined for the Web, including articles, theses and other student work, reports, course materials, academically inclined blog posts or other Web pages, and reports such as this one.

3. Solution

The improvement this study proposes to the current approach in using Word to create HTML5 is a tool called WordDown, created for this JISC project. This is a JavaScript application which runs in a Web browser and processes Word documents into clean HTML5. It takes Word's HTML output as an input.

The wiki at the JISC HTML5 Project at Google code [1] covers how to run the application and the basics of how to format documents.

Background to WordDown

The Word 2000 HTML format has been a feature of Microsoft's flagship word processor since the '2000' version, and has been the target of much criticism. At the time it was introduced, it was capable of rendering almost all features of Word documents into a kind of HTML, using a combination of extended CSS formatting and islands of very obscure non-standard mark-up. It was actually not very far away from XML, and could be processed into XML with a small transformation program [2].

The solution presented here revisits the format with modern tools by loading it into a modern Web browser and using the jQuery framework to interrogate various aspects of the formatting. Recent versions of Web browsers are all coded to deal gracefully with the 'mutant mark-up' in Word's HTML output, hiding Word-specific code in comments, because HTML5 parsing rules take account of all kinds of legacy issues like Microsoft's non-standard mark-up.

The application WordDown is inspired by the success of lightweight wiki-style mark-up languages which allow users to create HTML (and PDF in some cases) from simple text files . One of the foremost examples of this class of language is the MarkDown format, used by the Pandoc processing framework. (Peter Krautzberger has a useful introduction to Markdown and Pandoc for academic authors and explains how it may be considered what we might call 'the new LaTeX' for academic authoring.)

In Markdown, one way of making a heading is to preface some text with #.

# Introduction (turns into <h1> in HTML)

## About markdown (turns into <h2>

In WordDown, to accomplish the same thing, the author uses the built-in heading styles - Heading 1 and Heading 2 respectively. These styles are the only widely-used standard way of structuring documents in the word-processing world - other elements such as quotes or lists have no equivalent standard implementations.

To make a block-quote in Markdown, you use a greater-than character:

> This is a block-quote.

In WordDown, just indent the paragraph either using the formatting tools on the Word ribbon, or define a style - but (at this stage at least) the WordDown processor does not use style names other than for detecting some headings; it uses indenting. So any indented style which is not a list or a heading will be treated as a block-quote.

The algorithm WordDown uses is being documented on the Google Code site, initially via the code. In essence, it is designed around the assumption that the user wants to create clean HTML, not to recreate the look of a paper document. So in a similar way to the light-weight wiki mark-up languages, it uses formatting and indenting as structural cues. The main device used is to look at the left margin:

With a plain paragraph that is not a heading or a list item, an indent greater than the preceding non-heading paragraph means block-quote.
Paragraphs in a monospace font are mapped to the <pre> (pre-format) element.

Features Summary

WordDown has the following features which are described in greater detail on the Google Code site for the software [2].

Creates HTML from Word documents saved using "Save as HTML..." on Word for Windows versions from 2000 to 2010. The code runs in a Web browser and is packaged both as a bookmarklet and as a small Python Web server that users need to run from their documents directory.
Works with Zotero citations and embeds them in-line using best-practice Scholarly HTML5 conventions.
Can create rich semantic HTML5 with embedded microdata, given microformats in the source document.

Demonstration: Screenshots

The simplest way to run WordDown is to save documents manually as HTML, load them into a Web browser and use the WordDown bookmarklet as documented on the Google code wiki. A slightly easier workflow (which is harder to set up) is to run the WordDown server:

Figure 1. Browsing local files using the WordDown Web server.

When the user selects a word document, the WordDown server runs Microsoft Word in the background, saves the document as HTML, inserts Javascript into the head and serves the result back the user's browser. The result is that the user is presented with an HTML version of the Word document using a stylesheet derived from the one used by the W3C for its standards documents:

Figure 2. A Word document (this one) converted to HTML5 by WordDown running the browser.

The resulting document is HTML5 - and can be saved by the user for reuse. Alternatively, using another JavaScript application developed for the JISC HTML5 Project, parts of the document can be copied and pasted via the Show5ource bookmarklet.

Figure 3. The Show5 bookmarklet.

In Figure 3 the Show5 bookmarklet identifies the HTML5 sections in a document and lets the user click to see copy and paste-ready source, or grab the whole document as a Zip file with all images.

Figure 4. Show5 encodes image data in dataURIs to entire Web pages which can be copied and pasted, for example into a CML (corporate multi-lingual) such as WordPress.

Finally, the tool can create semantically rich documents. Here is the JSON format data which can be extracted from the page by clicking on the {} link:

Figure 5. JSON format data which can be extracted from the page by clicking on the {} link.

This data was embedded in the document details table, using a microformat-like technique which is documented on the Google Code wiki.

Demonstration Web Documents

Demonstration documents in Word format (as noted above) that can be automatically transformed to HTML5 with embedded document semantics and re-processable citations can be found in the Google Code repository [3].

4. Impact

This work has had no impact so far as it is very new, but could be important to the uptake of HTML5 in academia if it is picked up by user communities, such as, for example, the authors who publish to KnowledgeBlogs, or agencies such as UKOLN involved in publishing a variety of academic materials.

To have a substantial impact, there would need to be a driver for people to create HTML5 materials for academia. Except in pockets of activity (e.g. academic blogging) this is not currently the case. One current trend - the move to ebooks away from paper may finally tip the balance and have academic authors looking for tools that can create the HTML they need as the building block for epub and Amazon Kindle ebook publications.

This tool would need work to make it easier to deploy in academic contexts.

Of course, an official HTML and/or EPUB plug-in from Microsoft itself working along similar lines could make this work obsolete overnight.

5. Challenges

The biggest challenge in this project has been the cross-site scripting rules in Web browsers which prevent code from accessing certain domains. In this case, if the Word document is loaded from the local file system, code in the browser may not access images from the local file system - this means the plug-in is prevented from doing any processing on images, such as creating data URIs, or creating a zip file of the entire document with all its images. To get around this, a simple Web service was created using Python, repeating a design pattern used in a previous project, the Integrated Content Environment (ICE) [2] which used a local Web server on users' machines to convert office documents to HTML. At the moment, this Web server is only suitable for use by technically adept users who can install Python 3 and run it, as well as download the source code, but it could be packaged as a Windows executable, were appropriate resources available.

6. Conclusions

The WordDown application shows that:

The JQuery framework in a modern HTML5 browser is a very powerful document-processing language. When Word 2000 was released, it was unthinkable that a browser-based conversion system could be written in a few hundred lines of code.
Using very simple heuristics borrowed from lightweight wiki mark-up languages, it looks feasible to produce high-quality semantically rich HTML5 materials with a minimum of training for authors, if they can be persuaded that most formatting will be discarded, and only structurally meaningful formatting will be kept.

Where academics are using the Web, and care about producing well structured resources, as with a growing cohort of academic bloggers and scholars embracing the 'Beyond the PDF' movement (whether they know it or not), it should be possible to introduce a tool such as WordDown into the tool chains supplied by supporting institutions.

A small investment in frameworks to support users wishing to run this kind of code could dramatically reduce production costs for HTML5 and EPUB e-book resources, by giving authors and editors tools to create resources straight from the tools they know best, such as MS Word.

7. References

[1] Word to XML and back again Sefton, P. O'Reilly, XML.com. December 8, 2004. http://www.xml.com/pub/a/2004/12/08/word-to-xml.html

[2] The integrated content environment. In AUSWEB 2006. Noosa: Southern Cross University. Sefton, P. http://eprints.usq.edu.au/archive/00000697/01/Sefton_ICE-ausweb06-paper-revised-3.pdf.

Footnotes

[1] WordDown - jischtml5. http://code.google.com/p/jischtml5/wiki/WordDown

[2] WordDown - jischtml5, http://code.google.com/p/jischtml5/wiki/WordDown

[3] WordDown - jischtml5, http://code.google.com/p/jischtml5/wiki/WordDown

HTML5 Case Study 8: Conventions and Guidelines for Scholarly HTML5 Documents

Administrator — Mon, 16 Jul 2012 14:01:38 +0000

Author: Peter Sefton

1. About This Case Study

HTML5 case study on “Conventions and Guidelines for Scholarly HTML5 Documents” by Peter Sefton

This case study looks at the fundamentals of using HTML5 for scholarly documents of all kinds, particularly theses and courseware documents (with application to journal articles as well), but with an eye on a much broader spectrum of resources, including those which are the subject of other case studies in this project such as slide presentations. It will aim to establish the basic structural and semantic building blocks for how resources should be marked up for the Web, to increase their utility for people and machines, as well as help to ensure they can be preserved effectively. This case study will build on work already undertaken by in the Scholarly HTML community as well as the other HTML5 case studies [1], [2], [3], [4], [5], [6], [7] and [8].

Target Audience

The audience for this work is tool developers building authoring systems, repositories and publishing infrastructure for academic documents. The outcomes could be used by people hand-coding documents, but that is not a very likely scenario; another related case study will implement the advice from this study into a tool to allow users to create HTML5 from Microsoft Word (using the Word 2000 HTML format which is available on MS Windows from v2000 to v2010) [7] and a third looks at how citations embedded in a document using the guidelines presented here can be re-formatted [6].

What Is Covered

The following aspects are covered in this document:

The basic structural backbone of a scholarly HTML document; how to mark up the scope of what is the content on a page. For example, on a blog, which section is the scholarly work as distinct from the navigation elements, advertisements, etc.?
Best practice for marking up sections within the document (whether to use nested sections or just headings - discussion of issues like putting headings in tables and their implications).
A brief discussion of techniques for embedding rhetorical semantics in documents. That is, the ability to distinguish an introduction or conclusion, or to mark parts of a text such as learning objectives by drawing on XML schemas and ontologies. Some generic advice about how to mark up other kinds of semantic relationships, such as linking to a data file, illustrated with examples from Chemistry.
Work on metadata and semantics by other case studies incorporated into the core guidelines.

The following is still in gestation:

Anchors for commenting and annotation: this requires some attention as simple schemes such as numbering paragraphs are very limited in capability, and current tools such as digress.it require documents to be in a particular dialect of HTML to work. The introduction of standards in this area would allow interoperability between commenting and annotation systems.

This work should have an impact on:

Search engine optimisation (SEO), particularly for services such as Google Scholar.
Reduced friction in moving documents through submission processes to journals, to repositories and to review processes such as peer review, thesis examination, and assessment, via automated metadata extraction.
Improved machine-readability for text and data-mining processes.
Improved accessibility for readers - guidelines will take into account WCAG accessibility guidelines.
Preservation: the guidelines will assist authors and tool makers in constructing documents which do not 'rot' as technology changes.

2. Use Case

The use case here is very broad: it is about the optimal mark-up for any kind of academic-related document on the Web. The Web was conceived as a vehicle for scholarship, but in the two decades since its invention, scholarly communications have taken a back seat to the driver, commerce. The most common form of Web publishing for scholarly publications is articles in PDF format, which readers are expected to download and manage themselves. PDF has the advantage of capturing an absolute layout, preserving the exact look of a document, but it was designed for capturing print - not for delivery to an increasing variety of screen sizes (and to devices with no screen at all).

It has been recognised that the current scholarly publications landscape is not serving the needs of the community. Publishers have built an industry around the creation of paper-like objects which do not allow:

Delivery to any device
Re-use by humans to create new works.
Rich integration of publications with supporting data and visualisations of data.
Machine-processable semantics so that research literature can be mined, indexed and analysed automatically.

For learning resources the situation is a little different, in that there has been some history of Web-based delivery of materials controlled by institutions themselves.

As noted in another JISC project [FN1] HTML is the major format for the next stage of the development of the Web, as well structured Web resouces can not only be delivered and used on the Web, they are the basis for creating e-books. It is clear from the extremely rapid growth of the e-book market for commercial publishing that learning resources and research materials will need to follow. HTML5 is essential both to the open EPUB3 standard and to market leader Amazon's newest format.

In both research and learning materials, there is still a distinct lack of tools for creating Web-native resources, at least in a way accessible to typical academics. The use case of an academic sitting down to create richly structured HTML5 academic objects, with embedded semantics and preservation-quality mark-up, is something of a dream at this stage. The best this case study can hope to achieve is to provide a starting point for a description of what Scholarly HTML should look like and to provide a starting point for a roadmap for tool development to allow the scholarly Web to take its rightful place alongside the Web 'high street';.

3. Solution

The solution has two parts. The first is a guide to marking up Scholarly HTML documents. The second points to software packages and techniques that are useful in the process of marking up documents, both existing tools, and tools developed as part of this project.

How to Mark Up Documents for Scholarly HTML

This section will become a stand-alone guide and be posted on the Scholarly HTML Website as the core guide to structuring HTML documents. Scholarly HTML is a term used by a loose group of people interested in bringing scholarship to the Web, or the Web back to its scholarly roots as a publication and research platform. The group met physically once at a meeting convened by Peter Murray-Rust at Cambridge University in March 2011.

Use HTML5, Microdata and Common Vocabularies

HTML5 is an evolving standard which codifies HTML in the context of the real world. The Wikipedia page for HTML5 [9] is a good starting point for pointers to the specification. This document will assume that the reader is familiar with the HTML5 standard, in particular outline structures and microdata; Mark Pilgrim's Dive Into HTML5 [10] is a good free introduction.

Within the Scholarly HTML group there was a short, vigorous, debate about whether or not Scholarly HTML should be required to be well-formed XML. There were fears on both sides - that HTML resources would be impossible to parse reliably, and on the other hand that making XML mandatory would be too high a bar, reducing the pool of available content considerably. Mark Pilgrim covers very similar arguments in his chapter on the background of HTML5 including the now-abandoned XHTML standard. The good news is that with HTML5 you do not have to choose. The HTML5 standard specifies exactly how HTML5 should be parsed - and once parsed it can be re-serialised as XML. So, for machine-based processing, the advice is use an HTML5 parser, not an XML parser. Then, if you want to use XML tools, serialise the document as XML.

For example, here is some Python to illustrate the process using html5lib - a reference implementation of the parsing rules, and lxml. This is copied and pasted from an Ubuntu Linux environment.

First, install the Python libraries you need:

sudo easy_install lxml html5lib

Then, open a Python shell (type python) and try this:

import html5lib #Handles the ’HTML5’ stuff

from html5lib import treebuilders

from lxml import etree #Handles the XML serialization

parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder(“lxml”))

e = parser.parse(“<p>This is some HTML<p>Which is very far from being <b>XML <p>But which the HTML parser will be OK with”)

print etree.tostring(e.getroot())

The result is well-formed XML (yes, the namespace is probably wrong, but this will serve as the input to downstream processes).

xmlns:html=”http://www.w3.org/1999/xhtml“>

<html:head/>

<html:body>

<html:p>THis is some HTML</html:p>

<html:p>Which is very far from being <html:b>XML</html:b></html:p>

<html:p>

<html:b>But which the HTML parser will be OK with</html:b>

</html:p>

</html:body>

</html:html>

So, the advice for scholary HTML is:

Use HTML5 as per the standard including Microdata.

The Context for Pages

Scholarly works on the Web are unlikely to be stand-alone documents. They will very often be embedded in content management systems, repositories or publisher Web sites. It is out of scope for this case-study to consider the structure of Web pages produced by these Web applications, but for an example of best practice in structuring HTML5 Web sites, including navigation elements and so on, see the Common Framework case study conducted by Bilbie [2].

The key points in that case study involve:

Flexible design that can re-flow to any size of device.
Use of HTML5 attributes in the mark-up to provide cues to screen readers and other assistive software.

Be Declarative

HTML5 has a close relationship to the Javascript scripting language and has theelement on which you can cause things to appear, via scripts. But there are several reasons to avoid making academic works depend on particular scripts or applications, and instead look for ways to express the meaning of a work and the parts it links to as plain HTML, with enough information in it for scripts, etc., to come into play when needed:

Works can be archived and preserved independently of the scripts on which they depend
Other people can reuse the declaratively specified data in new ways
Revising the work when new applications emerge is much easier when there is a clear separation between documents, together with data and media, as opposed to the code that does interesting things with the documents, data and media.

A good example of this approach can be seen in this series of case studies in the work by Adams [1] and MacGillivray [5] on citation formats, where the same declarative format is a meeting point for two different projects. Another project based on a declarative format (though not HTML5) is the work by Gray on embedding 3D motion-capture models in HTML5. The embedding is done using a declarative XML format rather than via a script [3].

Outline Structure

HTML5 has an <article> element which at first glance seems to be the perfect container for scholarly works. It seems obvious that it should be used for the text of a scholarly article and reasonable that it should be used for book chapters, course modules and so on. The problem with this is that content management systems may also be using <article>. For example, the WordPress default theme at time of writing, uses <article> to mark up posts.

So, the advice is:

Conventions for document-level mark-up:

If your scholarly work is going to be part of a stand-alone Web page, or you know that it is appropriate in the context into which it will be published use <article>

If the article is going to be sent off to a publisher, posted to a blog (where for example the theme might change at some point) it is safer to use the <article> element.

In either case, mark up the scholarly work with microdata semantics:

Note that the Schema.org definition for Scholarly Article is at present rather light on detail, defining it as:

A Scholarly Article

This guide is making the assumption that, in spirit it is really the more generic "Scholarly Work". If more delicate terms are added to the Schema.org vocabularies or more appropriate terms identified then this advice will need to change.

Within the section or article element chosen, the question arises how to mark up the structure of a work with headings, sections, etc. In HTML5, documents have an outline, which can be computed using a well-specified algorithm.

This means that the use of internal section elements within resources has no real impact on semantics, so how you format a document depends on what is convenient or necessary:

For authoring in a text editor, or even an HTML editor, the use of sections may be an unnecessary complication; consider using headings which are not wrapped in enclosing sections.
For authoring in a word-processing environment, nested sections are impossible to implement; so use heading styles and choose conversion software that can respect them - for example WordDown, produced as a demonstrator for this project [7].
To add microdata semantics at the section level of the document, it will be necessary to use section mark-up on which to ‘hang’ microdata attributed. This is a significant barrier to editing with lightweight tools such as Markdown.
In published documents, using sections makes it easier for other people to copy and paste or machine-process documents, even though they could determine the structure of the document by computing its outline.
For the most general way of presenting documents for publication, it is possible to use this structure where each section has an <h1> heading, even where they are nested within each other, but this may not be encouraging re-use by others who need to edit the document.

For documents that need to work in legacy browsers, and content management systems where you do not have control over the CSS used current best-practice advice is to use a <header> block with the document title in and </header> element, then use ... <h5> throughout the document, each in a section, to enable the use of microdata semantics, and to aid others in re-using the content. (For example, loading such a document into Microsoft Word would lose the sections, but keep the headings.)

<header>
<!– document title–>
itemprop='name'>...,
...
</header>
<section>
<h1>...</h1>
...
<section>
<h1>...</h1>
...
</section>
</section>

Embedding Metadata and Semantics

Sam Adams has included a discussion of the most prominent methods of embedding semantics in documents in his case study. He considers microformats, RDFa and microdata. As microdata is part of the HTML5 specification, and is receiving mainstream support from major internet companies it is recommended as the default method of adding semantics to Scholarly HTML documents [1]:

Conventions for embedded semantics and metadata:

Use the schema.org vocabularies where possible, and when they are not adequate extend semantics by using well-documented ontologies or vocabularies maintained by groups with an interest in scholarship.

A blog post [FN2] is available as example of some of the design considerations in using microdata and which reports on work done as part of this case study.

Marking up Rhetorical Semantics

It is useful to be able to mark up sections in academic publications that have different roles. A W3C working draft on the "Ontology of Rhetorical Blocks" [11] puts it like this:

Having the rhetorical block structure externalised and attached to the digital publications would enable a richer and more expressive searching and browsing experience. One would be able to quickly spot the METHODS blocks within the publication and possibly resume the reading activity only to those, thus reducing the time usually spent on reading the entire publication. On the other hand, being able to formulate queries for content specific only to such blocks could already improve the quality (and possibly the quantity) of the set of relevant publications (e.g. methods: "autosomal-dominant mutations in APP").

For example, to mark up the major body sections of a document use mark-up like this:

Or one of the other rhetorical elements defined in the above-mentioned W3C ORB draft:

Relating the section back to the containing document still needs consideration, but using this kind of mark-up is a first step to capturing some of the document structure that XML is often used to describe, but in a more flexible way, that can be applied directly to Web documents.

As a simple demonstration of how this is useful, the WordDown converter that generates HTML from Word documents uses this mark-up for a references or bibliography section:

By using a public, well-defined and documented URI we increase the chances that software can interoperate and that others can reuse our scholarly resources. (But note that in the model of citations we are proposing here, detailed information about references might be stored in the document at the point they are cited, or not at all if the author is citing by reference).

Citations

See the case study by Sefton [6] on citation formatting, which contains draft examples.

Linking to Data and Supporting Documents

To link to a data set in a declarative way, use this pattern, with a generic property from the Citation ontology (cito) and a type which is domain-specific from an appropriate vocabulary, in this case a term associated with Chemical Markup Language:

This declarative statement of the relationship between a scholarly work and supporting data could then be turned into something interactive using a JavaScript library that is loaded by a bookmarklet, extension or added by the CMS serving the page. A worked example of using similar declarative mark-up to embed chemical visualisations in Web pages is available in a blog post [3].

Compound Documents/Objects

Many scholarly resources are made up of more than one part: theses and chapters have books; journal issues are made up of multiple articles; reviews etc. Courseware is very often made up of disparate resources brought together in a lesson context, and the research object of the future really must comprise a wide range of components, including documents, data, provenance information and so on.

In the academic environment, the Open Access Initiative has worked on a standard way of describing compound objects, the Object Reuse and Exchange (ORE) standard [FN4]. ORE is complicated to understand. For the purposes of real-world Web practice that is easy to implement, a simplified approach is needed; but as always, drawing on terms from established ontologies and vocabularies.

Take the case of a table of contents on the Web for work that is made up of multiple parts (NOTE: this approach is an early proposal only).

<article itemtype="http://schema.org/Book" itemscope>
<h1 itemprop=”name”>My book!</h1>
<ol >
<li itemtype="http://schema.org/ScholarlyArticle" itemprop="http://www.openarchives.org/ore/terms/aggregates">
<a href="./chapter1.html">
<span itemprop=”name”>Chapter One</span>
</a>
</li>
…
</ol>
</article>

Tools

As part of this work various tools were created by the author, drawing on open source libraries:

WordDown is a Word-to-HTML5 conversion application, covered in another case study document [7] and hosted on the Google Code site for jiscHTML5 [FN5].
ReCite is a citation re-formatter that uses Citation Style Language (CSL) to reformat the references in a page covered in another case study (Sefton, 2012a).
Show5ource is Javacript code, packaged as a bookmarklet for:
- Extracting Microdata from HTML5 documents in JSON format
- Copying and pasting the source of HTML5 documents for simple pasting into content management systems
- Alpha code for EPUB packaging of compound resources.

Also useful are these tools:

The Live Microdata tool [FN6] is useful for debugging microdata and uses the same underlying library as Show5ource.
H5o is an outline bookmarklet which can show the HTML5-compliant outline of a page [FN7].

4. Impact

This work, being very new, has had minimal impact apart from some engagement from a small cohort of scholars interested in Scholarly HTML. The techniques described here need to be tested in the use-case scenarios outlined above. None of this work will have measurable impact until there are services interoperating which use the conventions outlined here, or equivalents, but without someone specifying some base-line conventions to try, then those experiments will be doomed to be point-to-point agreements between disconnected projects.

5. Challenges

The main challenge in this work lies in dealing with a number of incomplete parallel projects. In particular schema.org has been disruptive in that it promises a mainstream vocabularly of URIs that can be used as Microdata types and properties yet it has some inconsistencies and lacks documentation for some parts (e.g., merely defining ScholarlyArticle as ‘A Scholarly Article’).

6. References

[1] Semantics and metadata, Adams, S., HTML5 Case Studies, University of Bath: UKOLN.

[2] The common Web design, Bilbie, A., HTML5 Case Studies, University of Bath: UKOLN.

[3] 3Dactyl: Using WebGL to represent human movement in 3D, Gray, S. HTML5 Case Studies, University of Bath: UKOLN.

[4] Re-Implementation of the Maavis assistive technology using HTML5 technologies, Lee, S. HTML5 Case Studies, University of Bath: UKOLN.

[5] Visualising embedded metadata. MacGillivray, M. HTML5 Case Studies, University of Bath: UKOLN.

[6] Challenging the tyranny of citation formats: Automated citation formatting for HTML5. Sefton, P. HTML5 Case Studies, University of Bath: UKOLN.

[7] WordDown: Word-to-HTML5 conversion tool, Sefton, P. HTML5 Case Studies, University of Bath: UKOLN.

[8] HTML5-based E-lecture Framework, Wang, Q., HTML5 Case Studies, University of Bath: UKOLN.

[9] HTML5 – Wikipedia, The Free Encyclopedia. Wikipedia, http://en.wikipedia.org/w/index.php?title=HTML5&oldid=455493654

[10] Dive into HTML5, Mark Pilgrim. http://diveintohtml5.info/

[11] Ontology of Rhetorical Blocks (ORB) Editor's Draft 5 June 2011, Ciccarese, P. & Groza, T. World Wide Web Consortium. http://www.w3.org/2001/sw/hcls/notes/orb/

Footnotes

[1] The #jiscPUB Project, http://jiscpub.blogs.edina.ac.uk/about/

[2] Scholarly HTML5: experimenting on myself with microdata and Schema.org vocabs http://ptsefton.com/2011/09/12/scholarly-html5-experimenting-on-myself-with-microdata-and-schema-org-vocabs.htm

[3] Scholarly HTML: Fraglets of progress http://ptsefton.com/2011/03/18/scholarly-html-fraglets-of-progress.htm

[4] What the OAI-ORE protocol can do for you http://ptsefton.com/2008/10/14/what-the-oai-ore-protocol-can-do-for-you.htm#id5

[5] http://ptsefton.com/2008/10/14/what-the-oai-ore-protocol-can-do-for-you.... jischtml5: Collection of HTML5 case studies and examples of scholarly resources and tools for processing them http://code.google.com/p/jischtml5/wiki/WordDown

[6] Live Microdata, http://foolip.org/microdatajs/live/

[7] Downloads - h5o – HTML5 outliner (bookmarklet, Chrome extension) – Google Project Hosting http://code.google.com/p/h5o/downloads/list

HTML5 Case Study 7: Challenging the Tyranny of Citation Formats: Automated Citation Formatting

Administrator — Mon, 16 Jul 2012 13:58:25 +0000

Author: Peter Sefton

1. About This Case Study

HTML5 case study on "Challenging the Tyranny of Citation Formats: Automated Citation Formatting for HTML5" by Peter Sefton

What Is Covered

This case study looks at how citations and reference lists can be represented in HTML5 in two ways; firstly with reference information supplied in-page and secondly using URIs that point to trusted bibliographic data stores. The end goal is to automate as much of the citation and reference management experience as possible at all stages of the academic workflow, from research to authoring, to publishing to citation analysis, generation of metrics and machine processing of data. This case study covers the HTML5 coding that would be needed to support the goal of end-to-end citation processing, and provides a proof-of-concept demonstration of the principles. There are, however, some significant problems to overcome before we have working software that can be used by a broad range of academic authors.

An outcome of this case study is a Javascript tool demonstrating how scholarly works of all kinds can be marked up in such a way that users can decide to display the citations and reference list in any format they choose via the Citation Style Language (CSL), initially choosing from a small set of commonly used formats. With further development, the tool would have access to the several hundred community-edited CSL styles available from the Zotero Project. The tool, ReCite, will continue to be developed at least until the completion of this JISC HTML5 work, and is documented on the Google Code wiki for the project.

The initial prototype builds on work in the JISC-funded Open Bibliography Project [1] and was developed in parallel with a case study by MacGillivray [2].

The case study has also been conducted with an eye to practical authoring tools, with consideration given to how people may create this kind of content using familiar tools such as word processors with Zotero or Mendeley and similar systems, or with LaTeX and BibTex, Wikis and online Web editors such as those provided on WordPress.

The exemplars and documentation cover:

Techniques for adding citation and reference lists using URIs in HTML5 and implications for authoring.
The issues this work presents for tool-chains where authors are using word processors, LaTeX, etc., to create content.

The point of this is to build on other projects in this stream mentioned above, covering the basics of embedding citations and structuring documents to:

Show user-centric user interface possibilities for displaying citations.
Point the way to improved authoring services that do away with unnecessary formatting of citations and references altogether.
Together with Mark MacGillivray's visualisation project [2], show a potential future for citations and referencing that transcends the limitations of a system still rooted in paper-based conventions.

The issues discussed in this case study were the subject of a session at THATCamp, Canberra (The Humanities and Technology Camp) [FN1] in October 2011, convened by the author.

This case study attempts to discuss some of the workflow issues for all kinds of academic work involving referencing, but due to time and resource constraints, will focus on one particular tool-chain: authoring in Microsoft Word, using Zotero to manage references.

What Is Not Covered

There is no scope in this case study for a detailed survey of citation practices or lengthy discussion of the relative merits of embedding citation data vs using URIs. It aims to produce a starting point for one simple, useful tool, and to gain further experience for the JISC community with bibliographic data.

Also out of scope is the detailed work needed to map between citation formats, and to write connector-code for online services so that citation by reference can be resolved reliably to machine-readable bibliographic data.

Target Audience

The main audience for this work is tool developers building authoring systems, repositories and publishing infrastructure for academic documents. The outcomes could be used by people hand-coding documents, but that is not a very likely scenario.

It is also targeted at digitally adept researchers, such as those in the digital humanities, or working on Web-based scholarship tools.

2. Use Case

Citations and references are a key part of all academic practice across many resource types including, but not limited to:

Articles (best not to call them "papers" in an HTML5 project).
Theses and other student work such as essays.
Books.
Course materials.
CVs and portfolios.
Report documents such as this one.

Even today, when there are many tools available for managing references and formatting documents, students are taught how to hand-format references. At the aforementioned THAT Camp session, two participants involved in information literacy noted that even though students should be using reference management software in their work, it is important that they can read the main citation formats used in their disciplines, and learning to hand-construct citations is a good way to learn. The nature of conventional citation and reference has evolved under the constant constraints of paper-based publishing, not least of which has been space, something that Web-based publishing has largely removed - but these constraints have made the process very costly in terms of researchers' time. But looking ahead to a world where the constraints of paper-based scholarship no longer require us to invent and memorise rigorous concise encodings for bibliographic references and citations, participants in the workshop agreed that it would be possible in Web contexts to show reference information in new ways, such as simple 'pop-up' tables that show the data set out with labelled fields. This would save us having to know, for example whether the text in italics in a reference in a given format is the name of the article or the name of the journal, or whether the theme under consideration is the title of a book or a report.

But for now, obviously citation formatting conventions are still important, hence this work on making it possible for users to choose the format they prefer, to improve the usability of online resources. Other benefits include:

Reliable machine processing for better citation metrics.
Reduced effort for authors, editors and readers who reuse citations.

Guidance for Authors

In addition to the solution described below, this case study looks at the broader issues around citation management in a Web world. This section proposes some best practice for academic authors.

The goals of this section are:

To describe current best practice for authors wanting to put academic resources on the Web with the most useful possible citations, including guidance on how to format HTML pages and descriptions of the demonstration tools developed in this case study.
To enable authors to meet publisher or university requirements for particular citation formats as easily as possible.
Provide a platform for change; where instead of publishers or markers requiring references to be formatted in a particular text-based format, the requirement is for rich citation data to be embedded.

Citation Data Online vs Embedded? Do Both

One of the main goals for best-practice HTML5 citation should be to make sure that citations have URIs that can be resolved to Web resources. That is, all bibliography should be published and authors should be able to use high-quality online sources when possible. This means that academic resources become part of the linked-data Web.

Take, for example, historians working with the National Library of Australia's Trove collection of Australian newspapers[FN2]. Using the Web interface, they can search the collection to locate an article. With Zotero [FN3] running in Firefox, an add icon appears in the browser so researchers can create a record for an article such as one about flooding in Toowoomba, Victoria in 1893 [3]. If researchers are sharing their Zotero libraries, then they have now created a usable public URI with bibliographic information for that article. It would then be possible to cite by reference to the bibliographic record.

While it might seem to be at odds with the above requirement to use URIs, it is also best practice in many situations to include a cached copy of the bibliographic data with the document. Including data inline means:

Data is available for re-formatting citations when working off-line.
Authors can fine-tune citation details to suit their own purposes.

Therefore, for maximum potential utility to readers, re-users, and machine-processing applications, including both cite-by-reference to a URI and embedded bibliographic metadata is important. In disciplines where high-quality reference sources such as PubMed Central are available online, then there may be no substantial advantage in using embedded metadata, whereas in the humanities, or areas where URIs are likely to be less stable, then it offers some extra insurance.

Formatting HTML5 citations

Following work by Sam Adams on the JISC HTML5 Project [4] subsequently refined by group discussion, the minimal citation for an online resource should be embedded in HTML5 in the following way:

<span itemtype=”http://schema.org/ScholarlyArticle“
itemscope=”"
itemprop=”http://purl.org/ontology/bibo/cites>
DISPOSABLE citation marker here
<link itemprop=”url” href=”http://example.com/uri-for-citation/” />
</span>

The 'HTML5 way' to embed further data is to use more link and metadata elements to embed the citation data at the point of citation. Sam Adams; case study [4] has chosen the bibliographic ontology for its namespace. For example, author names:

While this is the ‘HTML5 way’ to embed data, in this case study the initial prototype for the recite application is implemented with a hack. In the code described in the Solution section below, data is simply embedded in the document using a data URI that encodes the JSON data supplied by Zotero. In order to work with citeproc.js, the most efficient way to store data is using the JSON format that is used by citeproc.js, described at:

https://github.com/citation-style-language/schema/blob/master/csl-data.json

This is not usual HTML5 best practice, and subsequent versions will attempt to use more standard approaches. To accomplish this, though, more work needs to be done to map the JSON format to the Bibliographic ontology.

To illustrate this in a concrete way, let us take the Zotero example above, step by step.

The author cites the newspaper article about flooding using Zotero in Microsoft Word, using the Zotero tool, e.g. [3] .
Behind the scenes, Zotero stores the citation in JSON format with complete citation data in a Word field. In the raw save as HTML format from Word, this looks like so (truncated):

<!–[if supportFields]><span lang=EN-GB style=’mso-fareast-language:EN-GB’><span style=’mso-element:field-begin’></span>

ADDIN ZOTERO_ITEM CSL_CITATION

{"citationID":"j3o3mb0pi","properties":{"formattedCitation":"[2]","plainCitation":"[2]"},"citationItems":[{"id":55,"uris":["http://zotero.org/users/568/items/B2F9H4I2"],"uri":["http://zotero.org/users/568/items/B2F9H4I2"],"itemData":{"id":55,"type":"article-newspaper","title":"GREAT

FLOOD AT TOOWOOMBA. SYDNEY,

…

<![endif]–>

Using the WordDown Word-to-HTML converter, the author converts the document to HTML5. WordDown does two things:

1. For practical reasons, it keeps the JSON-formatted code as a data URI – so that it can be re-used by software that understands the Zotero JSON format.

2. For more general, standards-compliant use, WordDown converts the Zotero data from JSON into Microdata, so that the complete citation information is hidden inline at the point of citation, like this:

Note that this microdata is a demonstration only, it does not have a formal namespace for the Zotero data. At this stage, this case study remains incompatible with those of Adams [4], which examines several formats, none of which are directly compatible with Zotero’s format and MacGillivary [2] which uses the BibJSON format which has been developed independently of Zotero’s JSON format.

How to Generate This Code

HTML5-embedded citations are not practical to type by hand, so tool support is needed. The solution delivered as part of this use case is designed for word-processor users. Part of the ongoing work on this project is to discuss with the developers of other tools how their tool-chains might support similar outputs. In particular the WordPress community does a lot of work on citation plug-ins, and Pandoc [4], supports generation of HTML with CSL support for citations, discussion on mailing lists supporting those communities has informed this case study.

3. Solution

The solution presented in this case study consists of three main parts:

Collaborative work with JISCHTML5 project participants on declarative specifications for embedding citation data in HTML5 publications. This is covered in the case study on core HTML5 structure [5] and in the commentary below about best practice.
A demonstration implementation in the WordDown HTML5 [6] conversion tool of how Zotero references in Word documents can be captured.
The demonstration Javascript software for reformatting citations, ReCite, uses the Citation Style Language (CSL). CSL is a language for describing citation formats, invented by a USacademic, Bruce D’Arcus. It is now used in at least two major reference management applications: Zotero and Mendeley[5].

This solution uses the citeproc.js library for formatting citations.

This work has had very limited impact so far, but it has attracted some interest from those working on related projects and in areas such as the digital humanities. It is an important area for investment by the Higher Education sector, though, for simple reasons of productivity. Huge amounts of time are expended on current citation practices in authoring, publishing and re-using academic content.

4. Challenges

The main challenge in this work is aligning effort that is happening in many different projects across the world, and choosing which of many competing metadata standards to support. While Zotero, as an open source product, and Mendeley as a free-to-use product, are both widely deployed, and can share citation formats via CSL, the JSON format used by the citation formatter citeproc.js is not very well documented and at this stage it is not clear exactly how much work will be involved to map the JSON format to HTML5 microdata and to other formats.

There are also problems with the citeproc.js library used in the ReCite code. It uses an XML processing library which does not appear to work very well in the Google Chrome browser or Apple's Safari. It is not clear at this stage if these problems can be resolved or an alternative found.

5. Conclusions

This case study has shown a proof of concept implementation that demonstrates that citations can be inserted into a document in one mode, in a word processing application, and be published in HTML5 with the semantic structure of the citation intact; so it can be reformatted on demand, and, more importantly, processed by machines. This is just one tool-chain, the same principle could be applied in many different workflows.

While the proof of concept works it is limited by interoperability problems between the citation and bibliographic formats in use today on the Web and in academic systems. This early work shows promise, but to reap substantial benefits, investment is needed in the following areas:

Cross-walk services so that systems using different standards for bibliographic metadata can be bridged. A public web API would be ideal, and would reduce costs for web developers working with all kinds of academic materials which have formal citations in them
Guides users following and implementing current common academic authoring, editorial and publishing workflows, including information literacy materials for use in higher education institutions.
Representing the needs of higher education and research in current commercially driven quasi-standards efforts, particularly the Schema.org consortium, which is defining standards for representing embedded in HTML5 materials.

We have an opportunity, via a modest effort in development, standardisation work and outreach, to realise the goal set out in this case study: to automate processing of in-text citations and reference lists in publications.

6. References

[1] Open Bibliography for Science, Technology, and Medicine. Jones, R., MacGillivray, M., Murray-Rust, P., Pitman, J., Sefton, P., O'Steen, B., & Waites, W. (2011). Retrieved from: http://www.dspace.cam.ac.uk/handle/1810/238406.

[2] Visualising Embedded Metadata, MacGillivray, M. HTML5 Case Studies, University of Bath: UKOLN.

[3] GREAT FLOOD AT TOOWOOMBA. SYDNEY, Friday. Gippsland Times, 1893. p.3.

[4] Semantics and Metadata, HTML5 Case Studies, Adams, S. UKOLN, University of Bath:.

[5] Sefton, P. (2012b). Conventions and guidelines for Scholarly HTML5 documents, HTML5 Case Studies, UKOLN, University of Bath.

[6] Sefton, P. (2012c). WordDown: Word to HTML5 conversion tool, HTML5 Case Studies, UKOLN, University of Bath.

Footnotes

[1] THATCamp Canberra - The Humanities And Technology unconference, 7-9 October, University of Canberra, http://thatcampcanberra.org/

[2] TROVE Digitised Newspapers and more, http://trove.nla.gov.au/ndp/del/home

[3] Zotero, http://www.zotero.org/

[4] Pandoc, http://johnmacfarlane.net/pandoc/

[5] Mendeley, http://www.mendeley.com/

HTML5 Case Studies: Introduction

Administrator — Mon, 16 Jul 2012 11:59:11 +0000

Case studies illustrating development approaches to use of HTML5 and related Open Web Platform standards in the UK Higher Education sector.

Author: Brian Kelly

1. About This Document

This document provides an introduction to a series of HTML5 case studies which were commissioned by the JISC. The document gives an introduction to HTML5 and related standards developed by the W3C and explains why these developments represent a significant development to Web standards, which is of more significance than previous incremental developments to HTML and CSS.

2. About HTML5

As described in Wikipedia [1] HTML5 is a markup language for structuring and presenting content on the Web. HTML5 is the fifth version of the HTML language which was created in 1990. Since then the language has evolved from HTML 1, HTML 2, HTML 3.2, HTML 4 and XHTML 1.

The core aims of HTML5 are to improve the language with support for the latest multimedia while keeping it easily readable by humans and consistently understood by computers and devices.

HTML5 has been developed as a response to the observation that the HTML and XHTML standards in common use on the Web are a mixture of features introduced by various specifications, along with those introduced by software products such as web browsers, those established by common practice, and the many syntax errors in existing web documents

It is also an attempt to define a single markup language that can be written in either HTML or XHTML syntax. It includes detailed processing models to encourage more interoperable implementations; it extends, improves and rationalises the markup available for documents, and introduces markup and application programming interfaces (APIs) for complex web applications.

For the same reasons, HTML5 is also a potential candidate for cross-platform mobile applications. Many features of HTML5 have been built with the consideration of being able to run on low-powered devices such as smartphones and tablets.

In particular, HTML5 adds many new syntactical features. These include the new,andelements, as well as the integration of Scalable Vector Graphics (SVG) content that replaces the uses of generic tags and MathML for mathematical formulae.

As illustrated in Figure 2 HTML5 is built on a series of related technologies, which are at different stages of standardisation (see [2]). These features are designed to make it easy to include and handle multimedia and graphical content on the web without having to resort to proprietary plugins and APIs. Other new elements, such as,,and, are designed to enrich the semantic content of documents. New attributes have been introduced for the same purpose, while some elements and attributes have been removed. Some elements, such as , and

have been changed, redefined or standardised. The APIs and document object model (DOM) are no longer afterthoughts, but are fundamental parts of the HTML5 specification. HTML5 also defines in some detail the required processing for invalid documents so that syntax errors will be treated uniformly by all conforming browsers and other user agents.

3. The Open Web Platform

The Open Web Platform (OWP) is the name given to a collection of Web standards which have been developed by the W3C [3]. The Open Web Platform has been defined as “a platform for innovation, consolidation and cost efficiencies” [4].

The Open Web Platform covers Web standards such as HTML5, CSS 2.1, CSS3 (including the Selectors, Media Queries, Text, Backgrounds and Borders, Colors, 2D Transformations, 3D Transformations, Transitions, Animations, and Multi-Columns modules), CSS Namespaces, SVG 1.1, MathML 3, WAI-ARIA 1.0, ECMAScript 5, 2D Context, WebGL, Web Storage, Indexed Database API, Web Workers, WebSockets Protocol/API, Geolocation API, Server-Sent Events, Element Traversal, DOM Level 3 Events, Media Fragments, XMLHttpRequest, Selectors API, CSSOM View Module, Cross-Origin Resource Sharing, File API, RDFa, Microdata and WOFF.

Use of the term Open Web Platform can be helpful in describing developments which make use of standards which complement HTML5.

The list of Web standards covered by the term provides an indication of the significant developments which are currently taking place which aim to provide much greater and more robust support for use of the Web across a variety of platforms and for a variety of uses.

4. Importance to Higher Education

The Web became of strategic importance to higher education in the mid 1990s primarily in its role as an informational resource. As the potential of Web became better understood new types of services were developed and the Web is now used to support the key areas of significance to higher educational institutions: teaching and learning and research.

However although innovative uses of the Web have been seen in these areas, the limitations of Web standards made it difficult and costly to develop highly-interactive cross-platform applications. Such difficulties meant that significant developments in use of the Web to provide applications (as opposed to access to information) was being led to large global companies, with Google's range of services such as Google Docs providing an example of a widely used Web-based application.

The experiences gained in developing such Web-based applications led to the evolution of Web standards to support such development work. In addition the growth in popularity of mobile devices led to the development of standards which could be used across multiple types of devices, in addition to the cross-platform independence which allowed Web services to be accessed across desktop PCs running MS Windows, Apple Macintosh or Linux operating systems.

Developments to the HTML5 standard enable multimedia resources to be embedded in HTML resources as a native resources. In addition developments to related standards, such as SVG (Scalable Vector Graphics) and MathML (the Mathematics Markup Language) together with developments to standards which support programmatic manipulation of objects defined in these markup languages will provide a rich environment for the development of new types of tools and services which will be value to support a range of institutional requirements.

In addition the support for mobile devices will enable access to this new generation of applications to be provided across a range of mobile devices, including iPhones and iPads, Android devices and smart phones and tablet computers which may use operating systems provided by other vendors.

In brief the development of HTML5 and the Open Web Platform can provide the following benefits across higher education:

A rich environment for the development of applications which can run in a Web browser.
A rich environment for the development of applications which can run across a range of platforms and suit the particular requirements of mobile devices.
A rich environment for defining the structure of scholarly resources, such as research papers, to support more effective processing of the resources.
A neutral and open environment based on use of open standards which can provide a level playing field for application development.

5. About the HTML5 Case Studies

The HTML5 case studies have been commissioned in order to demonstrate development approaches taking place across the higher education sector by early adopters in order to support a variety of use cases which are particularly relevant in a higher education context.

The case studies are aimed primarily at developers and technical managers who wish to gain a better understanding of ways in which development approaches based on use of HTML5 and Open Web Platform can be used.

Whilst the examples described in the case studies are being used across a number of higher educational institutions we appreciate that not all institutions will wish to make use of the approaches described in the case studies - in particular we recognise that institutions may not have the development and support expertise to emulate the approaches described in the following documents. However increasingly we are seeing commercial vendors making use of HTML5 in new versions of their products. This suggests vendor support for HTML5 may be a relevant factor that in the procurement of new applications.

6. Summary of the HTML5 Case Studies

The HTML5 case studies included in this work are summarised below:

Case Study 1: Semantics and Metadata: Machine-Understandable Documents by Sam Adams
Case Study 2: CWD: The Common Web Design by Alex Bilbie:
Case Study 3: Re-Implementation of the Maavis Assistive Technology Using HTML5 by Steve Lee
Case Study 4: Visualising Embedded Metadata by Mark MacGillivray
Case Study 5: The HTML5-Based e-Lecture Framework by Qingqi Wang
Case Study 6: 3Dactyl: Using WebGL to Represent Human Movement in 3D by Stephen Gray
Case Study 7: Challenging the Tyranny of Citation Formats: Automated Citation Formatting by Peter Sefton
Case Study 8: Conventions and Guidelines for Scholarly HTML5 Documents by Peter Sefton
Case Study 9: WordDown: A Word-to-HTML5 Conversion Tool by Peter Sefton

Semantics and Metadata: Machine-Understandable Documents: Case study 1 describes how embedding machine-understandable metadata into researchers' Web sites can help to enhance researchers' reputation by making their research outputs more visible, easier to discover and increasing their use.

CWD: The Common Web Design: Case study 2 describes the Common Web Design (CWD): the interface for the University of Lincoln’s online services. Developed with HTML5 and CSS3 technologies, the University of Lincoln's Common Web Design enables rapid development of attractive, interactive and modern Web sites.

Re-Implementation of the Maavis Assistive Technology Using HTML5: Case study 3 is aimed at those interested in applications that provide alternative or innovative user experiences using HTML5 Web applications. The focus is on assistive technology which is designed to enable wider access to media, apps and other online technology. This access may be for users who have varying access requirements, such as older users or those with physical or cognitive impairment. Alternatively it may be for use in environments that require alternative interaction styles, for example in bright light or with restricted access to a mobile device.

Visualising Embedded Metadata: Case study 4 addresses ways of enhancing the dissemination and discoverability of research outputs. Having achieved success in making bibliographic metadata available on a large scale there is now a need to demonstrate ways for individuals and small groups to interact easily and usefully with the data, in order to show the benefit of open bibliography and open publishing in general. This case study describes how HTML5 and related Open Web Platform standards such as JavaScript, DOM and SVG can be used to provide visualisations of embedded metadata.

The HTML5-Based e-Lecture Framework: Case study 5 focuses on providing a solution to allow e-lecture creators to convert their Microsoft PowerPoint presentations into online lectures in a simple and quick fashion. The resulting e-lecture can be easily deployed on an existing Web server and delivered to both desktop and mobile platforms.

3Dactyl: Using WebGL to Represent Human Movement in 3D: Case study 6 covers the development of 3Dactyl, a hardware and software configuration, which is intended to record and represent the physical movements of an individual online in three dimensions, for scholarly research purposes. Resulting 3D scenes (as an XML document) are embeddable within a standard Web page or VLE. Examples of such 3D footage might be various forms of performance art, e.g. dance, drama or even sport where the performance of play strokes can be carefully analysed. Within the same constraints of space, surgical or therapeutic procedures would be another feasible use. When such scenes are viewed on future versions of browsers, they will not, typically, require special plug-ins to use the 3D footage interactively

Challenging the Tyranny of Citation Formats: Automated Citation Formatting: Case study 7 looks at how citations and reference lists can be represented in HTML5 in two ways; firstly with reference information supplied in-page and secondly using URIs that point to trusted bibliographic data stores. The end goal is to automate as much of the citation and reference management experience as possible at all stages of the academic workflow, from research to authoring, to publishing to citation analysis, generation of metrics and machine processing of data.

Conventions and Guidelines for Scholarly HTML5 Documents: Case study 8 looks at the fundamentals of using HTML5 for scholarly documents of all kinds, particularly theses and courseware documents (with application to journal articles as well), but with an eye on a much broader spectrum of resources, including those which are the subject of other case studies in this project such as slide presentations. It will aim to establish the basic structural and semantic building blocks for how resources should be marked up for the Web, to increase their utility for people and machines, as well as help to ensure they can be preserved effectively.

WordDown: A Word-to-HTML5 Conversion Tool: Case study 9 examines ways that academic authors working with word processors such as Microsoft Word, the OpenOffice.org family and Google Docs would be able to produce compliant Scholarly HTML5.

References

[1] HTML5, Wikipedia, <http://en.wikipedia.org/wiki/HTML5>

[2] Sergey’s HTML5 & CSS3 Quick Reference. 2nd Edition, Sergey Mavrody, ISBN 978-0-9833867-2-8

[3] Open Web Platform, Wikipedia, <http://en.wikipedia.org/wiki/Open_Web_Platform>

[4] Jeffe Jappe, W3C CEO quoted in <http://www.w3.org/2001/tag/doc/IAB_Prague_2011_slides.html>