Skip to Content

HTML5 Case Study 7: Challenging the Tyranny of Citation Formats: Automated Citation Formatting

Author: Peter Sefton

1. About This Case Study

HTML5 case study on "Challenging the Tyranny of Citation Formats: Automated Citation Formatting for HTML5" by Peter Sefton

What Is Covered

This case study looks at how citations and reference lists can be represented in HTML5 in two ways; firstly with reference information supplied in-page and secondly using URIs that point to trusted bibliographic data stores. The end goal is to automate as much of the citation and reference management experience as possible at all stages of the academic workflow, from research to authoring, to publishing to citation analysis, generation of metrics and machine processing of data. This case study covers the HTML5 coding that would be needed to support the goal of end-to-end citation processing, and provides a proof-of-concept demonstration of the principles. There are, however, some significant problems to overcome before we have working software that can be used by a broad range of academic authors.

An outcome of this case study is a Javascript tool demonstrating how scholarly works of all kinds can be marked up in such a way that users can decide to display the citations and reference list in any format they choose via the Citation Style Language (CSL), initially choosing from a small set of commonly used formats. With further development, the tool would have access to the several hundred community-edited CSL styles available from the Zotero Project. The tool, ReCite, will continue to be developed at least until the completion of this JISC HTML5 work, and is documented on the Google Code wiki for the project.

The initial prototype builds on work in the JISC-funded Open Bibliography Project [1] and was developed in parallel with a case study by MacGillivray [2].

The case study has also been conducted with an eye to practical authoring tools, with consideration given to how people may create this kind of content using familiar tools such as word processors with Zotero or Mendeley and similar systems, or with LaTeX and BibTex, Wikis and online Web editors such as those provided on WordPress.

The exemplars and documentation cover:

  • Techniques for adding citation and reference lists using URIs in HTML5 and implications for authoring.
  • The issues this work presents for tool-chains where authors are using word processors, LaTeX, etc., to create content.

The point of this is to build on other projects in this stream mentioned above, covering the basics of embedding citations and structuring documents to:

  • Show user-centric user interface possibilities for displaying citations.
  • Point the way to improved authoring services that do away with unnecessary formatting of citations and references altogether.
  • Together with Mark MacGillivray's visualisation project [2], show a potential future for citations and referencing that transcends the limitations of a system still rooted in paper-based conventions.

The issues discussed in this case study were the subject of a session at THATCamp, Canberra (The Humanities and Technology Camp) [FN1] in October 2011, convened by the author.

This case study attempts to discuss some of the workflow issues for all kinds of academic work involving referencing, but due to time and resource constraints, will focus on one particular tool-chain: authoring in Microsoft Word, using Zotero to manage references.

What Is Not Covered

There is no scope in this case study for a detailed survey of citation practices or lengthy discussion of the relative merits of embedding citation data vs using URIs. It aims to produce a starting point for one simple, useful tool, and to gain further experience for the JISC community with bibliographic data.

Also out of scope is the detailed work needed to map between citation formats, and to write connector-code for online services so that citation by reference can be resolved reliably to machine-readable bibliographic data.

Target Audience

The main audience for this work is tool developers building authoring systems, repositories and publishing infrastructure for academic documents. The outcomes could be used by people hand-coding documents, but that is not a very likely scenario.

It is also targeted at digitally adept researchers, such as those in the digital humanities, or working on Web-based scholarship tools.

2. Use Case

Citations and references are a key part of all academic practice across many resource types including, but not limited to:

  • Articles (best not to call them "papers" in an HTML5 project).
  • Theses and other student work such as essays.
  • Books.
  • Course materials.
  • CVs and portfolios.
  • Report documents such as this one.

Even today, when there are many tools available for managing references and formatting documents, students are taught how to hand-format references. At the aforementioned THAT Camp session, two participants involved in information literacy noted that even though students should be using reference management software in their work, it is important that they can read the main citation formats used in their disciplines, and learning to hand-construct citations is a good way to learn. The nature of conventional citation and reference has evolved under the constant constraints of paper-based publishing, not least of which has been space, something that Web-based publishing has largely removed - but these constraints have made the process very costly in terms of researchers' time. But looking ahead to a world where the constraints of paper-based scholarship no longer require us to invent and memorise rigorous concise encodings for bibliographic references and citations, participants in the workshop agreed that it would be possible in Web contexts to show reference information in new ways, such as simple 'pop-up' tables that show the data set out with labelled fields. This would save us having to know, for example whether the text in italics in a reference in a given format is the name of the article or the name of the journal, or whether the theme under consideration is the title of a book or a report.

But for now, obviously citation formatting conventions are still important, hence this work on making it possible for users to choose the format they prefer, to improve the usability of online resources. Other benefits include:

  • Reliable machine processing for better citation metrics.
  • Reduced effort for authors, editors and readers who reuse citations.

Guidance for Authors

In addition to the solution described below, this case study looks at the broader issues around citation management in a Web world. This section proposes some best practice for academic authors.

The goals of this section are:

  • To describe current best practice for authors wanting to put academic resources on the Web with the most useful possible citations, including guidance on how to format HTML pages and descriptions of the demonstration tools developed in this case study.
  • To enable authors to meet publisher or university requirements for particular citation formats as easily as possible.
  • Provide a platform for change; where instead of publishers or markers requiring references to be formatted in a particular text-based format, the requirement is for rich citation data to be embedded.

Citation Data Online vs Embedded? Do Both

One of the main goals for best-practice HTML5 citation should be to make sure that citations have URIs that can be resolved to Web resources. That is, all bibliography should be published and authors should be able to use high-quality online sources when possible. This means that academic resources become part of the linked-data Web.

Take, for example, historians working with the National Library of Australia's Trove collection of Australian newspapers[FN2]. Using the Web interface, they can search the collection to locate an article. With Zotero [FN3] running in Firefox, an add icon appears in the browser so researchers can create a record for an article such as one about flooding in Toowoomba, Victoria in 1893 [3]. If researchers are sharing their Zotero libraries, then they have now created a usable public URI with bibliographic information for that article. It would then be possible to cite by reference to the bibliographic record.

While it might seem to be at odds with the above requirement to use URIs, it is also best practice in many situations to include a cached copy of the bibliographic data with the document. Including data inline means:

  • Data is available for re-formatting citations when working off-line.
  • Authors can fine-tune citation details to suit their own purposes.

Therefore, for maximum potential utility to readers, re-users, and machine-processing applications, including both cite-by-reference to a URI and embedded bibliographic metadata is important. In disciplines where high-quality reference sources such as PubMed Central are available online, then there may be no substantial advantage in using embedded metadata, whereas in the humanities, or areas where URIs are likely to be less stable, then it offers some extra insurance.

Formatting HTML5 citations

Following work by Sam Adams on the JISC HTML5 Project [4] subsequently refined by group discussion, the minimal citation for an online resource should be embedded in HTML5 in the following way:

<span itemtype=”http://schema.org/ScholarlyArticle
itemscope=”"
itemprop=”http://purl.org/ontology/bibo/cites>
DISPOSABLE citation marker here
<link itemprop=”url” href=”http://example.com/uri-for-citation/” />
</span>

The 'HTML5 way' to embed further data is to use more link and metadata elements to embed the citation data at the point of citation. Sam Adams; case study [4] has chosen the bibliographic ontology for its namespace. For example, author names:

<span itemtype=”http://schema.org/Person” itemscope=”" itemprop=”author”>
<meta itemprop=”name” content=”Sam Adams” />
<link itemprop=”url” href=”http://example.com/uri-for-author” />
</span>

While this is the ‘HTML5 way’ to embed data, in this case study the initial prototype for the recite application is implemented with a hack. In the code described in the Solution section below, data is simply embedded in the document using a data URI that encodes the JSON data supplied by Zotero. In order to work with citeproc.js, the most efficient way to store data is using the JSON format that is used by citeproc.js, described at:

https://github.com/citation-style-language/schema/blob/master/csl-data.json

This is not usual HTML5 best practice, and subsequent versions will attempt to use more standard approaches. To accomplish this, though, more work needs to be done to map the JSON format to the Bibliographic ontology.

To illustrate this in a concrete way, let us take the Zotero example above, step by step.

  • The author cites the newspaper article about flooding using Zotero in Microsoft Word, using the Zotero tool, e.g. [3] .
  • Behind the scenes, Zotero stores the citation in JSON format with complete citation data in a Word field. In the raw save as HTML format from Word, this looks like so (truncated):

<!–[if supportFields]><span lang=EN-GB style=’mso-fareast-language:EN-GB’><span style=’mso-element:field-begin’></span>

ADDIN ZOTERO_ITEM CSL_CITATION

{&quot;citationID&quot;:&quot;j3o3mb0pi&quot;,&quot;properties&quot;:{&quot;formattedCitation&quot;:&quot;[2]&quot;,&quot;plainCitation&quot;:&quot;[2]&quot;},&quot;citationItems&quot;:[{&quot;id&quot;:55,&quot;uris&quot;:[&quot;http://zotero.org/users/568/items/B2F9H4I2&quot;],&quot;uri&quot;:[&quot;http://zotero.org/users/568/items/B2F9H4I2&quot;],&quot;itemData&quot;:{&quot;id&quot;:55,&quot;type&quot;:&quot;article-newspaper&quot;,&quot;title&quot;:&quot;GREAT

FLOOD AT TOOWOOMBA. SYDNEY,

<![endif]–>

  • Using the WordDown Word-to-HTML converter, the author converts the document to HTML5. WordDown does two things:

1. For practical reasons, it keeps the JSON-formatted code as a data URI – so that it can be re-used by software that understands the Zotero JSON format.

<link itemprop=”url” href=”data:application/json,%A0%20%7B%22citationID%22%3A%22ptFzCvsW%22%2C%22properties%22%3A%7B%22formattedCitation%22%3A%22%28Anon%201893%29%22%2C%22plainCitation%22%3A%22%28Anon%201893%29%5B2%5D%22%7D%2C%22citationItems%22%3A%5B%7B%22id%22%3A393%2C%22uris%22%3A%5B%22http%3A//zotero.org/users/568/items/B2F9H4I2%22%5D%2C%22uri%22%3A%5B%22http%3A//zotero.org/users/568/items/B2F9H4I2%22%5D%2C%22itemData%22%3A%7B%22id%22%3A393%2C%22type%22%3A%22article-newspaper%22%2C%22title%22%3A%22GREAT%20FLOOD%20AT%20TOOWOOMBA.%20SYDNEY%2C%20Friday.%22%2C%22container-title%22%3A%22Gippsland%20Times%22%2C%22publisher-place%22%3A%22Vic.%22%2C%22page%22%3A%223%22%2C%22event-place%22%3A%22Vic.%22%2C%22issued%22%3A%7B%22date-parts%22%3A%5B%5B%221893%22%2C2%2C20%5D%5D%7D%2C%22accessed%22%3A%7B%22date-parts%22%3A%5B%5B2011%2C10%2C17%5D%5D%7D%7D%7D%5D%2C%22schema%22%3A%22https%3A//github.com/citation-style-language/schema/raw/master/csl-citation.json%22%7D%20″>

2. For more general, standards-compliant use, WordDown converts the Zotero data from JSON into Microdata, so that the complete citation information is hidden inline at the point of citation, like this:

<span itemprop=”cites” itemscope=”itemscope” itemtype=”http://schema.org/ScholarlyArticle”><meta itemprop=”id” content=”393″><meta itemprop=”type” content=”article-newspaper”><meta itemprop=”title” content=”GREAT FLOOD AT TOOWOOMBA. SYDNEY, Friday.”><meta itemprop=”container-title” content=”Gippsland Times”><meta itemprop=”publisher-place” content=”Vic.”><meta itemprop=”page” content=”3″><meta itemprop=”event-place” content=”Vic.”><span itemprop=”issued”><meta itemprop=”date-parts” content=”1893220″></span><span itemprop=”accessed”><meta itemprop=”date-parts” content=”20111017″></span><link itemprop=”uri” href=”http://zotero.org/users/568/items/B2F9H4I2″></span>

Note that this microdata is a demonstration only, it does not have a formal namespace for the Zotero data. At this stage, this case study remains incompatible with those of Adams [4], which examines several formats, none of which are directly compatible with Zotero’s format and MacGillivary [2] which uses the BibJSON format which has been developed independently of Zotero’s JSON format.

How to Generate This Code

HTML5-embedded citations are not practical to type by hand, so tool support is needed. The solution delivered as part of this use case is designed for word-processor users. Part of the ongoing work on this project is to discuss with the developers of other tools how their tool-chains might support similar outputs. In particular the WordPress community does a lot of work on citation plug-ins, and Pandoc [4], supports generation of HTML with CSL support for citations, discussion on mailing lists supporting those communities has informed this case study.

3. Solution

The solution presented in this case study consists of three main parts:

  1. Collaborative work with JISCHTML5 project participants on declarative specifications for embedding citation data in HTML5 publications. This is covered in the case study on core HTML5 structure [5] and in the commentary below about best practice.
  2. A demonstration implementation in the WordDown HTML5 [6] conversion tool of how Zotero references in Word documents can be captured.
  3. The demonstration Javascript software for reformatting citations, ReCite, uses the Citation Style Language (CSL). CSL is a language for describing citation formats, invented by a USacademic, Bruce D’Arcus. It is now used in at least two major reference management applications: Zotero and Mendeley[5].

This solution uses the citeproc.js library for formatting citations.

This work has had very limited impact so far, but it has attracted some interest from those working on related projects and in areas such as the digital humanities. It is an important area for investment by the Higher Education sector, though, for simple reasons of productivity. Huge amounts of time are expended on current citation practices in authoring, publishing and re-using academic content.

4. Challenges

The main challenge in this work is aligning effort that is happening in many different projects across the world, and choosing which of many competing metadata standards to support. While Zotero, as an open source product, and Mendeley as a free-to-use product, are both widely deployed, and can share citation formats via CSL, the JSON format used by the citation formatter citeproc.js is not very well documented and at this stage it is not clear exactly how much work will be involved to map the JSON format to HTML5 microdata and to other formats.

There are also problems with the citeproc.js library used in the ReCite code. It uses an XML processing library which does not appear to work very well in the Google Chrome browser or Apple's Safari. It is not clear at this stage if these problems can be resolved or an alternative found.

5. Conclusions

This case study has shown a proof of concept implementation that demonstrates that citations can be inserted into a document in one mode, in a word processing application, and be published in HTML5 with the semantic structure of the citation intact; so it can be reformatted on demand, and, more importantly, processed by machines. This is just one tool-chain, the same principle could be applied in many different workflows.

While the proof of concept works it is limited by interoperability problems between the citation and bibliographic formats in use today on the Web and in academic systems. This early work shows promise, but to reap substantial benefits, investment is needed in the following areas:

  1. Cross-walk services so that systems using different standards for bibliographic metadata can be bridged. A public web API would be ideal, and would reduce costs for web developers working with all kinds of academic materials which have formal citations in them
  2. Guides users following and implementing current common academic authoring, editorial and publishing workflows, including information literacy materials for use in higher education institutions.
  3. Representing the needs of higher education and research in current commercially driven quasi-standards efforts, particularly the Schema.org consortium, which is defining standards for representing embedded in HTML5 materials.

We have an opportunity, via a modest effort in development, standardisation work and outreach, to realise the goal set out in this case study: to automate processing of in-text citations and reference lists in publications.

6. References

[1] Open Bibliography for Science, Technology, and Medicine. Jones, R., MacGillivray, M., Murray-Rust, P., Pitman, J., Sefton, P., O'Steen, B., & Waites, W. (2011). Retrieved from: http://www.dspace.cam.ac.uk/handle/1810/238406.

[2] Visualising Embedded Metadata, MacGillivray, M. HTML5 Case Studies, University of Bath: UKOLN.

[3] GREAT FLOOD AT TOOWOOMBA. SYDNEY, Friday. Gippsland Times, 1893. p.3.

[4] Semantics and Metadata, HTML5 Case Studies, Adams, S. UKOLN, University of Bath:.

[5] Sefton, P. (2012b). Conventions and guidelines for Scholarly HTML5 documents, HTML5 Case Studies, UKOLN, University of Bath.

[6] Sefton, P. (2012c). WordDown: Word to HTML5 conversion tool, HTML5 Case Studies, UKOLN, University of Bath.


Footnotes

[1] THATCamp Canberra - The Humanities And Technology unconference, 7-9 October, University of Canberra, http://thatcampcanberra.org/
[2] TROVE Digitised Newspapers and more, http://trove.nla.gov.au/ndp/del/home


Dr. Radut | technical_resources