Technical Foundations Blog

Blog no longer active

Administrator — Thu, 02 May 2013 16:18:44 +0000

Development of the site and content ceased because of the withdrawal of JISC funding for the Information Support Centre at UKOLN from August 2013, which was announced in October 2013. As a result, this blog is no longer being updated and comments have been switched off on all previous posts.

Instant Messaging: Past, Present and Future

Talat Chaudhri — Fri, 08 Mar 2013 11:22:57 +0000

A brief history...

Instant messaging has been around on the Internet for longer than the World Wide Web. In its earliest, purest (and, it's probably fair to say, crudest) form, it was possible to use the Unix command line tool write to output a message to another user's terminal, provided that they had previously typed mesg y (i.e. messaging yes), or indeed to directly echo or even cat the contents of a file to another terminal. Surprisingly, there was also a tool for real-time typing to the other terminal, which eventually settled on a split-screen approach. (Far more recently, this was, one of the supposed "killer" features of Google Wave before its development was abandoned - yet it had existed in a simpler form many years before.) While the original write and talk utilities have been gradually improved so that they can talk to users on different servers - and, for example, provide security over the Secure Sockets Layer (SSL) - they were never a user-friendly tool for the non-technical user. They are still installed by default on some Unix/Linux distributions but are little used even by developers, given the huge variety of more modern, scalable technologies.

What was not provided by these early utilities was the ability to have anything but the crudest control over who could chat with you. Once you had typed mesg y, anybody on that server could contact you until you typed mesg n (i.e. messaging no). In addition, giving somebody else control over your terminal was a major security issue. Even so, the modern concept of contact lists (i.e. friends) and presence information (e.g. available, busy, offline etc) were also missing. You can still tell, although only by user name, who is logged into the same Unix/Linux server as you by typing simply who into the terminal.

The next great step in the history of instant messaging was Internet Relay Chat (IRC). This essentially provides command-line chat rooms, which can be made at least somewhat more user-friendly through graphical user interface (GUI) tools such as mIRC, as well as private messages to individuals. While it is not particularly obvious how to indicate presence, amongst the myriad other commands that are available, all of the functions that one would expect in modern instant messaging are available. It was later made available over SSL, which provides basic security from snooping. However, IRC remains susceptible to netsplits, takeover wars from hackers, denial-of-service attacks, and one is not automatically guaranteed a unique identifier or nick if it has already been used on that server by another user or if the server does not allow a nickserv (nickname registration server). Despite all these failings and its consequent decline in popularity, IRC remains popular with developer communities because of its relative simplicity, in addition to a certain retro chic.

The MSN era and beyond

Instant messaging came to the ordinary user through a myriad of mutually non-interoperable commercial protocols, each with their own Graphical User Interface (GUI) provided by the company in question and many spin-off open source replacements. The underlying technology behind these protocols was not published, but they effectively supplied what would now be called an Application Programming Interface (API), stating how developers could write tools that could communicate with their servers. One could not simply run one's own server because the underlying technology was proprietary. Many of these are still in use, for example AIM, Yahoo Instant Messaging etc, and perhaps the largest, Microsoft's MSN or later Windows Live Messenger, has only just been retired through a merger with Skype. (This has, as a side effect, removed the ability of MSN users to chat with users of Yahoo Instant Messaging, as this is not possible in Skype.) For most users, all that has changed since those days is the gradual migration to new tools such as Skype, which adds voice and video chat, and Facebook chat, which is merely convenient because of the critical mass of contacts who are already on Facebook. Similarly, Google Talk offers IM services to anybody who already has a Google account and uses GMail for web-based email. Both Facebook and Google Talk have later added audio and video chat. Together, these dominate the market because they are attached to the most widely used Internet services and are accessible to ordinary, non-technical users. In the case of Facebook and Google Talk, there is the added advantage of access via the Web without downloading any dedicated software.

Open Standards

Both Google Talk and Skype are particularly interesting because, unknown to the bulk of their users, they implement the open standard XMPP (also known as Jabber), although Skype is not fully compliant with the roster system that enables one to have contacts across different XMPP servers. The reason for this is, of course, that they only want Facebook users to chat with other Facebook users rather than enable chat with other XMPP users, which would naturally include competitors such as Google Talk that also implement the protocol. However, the competition for instant messaging does not seem to be as fierce as it was, and the competitors have formed agreements: Facebook chat is now integrated into Skype despite Facebook offering competition in audio and video chat tools from its own Web site. This may be because the free service is effectively a loss leader: it does not provide the commercial income directly, since the service is free. Instead, Skype market additional paid services such as providing Skype Out (calling landline or mobile telephones), providing users with external telephone numbers and voicemail services, group video chat and so on; similarly, Facebook make their revenue through advertising on their site, which is attractive because of the free social networking tools, including instant messaging, audio and video chat. It appears to be in everybody's interest to cooperate to some degree.

Google stands out among the other commercial players in allowing its users to chat to other XMPP users who have accounts on different servers, either commercial, free or privately operated. One can talk to Google Talk users (or any other XMPP users) using a free account with jabber.org or even run one's own server (as one could with IRC) using ejabberd or similar open source XMPP server software. However, audio and video chat is limited to users with Google accounts, providing the incentive to prefer their all-in-one, one-stop shop approach to Internet services, which is convenient for most users. The development of open source extensions to XMPP has been slow. It is still difficult to find XMPP servers that deploy Jingle, the extension for audio and video chat, which is considerably harder to do effectively than merely installing an XMPP server, which is the work of an hour or two.

While XMPP is the de facto standard for modern IM, both for open source and increasingly commercial services, it is not without criticism. It is verbose, relying on XML, which can be an issue where bandwidth is an issue. This is a small problem for IM services but a much larger one, for example, when audio and video streams are added: it does not support binary data streams natively. It is designed for a federation network run on a number of servers and its network vulnerability, while not as high as IRC, remains a structural issue. It uses massive unicasting and does not support multicasting, which is a minor efficiency issue in chat rooms but becomes much more of a problem for group audio and video streaming. It is possible to directly substitute a newer, although relatively little known protocol called PSYC, an interserver that supports XMPP and IRC natively, which alleviates most of these problems. It takes about an hour or two to set up the psyced server, about the same as a basic IRC or XMPP server. This does, however, retain the federation approach: in future iterations of the protocol, an entirely re-engineered Peer-to-Peer (P2P) approach is under development. Being an open source project of interest mostly to technical users, development has been relatively slow. This lows XMPP and IRC to interoperate seamlessly, in addition to enabling fine control over notifications to and from other systems, friendcasting, multicasting, news federation, interoperability with microblogging systems such as Twitter and so on, via programmable chatrooms.

Voice Over IP

Coming to the same market from a diametrically opposed perspective is the SIP standard for Voice Over Internet Protocol (VOIP), which began as an audio service and later developed both IM and video services in addition. This is widely used in the commercial sector: for example Vonage in the UK. There are open source varieties that can be deployed by anybody, albeit with some technical difficulty, such as Asterisk and FreeSwitch. These only cost where they connect to the Public Telephone Service Network (PTSN) that provides ordinary landline telephony, but they also enable low-cost, in-house management of telephone extensions, voicemail and related services, as well as making telephony available through computer terminals as well as telephones. One can manage distributed calling, effectively enabling call centres, using this free technology, which can be installed even on a home server. While most people would not have a particular reason to go to such effort, the entry costs to setting up complex systems have been radically reduced to the point where they would now be affordable for small organisations who can rely either on voluntary contributions of development effort or who can outsource the work cheaply.

Why is this important?

Technologies such as XMPP may not be of immediate interest to the average Internet user, either in the HE sector or more widely. However, they underlie so many of the Internet services that we may use on a daily basis that issues such as interoperability of services via open standards are worth knowing about, at the very least in order to gain an understanding of the relative difficulty of providing such services and the costs involved. Given that more and more reliance is being put on an increasingly small group of major providers of Internet services by vast numbers of ordinary users, the consequences for privacy and management of personal information are potentially immense. There is an intense debate going on about whether services taking a federated approach, relying on a network of servers, or a peer-to-peer approach, is the best way (or even a feasible way) to mitigate against these risks is relevant to many other technologies, of which instant messaging is only one: the most significant of these may be social networking. For most people, social networking is vastly more important than, for example, darknet services and/or file sharing, which currently account for the large bulk of peer-to-peer services in widespread use. Indeed, it is social networking, that typically gathers together a number of pre-existing technologies together for convenience with the core microblogging service, that best highlights the widely differing approaches to the future of the architecture of Internet services.

URIs and URLs: Quick Reference

Talat Chaudhri — Mon, 25 Feb 2013 11:25:47 +0000

It has been explained elsewhere what the difference between URIs and URLs is. The type of URL that one generally sees is an HTTP URL. You can think of the family of URIs rather like in the diagram below, showing some of the most commonly encountered URIs (not by any means a complete list).

All of the URL protocols are associated with commonly used Internet services, of which the World Wide Web is only one, using the HTTP(S) scheme. The secure variants are mostly provided using the Secure Sockets Layer (SSL) or its successor Transport Layer Security (TLS), except in the case of Secure Shell (SSH) which has its own built-in encryption protocol. Unless non-secure data is being transmitted, e.g. ordinary web pages not containing sensitive information, it is almost always a good idea to use the secure varieties of the URL protocols - unless, for example, the data is being sent via an SSH tunnel, or else other security providing protection from both external and other local users is in place: a Virtual Private Network (VPN), for example, will not prevent snooping from other users of that private network. The default ports (a colon then a number) are generally assumed where omitted, although technically there is nothing preventing the use of a non-standard port except the inevitable confusion with other services using those ports. There are well known alternative ports used by developers for testing and similar purposes, e.g. port 8080 instead of the usual 80 for web servers.

While most users will not need to know the protocol syntax for the majority of the URL schemes apart from HTTP(S), and will almost certainly never need to know about the syntax of URN schemes, nevertheless these services underlie the functionality of the Internet services that they use every day. It is at least a good idea to understand the basic pattern, which most schemes share, together with an understanding of when and how to use SSL/TLS. Most users will know roughly how an HTTP(S) URL works, which follows the basic pattern used in most of the other URL schemes. Some protocol schemes are rarely seen expressed as an address in actual software implementations, even though they are widely used: this depends on the purpose and nature of the protocol, and to some extent on whether or not it is ever directly accessed from a command line tool in practice. Most non-technical users never do this except in the case of typing Web addresses into the address bar of a browser, which is why only the HTTP(S) protocol is ubiquitous to the general public.

You will notice that the URL scheme (for locating resources) has a large number of very commonly used protocols, whereas the URN scheme (for naming resources) is not as well known but remains technically important for more complex naming schemes, where more specific semantics are required. These are typically used by libraries and in developing Internet services that require access to large data sets about electronic and real-world resources, but are not seen by the average Internet user. Officially, all of these schemes are URI schemes, including both URLs and URNs, but here they are separated by those that locate resources and those that do not.

    URI
     |
     +--- URL
     |     |
     |     +--- HTTP e.g. http://www.google.com/ (using default port 80, equivalent to http://www.google.com:80/)
     |     |     |
     |     |      +--- HTTPS (secure/encrypted) e.g. https://accounts.google.com/ServiceLogin (using default port 443, equivalent to https://www.google.com:443/)
     |     |
     |     +--- SMTP e.g. smtp://bob.fisher@mymailservice.com:25 (also mailto: bob.fisher@mymailservice.com)
     |     |     |
     |     |     +--- SMTPS (secure/encrypted) e.g. smtps://bob.fisher@mymailservice.com:585 (also mailto: bob.fisher@mymailservice.com)
     |     |  
     |     +--- POP3 e.g. pop://bob.fisher@mymailservice.com:110 (for downloading email from a remote server)
     |     |     |
     |     |     +--- POP3S (secure/encrypted) e.g. pops://bob.fisher@mymailservice.com:995
     |     |
     |     +--- IMAP4 e.g. imap://bob.fisher@mymailservice.com:143 (for synchronising email with a remote server)
     |     |     |
     |     |     +--- IMAP4S (secure/encrypted) e.g. imaps://bob.fisher@mymailservice.com:993
     |     |
     |     +--- FTP e.g. ftp://bob.fisher@myserver.com:/my_folder_path/my_file.example (or ftp:bob.fisher@myserver.com:21/my_folder_path/my_file.example)
     |     |     |
     |     |     +--- FTPS e.g. ftps:bob.fisher@myserver.com:990/my_folder_path/my_file.example (it is now more normal to use SFTP via SSH instead)
     |     |
     |     +--- XMPP e.g. xmpp://bob.fisher@mychatservice.com:5222 (e.g. for GTalk, Facebook, jabber.org or other open protocol instant messaging)
     |     |     |
     |     |     +--- XMPPS (secure/encrypted) e.g. xmpps://bob.fisher@mychatservicecom:5222 (over the same default port or the legacy 5223)
     |     |
     |     +--- IRC e.g. irc://myircserver.org:6667/#mychatchannel
     |     |     |
     |     |     +--- IRCS (secure/encrypted) e.g. irc://myircserver.org:6697/#mychatchannel
     |     |
     |     +--- TELNET e.g. telnet://bob:mypassword@myserver:23 (highly insecure for command line access but occasionally used for other purposes)
     |           |
     |           +--- TELNET (secure/encrypted), as above but using SSL and either the same port or the SSH port 22, usually abandoned in favour of SSH
     |           |
     |           +--- SSH (secure/encrypted) e.g. ssh://bob:mypassword@myserver:22 (for command line access and related purposes)
     |           |
     |           +--- SFTP (secure/encrypted) e.g. sftp://bob:mypassword@myserver:22 (for file downloads, with the related UNIX/LINUX/POSIX scp command)
     |
     +--- URN (these examples were taken from Wikipedia)
     |           |
     |           +--- International Standard Book Number (ISBN) e.g. urn:isbn:0451450523 (the book The Last Unicorn, by Peter S. Beagle, 1968)
     |           |
     |           +--- International Standard Audiovisual Number (ISAN) e.g. urn:isan:0000-0000-9E59-0000-O-0000-0000-2 (the film Spider-Man, 2002)
     |           |
     |           +--- International Standard Serial Number (ISSN) e.g. urn:issn:0167-6423	 (the scientific journal Science of Computer Programming)
     |           |
     |           +--- Request For Comments (RFC) for memoranda of the Internet Engineering Task Force (IETF) on internet standards and protocols, e.g. urn:ietf:rfc:2648
     |           |
     |           +--- MPEG7 e.g. urn:mpeg:mpeg7:schema:2001 (the default namespace rules for MPEG-7 video metadata)
     |           |
     |           +---  Object Identifier (OID), e.g. urn:oid:2.16.840 (the United States of America)
     |           |
     |           +--- UUID e.g. urn:uuid:6e8bc430-9c3a-11d9-9669-0800200c9a66 (a type of unique identifier that is mathematically improbable to duplicate, version 1)
     |           |
     |           +--- National Bibliography Number (NBN) e.g. urn:nbn:de:bvb:19-146642 (a document in the Bibliotheksverbund Bayern, Germany, with library and document number)
     |           |
     |           +--- European Union Directive e.g urn:lex:eu:council:directive:2010-03-09;2010-19-UE (using the Lex URN namespace for legislation)
     |     
     +--- URC (internet standard proposal never developed, largely replaced by XML, RDF, JSON etc in providing metadata)

As noted, Uniform Resource Characteristics (URC) were abandoned in the early history of the internet. Numerous anomalies have developed, such as that noted above where both FTPS and SFTP perform similar functions in a different way, or where some services use a different port for SSL/TLS but XMPP usually does not. The Digital Object Identifier (DOI) scheme is effectively a URN scheme but has never been registered as such and performs the same function whilst officially remaining a URL scheme.

Why should universities care about identifiers?

Talat Chaudhri — Fri, 17 Aug 2012 15:19:53 +0000

Why do identifiers matter for research?

Imagine that you are a senior manager in an institution within the UK Higher Education sector with responsibilities for research: you have read some basic details about unique researcher identifiers and perhaps institutional identifiers. However, it may not be immediately apparent just how important these issues are, which may seem on the face of it to be a relatively superficial and/or trivial organisational matter. Clearly, any such strategic decision-maker will long have been aware of the demands of the Research Excellence Framework (REF) and its predecessor the Research Assessment Exercise (RAE), in which successful reporting of the best research outputs of university departments is crucial to the on-going funding of the institution. This is particularly central to the work of research-led universities, which is an increasingly competitive sector: even universities that formerly focussed more on teaching than research are increasingly aware of the need to drive up standards of quality research in order to secure additional funding.

The reality of unique identification in research

However, as anyone who has actually engaged with the business of research reporting to any degree will tell you, it is far from a superficial or trivial matter to carry out such an exercise without thinking very carefully about how researchers are identified; moreover, identifying the research groups, departments, projects and institutions that they may have variously belonged to at different times, all of which may have been re-organised on many occasions, is a considerable challenge raising considerable technical as well as organisational issues.

Perhaps the biggest problem of all derives from the scale of research reporting. On such a massive scale, it has to be done in a systematic way across higher education institutions in order to be useful. Any lack of a systematic approach in collecting the information on the institutional level will inevitably result in higher costs in processing the information later into a useful form, for example by governmental organisations such as HESA and the Research Councils (RCUK) relevant to each area of academic study. This may be carried out for a variety of reasons, amongst them for example:

The need to produce statistics at a national and at an institutional level in order to gauge how successful different parts of the research community are performing in comparison to each other and to similar institutions internationally, which may be a determinant of how funding is allocated.
The production of good, widely accessible information about the work of academic researchers and research groups for the purposes of future research, both in identifying research as a basis for future work and for guiding individuals and groups in terms of who they might work with in future, who their competitors may be, and in creating wider bibliographic information for a whole range of related purposes related to future publications.
Open Access, an increasing requirement imposed by funders where research is publicly funded.
Accountability in the use of public funds for research.

It is precisely the lack of a national approach to providing consistent metadata about individuals and groups connected with research that raises costs, creates inefficiencies and frustrates the development of new software functionality that makes the jobs of research managers more difficult and ultimately reduces the funds available to research and their best use within the sector. It is therefore the business of senior managers of academic research to care about identifiers.

Researcher identifiers: a crucial first step

Before any wider metadata about research may be considered, the most fundamental issue is identifying individuals who carry out research. Before this happens consistently on a national level, there is little point addressing the subsequent issue of identifying groups and institutions engaged in research consistently. It is also important to consider any national approach in terms of interoperability with other international approaches wherever possible: while, on the one hand, funders and statistics agencies can only hope to mandate national identifier schemes, at the same time it is clear that research collaboration is cross-institutional and international in scope, in some cases including researchers from numerous countries in one project or even in the production of one individual paper, data set or other research activity. This is the approach that has been taken by the JISC, together with RCUK, HESA and other partners in setting up the Research Identifiers Task and Finish Group, which is due to report in October 2012.

One emerging candidate with cross-sector and international support is the ORCID researcher identifier scheme, whose rapid development in 2011-12 is scheduled to culminate in a public launch in October 2012. There are, of course, existing, widely-used but relatively simple identifiers such as the HESA researcher identifier, and identifiers provided through commercial providers' web interfaces, but thus far these have not provided dependable unique identification. All such identifiers could be linked to a system like ORCID that is designed on interoperable principles and is not dependant on any particular software platform or web interface. An alternative approach is taken by the ISNI number: whereas ORCID seeks to offer individual researchers and institutions the ability to manage their data on a distributed model, ISNI represents a centrally moderated, bibliographic approach led by national libraries and other similar institutions with national and strategic responsibilities. It remains to be seen whether these different approaches are in competition or whether they will offer different but complementary functionality within the sector, and much may be dependent on how software vendors implement them.

Current Research Information Systems (CRIS)

It is not simply a matter of tracking publications and other related ouputs, for example in institutional repositories. This part of the equation is by now relatively well established in the UK HE sector, although it continues to develop: the issues surrounding Open Access, for example, have not been fully resolved. This, however, is just at the level of the final outputs of research and does not provide anything like sufficient insight into the processes of research, the projects and groups carrying out, the staff involved or the costs. Traditionally, this information has been gathered in a very long-winded process that is individual to each institution's particular workflows and processes (although there are obviously great similarities of approach between them), often a partly paper-based exercise that has been migrated to an extremely varied range of systems and databases, few of which are interoperable or complete. Many departments may be involved in the process apart from the institution's research office and the department in which the researchers are based, but perhaps the most significant would be the finance office, the human resources department and the library, to name just the key players. It will be necessary to keep some information confidential, e.g. personal staff information, salaries and so forth, to share some information internally and with research funders, and to publish other information, e.g. in a research repository that forms the institution's "shop window" of public outputs, library databases and so forth. The term Research Information Management (RIM) has emerged to cover all of these information gathering and information processing activities.

In order to do this systematically, more sophisticated research information management software has been developed, often known as Current Research Information Systems (CRIS). The market in the HE sector is currently led, in terms of the number of institutions adopting the software, by PURE, produced by ATIRA; other major players are Symplectic Elements, and CONVERIS, produced by AVEDAS. More recent entrants to this market are Thomson Reuters' Research in View. There are currently no open source products, although a JISC-funded modular approach by the Research Management and Administration Service (RMAS) project may have an increasing impact in this area, depending on subsequent adoption by HE institutions. It is not an overstatement to say that HE institutions are currently in a rush towards early adoption of these CRIS systems, motivated by the need to use research data to compete with each other for funding opportunities.

Next steps: organisational identifiers

In the next 2-3 years, it is likely that the matter of unique researcher identification will be resolved through the emergence of a dominant standard that has sufficient take-up and leverage in the UK and international HE sector to faciliate the work of research institutions and funders. Following this, there will be organisational structures associated with research that will require unique identification, often on a multi-layed basis: for example, a project may be at several institutions, perhaps internationally, and their staff may be in various departments or similar units whose names have changed or have been merged or de-merged at various times, all of which will require careful date and time stamping to make the information reliable for the period that it covers. There will be issues related to copyright, commercialisation and spin-off companies that make the precise provenance of research critical to the future success of academic research and development. Standards for organisational indentifiers are therefore the next important issue on the horizon. Like researcher identification standards, research managers and senior managers with strategic responsibility for research will need to keep abreast of this rapidly developing area.

Microservices in (and beyond) Research Information Management

Talat Chaudhri — Fri, 25 May 2012 15:37:53 +0000

Microservices: are they all that new?

Recently there has been something of a revival of interest in a small-scale development approach towards software design for repositories: microservices. This is far from an entirely new idea but seems to have been somewhat slow to develop in practice, even to date; a useful summary of the approach was given by Neil Jacobs back in 2010. Moreover, a modular approach towards software that fulfils various related functions in managing web content related to research clearly has a much longer history, and is not in itself particularly surprising in software development more broadly. However, it seems that microservices as an approach is gradually acquiring a clearer identity within this space, so it may be worth taking a look back at the nature of the types of software used in managing research content of various types, how they are related, and whether and to what extent terms like "repository", "Current Research Information System", "Research Information Management system" and so forth overlap in terms of software functionality that they offer.

Defining terms: "repository", CRIS, RIM etc

Institutions within Higher Education are often faced with questions of procurement such as technical suitability and sustainable technical support. Although these areas are broader than those normally covered by the Technical Foundations web site, since they encompass non-technical considerations related to funding, policy and practice that drive software acquisition in universities and related institutions, the purely technical aspects are securely within scope and of considerable interest to the community at large in terms of developing useful technical guidance.

The question "What is a repository?" is likely to have a range of possible answers, but Neil Jacobs noted the revival of an approach summarised in Cliff Lynch’s 2007 description of the institutional repository as “a set of services that a university offers to the members of its community for the management and dissemination of digital materials created by the institution and its community members”. Without reiterating the points made by Neil Jacobs in detail, suffice it to say that these efforts have been led by institutions such as the California Digital Library and notably by John Kunze and others. The difficulty with this approach in general is not a purely technical one but one of technical resources, and it is not unique to the microservices approach but can for example be seen with systems such as Fedora Commons as well.

Software development approaches

While the most modular, customisable and flexible technical approaches are often able to be adapted most quickly (and arguably most effectively) to the challenging technical demands placed on them, it is usually the case that significant development resources, usually in-house, are required in order to tailor the software to local requirements. In practice, the result is often that only certain large institutions are able to justify and support software systems such as Fedora or even "roll their own" local software solutions. A useful example is the eSciDoc suite of services, developed by the Max Planck Foundation and FIZ Karlsruhe. Together, these effectively represent what in other contexts (e.g. the Linux world) might be called a "distribution", in this case based on Fedora. It is also worth noting that these services have been developed so that they can be used independently of eSciDoc, for example with DSpace or another repository system. In this way, true to Cliff Lynch's definition, each aspect of what together we call a "repository" is handled by a different piece of software, which then interoperates with a range of other web services according to local requirements.

"Does it do more than we already do?"

This, in a nutshell, is the microservices approach. However, there is no reason why the question should be restricted to repositories, since "repository" is itself something of a catch-all term for a class of web content services that are by no means identical in their principal functions and aims, even where they are using the same underlying software. Where, for instance, does the functionality of a repository end and that of a research management system, research information management system or Current Research Information System begin? Without a clear understanding of what these systems do, it is possible if not likely that higher education institutions, especially where decisions about procurement could be made by relatively non-technical managers, might easily end up acquiring more than one system with overlapping functions. Clearly, in times of difficult financial circumstances, this ought to be avoided wherever possible. It is worth spelling out what exactly different systems do in order to minimise duplication of effort.

Similar software issues facing HEIs

The question need not be limited to repositories and research information management either, although it is not the intention to get into great detail in this particular blog post. For example, libraries are frequently offered new products either by vendors with whom they have existing contracts or by their rivals. It is always in the interests of a vendor to sell a new product, so the question of duplication of technical functionality and/or the most effective technology to address a local need is of far more pressing concern to the institution than the vendor. A range of commercial library portals are on offer, built on but extending the functionality of library catalogues and commercial publications databases related to e-journals such as Web of Science. It is a common experience amongst library staff to feel unsure to what extent new software is offering new functionality, how it fits their technical requirements, and to what extent it may be re-packaging existing functionality in new clothes. The same could perhaps be said, for example, of systems relating to human resources or institutional finance offices.

What else can these systems do?

Returning to repositories and research information management, it is clear that a wide range of resource types are being published on the web through a range of related systems. The best recognised use of the repository is as a research publications repository, which is unsually how the wider term "institutional repository" is understood within the context of higher education and issues relating to but not confined to Open Access. Increasingly, attention has turned to Current Research Information Systems, based on the CERIF standard, and similar research information systems. Of particular interest is the RMAS approach, effectively building such a system from a range of related pieces of software, i.e. a microservices approach outside the limits of the repository sphere. Research information management covers all aspects of the processes of research creation and dissemination, including research reporting, human resources, finance and publication, while publications repositories commonly focus only on the last of these. This is usually the area where institutions operate systems whose functionality overlaps, as there is no reason in principle why a CRIS, for example, cannot expose research publications on the Web: this is possible with the main commercial systems such as PURE and Converis, for example.

In any case, there is no necessary limitation on the term "repository" to cover only resources relating to the outputs of research. Teaching and learning materials, amongst a wider range of educational resources, are another major area that has seen substantial growth in the last two or three years. Various types of media resources from images to time-based media such as audio and video recordings are found in institutional repositories for a number of different academic purposes, e.g. art collections, media archives, music collections, health information and so on, not all of which are the direct products of either research or teaching but may be connected with one or both. In this context, it is as well to remember that the term "repository" means little more in essence than "organised place or system to put something [on the Web]" and that many such systems, especially older ones, have always been known as "digital archives", "electronic libraries", "media collections" and so on, in contexts where the word "repository" would still not generally be recognised. Large data collections are often stored in systems that are, in effect, repositories, but whose development has been through systems not normally known by that term.

Solutions that fit problems

In summary, dividing the world of software systems in academic and related outputs too rigidly into "repositories" and "research information systems" may be at the root of much of the difficulties that may arise in understanding which technical functionality is required for any given local purpose and the extent to which systems overlap. A better, more precise understanding of these functionalities would help to avoid unnecessary duplication of effort and proliferation of systems. Some approaches are effectively bundled within one piece of software for a particular purpose, e.g. DSpace and EPrints in the repositories space. These offer a conventional set of services that fit the requirements of most institutions but may place some limits on the ability to customise those services indefinitely. Even these systems are built to be general purpose systems with considerable potential for local customisation. However, there is the tendency seen elsewhere (for instance in open source software with a large and disparate user base) to introduce software bloat: more and more functionality, some of it never used by the majority of implementations, is shipped with each succeeding version as new scenarios are met with.

While potentially introducing the problem of sufficient availability and sustainability of technical development effort, microservices are the opposite end of this spectrum. Each service is ideally a separate entity on the web server, built for maximum interoperability with the other services that may be required for local purposes. Rather than acting as plug-ins to a base software system (which is perhaps an intermediate approach), these are separate code bases able to run independently, even where they may have been intended, as in RMAS or eSciDoc, to be used frequently together. The technical issues and demands of each system will be different in every case.

The business of unique identification

Talat Chaudhri — Thu, 23 Feb 2012 23:58:26 +0000

What need is there for unique identifiers?

Put in relatively non-technical language, there is an increasing concern in information science in general to uniquely identify different things, organisations or people that could otherwise be confused, whether on the Internet or in the physical world. In technical terms, these are all referred to as resources (even if people might find it vaguely demeaning in normal language to be considered as such). This need, whether real or perceived in any particular context, has grown as the complexity of information available on the Web has grown almost exponentially, increasing the potential for confusing similar resources.

Why aren't names good enough?

1. People

It is not necessarily enough to have a name, since even a relatively unusual combination of names might easily not be entirely unique from a worldwide or even universal perspective: at the basic level, John Steven Smith might be unique in a place called Barton but even if you cross-reference these references, two people with the same name could easily be confused, for example if there are several possible places called Barton.

My own name, Talat Zafar Chaudhri, might appear to be more unique until you realise that these are all fairly common names in the Indian subcontinent and thus in the Indo-Pakistani diaspora, so it is reasonably possible or even fairly likely that another named individual exists with this particular choice of spelling (of which others may exist). I am also Talat Chaudhri, T. Chaudhri, T Chaudhri, T.Z. Chaudhri, TZ Chaudhri and similar variations (with or without spaces and punctuation) that might make it harder to decide which individuals to reconcile as a single individual, especially by machine processing. At least I do not vary the spelling of my surname, but some people may, especially in cases such as my own where other transliterations could be possible: for example, my father previously used the spelling Chaudhry and many others such as Chaudry, Chowdhary and Chowdhuri are equally possible. I understand when companies misspell it, but a computer might not be sure if these were definitely the same person, even if it went to the lengths of calculating a probability for this.

Moreover, people change personal titles (e.g. I have been both a Mr and a Dr and I am occasionally still referred to as the former by companies that do not allow for the latter option); they have multiple, changing work roles and work places, and may be known in multiple contexts, e.g. work, social, voluntary roles and similar. At work, one may have additional roles in various professional bodies, so it may not be apparent who is who. Two people might have the same name in a large professional group, e.g. physicists, and may even produce outputs related to the same subject. Who owns which ones? This is a particular issue for electronically available outputs on the Internet, e.g. publications, educational resources, audio, visual or audiovisual resources and so on.

2. Organisations

The same issue arises for organisations. Can we be sure that a Board of Licencing Control is unique? No. Perhaps it is merely another spelling for the Board of Licensing Control but using a different spelling? What if one, but not all, of these were re-named as Burundian Licencing Control? What if the Board of Licencing Control merged with the Department for Regulatory Affairs under either of these names, a combination, or an entirely new name, yet continued their association with the assets of the originals. De-mergers are likewise possible, and may present issues of uncertain ownership of resources.

Perhaps there are organisations with this name in several countries but serving utterly different purposes, and perhaps one is merely one possible translation of a term into English but used natively in another language. Historical names have been used in multiple contexts that may still be valid, e.g. the Irish Volunteers, and these might need to be kept clearly separate from each other. Conversely, there are also organisations that have multiple names or forms of names, whether in one language or in multiple languages or during their history, e.g. Óglaigh na hÉireann is Irish for both the terrorist Irish Republican Army (IRA) and most of its subsequent splinter groups but is also, however, an acceptable name, for historical reasons, for the Defence Forces of the Republic of Ireland, and previously just the Irish Army (an tArm) that now forms a part of it. These are clearly not the same and must be distinguished. It must be also noted that typographical constraints and character encodings will lead to yet more duplicate forms.

Isn't this bigger than the question of unique identification?

Yes, the need for complex metadata to express these things can go far beyond merely identifying resources in a unique manner. However, before one can even start thinking about complex descriptive and relational metadata, one first has to be clear which resource is mentioned: hence the first step must be unique identification of what it is we are talking about. Only once we have done that can we feel reasonably confident about talking about how resources relate to one another and how they may have changed over time.

Overall, there is an ever increasing need to make clear what is meant, as more and more things and agents have on-line identities that need to be distinguished, whether this is as an owner of resources or as a referrant within a resource, e.g. the subject of the resource in a particular context, and even of the role played and the relationship to other resources or agents, perhaps in a specific time period. Information models can quickly become extremely complex, and this is certainly true where identity is concerned.

What is an identifier?

In concept, an identifier is similar in its basic concept to a name. At its most basic, an identifier in the context of an information system is a token (usually a number or a string of characters) used to refer to an entity (anything which can be referred to). Identifiers are fundamental to most, if not all, information systems. As the global network of information systems evolves, identifiers take on a greater significance. And as the Web becomes more 'machine readable', it becomes vital for all organisations who publish Internet resources to adopt well-managed strategies for creating, maintaining and consistently using identifiers to refer to those assets it cares about.

What are unique identifiers?

The simple answer is that this is the only way to avoid misidentification confidently, and therefore prevent any errors about ownership or rights over resources that might arise, as well as making sure that large bodies of resources contain reliable information generally.

The fundamental question is whether the identifier or token that has been chosen is unique and how best to ensure this. Some identifiers are so complex that mathematical probability makes them effectively unique in the universe, notably UUIDs. In essence, a UUID is no more than a complex numerical token: it is only additional complexity (and thus uniqueness) that it offers compared to, for example, a running number. Others like names can only be distinguished unambiguously by making a series of statements about which names are considered equivalent, which contexts (e.g. a person's work or town) are valid, and so on, where a number of relationships have to be attached to a particular identifier and checked in order to reach an acceptable level of uniqueness and to eliminate any mistaken connections with resources that might be similar in name or perhaps also in other respects by chance.

The problem with UUIDs is that, while the chances of them failing to be unique are, to all practical purposes, non-existent, it is not very clear from a UUID alone what the nature of that resource is. It may be machine-readable but it says nothing about who generated that identifier and when, or which other identifiers might exist for the same resource in different systems that also generated an identifier for the same resource. Consequently, the need to associate other metadata with any complex number or other similar token remains (including but not limited to UUIDs). Simply, no single token can be sufficient for any complex purpose and, at the very least, an electronic or physical resource must be referenced for the token to have any useful meaning at all.

This is effectively that a URL is: another type of token. While I will not go into the whole discussion about URLs and URNs as sub-types of URIs, it is worth noting that, in many quarters, the term URL is no longer preferred despite it being the most commonly used in practice. In strict terms, there is a clear difference: while a URI is usually resolvable to an electronic resource, which may be either a description of a physical or electronic resource or may be an electronic resource itself, there is technically no requirement that a URI should be resolvable, i.e. that all it needs to be is a token that doesn't necessarily have to represent an address that actually delivers a resource. However, it is usual to use the HTTP scheme, which is designed for delivering such a resource, so it would be somewhat eccentric and misleading if one were deliberately to choose an ostensibly resolvable syntax that does not in fact resolve. In effect, virtually all such URIs are also URLs (unless a resource has become unavailable and link rot has set in), since the latter must locate the resource or representation of it: this is inherently useful. Any URI that resolves, i.e. URL, will be effectively unique within the standard Domain Name System (DNS). As a result, there is no absolute need for UUIDs in many contexts, since a sufficiently unique and practical token already exists in the URI. Any unique but arbitrary token serves the core purpose here.

Aren't identifiers really just names?

Yes and no. Names are intrinsically arbitrary too when they are first given. However, they are identifiable on a number of levels from a human perspective. In addition to a combination of names belonging to one or more particular linguistic and/or ethnic origins and usually identifying gender, they quickly become associated with a particular person, so their use in uniquely identifying that person within a given context become central to maintaining the person's reputation in whatever they do. This is, for example, particularly important to academics in Higher Education. In modern times, this name resolution needs to be done globally wherever the Internet is the context, whereas previously it would have been possible to use fewer additional pieces of information in more restricted contexts (e.g. a village, a country etc), depending on the purpose. These different contexts still co-exist but it is now necessary to provide as many as possible, since one cannot control or predict why the information is being requested in each instance on a global system such as the Internet.

How does this affect Higher and Further Education?

Increasing numbers of professionals and the bodies that they work for and represent need to describe their resources on the Internet, whether those are in themselves electronic resources, whether they are descriptions of electronic or physical resources (metadata), or whether they are other representations of physical resources, perhaps in addition to themselves being electronic resources (e.g. photographs). This is a particularly pressing issue in Higher Education and, to an increasing extent, in Further Education. Academic outputs may include publications, educational resources, visual, audio and audiovisual resources and so on. Perhaps the best known is the issue of scholarly publications, partly through the rise of the Open Access movement to make such resources freely available.

There are already a range of identifiers for academics and related professional university staff. One of the problems is that these are created for specific purposes that only cover whichever subset of staff is relevant to those purposes. For example, HESA keeps records that contain a HESA number for academic staff, which means that at least those who have published academic outputs will have such a number. Another number called the HUSID number is maintained for students, since tracking academic careers from student to staff is one important concern for HESA. Many academics in relevant fields may have ISNI numbers, which are used widely in the media content industries. Many academics will have one or more professional staff pages, including within repositories and Current Research Information Systems (CRIS), each with a URI, not to mention OpenIDs and URIs associated with Web services which they use professionally and/or privately, e.g. LinkedIn, Academic.edu, Facebook, Twitter and so on.

Here are some examples belonging to Brian Kelly of UKOLN:

The problem is that the coverage of these numbers is not universal within the HE sector, and there is no single recognised authority or other agreement to prevent and resolve conflicts where information is not consistent between two or more information sources.

At present, the JISC are trying to solve this through the Unique Identifiers Task and Finish Group, which also includes representatives of HESA, HEFCE, the various Research Councils in the UK and UKOLN. The preferred solution is currently the ORCID academic identifier, which is being developed internationally with publishers, with a great deal of input from the United States in particular.

In order to succeed, any such identifier will need international penetration of the higher education sector, since academics will not use it unless it delivers the sorts of interoperability benefits that make their work easier and become integrated into the recognised systems required of them by funders and publishers in the course of their work. Since students and academics change roles and institutions, this needs to be recognised and outputs properly allocated to institutions and departments, which may themselves change identities, merge and de-merge over time.

While institutions will need to reduce the workload on academics by bulk loading information about staff, since the main incentive to use the system is that every academic has a record, there is also an issue about control. Should academics have the ability to alter their records at will? Are assertions automatically trusted or does a particular record for an academic's time at an institution need to be verified by that trusted body? Who should maintain a list of trusted bodies who can back up assertions? How will this effort be funded sustainably? It becomes clear that some of these points are central structural concerns whereas others may cover only fringe issues such as avoiding deliberate falsification, which may be rare.

Proprietary academic identifiers

There are also a number of proprietary identifiers associated with different commercial services related to electronic publishing and related academic service industries. Thomson Reuters and Elsevier provide identities for individuals and organisations as part of their bibliographic and academic services; similarly, search services such as Google Scholar (see the study in this blog post) and Microsoft Academic Search have also started to offer identifiers (see this blog post). There may be privacy issues, for example in Google and Microsoft publicly surfacing information about researchers without explicit consent: while this information might have been suitable for the limited purpose of publication, academics may not have intended for it to be synthesised into a single, public description of their personal details available to all.

Some of these services introduce new problems, since their primary purpose is commercial and it is often less of a priority to deal with the internal issues facing academic institutions unless that impacts significantly on the ability to make commercial profit. These may be resolved over time or be reintroduced as services change and compete: the academic has little or no control over the effects of commercial decisions upon their work. For example, Microsoft Academic Search often misrepresents outputs as belonging to similarly named individuals (thus is currently failing at unique identification) and, by default, requires the manual input of researchers to edit out errors and take a proactive approach towards managing the information about themselves. This brings the overall quality of data into question: for large-scale statistical purposes, this could be tolerable, depending on the degree of error; however, for academic citations and reporting purposes such as the Research Excellence Framework (REF), it would not be acceptable to use this data without further refinement, which would most likely remain a long, manual process.

Software and services

Any software application layer, whether operated by commercial companies, higher educational institutions, funders or governmental bodies, needs to be maintained. If information is harvested or processed automatically, it needs to be clear who corrects information where errors are found and what the resources are for academics to contact individuals with the time and effort available to improve the data as part of their work. In the case of commercial organisations, this is usually unclear and may change. There is no guarantee that the commercial reason for providing services will continue over time, unlike in most cases in the public sector within Higher Education. Coverage of such commercial services is often geared towards institutions rather than individuals: for example, Google Scholar requires registration using a valid university email address that it recognises, which would exclude private scholars and perhaps some retired staff who produce research.

The Web of Things

It has already been mentioned that electronic descriptions or other representations of physical objects may be found on the internet, including written descriptions, pictures, geographical locations, dimensions and so on. It is even possible to describe physical objects that were extant but are now historical, or which have moved or whose location is now unknown, referencing comparable objects and linking these descriptions with other resources that are related. In each case, the nature of the relationship, relevant agents who may have been responsible for it, and when it was valid can be described in metadata.

This opens the way for the Web of Things, a term used to describe that part of the Semantic Web that covers physical resources as opposed to, or as well as, purely electronic ones. Some authorities use the term to mean physical objects with miniaturised electronic devices to enable them to be located, whereas others merely mean any physical object that is described in a record on the Web. It may be argued that all electronic resources have relationships to physical ones, even if that is only with regard to authorship and subject. The Resource Description Framework (RDF) provides a means to describe these relationships and transmit information about them in ways readable to humans and machines. Although these are usually expressed as triples, where two things are described with a relationship between them, metadata structures such as the Common European Research Information Framework (CERIF) can add link tables that give far more detailed information about the relationships themselves. All of this can be made available as Linked Data and surfaced in many software applications on the Web.

The Semantic Web is often seen as a utopian view of a future where no electronic resources will be published without complex information being provided or automatically generated about its origins. The reality is that manual entry of information is generally very limited unless it serves the purposes of the person entering it, and this cannot be relied upon as an approach to ensuring large-scale, consistent metadata on a sufficient scale for the Semantic Web to work. Technology has in some cases improved to the extent that geographical and technical information is now automatically produced, for example in digital cameras and in mobile phones able to record GPS coordinates.

However, the effort and cost required to catalogue the entire physical world and the extent to which this is even possible is highly doubtful. Where the Semantic Web could be useful is within particular large bodies of data, for example experimental scientific data, publications and so on. In the case of the Web of Things, this could include art collections, photography, archaelogical information, the locations of public institutions and many more. For all of these purposes, it will be necessary to provide unique identifiers for increasingly large numbers of resources, including things and agents, in order to provide complex metadata about them.

Education in the wider world

It has perhaps not been sufficiently investigated how unique identifiers for researchers and other staff in Higher Education will fit into the wider question of unique identification on the Web. Relevant purposes might be:

(1) commercial, for example the identification of companies and individuals owning the rights to photos, music, video or publications, particularly legacy resources of ongoing commercial value in terms of royalties and performance licencing.

(2) governmental, for example biometric information about people, used in border controls, crime prevention and citizenship contexts; or about public or private organisations such as charities, political groups of interest to law enforcement etc. Information about individuals, in particular, may be subject to privacy laws, which will vary between jurisdications.

It is clear that there are interfaces between the various agents and outputs of academic institutions and many other purposes, notably those commercial and governmental activities already described. For example, a foreign student or member of staff seeking a work permit will require institutions and governmental bodies to use personal and citizenship information co-operatively, which will be linked to their academic identity in the course of their work at the institution. Some of this information will be private and some public, so there is an issue about who can see which parts of a particular corpus of Linked Data, requiring authentication protocols and systems.

The extent to which consistency of approach between HE institutions and other sectors and contexts can ever be ensured is moot, since there is of course no single international authority and because any single metadata solution that tried to cover so many diverse purposes would be fatally unwieldy. How different, flexible approaches can be understood by machine processing is perhaps the technological key to how well the Semantic Web will answer these questions in future, both within Higher Education and beyond.

IDCC11 Workshop on Domain names and persistence on 8 December

Brian Kelly — Wed, 09 Nov 2011 12:10:29 +0000

Many JISC-funded projects are involved in the development of important aspects of the technological infrastructure which will support teaching and learning, research and administrative activities across the sector. Other projects may be developing digital content for use across the sector.

But how robust if the technical infrastructure and how sustainable with the content and the services be? Such issues will be dependent on factors such as standards which are used and sustainable business models. But an additional important factor is the persistency of Internet domain names.

The seventh IDCC (International Digital Curation Conference) is taking place in Bristol on 5-7 December 2011. The accompanying series of co-located events includes a one-day workshop on Domain names and persistence.

As described in the workshop description:

The vulnerability of any digital material to unexpected or unintended changes in Internet domain name assignment, and hence to the outcome of domain name resolution, is widely recognised. The fact that domain names are not permanently assigned is regularly cited as one of the main reasons why http: URIs cannot be regarded as persistent identifiers over the long term.

However, the claim that http: URIs are considered inadequately persistent is belied by widespread reliance on them in digital material that will undoubtedly persist, such as technical standards and research articles. As this practice continues - and it certainly will - it will become increasingly important as a matter of clarity, trust, and integrity to align Web governance, which currently specifies potential impermanence for domain name assignments, with practice. Either it needs to be brought about that at least some domain name assignments are universally recognised as persistent, and hence at least that vulnerability to http: URI persistence removed for URIs using them, or a credible alternative must be supplied. But attempts to establish permanent actionable URIs outside of the http: URI scheme have met with little success. It is therefore necessary to investigate the prospects for universal recognition of at least some permanent domain names.

The workshop organisers are invited presentations on any subject relevant to the problem of domain name persistence, including, but not limited to:

Better characterisation of the problem(s) (or denial that there are any)
- Is leasing vs. owning the source of the problem? What would owning a domain name even mean?
- What other managed naming systems are there, and how well do they persist (or fail to)?
Accounts of experience with problems or solutions
Relationship of long-term domain name persistence to domain name continuity management, that is, provision for catastrophic loss of the ability to host a domain due to natural disaster, civil unrest or government action
Domain name ownership information archiving
Places to look for solutions
- Creating parallel/alternative domain (or whole URI) lookup mechanisms
- Creating a top-level domain (TLD) within which different persistence guarantees would be expected/enforced (how?)
- Changing the rules (how?) so that e.g. standards bodies could give more credible persistence guarantees
- Domain name insurance schemes
- Mutual aid pacts

Attendance at the workshop costs £80 and bookings can be made on the IDCC 2011 conference website.

DC-2011

Alex Ball — Fri, 30 Sep 2011 13:13:00 +0000

On the 21-23 September 2011, I attended the Eleventh International Conference on Dublin Core and Metadata Applications, known as DC-2011 to its friends but #dcmi11 to the true elite. The National Library of the Netherlands (KB) in The Hague made a pleasant setting for the event, although it was perhaps too small. That is to say, the public portion of it did not have sufficient rooms for all the parallel sessions, so some had to be held deep in the secure area of the building. This, as you can imagine, caused headaches for delegates and hosts alike and restricted movement between sessions. In spite of this there was a friendly and lively atmosphere.

On the first day there were tutorial sessions introducing the world of Dublin Core to those less familiar with it. I was not able to attend, and I feel I missed out as people kept telling me about meerkats being behind the name for the original 15 Dublin Core elements. Or something like that.

The conference proper kicked off on the second day with Mikael Nilsson explaining that interoperability (system B understanding what system A produced) is insufficient, and what we really need is harmonization. In other words, metadata that conform to multiple specifications, and systems that can understand and integrate multiple metadata schemes. If you're familiar with RDF and application profiles, you can see where this is going.

In the following plenary session, Jae-Eun Baek used a task-based, 5W1H model to compare different archival and preservation metadata schemes. The 5W1H refers to questions that the metadata are supposed to answer about a task: who does it, why they do it, what they do it to, and so on. The model revealed how different metadata schemes concentrate on different lifecycle stages. This was followed by Kai Eckert, who explained how the Dublin Core Abstract Model needs to be extended in order to provide proper support for recording the provenance of metadata. It involves allowing Description Sets to be the subject of further Descriptions (specifically Annotations); if you know about RDF named graphs, you'll recognise the concept.

The next session was all about mapping between different schemes. Gordon Dunsire argued that to get the benefit of working with Semantic Web technologies, we need to avoid translating values into different formats, and instead concentrate on mapping out the relationships between the properties themselves. Ahsan Morshed talked about how concepts in AGROVOC (an agricultural thesaurus) were mapped to other vocabularies; of particular interest was the way multiple languages were used to pin down the concepts in question. Lastly, Nuno Freire reported on efforts to transform subject headings from various schemes into sets of more specific properties (times, places, events), to make them easier for computers to work with.

The afternoon saw proceedings split into project reports and Dublin Core Community and Task Group workshops. I was involved in the Science and Metadata Community workshop. Jian Qin gave an update on the work she and I are doing with DataCite to produce a Dublin Core Application Profile version of the DataCite Metadata Specification. I gave an overview of current scientific metadata schemes with the aid of some diagrams based on the scoping study I conducted a couple of years ago. The other highlight was a presentation from Michael Lauruhn and Véronique Malaisé of Elsevier on their work with linked data, including the Elsevier Merged Medical Taxonomy (EMMeT) and the Data to Semantics research project.

The talk by Emmanuelle Bermès that kicked off the final day will probably best be remembered for its cookery metaphors, especially the 'stone soup'. If you're not aware of the fable that features stone soup, think of it as a benign slippery slope: some people who weren't willing to help make soup were persuaded instead to incrementally improve boiling water (with stones in) until it became soup. If data are the ingredients, and a functional web of linked data is the soup we're after, what are the 'stones' that will catalyse the transformation from one to the other?

The third plenary session presented the experience of people working with linked data. Antoine Isaac recounted how the Europeana digital library has been making a transition from Europeana Semantic Elements to the (linked-data-friendly) Europeana Data Model, the design decisions they had to make and problems they had normalizing their stock of data. Daniel Vila-Suero justified the style guidelines he and his colleagues have been working on for naming and labelling ontologies in the Multilingual Web. These are being trialled with IFLA's implementation of the FRBR model in RDF. Benjamin Zapilko talked about trying to perform statistical analysis directly through SPARQL. One of his conclusions was that it would probably be better to teach statistical packages SPARQL than to teach SPARQL statistics.

The final plenary collected some more examples of metadata usage in practice. Jörg Brunsmann gave the latest from the SHAMAN Project on handling engineering data, although of most interest to me was how he introduced the notion of Metadata Information Packages to OAIS. Mohammed Ourabah Soualah described the challenges of agreeing a common protocol for cataloguing Arabic manuscripts in Dublin Core, for a cross-search application. Finally, we had a screencast recorded by Oksana Zavalina on the different ways in which digital library collections handled collection-level metadata using the DC Collection Application Profile.

The afternoon was again a mixture of project updates and Community/Task Group meetings. The Registry Community meeting was largely taken up with discussions about the proposed requirements for a new system to manage DCMI's namespaces (and any that its Communities might want to set up). The highlight of the projects session was a paper on encoding the relationships between jazz musicians (e.g. influencedBy, mentorOf) in RDF.

The closing plenary consisted of two videos. The first was from the Free Your Metadata project, who provide guidance on using Google Refine to publish Linked Open Data. The second was an extensive and tuneful tourism advertisement for Malaysia, the host country for next year's conference.

That was my first experience of the Dublin Core conference, but with up to six parallel streams each afternoon, I can't claim to have a representative view on it. There was entire unconference component I didn't experience at all. If there is a common theme I can pick out, it is that the technology still hasn't caught up with demands of people working with the thornier issues of metadata. There was palpable impatience for Named Graphs to become an official part of RDF, for instance. I see a lot of potential for great work to come out of the Community meetings that form a major part of the Conference, and although I'm clearly biased, my own Community meeting was the highlight for me.

Draft ORCID API is now open for viewing!

Ben O'Steen — Thu, 29 Sep 2011 13:57:00 +0000

The API draft is now available for public viewing and covers:

Levels of privacy and other contextual terminology.
Public query API by way of illustrative HTTP query dialogues.
Protected Data query via OAuth.
- OAuth Workflow is illustrated in some depth

This is a pre-release of the API; it is nearly there, but it would be foolish to assume that the API will not change if any difficulties arise or if a better way is agreed upon.

Google Doc version of API:

https://docs.google.com/document/d/1hEHwKEpQ3wH-qmgmQAgdxdcEIG1jmv6e2-FgdEfW89I/edit?hl=en_GB

As the document is ‘view-only’, you cannot comment on it directly. Please post queries and observations to the ORCID Researcher Google group.

NB Posting a comment here will not directly reach the other members of the ORCID board.

ORCID Outreach Event at CERN

Ben O'Steen — Thu, 22 Sep 2011 10:23:00 +0000

Program

10:00 Welcome and what’s new – Howard Ratner, ORCID Chair (Slides [PPTX 2.55Mb])

Talk discussed:

Key quote “ORCID will work to support the creation of a permanent, clear and unambiguous record of scholarly communication by enabling reliable attribution of authors and contributors”

Re-statement of the 10 ORCID principles

Various demographics and participant statistics

Illustration of how the Trusted Partners can give more weight to the assertions made in a profile by a researcher by ‘agreeing’ (same_as):

An overview of other researcher ID initiatives and some bullet points on why they feel ORCID is different:

Only not-for-profit contributor identifier initiative dedicated to an open and global service focused on scholarly communication

ORCID is backed by a non-profit organization with over 250 participants behind it

ORCID is backed by many different stakeholders

Publishers are an important ORCID stakeholder but are just one part

ORCID is serious about building an open system

ORCID is the only researcher identifier that is not limited to discipline, institution or geographic area

ORCID is the one to bridge them all by registering the identifiers of all other relevant standalone services (silos big and small)

10:30 What ORCID already does and will do next – Brian Wilson and Geoff Bilder for the Technical Working Group (Slides [PPTX 3.8Mb])

Talk covered:

Development approach, timeline and progress overview

Discussion of the form of ORCIDs as URLs

Overview of what the Query API will provide (non-technical)

Details of the VIVO/ORCID collaboration and code resulting from that.

11:00 Open Q&A on the above

11:30 Cool, but who’s going to pay for that – Craig Van Dyck and Ed Pentz for the Business Working Group (Slides [PPTX 1.19Mb])

Talk covered:

Details of the financial models and projections for the ORCID project

Expected cost to institutions, publishers and funders

$2.75 million required as investment capital (to be paid back after the project breaks even)

13:30 ORCID and me: synergies – Each followed by animated discussion with the audience

ORCID and researchers – Cameron Neylon, STFC

Cameron’s key points were:

Without giving researchers total control over their data and their profile, the system will fail. This includes the power to not list works and co-authorship that the researcher does not want to show.

The most authoritative information you have about a researcher, WILL be from the researcher. Not the institution, not the publisher, but the researcher. It is up to them to specify what is ‘true’ or not.

Researchers wanted three things:

Online profiles that could be used to generate CVs (as maintenance-free as possible) – “It should just know about what articles I publish”

Tracking and aggregation of non-standard outputs in repositories (eg Data, software). This also relates to an identifier being used as a marker that I can use to say “This is a scholarly output for me” even on non-traditional outputs (eg blog posts)

And this is the key. Automating and simplifying grant submissions systems but critically manuscript submission systems. That got clearly the most votes, is probably actually the most tractable and offers the most opportunity for immediate traction with researchers.

ORCID and data – Jan Brase, DataCite (Slides [PPT 0.5Mb])

Provided an overview of DataCite and why it exists (no current convention for citing datasets, attributing impact to them or linking them to the articles which use them)

“DataCite is part of ORCID as ORCID is a community, DataCite is about linking all types of scientific content together, and author identification is one of the key issues”

DataCite search interface: http://search.datacite.org/ui

An example PANGAEA dataset (NB not the one used in presentation unfortunately): http://doi.pangaea.de/10.1594/PANGAEA.733100

ORCID and funding agencies – Carlos Morais-Pires, European Commission (Slides [PDF])

Provided the EU context for FP8, and where ORCID and related efforts may fit within the overall strategy, including overarching figures and funding information.

No questions were raised immediately following this talk, but it did give a very good context to the levels of money that the EU is pushing into this area.

ORCID and your university library – Consol Garcia, Biblioteca del Campus del Baix Llobregat (a Prezi which I cannot find online, may be private)

Provided a good illustration of why the ‘first name, last name’ paradigm falls flat for many cultures and languages.

Asked many questions about what ORCID may do to help libraries but also how it could fit within library practices as they currently stand.

[Ben: Fundamentally, it raised more issues about current library practices and its shortfalls than what a global id for researchers could do]

ORCID and your repository – Najko Jahn, Universität Bielefeld

The presentation gave an overview as to the work they had been doing for the past year or more on their repository. They had already begun to tackle the author disambiguation problem, assigning IDs to authors and so on. Librarians suggest which works to attribute to researchers, and the researchers were able to simply confirm or deny that the work was authored by them. They had done so for approximately 300 of their researchers.

The key question he posed at the end was “What would adopting ORCID do for my repository?” which is a perfectly valid question, given the work they had already undertaken to disambiguate. The discussion was slow, but eventually focussed on the difference in scope – their researcher IDs were locally valid without a widely understood API to query about them, and an international ID system would have a global scope, with effort being made so that the API is as simple but useful as possible.

ORCID and your journal – Brian Hole – Ubiquity Press

Talked about how ORCID may work with a small, independent publisher and what made them different from others (publishing by researchers, for researchers)