identifiers: relevant content on this site

Identifiers

Talat Chaudhri — Wed, 06 Feb 2013 17:10:09 +0000

Digital identifiers are central to information retrieval and management, so they are consequently crucial to the operation of the World Wide Web. In particular, the ability to identify a digital or real-world, physical resource and/or a description of such a resource relies on its unique identification. The relationships between such resources, described in XML and/or RDF, may provide a powerful resource for research by establishing what other entities and resources are linked to the one being described and providing a means to find information that might not otherwise have been discovered. Higher education institutions are large-scale producers and consumers of information about their research and teaching activities, and also generate considerable volumes of administrative data that are linked to these activities. These resources may, for example, include research projects, resulting publications, teaching resources and complex data sets. It is often crucial in order to secure grant funding or commercial collaborations to be able to demonstrate the success and productivity of academic institutions or individual academic units within them. In particular, the unique identification of academic staff is crucial to their continuing success.

Background:

On the most basic level, Web technologies intrinsically require the identification of information resources so that requests for such resources can be properly processed and so that the correct resources can be supplied. To this end, every resource on the World Wide Web is uniquely identified with a Uniform Resource Identifier (URI). The identifiers must be unique simply in order that there is no confusion about which resource is being sought or which server ought to seek to provide it. These resources may be of any file type whatsoever, including textual documents, web pages, images, video or audio resources, scripts, compressed archives etc; these may be for human viewing or may be intended to be machine readable in order to support a particular web service.

Information on the Web is found within resources: these may be the object of the request in themselves, if those resources are electronic; however, if the resources are not electronic, e.g. physical objects such as buildings, works of art, books or even people, the electronic resource is likely to be only a description of the actual resource that is being sought. Examples might be catalogues, lists, biographies and similar. While it may seem odd from a human perspective to refer even to people as resources, servers simply return the information, however organised, that is attached to the HTTP identifier that has been requested, whatever it may be. For instance, the simplest resource related to a person (or organisation, similarly) could be a declaration of the forms of a person’s name and other personal details that identify that person better as an individual and avoid confusion, i.e. disambiguate from other similarly named people, organisations or other resources.

Technically, there are two types of URI. The most immediately familiar of these is the Uniform Resource Locator (URL), which, for example, will be seen in the browser address bar above this and all other web pages. Another, rather less well known type of URI is the Uniform Resource Name (URN).

The difference between the URI and URL is a technical one and is often poorly understood in practice, to the extent that these terms are frequently confused. In theory at least, the URI only has to be a uniform identifier that is unique, which means that there is no automatic requirement for it to be expressed using the Hypertext Transfer Protocol (HTTP), or any other standard protocol such as XMPP, SMTP, MAILTO, FTP and so on. Effectively, any name or other sequence of characters that is unique to a particular resource could theoretically be classified as a URI and be used accordingly. However, in practice, it is not automatically useful to have a simple identifier that does not provide a means to locate and retrieve the resource in question. There is nothing to prevent a URI being expressed using the HTTP or other protocol and choosing a URI that points to a genuine resource on an active server. This provides a means for the server to resolve the request for that resource and hence to either return it, point to an alternative resource, state why the resource is no longer available, or declare that the resource does not exist, as may be appropriate. Each protocol, including HTTP, has its own technical communication standards which servers should adhere to.

Unlike the URL, there is no requirement for a URN to resolve to a resource. It is instead intended to provide a particular recognised scheme, over and above the normal HTTP identifier, for the identifier to be globally unique. In practice, some means to resolve the URN is generally required if the resource can be returned electronically via the Web, though not always in the case of physical resources. An example of this might be the International Standard Book Number (ISBN) for a particular physical book. It is intrinsically useful in the physical world to be able to discover this identifier, and so providing it via the Web may be useful for both humans and machines to discover the locations of actual physical copies or descriptions thereof, even without immediately providing a document on the Web that contains such a description. Inevitably, it is always more useful to provide a Web resource immediately that describes the physical resource in more detail, in addition to simply providing a URN within some appropriate schema. On this basis, the URL is a more flexible and widely applicable form of identifier than the URN, while the URN provides one method for providing more specific, narrowly defined semantic information as part of that identifier. It is not the only method of providing such information, which may instead be marked up, for example using the Resource Description Framework (RDF), which may be expressed in XML, or in simpler ways such as the JavaScript Object Notation (JSON). These would normally be contained in a machine readable document resource that itself has a resolvable HTTP URI, i.e. a URL.

The concept of unique identification may be combined with the concept of describing relationships, often using RDF or other similar or related mark-up technologies. The idea that physical resources can have presence on the Web via their descriptions is often called the “Web of Things” and is part of the Semantic Web. The latter provides a means to find out how resources are related to other resources, e.g. how the Leaning Tower of Pisa is related to Italy, where it is located as a grid reference and its GPS coordinates, who built it, which famous experiment was allegedly conducted there by Galileo etc. When a large quantity of information is made available using description schemas that are machine-readable, Web services can be developed that can, for example, discover video, audio or textual resources such as publications about that place, or even perhaps artworks or photographs depicting it. For example, a scientist and an architect might be interested, for different reasons, in closely related resources to do with the construction and more recent stabilisation of the famous tower in Pisa. At least in theory, the Semantic Web, which relies on unique identifiers for every resource and every document providing such descriptions, could have unlimited potential for providing a means to cross-reference stores of information that are available on the Web and could thus provide more in total than any one such store is capable of providing alone.

In higher education, the creation and provision of resources is inevitably critical to all academic and teaching work. It is inevitable, therefore, that the Web has become a critical tool in carrying out all of these functions. For example, the effective organisation of complex research activities leading to academic publications that bring in funding revenue, or of teaching resources that facilitate the education of students, will contribute considerably to the performance of particular institutions. Providing and organising this information across the various functions and departments of an institution is often key: research data, human resources or finance information, library catalogues and publication information can be combined centrally to provide a powerful and flexible database of the activities of an institution or its departments, including their strengths and weaknesses at any particular point in time.

As technologies are developed and used to bring together diverse sources of information more efficiently and automatically than was possible by manual processes, it becomes clear how increasingly important it will be to identify the individuals, organisations, documents, artifacts and multimedia resources involved in these activities correctly. It is a particular concern to identify the names of people, organisations and their subdivisions and places correctly, since the productivity of individuals and institutions rely heavily on the accuracy of information and because any misattribution could impact negatively on, for example, their standing with funders and commercial partners. On a fundamental level, identifiers contribute to the efficient organisation, consumption and reuse of information resources on the Web. The relevance of unique identification to such a major producer and consumer of information as the higher education sector can clearly not be overestimated in this light.

Current Usage:

There are so many different schemas and protocols for unique identification, within the overall superset of URIs and the broad divisions into URLs and URNs, that it is practically impossible to create an exhaustive list of them. Within the higher education sector, there are a number of broad classes of identifiers and metadata schemas containing various approaches to unique identification that can be described in broad detail here.

On the simplest level, one technique is to aggregate information about an identifier and its various or equivalent forms in other contexts or schemas in order that it cannot be confused with similar identifiers. This is a useful approach, for example, with regard to the names of people and organisations, which have multiple name forms that might be valid either concurrently or at different times in their history. It is useful to know that a frequent name like John Smith refers to John Nathan Smith, since there are fewer so named, and even more useful to state a relationship about where (and when) he was employed. A machine will not know that J.N. Smith, J N Smith, JN Smith, Dr. J. Smith and perhaps later Prof. J. Smith are different or identical people, especially where titles change or an individual has several valid name forms. It is useful to state that these are equivalents, as together these contribute to the uniqueness of the record or records about this individual. Details about where a person worked and their job, together with what the institution was called at various times, with date stamps where appropriate, serve to make these entities unique. The order of elements of a name and what function each has, for instance in oriental names where alternative name orders are used, can also be marked up in metadata.

Such metadata records can be produced in a relatively simple way, depending on how much semantic information is necessary or useful. At the simplest, a major metadata schema such as Dublin Core can be used with repeated fields for forms of the name of an entity associated with a resource. More complex metadata schemas such as the Common European Research Information Format (CERIF) can be used to encode more complex information, e.g. the dates when a particular academic’s pre-marital and married names were valid, which could be cross-referenced to her or his publications to ensure that they were correctly cited and/or verifiable against the publisher’s information. Similarly, the name or changing names of the organisation(s) and their sub-divisions at the time of the production of a teaching resource or academic article could be verified. Where these records disagree, it is possible to programmatically establish the likelihood of errors by comparing different resources and thus correcting and adding to (or enriching) the metadata held about resources and their relationships.

However, it is useful to have an overall, canonical URI to combine with a record or records about individuals, organisations or other entities, for example if some sources of information are less well curated than others. Uniqueness can be determined by methods such as Universally Unique Identifiers (UUIDs) that rely on mathematical probability to be functionally unique. While convenient for machines, these must be combined with strings of characters representing common names or words in order to be meaningful to a human as well, which is easily achieved in markup languages such as RDF, or just in simple XML. It is also useful to be able to apply unique identifiers to documents, such as the Digital Object Identifier (DOI). This is often used, for example, by publishers to give a unique identifier to a published electronic resource such as an academic article, but could be used for any document. Metadata attributes are associated with the identifier so that persistent identification is provided for that document together with relationships that the document has, e.g. to its author.

In terms of Web services such as social networks, which may include academic and other professional networks such as Academia.edu and LinkedIn, the ability to judge whether a particular public-facing user account is the same as a named user on another service is also important to the discovery, for example, of potential collaborators or professional rivals and their work, be they individuals or organisations. This may, for some individuals, extent into personal social Web services that are partially or wholly used for academic or other professional purposes, such as Twitter. Purely academic identifiers also exist, for example commercial identifiers such as those offered by Thomson Reuters at ResearcherID.com, and the Scopus Author Identifier; or those provided by national organisations or within the public domain, such as the International Standard Name Identifier (ISNI) and the ORCID researcher identifier.

Current Issues:

Within academia, the business of unique identification has advanced at different rates for different types of resources and related entities that are available or described on the Web. On the one hand, it has been in the interests of publishers to make sure that academic articles from which they derive profit are uniquely identified in order that they can support their subscription income. It is not surprising then that the DOI scheme is the most widely implemented scheme for digital object identification, or that it has been most widely applied so far to such published materials. Pre-existing international schemes such as IBSN and ISSN have been relatively easily re-used in the Web context.

On the other hand, there has been significantly less progress towards a de facto internationally recognised standard for unique identifiers for individuals. The ISNI identifier is widely used by public funding bodies but the individuals and even the organisations described by it have no direct input into the accuracy or presentation of the information except initially through their national research assessment schemes, e.g. the Research Excellence Framework (REF) in the United Kingdom. The ORCID scheme has the provisional support of funding bodies represented by Research Councils UK and by similar organisations in other developed countries, as well as by major publishers; however, it is still in the early stages of development and it remains unclear whether the necessary adoption will occur among academics. The approach taken by ORCID is for organisations to seed the information about their academic staff but for the information to be controlled by the academics themselves and authenticated by the trusted institutions who employ or fund them. At present, ORCID appears to have considerable initial support and is a fast-developing standard.

The unique identification of organisations has progressed less far. An academic may have worked at several institutions, each of which may have alternative names (e.g. bilingual names in Welsh or Gaelic) or which may have changed their names. They may work within more than one research group, department and/or school, which may have been reorganised, merged or de-merged over time with resultant effects on the appropriate nomenclature at the time that a particular resource was created or published. As yet, although some metadata schemas such as CERIF have the ability to mark up increasingly complex information of this type, there is little evidence that this is being done by academic institutions. In countries such as Australia, there may be national institutions, e.g. the Australian National Data Service (ANDS) that maintain or provide approved lists of names. However, there is no evidence that national bodies who keep such lists for their own purposes have made these widely available in machine-readable, date-stamped formats for programmatic reuse such as metadata verification and enrichment.

Individual software services may create widely used methods of identification for their own purposes, for example within the repository software DSpace, Fedora and EPrints. As supplied by default, however, these are not especially reusable outside the software, given that the agreed international standards for identifying individuals and organisations are not well established. It is always possible to include any unique identifier, such as ResearcherID or ORCID, in a record, but the usefulness will be limited if there is no consistency of practice in terms of which metadata field is used in each local customisation. There are numerous useful approaches such as that taken by the developers of eSciDoc in the Control of Named Identities (CoNE) service, which can be used either with the Fedora-based eSciDoc or as a standalone service. CoNE, for example, can provide a means to uniquely identify any entity. However, these have not seen wide adoption beyond their specific software communities.

The CERIF metadata standard can perform a similar function with an almost arbitrary level of granularity of information, yet it has not been nearly as widely implemented in software as the Dublin Core metadata standard which has no such abilities but which is far easier to implement. In the same way, Dublin Core application profiles provide an overlay of more complex metadata built on top of simpler Dublin Core. However, these are used largely in limited, specific information retrieval environments, e.g. libraries or specific, subject-based services and have not developed into widely used, de facto standards.

There are numerous approaches to cross-referencing which online identities in professional and social networks belong to the same individual. Most of these, however, rely on the services themselves providing a means to do so. As these are commercial, there is no guarantee that they will continue to do so in future if commercial rivalries later cause that to be outside their own interests, or that the information will be publicly viewable, even if the user chooses to set privacy settings to allow this, or is even able to do so.

However, the non-commercial means to state such equivalences are of uncertain value too. Most users who have an OpenID, for example, which is effectively a user account attached to a unique identifier and metadata about that individual, only have one because they use a service like Facebook which, usually without their knowledge, provides one. It may not be obvious to a non-technical user that it can be in their interests to state publicly which other professional services may represent them. While Friend Of A Friend (FOAF) allows relatively simple declarations about individuals and their relationships with other entities, most individuals do not have the technical knowledge or means to provide markup in such files on the Internet, or are even aware of what they can achieve. It is not clear to what extent commercial search algorithms used by the major search engines may take FOAF, or indeed any other metadata source, into account, which may impact upon the practical value of providing such metadata.

Why should universities care about identifiers?

Talat Chaudhri — Fri, 17 Aug 2012 15:19:53 +0000

Why do identifiers matter for research?

Imagine that you are a senior manager in an institution within the UK Higher Education sector with responsibilities for research: you have read some basic details about unique researcher identifiers and perhaps institutional identifiers. However, it may not be immediately apparent just how important these issues are, which may seem on the face of it to be a relatively superficial and/or trivial organisational matter. Clearly, any such strategic decision-maker will long have been aware of the demands of the Research Excellence Framework (REF) and its predecessor the Research Assessment Exercise (RAE), in which successful reporting of the best research outputs of university departments is crucial to the on-going funding of the institution. This is particularly central to the work of research-led universities, which is an increasingly competitive sector: even universities that formerly focussed more on teaching than research are increasingly aware of the need to drive up standards of quality research in order to secure additional funding.

The reality of unique identification in research

However, as anyone who has actually engaged with the business of research reporting to any degree will tell you, it is far from a superficial or trivial matter to carry out such an exercise without thinking very carefully about how researchers are identified; moreover, identifying the research groups, departments, projects and institutions that they may have variously belonged to at different times, all of which may have been re-organised on many occasions, is a considerable challenge raising considerable technical as well as organisational issues.

Perhaps the biggest problem of all derives from the scale of research reporting. On such a massive scale, it has to be done in a systematic way across higher education institutions in order to be useful. Any lack of a systematic approach in collecting the information on the institutional level will inevitably result in higher costs in processing the information later into a useful form, for example by governmental organisations such as HESA and the Research Councils (RCUK) relevant to each area of academic study. This may be carried out for a variety of reasons, amongst them for example:

The need to produce statistics at a national and at an institutional level in order to gauge how successful different parts of the research community are performing in comparison to each other and to similar institutions internationally, which may be a determinant of how funding is allocated.
The production of good, widely accessible information about the work of academic researchers and research groups for the purposes of future research, both in identifying research as a basis for future work and for guiding individuals and groups in terms of who they might work with in future, who their competitors may be, and in creating wider bibliographic information for a whole range of related purposes related to future publications.
Open Access, an increasing requirement imposed by funders where research is publicly funded.
Accountability in the use of public funds for research.

It is precisely the lack of a national approach to providing consistent metadata about individuals and groups connected with research that raises costs, creates inefficiencies and frustrates the development of new software functionality that makes the jobs of research managers more difficult and ultimately reduces the funds available to research and their best use within the sector. It is therefore the business of senior managers of academic research to care about identifiers.

Researcher identifiers: a crucial first step

Before any wider metadata about research may be considered, the most fundamental issue is identifying individuals who carry out research. Before this happens consistently on a national level, there is little point addressing the subsequent issue of identifying groups and institutions engaged in research consistently. It is also important to consider any national approach in terms of interoperability with other international approaches wherever possible: while, on the one hand, funders and statistics agencies can only hope to mandate national identifier schemes, at the same time it is clear that research collaboration is cross-institutional and international in scope, in some cases including researchers from numerous countries in one project or even in the production of one individual paper, data set or other research activity. This is the approach that has been taken by the JISC, together with RCUK, HESA and other partners in setting up the Research Identifiers Task and Finish Group, which is due to report in October 2012.

One emerging candidate with cross-sector and international support is the ORCID researcher identifier scheme, whose rapid development in 2011-12 is scheduled to culminate in a public launch in October 2012. There are, of course, existing, widely-used but relatively simple identifiers such as the HESA researcher identifier, and identifiers provided through commercial providers' web interfaces, but thus far these have not provided dependable unique identification. All such identifiers could be linked to a system like ORCID that is designed on interoperable principles and is not dependant on any particular software platform or web interface. An alternative approach is taken by the ISNI number: whereas ORCID seeks to offer individual researchers and institutions the ability to manage their data on a distributed model, ISNI represents a centrally moderated, bibliographic approach led by national libraries and other similar institutions with national and strategic responsibilities. It remains to be seen whether these different approaches are in competition or whether they will offer different but complementary functionality within the sector, and much may be dependent on how software vendors implement them.

Current Research Information Systems (CRIS)

It is not simply a matter of tracking publications and other related ouputs, for example in institutional repositories. This part of the equation is by now relatively well established in the UK HE sector, although it continues to develop: the issues surrounding Open Access, for example, have not been fully resolved. This, however, is just at the level of the final outputs of research and does not provide anything like sufficient insight into the processes of research, the projects and groups carrying out, the staff involved or the costs. Traditionally, this information has been gathered in a very long-winded process that is individual to each institution's particular workflows and processes (although there are obviously great similarities of approach between them), often a partly paper-based exercise that has been migrated to an extremely varied range of systems and databases, few of which are interoperable or complete. Many departments may be involved in the process apart from the institution's research office and the department in which the researchers are based, but perhaps the most significant would be the finance office, the human resources department and the library, to name just the key players. It will be necessary to keep some information confidential, e.g. personal staff information, salaries and so forth, to share some information internally and with research funders, and to publish other information, e.g. in a research repository that forms the institution's "shop window" of public outputs, library databases and so forth. The term Research Information Management (RIM) has emerged to cover all of these information gathering and information processing activities.

In order to do this systematically, more sophisticated research information management software has been developed, often known as Current Research Information Systems (CRIS). The market in the HE sector is currently led, in terms of the number of institutions adopting the software, by PURE, produced by ATIRA; other major players are Symplectic Elements, and CONVERIS, produced by AVEDAS. More recent entrants to this market are Thomson Reuters' Research in View. There are currently no open source products, although a JISC-funded modular approach by the Research Management and Administration Service (RMAS) project may have an increasing impact in this area, depending on subsequent adoption by HE institutions. It is not an overstatement to say that HE institutions are currently in a rush towards early adoption of these CRIS systems, motivated by the need to use research data to compete with each other for funding opportunities.

Next steps: organisational identifiers

In the next 2-3 years, it is likely that the matter of unique researcher identification will be resolved through the emergence of a dominant standard that has sufficient take-up and leverage in the UK and international HE sector to faciliate the work of research institutions and funders. Following this, there will be organisational structures associated with research that will require unique identification, often on a multi-layed basis: for example, a project may be at several institutions, perhaps internationally, and their staff may be in various departments or similar units whose names have changed or have been merged or de-merged at various times, all of which will require careful date and time stamping to make the information reliable for the period that it covers. There will be issues related to copyright, commercialisation and spin-off companies that make the precise provenance of research critical to the future success of academic research and development. Standards for organisational indentifiers are therefore the next important issue on the horizon. Like researcher identification standards, research managers and senior managers with strategic responsibility for research will need to keep abreast of this rapidly developing area.

The business of unique identification

Talat Chaudhri — Thu, 23 Feb 2012 23:58:26 +0000

What need is there for unique identifiers?

Put in relatively non-technical language, there is an increasing concern in information science in general to uniquely identify different things, organisations or people that could otherwise be confused, whether on the Internet or in the physical world. In technical terms, these are all referred to as resources (even if people might find it vaguely demeaning in normal language to be considered as such). This need, whether real or perceived in any particular context, has grown as the complexity of information available on the Web has grown almost exponentially, increasing the potential for confusing similar resources.

Why aren't names good enough?

1. People

It is not necessarily enough to have a name, since even a relatively unusual combination of names might easily not be entirely unique from a worldwide or even universal perspective: at the basic level, John Steven Smith might be unique in a place called Barton but even if you cross-reference these references, two people with the same name could easily be confused, for example if there are several possible places called Barton.

My own name, Talat Zafar Chaudhri, might appear to be more unique until you realise that these are all fairly common names in the Indian subcontinent and thus in the Indo-Pakistani diaspora, so it is reasonably possible or even fairly likely that another named individual exists with this particular choice of spelling (of which others may exist). I am also Talat Chaudhri, T. Chaudhri, T Chaudhri, T.Z. Chaudhri, TZ Chaudhri and similar variations (with or without spaces and punctuation) that might make it harder to decide which individuals to reconcile as a single individual, especially by machine processing. At least I do not vary the spelling of my surname, but some people may, especially in cases such as my own where other transliterations could be possible: for example, my father previously used the spelling Chaudhry and many others such as Chaudry, Chowdhary and Chowdhuri are equally possible. I understand when companies misspell it, but a computer might not be sure if these were definitely the same person, even if it went to the lengths of calculating a probability for this.

Moreover, people change personal titles (e.g. I have been both a Mr and a Dr and I am occasionally still referred to as the former by companies that do not allow for the latter option); they have multiple, changing work roles and work places, and may be known in multiple contexts, e.g. work, social, voluntary roles and similar. At work, one may have additional roles in various professional bodies, so it may not be apparent who is who. Two people might have the same name in a large professional group, e.g. physicists, and may even produce outputs related to the same subject. Who owns which ones? This is a particular issue for electronically available outputs on the Internet, e.g. publications, educational resources, audio, visual or audiovisual resources and so on.

2. Organisations

The same issue arises for organisations. Can we be sure that a Board of Licencing Control is unique? No. Perhaps it is merely another spelling for the Board of Licensing Control but using a different spelling? What if one, but not all, of these were re-named as Burundian Licencing Control? What if the Board of Licencing Control merged with the Department for Regulatory Affairs under either of these names, a combination, or an entirely new name, yet continued their association with the assets of the originals. De-mergers are likewise possible, and may present issues of uncertain ownership of resources.

Perhaps there are organisations with this name in several countries but serving utterly different purposes, and perhaps one is merely one possible translation of a term into English but used natively in another language. Historical names have been used in multiple contexts that may still be valid, e.g. the Irish Volunteers, and these might need to be kept clearly separate from each other. Conversely, there are also organisations that have multiple names or forms of names, whether in one language or in multiple languages or during their history, e.g. Óglaigh na hÉireann is Irish for both the terrorist Irish Republican Army (IRA) and most of its subsequent splinter groups but is also, however, an acceptable name, for historical reasons, for the Defence Forces of the Republic of Ireland, and previously just the Irish Army (an tArm) that now forms a part of it. These are clearly not the same and must be distinguished. It must be also noted that typographical constraints and character encodings will lead to yet more duplicate forms.

Isn't this bigger than the question of unique identification?

Yes, the need for complex metadata to express these things can go far beyond merely identifying resources in a unique manner. However, before one can even start thinking about complex descriptive and relational metadata, one first has to be clear which resource is mentioned: hence the first step must be unique identification of what it is we are talking about. Only once we have done that can we feel reasonably confident about talking about how resources relate to one another and how they may have changed over time.

Overall, there is an ever increasing need to make clear what is meant, as more and more things and agents have on-line identities that need to be distinguished, whether this is as an owner of resources or as a referrant within a resource, e.g. the subject of the resource in a particular context, and even of the role played and the relationship to other resources or agents, perhaps in a specific time period. Information models can quickly become extremely complex, and this is certainly true where identity is concerned.

What is an identifier?

In concept, an identifier is similar in its basic concept to a name. At its most basic, an identifier in the context of an information system is a token (usually a number or a string of characters) used to refer to an entity (anything which can be referred to). Identifiers are fundamental to most, if not all, information systems. As the global network of information systems evolves, identifiers take on a greater significance. And as the Web becomes more 'machine readable', it becomes vital for all organisations who publish Internet resources to adopt well-managed strategies for creating, maintaining and consistently using identifiers to refer to those assets it cares about.

What are unique identifiers?

The simple answer is that this is the only way to avoid misidentification confidently, and therefore prevent any errors about ownership or rights over resources that might arise, as well as making sure that large bodies of resources contain reliable information generally.

The fundamental question is whether the identifier or token that has been chosen is unique and how best to ensure this. Some identifiers are so complex that mathematical probability makes them effectively unique in the universe, notably UUIDs. In essence, a UUID is no more than a complex numerical token: it is only additional complexity (and thus uniqueness) that it offers compared to, for example, a running number. Others like names can only be distinguished unambiguously by making a series of statements about which names are considered equivalent, which contexts (e.g. a person's work or town) are valid, and so on, where a number of relationships have to be attached to a particular identifier and checked in order to reach an acceptable level of uniqueness and to eliminate any mistaken connections with resources that might be similar in name or perhaps also in other respects by chance.

The problem with UUIDs is that, while the chances of them failing to be unique are, to all practical purposes, non-existent, it is not very clear from a UUID alone what the nature of that resource is. It may be machine-readable but it says nothing about who generated that identifier and when, or which other identifiers might exist for the same resource in different systems that also generated an identifier for the same resource. Consequently, the need to associate other metadata with any complex number or other similar token remains (including but not limited to UUIDs). Simply, no single token can be sufficient for any complex purpose and, at the very least, an electronic or physical resource must be referenced for the token to have any useful meaning at all.

This is effectively that a URL is: another type of token. While I will not go into the whole discussion about URLs and URNs as sub-types of URIs, it is worth noting that, in many quarters, the term URL is no longer preferred despite it being the most commonly used in practice. In strict terms, there is a clear difference: while a URI is usually resolvable to an electronic resource, which may be either a description of a physical or electronic resource or may be an electronic resource itself, there is technically no requirement that a URI should be resolvable, i.e. that all it needs to be is a token that doesn't necessarily have to represent an address that actually delivers a resource. However, it is usual to use the HTTP scheme, which is designed for delivering such a resource, so it would be somewhat eccentric and misleading if one were deliberately to choose an ostensibly resolvable syntax that does not in fact resolve. In effect, virtually all such URIs are also URLs (unless a resource has become unavailable and link rot has set in), since the latter must locate the resource or representation of it: this is inherently useful. Any URI that resolves, i.e. URL, will be effectively unique within the standard Domain Name System (DNS). As a result, there is no absolute need for UUIDs in many contexts, since a sufficiently unique and practical token already exists in the URI. Any unique but arbitrary token serves the core purpose here.

Aren't identifiers really just names?

Yes and no. Names are intrinsically arbitrary too when they are first given. However, they are identifiable on a number of levels from a human perspective. In addition to a combination of names belonging to one or more particular linguistic and/or ethnic origins and usually identifying gender, they quickly become associated with a particular person, so their use in uniquely identifying that person within a given context become central to maintaining the person's reputation in whatever they do. This is, for example, particularly important to academics in Higher Education. In modern times, this name resolution needs to be done globally wherever the Internet is the context, whereas previously it would have been possible to use fewer additional pieces of information in more restricted contexts (e.g. a village, a country etc), depending on the purpose. These different contexts still co-exist but it is now necessary to provide as many as possible, since one cannot control or predict why the information is being requested in each instance on a global system such as the Internet.

How does this affect Higher and Further Education?

Increasing numbers of professionals and the bodies that they work for and represent need to describe their resources on the Internet, whether those are in themselves electronic resources, whether they are descriptions of electronic or physical resources (metadata), or whether they are other representations of physical resources, perhaps in addition to themselves being electronic resources (e.g. photographs). This is a particularly pressing issue in Higher Education and, to an increasing extent, in Further Education. Academic outputs may include publications, educational resources, visual, audio and audiovisual resources and so on. Perhaps the best known is the issue of scholarly publications, partly through the rise of the Open Access movement to make such resources freely available.

There are already a range of identifiers for academics and related professional university staff. One of the problems is that these are created for specific purposes that only cover whichever subset of staff is relevant to those purposes. For example, HESA keeps records that contain a HESA number for academic staff, which means that at least those who have published academic outputs will have such a number. Another number called the HUSID number is maintained for students, since tracking academic careers from student to staff is one important concern for HESA. Many academics in relevant fields may have ISNI numbers, which are used widely in the media content industries. Many academics will have one or more professional staff pages, including within repositories and Current Research Information Systems (CRIS), each with a URI, not to mention OpenIDs and URIs associated with Web services which they use professionally and/or privately, e.g. LinkedIn, Academic.edu, Facebook, Twitter and so on.

Here are some examples belonging to Brian Kelly of UKOLN:

The problem is that the coverage of these numbers is not universal within the HE sector, and there is no single recognised authority or other agreement to prevent and resolve conflicts where information is not consistent between two or more information sources.

At present, the JISC are trying to solve this through the Unique Identifiers Task and Finish Group, which also includes representatives of HESA, HEFCE, the various Research Councils in the UK and UKOLN. The preferred solution is currently the ORCID academic identifier, which is being developed internationally with publishers, with a great deal of input from the United States in particular.

In order to succeed, any such identifier will need international penetration of the higher education sector, since academics will not use it unless it delivers the sorts of interoperability benefits that make their work easier and become integrated into the recognised systems required of them by funders and publishers in the course of their work. Since students and academics change roles and institutions, this needs to be recognised and outputs properly allocated to institutions and departments, which may themselves change identities, merge and de-merge over time.

While institutions will need to reduce the workload on academics by bulk loading information about staff, since the main incentive to use the system is that every academic has a record, there is also an issue about control. Should academics have the ability to alter their records at will? Are assertions automatically trusted or does a particular record for an academic's time at an institution need to be verified by that trusted body? Who should maintain a list of trusted bodies who can back up assertions? How will this effort be funded sustainably? It becomes clear that some of these points are central structural concerns whereas others may cover only fringe issues such as avoiding deliberate falsification, which may be rare.

Proprietary academic identifiers

There are also a number of proprietary identifiers associated with different commercial services related to electronic publishing and related academic service industries. Thomson Reuters and Elsevier provide identities for individuals and organisations as part of their bibliographic and academic services; similarly, search services such as Google Scholar (see the study in this blog post) and Microsoft Academic Search have also started to offer identifiers (see this blog post). There may be privacy issues, for example in Google and Microsoft publicly surfacing information about researchers without explicit consent: while this information might have been suitable for the limited purpose of publication, academics may not have intended for it to be synthesised into a single, public description of their personal details available to all.

Some of these services introduce new problems, since their primary purpose is commercial and it is often less of a priority to deal with the internal issues facing academic institutions unless that impacts significantly on the ability to make commercial profit. These may be resolved over time or be reintroduced as services change and compete: the academic has little or no control over the effects of commercial decisions upon their work. For example, Microsoft Academic Search often misrepresents outputs as belonging to similarly named individuals (thus is currently failing at unique identification) and, by default, requires the manual input of researchers to edit out errors and take a proactive approach towards managing the information about themselves. This brings the overall quality of data into question: for large-scale statistical purposes, this could be tolerable, depending on the degree of error; however, for academic citations and reporting purposes such as the Research Excellence Framework (REF), it would not be acceptable to use this data without further refinement, which would most likely remain a long, manual process.

Software and services

Any software application layer, whether operated by commercial companies, higher educational institutions, funders or governmental bodies, needs to be maintained. If information is harvested or processed automatically, it needs to be clear who corrects information where errors are found and what the resources are for academics to contact individuals with the time and effort available to improve the data as part of their work. In the case of commercial organisations, this is usually unclear and may change. There is no guarantee that the commercial reason for providing services will continue over time, unlike in most cases in the public sector within Higher Education. Coverage of such commercial services is often geared towards institutions rather than individuals: for example, Google Scholar requires registration using a valid university email address that it recognises, which would exclude private scholars and perhaps some retired staff who produce research.

The Web of Things

It has already been mentioned that electronic descriptions or other representations of physical objects may be found on the internet, including written descriptions, pictures, geographical locations, dimensions and so on. It is even possible to describe physical objects that were extant but are now historical, or which have moved or whose location is now unknown, referencing comparable objects and linking these descriptions with other resources that are related. In each case, the nature of the relationship, relevant agents who may have been responsible for it, and when it was valid can be described in metadata.

This opens the way for the Web of Things, a term used to describe that part of the Semantic Web that covers physical resources as opposed to, or as well as, purely electronic ones. Some authorities use the term to mean physical objects with miniaturised electronic devices to enable them to be located, whereas others merely mean any physical object that is described in a record on the Web. It may be argued that all electronic resources have relationships to physical ones, even if that is only with regard to authorship and subject. The Resource Description Framework (RDF) provides a means to describe these relationships and transmit information about them in ways readable to humans and machines. Although these are usually expressed as triples, where two things are described with a relationship between them, metadata structures such as the Common European Research Information Framework (CERIF) can add link tables that give far more detailed information about the relationships themselves. All of this can be made available as Linked Data and surfaced in many software applications on the Web.

The Semantic Web is often seen as a utopian view of a future where no electronic resources will be published without complex information being provided or automatically generated about its origins. The reality is that manual entry of information is generally very limited unless it serves the purposes of the person entering it, and this cannot be relied upon as an approach to ensuring large-scale, consistent metadata on a sufficient scale for the Semantic Web to work. Technology has in some cases improved to the extent that geographical and technical information is now automatically produced, for example in digital cameras and in mobile phones able to record GPS coordinates.

However, the effort and cost required to catalogue the entire physical world and the extent to which this is even possible is highly doubtful. Where the Semantic Web could be useful is within particular large bodies of data, for example experimental scientific data, publications and so on. In the case of the Web of Things, this could include art collections, photography, archaelogical information, the locations of public institutions and many more. For all of these purposes, it will be necessary to provide unique identifiers for increasingly large numbers of resources, including things and agents, in order to provide complex metadata about them.

Education in the wider world

It has perhaps not been sufficiently investigated how unique identifiers for researchers and other staff in Higher Education will fit into the wider question of unique identification on the Web. Relevant purposes might be:

(1) commercial, for example the identification of companies and individuals owning the rights to photos, music, video or publications, particularly legacy resources of ongoing commercial value in terms of royalties and performance licencing.

(2) governmental, for example biometric information about people, used in border controls, crime prevention and citizenship contexts; or about public or private organisations such as charities, political groups of interest to law enforcement etc. Information about individuals, in particular, may be subject to privacy laws, which will vary between jurisdications.

It is clear that there are interfaces between the various agents and outputs of academic institutions and many other purposes, notably those commercial and governmental activities already described. For example, a foreign student or member of staff seeking a work permit will require institutions and governmental bodies to use personal and citizenship information co-operatively, which will be linked to their academic identity in the course of their work at the institution. Some of this information will be private and some public, so there is an issue about who can see which parts of a particular corpus of Linked Data, requiring authentication protocols and systems.

The extent to which consistency of approach between HE institutions and other sectors and contexts can ever be ensured is moot, since there is of course no single international authority and because any single metadata solution that tried to cover so many diverse purposes would be fatally unwieldy. How different, flexible approaches can be understood by machine processing is perhaps the technological key to how well the Semantic Web will answer these questions in future, both within Higher Education and beyond.

ORCID Outreach Event at CERN

Ben O'Steen — Thu, 22 Sep 2011 10:23:00 +0000

Program

10:00 Welcome and what’s new – Howard Ratner, ORCID Chair (Slides [PPTX 2.55Mb])

Talk discussed:

Key quote “ORCID will work to support the creation of a permanent, clear and unambiguous record of scholarly communication by enabling reliable attribution of authors and contributors”

Re-statement of the 10 ORCID principles

Various demographics and participant statistics

Illustration of how the Trusted Partners can give more weight to the assertions made in a profile by a researcher by ‘agreeing’ (same_as):

An overview of other researcher ID initiatives and some bullet points on why they feel ORCID is different:

Only not-for-profit contributor identifier initiative dedicated to an open and global service focused on scholarly communication

ORCID is backed by a non-profit organization with over 250 participants behind it

ORCID is backed by many different stakeholders

Publishers are an important ORCID stakeholder but are just one part

ORCID is serious about building an open system

ORCID is the only researcher identifier that is not limited to discipline, institution or geographic area

ORCID is the one to bridge them all by registering the identifiers of all other relevant standalone services (silos big and small)

10:30 What ORCID already does and will do next – Brian Wilson and Geoff Bilder for the Technical Working Group (Slides [PPTX 3.8Mb])

Talk covered:

Development approach, timeline and progress overview

Discussion of the form of ORCIDs as URLs

Overview of what the Query API will provide (non-technical)

Details of the VIVO/ORCID collaboration and code resulting from that.

11:00 Open Q&A on the above

11:30 Cool, but who’s going to pay for that – Craig Van Dyck and Ed Pentz for the Business Working Group (Slides [PPTX 1.19Mb])

Talk covered:

Details of the financial models and projections for the ORCID project

Expected cost to institutions, publishers and funders

$2.75 million required as investment capital (to be paid back after the project breaks even)

13:30 ORCID and me: synergies – Each followed by animated discussion with the audience

ORCID and researchers – Cameron Neylon, STFC

Cameron’s key points were:

Without giving researchers total control over their data and their profile, the system will fail. This includes the power to not list works and co-authorship that the researcher does not want to show.

The most authoritative information you have about a researcher, WILL be from the researcher. Not the institution, not the publisher, but the researcher. It is up to them to specify what is ‘true’ or not.

Researchers wanted three things:

Online profiles that could be used to generate CVs (as maintenance-free as possible) – “It should just know about what articles I publish”

Tracking and aggregation of non-standard outputs in repositories (eg Data, software). This also relates to an identifier being used as a marker that I can use to say “This is a scholarly output for me” even on non-traditional outputs (eg blog posts)

And this is the key. Automating and simplifying grant submissions systems but critically manuscript submission systems. That got clearly the most votes, is probably actually the most tractable and offers the most opportunity for immediate traction with researchers.

ORCID and data – Jan Brase, DataCite (Slides [PPT 0.5Mb])

Provided an overview of DataCite and why it exists (no current convention for citing datasets, attributing impact to them or linking them to the articles which use them)

“DataCite is part of ORCID as ORCID is a community, DataCite is about linking all types of scientific content together, and author identification is one of the key issues”

DataCite search interface: http://search.datacite.org/ui

An example PANGAEA dataset (NB not the one used in presentation unfortunately): http://doi.pangaea.de/10.1594/PANGAEA.733100

ORCID and funding agencies – Carlos Morais-Pires, European Commission (Slides [PDF])

Provided the EU context for FP8, and where ORCID and related efforts may fit within the overall strategy, including overarching figures and funding information.

No questions were raised immediately following this talk, but it did give a very good context to the levels of money that the EU is pushing into this area.

ORCID and your university library – Consol Garcia, Biblioteca del Campus del Baix Llobregat (a Prezi which I cannot find online, may be private)

Provided a good illustration of why the ‘first name, last name’ paradigm falls flat for many cultures and languages.

Asked many questions about what ORCID may do to help libraries but also how it could fit within library practices as they currently stand.

[Ben: Fundamentally, it raised more issues about current library practices and its shortfalls than what a global id for researchers could do]

ORCID and your repository – Najko Jahn, Universität Bielefeld

The presentation gave an overview as to the work they had been doing for the past year or more on their repository. They had already begun to tackle the author disambiguation problem, assigning IDs to authors and so on. Librarians suggest which works to attribute to researchers, and the researchers were able to simply confirm or deny that the work was authored by them. They had done so for approximately 300 of their researchers.

The key question he posed at the end was “What would adopting ORCID do for my repository?” which is a perfectly valid question, given the work they had already undertaken to disambiguate. The discussion was slow, but eventually focussed on the difference in scope – their researcher IDs were locally valid without a widely understood API to query about them, and an international ID system would have a global scope, with effort being made so that the API is as simple but useful as possible.

ORCID and your journal – Brian Hole – Ubiquity Press

Talked about how ORCID may work with a small, independent publisher and what made them different from others (publishing by researchers, for researchers)

ORCID Executive Update (Sept 11)

Ben O'Steen — Thu, 22 Sep 2011 06:50:00 +0000

ORCID in a nutshell (current strategy):

ORCID is a registry of profiles for people involved in research – a profile can be created by the person themselves (self-registry) or by what is termed a Trusted Partner, such as a University or Publisher.
The people using the system decide who is and is not a researcher, not the system itself.
A self-registered profile, for “John Smith” for example, can state that it is the same ‘John Smith’ in a profile created by a Trusted Partner and vice-versa. (akin to the semantic web’s “sameAs”)
Profiles which are linked like this in both directions (researcher to trusted partner and back again) are trusted more than a profile without such verifying claims.
Profile data can have varying levels of privacy: fields can be made public (anyone can see the data), protected (only those that a researcher authorises can see the data) or private (only the researcher can see it). It is expected that when profiles are linked in the above manner, the researcher’s privacy settings will cover the data submitted by the other parties too (but this mechanism is by no means confirmed or implemented yet.)
A researcher will be able to authorise other parties to access their protected data using a scheme called OAuth. This is a simple process for the user, and requires little to be remembered on their part. An example Twitter OAuth authorisation can be seen in the first 30 seconds of http://www.youtube.com/watch?v=yhrbmUbF0IE - blink and you’ll miss it.
The main selling point for the system at this time is that it is attempting to save a researcher’s time spent filling in publisher and funder forms for article and bid submissions by having the pertinent details automatically drawn from their ORCID profile (once the publisher/funder’s system has been authorised via the aforementioned OAuth)
The later selling point, when a tipping point of signed up users is reached, is expected to be for the universities, funders and publishers. The ability to draw up an REF return or to see which publications have been made as a result of which project funding is an expected feature.
It is expected that usable ORCIDs will be assigned from Q2 2012

Money:

(much of the following is taken from Ed Pentz’s powerpoint presentation: http://orcid.org/sites/default/files/bwgsep11.pptx WARNING: new Powerpoint required to view.)

Current projections suggest that the ORCID system will require operating costs of around $2.1 million a year for the next few years.
The organisation has approximately 6 months left of funding capital left to work with and is on a funding drive at this moment.
It is looking to follow in other CrossRef project’s footsteps by asking publishers and the like for loans – it projects that it will reach the break-even point in 5 to 6 years.
No researcher is going to pay for access to the service to create and use a profile and its ID.
The Trusted Partners are expected to pay – what the value-added services might be for these parties are still in discussion.
- The 5 to 6 years break-even point is based on what seems to be a conservative uptake by these parties – however, the system still needs to be sold to them! The following figures are extremely preliminary (tiering is based on number of people/size of organisation):

[Ben: Just repeating - these figures are pre- pre- pre-alpha and subject to change at the drop of a hat. In fact, I'd bet that they already have]

Things yet to be dealt with (my opinion):

Whilst no-one has stated a problem with ORCID’s software being Open Source, it has yet to be released as an Open Source Project. The code base that they are working on, IP belonging to Thomson-Reuters, has been scrubbed of any Thomson-Reuters specific code and they (T-R) have agree that it is suitable to be placed under an OSI licence. It just hasn’t been done yet.
The ORCID software release was planned to be just a deployable .war file – without source code. This obviously is not acceptable if the O in ORCID is to remain to stand for Open (in spirit if not pedantically.)
How privacy is to be handled with multiple parties asserting various pieces of information is not yet decided or agreed upon. This type of functionality is quite a deal-breaker for many academics.
How malicious or false claims are going to be dealt with, at a policy level, has not been clear. What level of recourse will an individual have against false claims made (mistakenly) by a trusted partner and vice-versa? Researchers making multiple accounts? Profiles made by bored teenagers for ‘fun’?
There is still a short-term gap of investment funding required of $2.75 million dollars – it remains to be seen what occurs if the code is still not made open source by the end of six months if no other sources of capital is found.
Whilst other identifier schemes can be easily included within an ORCID profile, it is not clear if – at an organisational level – if they would be happy if another organisation used the ORCID code to set up another ‘ORCID’ system. Due to the timeline of when ORCID might go live (Q2 2012), the urgency with which other organisations require them might force other systems to be put into place much earlier. For example, as Andrew Treloar jokingly quoted on the ORCID outreach event’s live chat: “If you guys have an ORC-ID, then I want an ELF-ID” – could the next ORCID-free six months force some funders to take matters into their own hands?
ORCID exit-strategies – both for the organisation and for individual profiles. What happens when the money runs out? What happens to the data? If someone wanted ‘out’, is there a way for them to remove all their data and take it with them? (in a similar vein to http://www.dataliberation.org/)
The authorisation system relies on OAuth (which is no bad thing) but I don’t think that the time required for existing organisation to adopt this has been adequately estimated. ORCIDs use on other systems to save time and effort filling in forms is a crucial part of the ‘sales pitch’ to academics – this hasn’t gotten the visible focus I would’ve expected.

ORCID – a taster of the API

Ben O'Steen — Tue, 13 Sep 2011 10:09:00 +0000

As the official draft API (googledoc) is both in flux and read-protected so that only those invited can see it, I am unable to give you a complete view of how things are shaping up.

However, I can relay a number of key points that everyone involved is concerned about:

It must have sensible (some may say RESTful) URLs
Human and machine-readable data is a must via
- Content-negotiation,
- and optionally, “suffix” negotiation (adding a “.xml” or “/xml”) for convenience.
OAuth is the current plan to share trust, allowing users the greatest control over what and who has access to their live profile data.
Profile creation/editing “By Proxy” is important, but shouldn’t take any control of the researcher’s basic profile information from the researcher themselves.

Some code!

Gudmundur A. Thorisson (University of Leicester and a member of the ORCID Technical Advisory Group) has put together an emulation of certain portions of the ORCID API, including some of the OAuth parts:

https://github.com/gthorisson/orcid-sandbox

Get the code from github and making sure you already have Rails/Ruby installed:

$ cd orcid-sandbox $ bundle install $ bundle exec rake db:migrate $ bundle exec rake db:setup $ rails server -p 3001 -d

(you may need “$ scripts/rails server -p 3001 -d)

Once you have the code up and running, you should be able to log in, make accounts and so on.

# OAuth-protected access to profile
[mummi@cambozola-2]curl  http://localhost:3001/profile -H "Accept: text/xml"  -I
HTTP/1.1 401 Unauthorized
X-Ua-Compatible: IE=Edge
Connection: Keep-Alive
Content-Type: text/html; charset=utf-8
Date: Thu, 08 Sep 2011 12:20:50 GMT
Server: WEBrick/1.3.1 (Ruby/1.8.7/2009-06-12)
X-Runtime: 0.087887
Content-Length: 0
Cache-Control: no-cache

# Public access to profile
[mummi@cambozola-2]curl  http://localhost:3001/cid/0723-1814-6587-5983 -H "Accept: text/xml" -I
HTTP/1.1 200 OK
X-Ua-Compatible: IE=Edge
Etag: "390c3560fce0064d65dd1373799f13d0"
Connection: Keep-Alive
Content-Type: application/xml; charset=utf-8
Date: Thu, 08 Sep 2011 12:20:49 GMT
Server: WEBrick/1.3.1 (Ruby/1.8.7/2009-06-12)
X-Runtime: 0.035443
Content-Length: 0
Cache-Control: max-age=0, private, must-revalidate

# Sample response (from Mike's XML examples)
[mummi@cambozola-2]curl  http://localhost:3001/cid/0723-1814-6587-5983 -H "Accept: text/xml"
<?xml version="1.0" encoding="UTF-8"?>
<orcid-bio-response xsi:schemaLocation="http://www.orcid.org/ns/orcid_bio_response_1.0.xsd"
    xmlns="http://www.orcid.org/ns/bio_response"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <response_version>1.0</response_version>
    <response_summary>
        <submission-date>09-10-2012 15:50:01</submission-date>
        <completion-date>09-10-2012 15:50:07</completion-date>
        <total-researchers-found>1</total-researchers-found>
        <!-- error message, if applicable -->
        <!-- unable to connect to ORCID, no matching researchers etc -->
        <error-desc>No researcher found for this institution.</error-desc>
    </response_summary>
    <profileList>
        <researcher-profile>
            <!--  ORCID elements will be present for every researcher found -->
            <orcid>XXXXXXXXXXXXXX</orcid>
            <!-- In short, researcher has claimed the profile if confirmed=true-->
            <confirmed>true</confirmed>
            <firstName>Josiah</firstName>
            <lastName>Carberry</lastName>
            <middleName>Stinkney</middleName>
            <other-names>
                <other-name>J. Carberry</other-name>
                <other-name>J. S. Carberry</other-name>
            </other-names>
            <researcher-urls>
                <url>http://library.brown.edu/about/hay/carberry.php</url>
                <url>http://en.wikipedia.org/wiki/Josiah_S._Carberry</url>
                <url>http://www.brown.edu/Administration/News_Bureau/Databases/Encyclopedia/search.php?serial=C0070</url>
            </researcher-urls>
            <institution>
                <name>Brown University</name>
                <address>
                        <addressLine1>38 Brown Street / Box 1920</addressLine1>
                        <city>Providence</city>
                        <state-or-province>Rhode Island</state-or-province>
                        <country>United States</country>
                        <postalcode>02912</postalcode>
                    </address>
                <departmentName>Psychoceramics</departmentName>
                <departmentName>High Energy Metaphysics</departmentName>
                <role>Researcher (Academic)</role>
                <start-date>1929</start-date>
            </institution>
            <bulk-institution>Brown University</bulk-institution>
            <sponsor>Brown University Library</sponsor>
            <affiliate-institution>
                <name> Wesleyan University</name>
                <address>
                        <addressLine1>Wesleyan University</addressLine1>
                        <addressLine2>Czech-Republic</addressLine2>
                        <city>Middletown</city>
                        <state-or-province>Connecticut</state-or-province>
                        <country>United States</country>
                        <postalcode>06459</postalcode>
                    </address>
                <departmentName>Bilocation</departmentName>
                <role>Researcher (Academic)</role>
                <start-date>1930</start-date>
            </affiliate-institution>
        </researcher-profile>
    </profileList>
</orcid-bio-response>

ORCID: some questions and answers

Ben O'Steen — Tue, 13 Sep 2011 09:38:00 +0000

The following is from an email exchange with Nicky Ferguson. These are my answers to the questions
he posed, and as such shouldn’t be considered the opinion of the ORCID project itself. They are the
answers I believe are correct, based on the meetings and discussions I have been part of on the
technical advisory group.

If any other member of the advisory group can correct any inaccuracies in the comments, I’d be
most appreciative.

> 1. ISNI, ORCID, VIAF etc … will they each or should they be a
> subset of UUID, in a world where there is a need for identifiers for
> all sorts of things from lab notebooks to datasets to institutions, as
> well as researchers?

ORCID and VIAF have both plumped for a ‘short’ number and a verbal
prefix (eg VIAF ID: 747462). It is intended (eventually) that the profile
corresponding to a given ORCID should be able to be found from
an ORCID site, and not necessarily the ORCID site.

You can currently construct URLs for both
where that ID number is used as a suffix to do a lookup on that
researcher/author/etc, with effort and consideration being made so
that the URL prefix will not change in the near future. It is naive to
think that any URL prefix that will never, ever change but keeping the
URL usable for as long as humanly possible is given serious thought.

With UUIDs, you will have to do something identical as there is no DNS
lookup *system* for them but a handful of individual sites that record
links as it suits them. Due to the UUID range being so large, the key
advantage of the scheme is that given a suitably random manner to
generate them, collisions between UUIDs made on separate systems are
incredibly rare. I’m not sure that anyone has recorded a collision
yet, (disregarding those due to poorly configured entropy pools on
virtual machines) This means that it is perfectly reasonable to
generate UUIDs for things completely independently of any central
organising body, and so makes them very cheap and long-lasting.

People do not like them however – subjectively – they do not like them
as part of visible URLs, they do not like them as identifiers to
wield, and they do not like identifiers for themselves that they
cannot remember by rote.

> 2. Who decides who is a researcher? In the UK some universities call
> all their members of staff ”teacher/researchers”, others make a clear
> distinction. What about schoolchildren who jointly author a paper?
> What about researchers in charities or industry who may never author a
> paper. What about peer-reviewers and research “users”?

ORCID currently is an “Allow then Deny Later” system. The main
‘ORCID’ site will be a self-signup website (with an initially limited
ability for proxies to sign up and create and amend profiles for others)
and the ‘researcher-iness’ of profiles will not be policed as there is no need to,
unless the profile claims something untruthful.

The core of the system is based on trust – if a person claims an institutional affiliation,
that will be marked as untrusted until that institution
verifies this. If an institution or research group doesn’t verify the
data, care is being taken that this is displayed as clearly as
possible.

There is no need to police people, only to police the claims they make
about themselves and the works they claim to have a hand in
publishing.

>
> 3. Even institutions which pride themselves on their research may
> only have 20-30% of their staff who are researchers, how do you sell a
> business case to them that they should alter their systems to
> accommodate an identifier for only a minority of the staff on their
> finance/HR/security systems?

Again, the ORCID system (and to an extent the VIAF system) is geared
to help the researcher – at a basic level, keeping a note of the ID
which a researcher has is all that is required to begin to benefit
from it. I think that due to the well understood pace at which change
occurs within the administrative systems of an institution, the first
meeting at which a business case for change might need to be presented
will occur many, many months after the researchers have adopted the
system for themselves as just part of the academic toolset. And if the
researchers do not find it useful, then it will disappear like so many
of the previous ID systems.

> 4. Similar question about researchers themselves – they have been
> disappointingly reluctant to deposit their papers in repositories and
> to use grant numbers in their publications, even when “mandated” – who
> will design the compelling interfaces which will encourage them to use
> ORCID … in the academic community we don’t have a great track record
> at designing compelling interfaces?

It is not an academic community that is designing the interface for
one – it has already been outsourced to a small team of local
designers and developers that Crossref have had good working
relationships with so there is hope there. The key will be
whether or not the system will save time for the researcher and make
certain tasks that they already do easier.

The API for the ORCID
service is very much the focus at the moment and certain use-cases
have been thought through, such as encouraging publishers and journal
submission processes to use the ID system, rather than get the
researcher (or PA/postgrad by proxy) to fill in all their information
again, as well as bootstrapping the ORCID database with information
already within existing bibliographic databases so that many profiles
need only be claimed and verified, rather than generated anew.

I do not mean to knock the institutional repository scene unduly
(having been an institutional repo person myself) but I have yet to see
more than a few repositories strive to make the researcher’s
lives easier and better. It is worth noting that those repositories
are the one’s that are thriving.

>
> 5. What role would a national registry need to play to map ORCID (or
> a.n.other identifier) with key information? and finally …

In short, include something semantically similar to ‘rdf:seeAlso’
within the database/triplestore/profile for the national registry’s
version of the same person. Many of the codebase changes occurring at
this time are so that the informational claims within other
whitelisted registries can be automatically shown and interpreted
within the ORCID store, moving towards a multi-trust system.

>
> 6. I understand that the idea is that the researchers themselves
> would control the registration and updating processes – but
> institutions, funders and government agencies will surely want to
> maintain their own registries/database using the ID … yes? Is the
> mechanism for change control of personal information thought out?
>

As mentioned above, the changes occurring and being implemented are to
effect a solid multi-trust control system, which will allow for the
kind of distributed profiles you mention to be accepted. However, the
systems have to provide data such that a machine can use it, and that
may be the sticking point for a few of these systems.

Confidence, and the business of persistent identification

Paul Walk — Thu, 28 Oct 2010 12:48:10 +0000

The persistent identification of resources is a foundational element of the JISC Information Environment. There are several schemes and technologies available to support this, with one of the most prominently used in the JISC IE being the Digital Object Identifier (DOI). Built on the Handle technology, the DOI, under the stewardship of the not-for-profit International DOI Foundation (IDF), adds the important element of collective commitment and management, based on straightforward business interests. DOIs are allocated and managed through Registration Agencies (RAs).

DOI has become somewhat synonymous with scholarly publishing, with most people working in the JISC IE having encountered them in citations for papers in online journals and repositories. However, while publishers continue to play an important role in minting and using DOIs, the use of DOIs to persistently identify datasets produced in research is growing in significance. Last year saw the creation of a new RA - DataCite, which deals with this relatively new and growing area.

There has been much debate over the years about the persistent identification of resources - especially at the technical level. Yet all technical solutions are bound, eventually, to come up against the issue of the persistence, or lack thereof, of organisations of people. In the JISC IE space we can see that publishers come and go, and that journal titles, for example, merge or change ownership from time to time. Universities, seen by many as very persistent organisations (a pre-conception which might, sadly, be tested in the next few years) do, nonetheless, merge and change.

The creation of a body which has as its primary goal the management of the persistence of identifiers - essentially the role of the Registration Agency in DOI - is an approach to addressing this lack of permanence. Within the 'ecosystem' of the RAs, each participant has a vested interest not only in maintaing their own identifiers, but in ensuring that the system as a whole continues to function well. From this point of view, it is in the interests of all participants that the commitment from others is strong which means that the addition of new RAs, such as DataCite, can only be a good thing.

Over the last year or so, IDF has been working with MovieLabs as part of a project to establish the not-for-profit Entertainment Identifier Registry (EIDR). This initiative includes the establishment of a new Registration Agency for DOIs for all digital resources created for TV and film by a consortium of many of the major producers in the entertainment industry. EIDR is actively seeking more participants, and offers a variety of types of membership.

While the engagement of this new industry may not be directly relevant to many people working in the scope of the JISC IE, the confidence and investment which this industry has placed in the DOI system is significant. This development increases the viability of DOI in general and, as such, should make it a more attractive prospect to those working in the JISC IE and in HE in the UK generally.

Essentially, confidence is an important aspect of persistence - and significant buy-in to DOI from such different sectors, commercial and public, should increase confidence in this solution.

A whitepaper about EIDR is available on request.

An introduction to DOI in a higher education context (set of presentation slides)