Identifiers

6 February 2013 - 5:10pm — Talat Chaudhri

Digital identifiers are central to information retrieval and management, so they are consequently crucial to the operation of the World Wide Web. In particular, the ability to identify a digital or real-world, physical resource and/or a description of such a resource relies on its unique identification. The relationships between such resources, described in XML and/or RDF, may provide a powerful resource for research by establishing what other entities and resources are linked to the one being described and providing a means to find information that might not otherwise have been discovered. Higher education institutions are large-scale producers and consumers of information about their research and teaching activities, and also generate considerable volumes of administrative data that are linked to these activities. These resources may, for example, include research projects, resulting publications, teaching resources and complex data sets. It is often crucial in order to secure grant funding or commercial collaborations to be able to demonstrate the success and productivity of academic institutions or individual academic units within them. In particular, the unique identification of academic staff is crucial to their continuing success.

Printer-friendly version

Author: Talat Chaudhri
Last updated on 2 May 2013 - 3:57pm

On the most basic level, Web technologies intrinsically require the identification of information resources so that requests for such resources can be properly processed and so that the correct resources can be supplied. To this end, every resource on the World Wide Web is uniquely identified with a Uniform Resource Identifier (URI). The identifiers must be unique simply in order that there is no confusion about which resource is being sought or which server ought to seek to provide it. These resources may be of any file type whatsoever, including textual documents, web pages, images, video or audio resources, scripts, compressed archives etc; these may be for human viewing or may be intended to be machine readable in order to support a particular web service.

Information on the Web is found within resources: these may be the object of the request in themselves, if those resources are electronic; however, if the resources are not electronic, e.g. physical objects such as buildings, works of art, books or even people, the electronic resource is likely to be only a description of the actual resource that is being sought. Examples might be catalogues, lists, biographies and similar. While it may seem odd from a human perspective to refer even to people as resources, servers simply return the information, however organised, that is attached to the HTTP identifier that has been requested, whatever it may be. For instance, the simplest resource related to a person (or organisation, similarly) could be a declaration of the forms of a person’s name and other personal details that identify that person better as an individual and avoid confusion, i.e. disambiguate from other similarly named people, organisations or other resources.

Technically, there are two types of URI. The most immediately familiar of these is the Uniform Resource Locator (URL), which, for example, will be seen in the browser address bar above this and all other web pages. Another, rather less well known type of URI is the Uniform Resource Name (URN).

The difference between the URI and URL is a technical one and is often poorly understood in practice, to the extent that these terms are frequently confused. In theory at least, the URI only has to be a uniform identifier that is unique, which means that there is no automatic requirement for it to be expressed using the Hypertext Transfer Protocol (HTTP), or any other standard protocol such as XMPP, SMTP, MAILTO, FTP and so on. Effectively, any name or other sequence of characters that is unique to a particular resource could theoretically be classified as a URI and be used accordingly. However, in practice, it is not automatically useful to have a simple identifier that does not provide a means to locate and retrieve the resource in question. There is nothing to prevent a URI being expressed using the HTTP or other protocol and choosing a URI that points to a genuine resource on an active server. This provides a means for the server to resolve the request for that resource and hence to either return it, point to an alternative resource, state why the resource is no longer available, or declare that the resource does not exist, as may be appropriate. Each protocol, including HTTP, has its own technical communication standards which servers should adhere to.

Unlike the URL, there is no requirement for a URN to resolve to a resource. It is instead intended to provide a particular recognised scheme, over and above the normal HTTP identifier, for the identifier to be globally unique. In practice, some means to resolve the URN is generally required if the resource can be returned electronically via the Web, though not always in the case of physical resources. An example of this might be the International Standard Book Number (ISBN) for a particular physical book. It is intrinsically useful in the physical world to be able to discover this identifier, and so providing it via the Web may be useful for both humans and machines to discover the locations of actual physical copies or descriptions thereof, even without immediately providing a document on the Web that contains such a description. Inevitably, it is always more useful to provide a Web resource immediately that describes the physical resource in more detail, in addition to simply providing a URN within some appropriate schema. On this basis, the URL is a more flexible and widely applicable form of identifier than the URN, while the URN provides one method for providing more specific, narrowly defined semantic information as part of that identifier. It is not the only method of providing such information, which may instead be marked up, for example using the Resource Description Framework (RDF), which may be expressed in XML, or in simpler ways such as the JavaScript Object Notation (JSON). These would normally be contained in a machine readable document resource that itself has a resolvable HTTP URI, i.e. a URL.

The concept of unique identification may be combined with the concept of describing relationships, often using RDF or other similar or related mark-up technologies. The idea that physical resources can have presence on the Web via their descriptions is often called the “Web of Things” and is part of the Semantic Web. The latter provides a means to find out how resources are related to other resources, e.g. how the Leaning Tower of Pisa is related to Italy, where it is located as a grid reference and its GPS coordinates, who built it, which famous experiment was allegedly conducted there by Galileo etc. When a large quantity of information is made available using description schemas that are machine-readable, Web services can be developed that can, for example, discover video, audio or textual resources such as publications about that place, or even perhaps artworks or photographs depicting it. For example, a scientist and an architect might be interested, for different reasons, in closely related resources to do with the construction and more recent stabilisation of the famous tower in Pisa. At least in theory, the Semantic Web, which relies on unique identifiers for every resource and every document providing such descriptions, could have unlimited potential for providing a means to cross-reference stores of information that are available on the Web and could thus provide more in total than any one such store is capable of providing alone.

In higher education, the creation and provision of resources is inevitably critical to all academic and teaching work. It is inevitable, therefore, that the Web has become a critical tool in carrying out all of these functions. For example, the effective organisation of complex research activities leading to academic publications that bring in funding revenue, or of teaching resources that facilitate the education of students, will contribute considerably to the performance of particular institutions. Providing and organising this information across the various functions and departments of an institution is often key: research data, human resources or finance information, library catalogues and publication information can be combined centrally to provide a powerful and flexible database of the activities of an institution or its departments, including their strengths and weaknesses at any particular point in time.

As technologies are developed and used to bring together diverse sources of information more efficiently and automatically than was possible by manual processes, it becomes clear how increasingly important it will be to identify the individuals, organisations, documents, artifacts and multimedia resources involved in these activities correctly. It is a particular concern to identify the names of people, organisations and their subdivisions and places correctly, since the productivity of individuals and institutions rely heavily on the accuracy of information and because any misattribution could impact negatively on, for example, their standing with funders and commercial partners. On a fundamental level, identifiers contribute to the efficient organisation, consumption and reuse of information resources on the Web. The relevance of unique identification to such a major producer and consumer of information as the higher education sector can clearly not be overestimated in this light.

There are so many different schemas and protocols for unique identification, within the overall superset of URIs and the broad divisions into URLs and URNs, that it is practically impossible to create an exhaustive list of them. Within the higher education sector, there are a number of broad classes of identifiers and metadata schemas containing various approaches to unique identification that can be described in broad detail here.

On the simplest level, one technique is to aggregate information about an identifier and its various or equivalent forms in other contexts or schemas in order that it cannot be confused with similar identifiers. This is a useful approach, for example, with regard to the names of people and organisations, which have multiple name forms that might be valid either concurrently or at different times in their history. It is useful to know that a frequent name like John Smith refers to John Nathan Smith, since there are fewer so named, and even more useful to state a relationship about where (and when) he was employed. A machine will not know that J.N. Smith, J N Smith, JN Smith, Dr. J. Smith and perhaps later Prof. J. Smith are different or identical people, especially where titles change or an individual has several valid name forms. It is useful to state that these are equivalents, as together these contribute to the uniqueness of the record or records about this individual. Details about where a person worked and their job, together with what the institution was called at various times, with date stamps where appropriate, serve to make these entities unique. The order of elements of a name and what function each has, for instance in oriental names where alternative name orders are used, can also be marked up in metadata.

Such metadata records can be produced in a relatively simple way, depending on how much semantic information is necessary or useful. At the simplest, a major metadata schema such as Dublin Core can be used with repeated fields for forms of the name of an entity associated with a resource. More complex metadata schemas such as the Common European Research Information Format (CERIF) can be used to encode more complex information, e.g. the dates when a particular academic’s pre-marital and married names were valid, which could be cross-referenced to her or his publications to ensure that they were correctly cited and/or verifiable against the publisher’s information. Similarly, the name or changing names of the organisation(s) and their sub-divisions at the time of the production of a teaching resource or academic article could be verified. Where these records disagree, it is possible to programmatically establish the likelihood of errors by comparing different resources and thus correcting and adding to (or enriching) the metadata held about resources and their relationships.

However, it is useful to have an overall, canonical URI to combine with a record or records about individuals, organisations or other entities, for example if some sources of information are less well curated than others. Uniqueness can be determined by methods such as Universally Unique Identifiers (UUIDs) that rely on mathematical probability to be functionally unique. While convenient for machines, these must be combined with strings of characters representing common names or words in order to be meaningful to a human as well, which is easily achieved in markup languages such as RDF, or just in simple XML. It is also useful to be able to apply unique identifiers to documents, such as the Digital Object Identifier (DOI). This is often used, for example, by publishers to give a unique identifier to a published electronic resource such as an academic article, but could be used for any document. Metadata attributes are associated with the identifier so that persistent identification is provided for that document together with relationships that the document has, e.g. to its author.

In terms of Web services such as social networks, which may include academic and other professional networks such as Academia.edu and LinkedIn, the ability to judge whether a particular public-facing user account is the same as a named user on another service is also important to the discovery, for example, of potential collaborators or professional rivals and their work, be they individuals or organisations. This may, for some individuals, extent into personal social Web services that are partially or wholly used for academic or other professional purposes, such as Twitter. Purely academic identifiers also exist, for example commercial identifiers such as those offered by Thomson Reuters at ResearcherID.com, and the Scopus Author Identifier; or those provided by national organisations or within the public domain, such as the International Standard Name Identifier (ISNI) and the ORCID researcher identifier.

Within academia, the business of unique identification has advanced at different rates for different types of resources and related entities that are available or described on the Web. On the one hand, it has been in the interests of publishers to make sure that academic articles from which they derive profit are uniquely identified in order that they can support their subscription income. It is not surprising then that the DOI scheme is the most widely implemented scheme for digital object identification, or that it has been most widely applied so far to such published materials. Pre-existing international schemes such as IBSN and ISSN have been relatively easily re-used in the Web context.

On the other hand, there has been significantly less progress towards a de facto internationally recognised standard for unique identifiers for individuals. The ISNI identifier is widely used by public funding bodies but the individuals and even the organisations described by it have no direct input into the accuracy or presentation of the information except initially through their national research assessment schemes, e.g. the Research Excellence Framework (REF) in the United Kingdom. The ORCID scheme has the provisional support of funding bodies represented by Research Councils UK and by similar organisations in other developed countries, as well as by major publishers; however, it is still in the early stages of development and it remains unclear whether the necessary adoption will occur among academics. The approach taken by ORCID is for organisations to seed the information about their academic staff but for the information to be controlled by the academics themselves and authenticated by the trusted institutions who employ or fund them. At present, ORCID appears to have considerable initial support and is a fast-developing standard.

The unique identification of organisations has progressed less far. An academic may have worked at several institutions, each of which may have alternative names (e.g. bilingual names in Welsh or Gaelic) or which may have changed their names. They may work within more than one research group, department and/or school, which may have been reorganised, merged or de-merged over time with resultant effects on the appropriate nomenclature at the time that a particular resource was created or published. As yet, although some metadata schemas such as CERIF have the ability to mark up increasingly complex information of this type, there is little evidence that this is being done by academic institutions. In countries such as Australia, there may be national institutions, e.g. the Australian National Data Service (ANDS) that maintain or provide approved lists of names. However, there is no evidence that national bodies who keep such lists for their own purposes have made these widely available in machine-readable, date-stamped formats for programmatic reuse such as metadata verification and enrichment.

Individual software services may create widely used methods of identification for their own purposes, for example within the repository software DSpace, Fedora and EPrints. As supplied by default, however, these are not especially reusable outside the software, given that the agreed international standards for identifying individuals and organisations are not well established. It is always possible to include any unique identifier, such as ResearcherID or ORCID, in a record, but the usefulness will be limited if there is no consistency of practice in terms of which metadata field is used in each local customisation. There are numerous useful approaches such as that taken by the developers of eSciDoc in the Control of Named Identities (CoNE) service, which can be used either with the Fedora-based eSciDoc or as a standalone service. CoNE, for example, can provide a means to uniquely identify any entity. However, these have not seen wide adoption beyond their specific software communities.

The CERIF metadata standard can perform a similar function with an almost arbitrary level of granularity of information, yet it has not been nearly as widely implemented in software as the Dublin Core metadata standard which has no such abilities but which is far easier to implement. In the same way, Dublin Core application profiles provide an overlay of more complex metadata built on top of simpler Dublin Core. However, these are used largely in limited, specific information retrieval environments, e.g. libraries or specific, subject-based services and have not developed into widely used, de facto standards.

There are numerous approaches to cross-referencing which online identities in professional and social networks belong to the same individual. Most of these, however, rely on the services themselves providing a means to do so. As these are commercial, there is no guarantee that they will continue to do so in future if commercial rivalries later cause that to be outside their own interests, or that the information will be publicly viewable, even if the user chooses to set privacy settings to allow this, or is even able to do so.

However, the non-commercial means to state such equivalences are of uncertain value too. Most users who have an OpenID, for example, which is effectively a user account attached to a unique identifier and metadata about that individual, only have one because they use a service like Facebook which, usually without their knowledge, provides one. It may not be obvious to a non-technical user that it can be in their interests to state publicly which other professional services may represent them. While Friend Of A Friend (FOAF) allows relatively simple declarations about individuals and their relationships with other entities, most individuals do not have the technical knowledge or means to provide markup in such files on the Internet, or are even aware of what they can achieve. It is not clear to what extent commercial search algorithms used by the major search engines may take FOAF, or indeed any other metadata source, into account, which may impact upon the practical value of providing such metadata.