repositories: relevant content on this site

Microservices in (and beyond) Research Information Management

Talat Chaudhri — Fri, 25 May 2012 15:37:53 +0000

Microservices: are they all that new?

Recently there has been something of a revival of interest in a small-scale development approach towards software design for repositories: microservices. This is far from an entirely new idea but seems to have been somewhat slow to develop in practice, even to date; a useful summary of the approach was given by Neil Jacobs back in 2010. Moreover, a modular approach towards software that fulfils various related functions in managing web content related to research clearly has a much longer history, and is not in itself particularly surprising in software development more broadly. However, it seems that microservices as an approach is gradually acquiring a clearer identity within this space, so it may be worth taking a look back at the nature of the types of software used in managing research content of various types, how they are related, and whether and to what extent terms like "repository", "Current Research Information System", "Research Information Management system" and so forth overlap in terms of software functionality that they offer.

Defining terms: "repository", CRIS, RIM etc

Institutions within Higher Education are often faced with questions of procurement such as technical suitability and sustainable technical support. Although these areas are broader than those normally covered by the Technical Foundations web site, since they encompass non-technical considerations related to funding, policy and practice that drive software acquisition in universities and related institutions, the purely technical aspects are securely within scope and of considerable interest to the community at large in terms of developing useful technical guidance.

The question "What is a repository?" is likely to have a range of possible answers, but Neil Jacobs noted the revival of an approach summarised in Cliff Lynch’s 2007 description of the institutional repository as “a set of services that a university offers to the members of its community for the management and dissemination of digital materials created by the institution and its community members”. Without reiterating the points made by Neil Jacobs in detail, suffice it to say that these efforts have been led by institutions such as the California Digital Library and notably by John Kunze and others. The difficulty with this approach in general is not a purely technical one but one of technical resources, and it is not unique to the microservices approach but can for example be seen with systems such as Fedora Commons as well.

Software development approaches

While the most modular, customisable and flexible technical approaches are often able to be adapted most quickly (and arguably most effectively) to the challenging technical demands placed on them, it is usually the case that significant development resources, usually in-house, are required in order to tailor the software to local requirements. In practice, the result is often that only certain large institutions are able to justify and support software systems such as Fedora or even "roll their own" local software solutions. A useful example is the eSciDoc suite of services, developed by the Max Planck Foundation and FIZ Karlsruhe. Together, these effectively represent what in other contexts (e.g. the Linux world) might be called a "distribution", in this case based on Fedora. It is also worth noting that these services have been developed so that they can be used independently of eSciDoc, for example with DSpace or another repository system. In this way, true to Cliff Lynch's definition, each aspect of what together we call a "repository" is handled by a different piece of software, which then interoperates with a range of other web services according to local requirements.

"Does it do more than we already do?"

This, in a nutshell, is the microservices approach. However, there is no reason why the question should be restricted to repositories, since "repository" is itself something of a catch-all term for a class of web content services that are by no means identical in their principal functions and aims, even where they are using the same underlying software. Where, for instance, does the functionality of a repository end and that of a research management system, research information management system or Current Research Information System begin? Without a clear understanding of what these systems do, it is possible if not likely that higher education institutions, especially where decisions about procurement could be made by relatively non-technical managers, might easily end up acquiring more than one system with overlapping functions. Clearly, in times of difficult financial circumstances, this ought to be avoided wherever possible. It is worth spelling out what exactly different systems do in order to minimise duplication of effort.

Similar software issues facing HEIs

The question need not be limited to repositories and research information management either, although it is not the intention to get into great detail in this particular blog post. For example, libraries are frequently offered new products either by vendors with whom they have existing contracts or by their rivals. It is always in the interests of a vendor to sell a new product, so the question of duplication of technical functionality and/or the most effective technology to address a local need is of far more pressing concern to the institution than the vendor. A range of commercial library portals are on offer, built on but extending the functionality of library catalogues and commercial publications databases related to e-journals such as Web of Science. It is a common experience amongst library staff to feel unsure to what extent new software is offering new functionality, how it fits their technical requirements, and to what extent it may be re-packaging existing functionality in new clothes. The same could perhaps be said, for example, of systems relating to human resources or institutional finance offices.

What else can these systems do?

Returning to repositories and research information management, it is clear that a wide range of resource types are being published on the web through a range of related systems. The best recognised use of the repository is as a research publications repository, which is unsually how the wider term "institutional repository" is understood within the context of higher education and issues relating to but not confined to Open Access. Increasingly, attention has turned to Current Research Information Systems, based on the CERIF standard, and similar research information systems. Of particular interest is the RMAS approach, effectively building such a system from a range of related pieces of software, i.e. a microservices approach outside the limits of the repository sphere. Research information management covers all aspects of the processes of research creation and dissemination, including research reporting, human resources, finance and publication, while publications repositories commonly focus only on the last of these. This is usually the area where institutions operate systems whose functionality overlaps, as there is no reason in principle why a CRIS, for example, cannot expose research publications on the Web: this is possible with the main commercial systems such as PURE and Converis, for example.

In any case, there is no necessary limitation on the term "repository" to cover only resources relating to the outputs of research. Teaching and learning materials, amongst a wider range of educational resources, are another major area that has seen substantial growth in the last two or three years. Various types of media resources from images to time-based media such as audio and video recordings are found in institutional repositories for a number of different academic purposes, e.g. art collections, media archives, music collections, health information and so on, not all of which are the direct products of either research or teaching but may be connected with one or both. In this context, it is as well to remember that the term "repository" means little more in essence than "organised place or system to put something [on the Web]" and that many such systems, especially older ones, have always been known as "digital archives", "electronic libraries", "media collections" and so on, in contexts where the word "repository" would still not generally be recognised. Large data collections are often stored in systems that are, in effect, repositories, but whose development has been through systems not normally known by that term.

Solutions that fit problems

In summary, dividing the world of software systems in academic and related outputs too rigidly into "repositories" and "research information systems" may be at the root of much of the difficulties that may arise in understanding which technical functionality is required for any given local purpose and the extent to which systems overlap. A better, more precise understanding of these functionalities would help to avoid unnecessary duplication of effort and proliferation of systems. Some approaches are effectively bundled within one piece of software for a particular purpose, e.g. DSpace and EPrints in the repositories space. These offer a conventional set of services that fit the requirements of most institutions but may place some limits on the ability to customise those services indefinitely. Even these systems are built to be general purpose systems with considerable potential for local customisation. However, there is the tendency seen elsewhere (for instance in open source software with a large and disparate user base) to introduce software bloat: more and more functionality, some of it never used by the majority of implementations, is shipped with each succeeding version as new scenarios are met with.

While potentially introducing the problem of sufficient availability and sustainability of technical development effort, microservices are the opposite end of this spectrum. Each service is ideally a separate entity on the web server, built for maximum interoperability with the other services that may be required for local purposes. Rather than acting as plug-ins to a base software system (which is perhaps an intermediate approach), these are separate code bases able to run independently, even where they may have been intended, as in RMAS or eSciDoc, to be used frequently together. The technical issues and demands of each system will be different in every case.

The business of unique identification

Talat Chaudhri — Thu, 23 Feb 2012 23:58:26 +0000

What need is there for unique identifiers?

Put in relatively non-technical language, there is an increasing concern in information science in general to uniquely identify different things, organisations or people that could otherwise be confused, whether on the Internet or in the physical world. In technical terms, these are all referred to as resources (even if people might find it vaguely demeaning in normal language to be considered as such). This need, whether real or perceived in any particular context, has grown as the complexity of information available on the Web has grown almost exponentially, increasing the potential for confusing similar resources.

Why aren't names good enough?

1. People

It is not necessarily enough to have a name, since even a relatively unusual combination of names might easily not be entirely unique from a worldwide or even universal perspective: at the basic level, John Steven Smith might be unique in a place called Barton but even if you cross-reference these references, two people with the same name could easily be confused, for example if there are several possible places called Barton.

My own name, Talat Zafar Chaudhri, might appear to be more unique until you realise that these are all fairly common names in the Indian subcontinent and thus in the Indo-Pakistani diaspora, so it is reasonably possible or even fairly likely that another named individual exists with this particular choice of spelling (of which others may exist). I am also Talat Chaudhri, T. Chaudhri, T Chaudhri, T.Z. Chaudhri, TZ Chaudhri and similar variations (with or without spaces and punctuation) that might make it harder to decide which individuals to reconcile as a single individual, especially by machine processing. At least I do not vary the spelling of my surname, but some people may, especially in cases such as my own where other transliterations could be possible: for example, my father previously used the spelling Chaudhry and many others such as Chaudry, Chowdhary and Chowdhuri are equally possible. I understand when companies misspell it, but a computer might not be sure if these were definitely the same person, even if it went to the lengths of calculating a probability for this.

Moreover, people change personal titles (e.g. I have been both a Mr and a Dr and I am occasionally still referred to as the former by companies that do not allow for the latter option); they have multiple, changing work roles and work places, and may be known in multiple contexts, e.g. work, social, voluntary roles and similar. At work, one may have additional roles in various professional bodies, so it may not be apparent who is who. Two people might have the same name in a large professional group, e.g. physicists, and may even produce outputs related to the same subject. Who owns which ones? This is a particular issue for electronically available outputs on the Internet, e.g. publications, educational resources, audio, visual or audiovisual resources and so on.

2. Organisations

The same issue arises for organisations. Can we be sure that a Board of Licencing Control is unique? No. Perhaps it is merely another spelling for the Board of Licensing Control but using a different spelling? What if one, but not all, of these were re-named as Burundian Licencing Control? What if the Board of Licencing Control merged with the Department for Regulatory Affairs under either of these names, a combination, or an entirely new name, yet continued their association with the assets of the originals. De-mergers are likewise possible, and may present issues of uncertain ownership of resources.

Perhaps there are organisations with this name in several countries but serving utterly different purposes, and perhaps one is merely one possible translation of a term into English but used natively in another language. Historical names have been used in multiple contexts that may still be valid, e.g. the Irish Volunteers, and these might need to be kept clearly separate from each other. Conversely, there are also organisations that have multiple names or forms of names, whether in one language or in multiple languages or during their history, e.g. Óglaigh na hÉireann is Irish for both the terrorist Irish Republican Army (IRA) and most of its subsequent splinter groups but is also, however, an acceptable name, for historical reasons, for the Defence Forces of the Republic of Ireland, and previously just the Irish Army (an tArm) that now forms a part of it. These are clearly not the same and must be distinguished. It must be also noted that typographical constraints and character encodings will lead to yet more duplicate forms.

Isn't this bigger than the question of unique identification?

Yes, the need for complex metadata to express these things can go far beyond merely identifying resources in a unique manner. However, before one can even start thinking about complex descriptive and relational metadata, one first has to be clear which resource is mentioned: hence the first step must be unique identification of what it is we are talking about. Only once we have done that can we feel reasonably confident about talking about how resources relate to one another and how they may have changed over time.

Overall, there is an ever increasing need to make clear what is meant, as more and more things and agents have on-line identities that need to be distinguished, whether this is as an owner of resources or as a referrant within a resource, e.g. the subject of the resource in a particular context, and even of the role played and the relationship to other resources or agents, perhaps in a specific time period. Information models can quickly become extremely complex, and this is certainly true where identity is concerned.

What is an identifier?

In concept, an identifier is similar in its basic concept to a name. At its most basic, an identifier in the context of an information system is a token (usually a number or a string of characters) used to refer to an entity (anything which can be referred to). Identifiers are fundamental to most, if not all, information systems. As the global network of information systems evolves, identifiers take on a greater significance. And as the Web becomes more 'machine readable', it becomes vital for all organisations who publish Internet resources to adopt well-managed strategies for creating, maintaining and consistently using identifiers to refer to those assets it cares about.

What are unique identifiers?

The simple answer is that this is the only way to avoid misidentification confidently, and therefore prevent any errors about ownership or rights over resources that might arise, as well as making sure that large bodies of resources contain reliable information generally.

The fundamental question is whether the identifier or token that has been chosen is unique and how best to ensure this. Some identifiers are so complex that mathematical probability makes them effectively unique in the universe, notably UUIDs. In essence, a UUID is no more than a complex numerical token: it is only additional complexity (and thus uniqueness) that it offers compared to, for example, a running number. Others like names can only be distinguished unambiguously by making a series of statements about which names are considered equivalent, which contexts (e.g. a person's work or town) are valid, and so on, where a number of relationships have to be attached to a particular identifier and checked in order to reach an acceptable level of uniqueness and to eliminate any mistaken connections with resources that might be similar in name or perhaps also in other respects by chance.

The problem with UUIDs is that, while the chances of them failing to be unique are, to all practical purposes, non-existent, it is not very clear from a UUID alone what the nature of that resource is. It may be machine-readable but it says nothing about who generated that identifier and when, or which other identifiers might exist for the same resource in different systems that also generated an identifier for the same resource. Consequently, the need to associate other metadata with any complex number or other similar token remains (including but not limited to UUIDs). Simply, no single token can be sufficient for any complex purpose and, at the very least, an electronic or physical resource must be referenced for the token to have any useful meaning at all.

This is effectively that a URL is: another type of token. While I will not go into the whole discussion about URLs and URNs as sub-types of URIs, it is worth noting that, in many quarters, the term URL is no longer preferred despite it being the most commonly used in practice. In strict terms, there is a clear difference: while a URI is usually resolvable to an electronic resource, which may be either a description of a physical or electronic resource or may be an electronic resource itself, there is technically no requirement that a URI should be resolvable, i.e. that all it needs to be is a token that doesn't necessarily have to represent an address that actually delivers a resource. However, it is usual to use the HTTP scheme, which is designed for delivering such a resource, so it would be somewhat eccentric and misleading if one were deliberately to choose an ostensibly resolvable syntax that does not in fact resolve. In effect, virtually all such URIs are also URLs (unless a resource has become unavailable and link rot has set in), since the latter must locate the resource or representation of it: this is inherently useful. Any URI that resolves, i.e. URL, will be effectively unique within the standard Domain Name System (DNS). As a result, there is no absolute need for UUIDs in many contexts, since a sufficiently unique and practical token already exists in the URI. Any unique but arbitrary token serves the core purpose here.

Aren't identifiers really just names?

Yes and no. Names are intrinsically arbitrary too when they are first given. However, they are identifiable on a number of levels from a human perspective. In addition to a combination of names belonging to one or more particular linguistic and/or ethnic origins and usually identifying gender, they quickly become associated with a particular person, so their use in uniquely identifying that person within a given context become central to maintaining the person's reputation in whatever they do. This is, for example, particularly important to academics in Higher Education. In modern times, this name resolution needs to be done globally wherever the Internet is the context, whereas previously it would have been possible to use fewer additional pieces of information in more restricted contexts (e.g. a village, a country etc), depending on the purpose. These different contexts still co-exist but it is now necessary to provide as many as possible, since one cannot control or predict why the information is being requested in each instance on a global system such as the Internet.

How does this affect Higher and Further Education?

Increasing numbers of professionals and the bodies that they work for and represent need to describe their resources on the Internet, whether those are in themselves electronic resources, whether they are descriptions of electronic or physical resources (metadata), or whether they are other representations of physical resources, perhaps in addition to themselves being electronic resources (e.g. photographs). This is a particularly pressing issue in Higher Education and, to an increasing extent, in Further Education. Academic outputs may include publications, educational resources, visual, audio and audiovisual resources and so on. Perhaps the best known is the issue of scholarly publications, partly through the rise of the Open Access movement to make such resources freely available.

There are already a range of identifiers for academics and related professional university staff. One of the problems is that these are created for specific purposes that only cover whichever subset of staff is relevant to those purposes. For example, HESA keeps records that contain a HESA number for academic staff, which means that at least those who have published academic outputs will have such a number. Another number called the HUSID number is maintained for students, since tracking academic careers from student to staff is one important concern for HESA. Many academics in relevant fields may have ISNI numbers, which are used widely in the media content industries. Many academics will have one or more professional staff pages, including within repositories and Current Research Information Systems (CRIS), each with a URI, not to mention OpenIDs and URIs associated with Web services which they use professionally and/or privately, e.g. LinkedIn, Academic.edu, Facebook, Twitter and so on.

Here are some examples belonging to Brian Kelly of UKOLN:

The problem is that the coverage of these numbers is not universal within the HE sector, and there is no single recognised authority or other agreement to prevent and resolve conflicts where information is not consistent between two or more information sources.

At present, the JISC are trying to solve this through the Unique Identifiers Task and Finish Group, which also includes representatives of HESA, HEFCE, the various Research Councils in the UK and UKOLN. The preferred solution is currently the ORCID academic identifier, which is being developed internationally with publishers, with a great deal of input from the United States in particular.

In order to succeed, any such identifier will need international penetration of the higher education sector, since academics will not use it unless it delivers the sorts of interoperability benefits that make their work easier and become integrated into the recognised systems required of them by funders and publishers in the course of their work. Since students and academics change roles and institutions, this needs to be recognised and outputs properly allocated to institutions and departments, which may themselves change identities, merge and de-merge over time.

While institutions will need to reduce the workload on academics by bulk loading information about staff, since the main incentive to use the system is that every academic has a record, there is also an issue about control. Should academics have the ability to alter their records at will? Are assertions automatically trusted or does a particular record for an academic's time at an institution need to be verified by that trusted body? Who should maintain a list of trusted bodies who can back up assertions? How will this effort be funded sustainably? It becomes clear that some of these points are central structural concerns whereas others may cover only fringe issues such as avoiding deliberate falsification, which may be rare.

Proprietary academic identifiers

There are also a number of proprietary identifiers associated with different commercial services related to electronic publishing and related academic service industries. Thomson Reuters and Elsevier provide identities for individuals and organisations as part of their bibliographic and academic services; similarly, search services such as Google Scholar (see the study in this blog post) and Microsoft Academic Search have also started to offer identifiers (see this blog post). There may be privacy issues, for example in Google and Microsoft publicly surfacing information about researchers without explicit consent: while this information might have been suitable for the limited purpose of publication, academics may not have intended for it to be synthesised into a single, public description of their personal details available to all.

Some of these services introduce new problems, since their primary purpose is commercial and it is often less of a priority to deal with the internal issues facing academic institutions unless that impacts significantly on the ability to make commercial profit. These may be resolved over time or be reintroduced as services change and compete: the academic has little or no control over the effects of commercial decisions upon their work. For example, Microsoft Academic Search often misrepresents outputs as belonging to similarly named individuals (thus is currently failing at unique identification) and, by default, requires the manual input of researchers to edit out errors and take a proactive approach towards managing the information about themselves. This brings the overall quality of data into question: for large-scale statistical purposes, this could be tolerable, depending on the degree of error; however, for academic citations and reporting purposes such as the Research Excellence Framework (REF), it would not be acceptable to use this data without further refinement, which would most likely remain a long, manual process.

Software and services

Any software application layer, whether operated by commercial companies, higher educational institutions, funders or governmental bodies, needs to be maintained. If information is harvested or processed automatically, it needs to be clear who corrects information where errors are found and what the resources are for academics to contact individuals with the time and effort available to improve the data as part of their work. In the case of commercial organisations, this is usually unclear and may change. There is no guarantee that the commercial reason for providing services will continue over time, unlike in most cases in the public sector within Higher Education. Coverage of such commercial services is often geared towards institutions rather than individuals: for example, Google Scholar requires registration using a valid university email address that it recognises, which would exclude private scholars and perhaps some retired staff who produce research.

The Web of Things

It has already been mentioned that electronic descriptions or other representations of physical objects may be found on the internet, including written descriptions, pictures, geographical locations, dimensions and so on. It is even possible to describe physical objects that were extant but are now historical, or which have moved or whose location is now unknown, referencing comparable objects and linking these descriptions with other resources that are related. In each case, the nature of the relationship, relevant agents who may have been responsible for it, and when it was valid can be described in metadata.

This opens the way for the Web of Things, a term used to describe that part of the Semantic Web that covers physical resources as opposed to, or as well as, purely electronic ones. Some authorities use the term to mean physical objects with miniaturised electronic devices to enable them to be located, whereas others merely mean any physical object that is described in a record on the Web. It may be argued that all electronic resources have relationships to physical ones, even if that is only with regard to authorship and subject. The Resource Description Framework (RDF) provides a means to describe these relationships and transmit information about them in ways readable to humans and machines. Although these are usually expressed as triples, where two things are described with a relationship between them, metadata structures such as the Common European Research Information Framework (CERIF) can add link tables that give far more detailed information about the relationships themselves. All of this can be made available as Linked Data and surfaced in many software applications on the Web.

The Semantic Web is often seen as a utopian view of a future where no electronic resources will be published without complex information being provided or automatically generated about its origins. The reality is that manual entry of information is generally very limited unless it serves the purposes of the person entering it, and this cannot be relied upon as an approach to ensuring large-scale, consistent metadata on a sufficient scale for the Semantic Web to work. Technology has in some cases improved to the extent that geographical and technical information is now automatically produced, for example in digital cameras and in mobile phones able to record GPS coordinates.

However, the effort and cost required to catalogue the entire physical world and the extent to which this is even possible is highly doubtful. Where the Semantic Web could be useful is within particular large bodies of data, for example experimental scientific data, publications and so on. In the case of the Web of Things, this could include art collections, photography, archaelogical information, the locations of public institutions and many more. For all of these purposes, it will be necessary to provide unique identifiers for increasingly large numbers of resources, including things and agents, in order to provide complex metadata about them.

Education in the wider world

It has perhaps not been sufficiently investigated how unique identifiers for researchers and other staff in Higher Education will fit into the wider question of unique identification on the Web. Relevant purposes might be:

(1) commercial, for example the identification of companies and individuals owning the rights to photos, music, video or publications, particularly legacy resources of ongoing commercial value in terms of royalties and performance licencing.

(2) governmental, for example biometric information about people, used in border controls, crime prevention and citizenship contexts; or about public or private organisations such as charities, political groups of interest to law enforcement etc. Information about individuals, in particular, may be subject to privacy laws, which will vary between jurisdications.

It is clear that there are interfaces between the various agents and outputs of academic institutions and many other purposes, notably those commercial and governmental activities already described. For example, a foreign student or member of staff seeking a work permit will require institutions and governmental bodies to use personal and citizenship information co-operatively, which will be linked to their academic identity in the course of their work at the institution. Some of this information will be private and some public, so there is an issue about who can see which parts of a particular corpus of Linked Data, requiring authentication protocols and systems.

The extent to which consistency of approach between HE institutions and other sectors and contexts can ever be ensured is moot, since there is of course no single international authority and because any single metadata solution that tried to cover so many diverse purposes would be fatally unwieldy. How different, flexible approaches can be understood by machine processing is perhaps the technological key to how well the Semantic Web will answer these questions in future, both within Higher Education and beyond.

Confidence, and the business of persistent identification

Paul Walk — Thu, 28 Oct 2010 12:48:10 +0000

The persistent identification of resources is a foundational element of the JISC Information Environment. There are several schemes and technologies available to support this, with one of the most prominently used in the JISC IE being the Digital Object Identifier (DOI). Built on the Handle technology, the DOI, under the stewardship of the not-for-profit International DOI Foundation (IDF), adds the important element of collective commitment and management, based on straightforward business interests. DOIs are allocated and managed through Registration Agencies (RAs).

DOI has become somewhat synonymous with scholarly publishing, with most people working in the JISC IE having encountered them in citations for papers in online journals and repositories. However, while publishers continue to play an important role in minting and using DOIs, the use of DOIs to persistently identify datasets produced in research is growing in significance. Last year saw the creation of a new RA - DataCite, which deals with this relatively new and growing area.

There has been much debate over the years about the persistent identification of resources - especially at the technical level. Yet all technical solutions are bound, eventually, to come up against the issue of the persistence, or lack thereof, of organisations of people. In the JISC IE space we can see that publishers come and go, and that journal titles, for example, merge or change ownership from time to time. Universities, seen by many as very persistent organisations (a pre-conception which might, sadly, be tested in the next few years) do, nonetheless, merge and change.

The creation of a body which has as its primary goal the management of the persistence of identifiers - essentially the role of the Registration Agency in DOI - is an approach to addressing this lack of permanence. Within the 'ecosystem' of the RAs, each participant has a vested interest not only in maintaing their own identifiers, but in ensuring that the system as a whole continues to function well. From this point of view, it is in the interests of all participants that the commitment from others is strong which means that the addition of new RAs, such as DataCite, can only be a good thing.

Over the last year or so, IDF has been working with MovieLabs as part of a project to establish the not-for-profit Entertainment Identifier Registry (EIDR). This initiative includes the establishment of a new Registration Agency for DOIs for all digital resources created for TV and film by a consortium of many of the major producers in the entertainment industry. EIDR is actively seeking more participants, and offers a variety of types of membership.

While the engagement of this new industry may not be directly relevant to many people working in the scope of the JISC IE, the confidence and investment which this industry has placed in the DOI system is significant. This development increases the viability of DOI in general and, as such, should make it a more attractive prospect to those working in the JISC IE and in HE in the UK generally.

Essentially, confidence is an important aspect of persistence - and significant buy-in to DOI from such different sectors, commercial and public, should increase confidence in this solution.

A whitepaper about EIDR is available on request.

An introduction to DOI in a higher education context (set of presentation slides)

Aggregation and the Resource Discovery Taskforce vision

Paul Walk — Thu, 19 Aug 2010 12:43:35 +0000

On Tuesday of this week, UKOLN convened a group of invited experts to discuss aggregation in the context of the Resource Discovery Taskforce's vision. The Resource Discovery Taskforce (RDTF), a joint JISC / RLUK venture, has summed up its vision:

UK researchers and students will have easy, flexible and ongoing access to content and services through a collaborative, aggregated and integrated resource discovery and delivery framework which is comprehensive, open and sustainable

Given the limitations of time and resources, and with a firm intention to make a real contribution, the RDTF has decided to focus on aggregation of metadata as a means to progressing the vision. There was some debate at the meeting about the extent to which aggregation is something worth focussing on, and a general concern that this not become an end in itself, rather than a means to an end. We agreed to use the phrase 'aggregation as a tactic' as a way of characterising the proper relationship of this approach to the vision, and steered the remainder of the meeting to address aggregation from a mainly technical perspective. To get the ball rolling, I introduced a slide wherein I attempt to list possible reasons for aggregating data:

to address systems/network latency - a cache
for ‘Web Scale concentration’
- ‘gaming’ Google - raising ‘visibility’ of content
- network effects if user facing services also developed
to showcase (e.g. scale & nature of OER in UK)
to create middleman business opportunities
as infrastructure to support locally developed services
as an approach to preservation

This was discussed at some length, and we agreed that some other reasons could be added to this list:

for economic reasons - e.g. to achieve economies of scale through storing & managing metadata in one place, implying that the aggregation becomes the sole source of a given metadata record
to add value to the data through processes, especially around data quality, which are impractical or even impossible to contemplate when the metadata is distributed
to simplify licensing from the point of view of the consumer of the aggregated data

We noted that while the RDTF vision seems to concentrate on metadata describing resources and their provision, other types of metadata, such as user-generated annotations and user attention or activity data, which is also of great potential interest and value might be aggregated advantageously.

The importance of registries to help in the identification and discovery of relevant data was raised.

For the second part of the day we broke the meeting up into three smaller groups, each concentrating on an aspect of the preceding general discussions. Each of these groups, when they summarised their discussions for the whole meeting later, identified issues and made recommendations. Where these are generally applicable (which they mostly are), rather than outline them in the following descriptions of the breakout groups I have treated them together in two sections at the end of this post.

Breakout 1: APIs

This group looked at the role which Application Programming Interfaces (APIs) have to play in an environment of aggregated metadata and related services. It used a spectrum of technological interventions ranging from specific service development to meet a particular need, through to generic infrastructure provision to provide opportunities for others to develop services, and attempted to place classes of APIs on this spectrum:

It was agreed that it was important to understand this distinction, and to be equipped to judge where to 'draw the line' between meeting specific requirements and investing in capacity for future innovation. There is clearly a tension between agility - which is a feature which becomes more desirable as one moves along the spectrum towards those servicing users' requirements, and stability which is necessary for infrastructure to be trusted. Part of the purpose of APIs is to help to manage this tension.

APIs are for developers, and so APIs on aggregations must be highly usable from the point of view of a developer. Focussing on the need for aggregations to expose APIs so that services can build upon them, this group made some recommendations (included in the general recommendations at the end of this post) about the sorts of general features an API should exhibit. In general, it was agreed that an API on an aggregation must be more convenient, from the point of view of a developer, than going directly to the individual sources. Leaving aside simple issues of network latency, in a possible Linked Data future where data is commonly openly available, the aggregation and its API must not become a barrier to building services and adding value to data.

This group also discussed the issue of federation of aggregations - where one aggregation feeds another. There are serious engineering issues with this kind of federation which require better understanding.

Breakout 2: Aggregation as tactic

This group decided to start by looking for "prior art" - examples of successful uses of aggregation as an tactic to improving resource discovery. With this approach, it was suggested, it would be possible to identify stakeholder groups which are already 'bought into' the idea of using aggregation as a tactic in this way, which ought to be easier than convincing people from scratch. The trick would seem be to be to identify a shared service which could be developed upon an aggregation of metadata, and which they could recognise would be beneficial to them. Examples of successful aggregations were identified and included:

Copac (aggregated records from National, Academic, and Specialist Library Catalogues)
SUNCAT (a national serials union catalogue)
Worldcat (a global, aggregated library catalogue)

Echoing an earlier point, the group suggested that the value in aggregation as a tactic comes from the ability to normalise metadata into some sort of canonical form. This aspect of the aggregation adding value to the data it aggregates is crucial if the source record holders are to be persuaded to participate.

The group suggested that JORUM's role in supporting the national (and global) Open Educational Resources (OER) movement was very much in line with this thinking: that JORUM enhances discoverability of OERs created in UK institutions, while simultaneously offering the potential for long term archiving (preservation). Again, the importance of the registry becomes apparent this group suggested, with JORUM likely to become important as a service providing identification and 'provenance' services.

The group discussed the idea of concentrating on one particular domain, such as geography, on the grounds that this could then be built out to an extent that other domains would become interested once they had seen what has been achieved. The counter to this argument was a suggestion that it might be better to consider a range of resource types including scholarly communications (bibliographic data), learning materials, repositories, spatial/geographical data and multi-media.

It was also noted that the 'aggregation as a tactic' argument might apply to self-archiving and Open Access - which has similar arguments as for JORUM and OERs.

It was suggested that this was leading to a set of tactics which would help content providers get over a 'fear' of aggregation, and of encouraging them to open up from a position of 'data ownership'. It was also recognised that once this is achieved, aggregation as a tactic creates opportunities for 'middle-men' to add value through new services building on top of the aggregation.

Interestingly, this group suggested that aggregation as a tactic might be a short-or-medium-term tactic, that the 'end game' would be to dis-aggregate content back to source. At this point, the remaining infrastructure would be of the 'registry' type, helping to locate data at source.

Breakout 3: Build better websites!

The emphasis of this session was about advising & enabling those who hold source metadata to make it available in an appropriate form. The group identified a number of 'steps' that a content provider might take. These steps are ordered in a system of progressive desirability in a model influenced by Tim Berners-Lee's Linked Data Note:

make data available in an open form (even using the much-maligned CSV format if necessary)
assign and expose HTTP URIs for everything, and expose useful content at those URIs
publish as XML
expose semantics

It was noted that these steps do not demand that a provider should work their way through them sequentially - it is perfectly acceptable and even desirable to jump in at step 4 - however this might represent a significant barrier to some, so steps 1-3 are there to give content providers a chance to engage comfortably.

Barriers specific to this model being adopted successfully include the issue of securing vendor 'buy-in'. For content providers to support this model, their software platforms need to enable it. This may not be the case at present in most cases. Also, specific skills in Linked Data are not so widespread in these sectors (yet), and an appreciation of and support for Linked Data is not common among senior managers. It was recommended that JISC create some political momentum around this, perhaps devising a convincing argument for senior management. It was also suggested in this breakout group that RDTF should provide a central resource (guidance & possibly infrastructure) for hosting data, especially for smaller organisations.

This approach was summed up as a description of a potential glam.ac.uk where glam is galleries, libraries, archives and museums.

General Issues

Lack of technical expertise in libraries, museums and archives. This applies most strongly in respect of the 'build better websites' model, but is also true more generally, especially when the long-tail of glams is considered.
Business case, or possible lack thereof. The content providers need to see a clear benefit before committing to the cost involved in supporting the aggregation of their data.
Content providers often show a reluctance to make data openly available on the grounds that they may expose poor quality which reflects badly on them

Recommendations

The various discussions during the meeting gave rise to a number of suggested recommendations. It should be noted that these are based on a few short hours of discussion - however the experience of the group which made them is considerable, so I hope they might be considered seriously.

The 4 step model for advising/supporting content providers in opening up their metadata
The RDTF should fund aggregation projects that demonstrate value in these steps
- e.g. "Tell me how my content is being used"
Providers should provide a semantic sitemap leading to a data aggregation. This could be RDF or XML
Providers should expose the schemas they use (whether their own schemas or links to established schemas)
Aggregation services should provide guidance to content providers about schemas to be used (a registry of recommended schemas would be a useful component)
Aggregators should not reject data on basis of schema used by the content provider - aggregators should be prepared to accept anything
The RDTF should (in partnership with others) seek to engage with vendors of collections/content management systems in the various domains.
Aggregations should have supported APIs which are attractive to and convenient for developers, offering developer-friendly output formats such as XML or JSON
Aggregation should be considered, perhaps, as a temporary approach to aiding discoverability. More extremely, a 'just in time' approach to aggregation might be considered.
A 'cookbook' of design patterns involving aggregation as a technical approach to resource discovery might be a useful thing to consider funding.
A '2 tier' model of metadata might be worth considering, where one tier is for common, basic description and identification, and the other tier is for more targeted uses.

Many thanks to those who attended and made the meeting a success:

Peter Burnhill (Edina)
Hugh Glaser (Seme4)
David Kay (Sero)
Andrew Kitchen (Becta)
Ross MacIntyre (Mimas)
Andy McGregor (JISC)
Paul Miller (Cloud of Data)
Andy Powell (Eduserv)
Owen Stephens (independent)
Adrian Stevenson (UKOLN)
Paul Walk (UKOLN)
Jo Walsh (Edina)

And thanks to Adrian also for organising the meeting.

DuraSpace

Talat Chaudhri — Fri, 15 May 2009 12:03:00 +0000

The recent announcement of the merger of DSpace and Fedora Commons as DuraSpace is potentially a very significant advance in the repositories sector. Although the two platforms will continue to exist as separate entities, they will no doubt collaborate to their mutual benefit in technical development. In addition, new software products such as DuraCloud are described.

I have personally found DSpace to be an effective and flexible platform, although until version 1.5 it was missing some fundamental functionality that meant it was overall an inferior product to EPrints. However, I have always said that if certain issues were sorted out, such as a more granular permissions system, versioning and so on, it is otherwise as good overall as EPrints. (I particularly appreciate how easy it is to see full metadata records in DSpace, from the point of view of research, though this is an entirely trivial technical point - it just happens to suit my work in my present and previous post!)

DSpace is clearly second to EPrints in terms of market penetration, but it is the only other major competitor to enjoy such a sizeable market share. Fedora is third not on its technical merits but largely because it is not a "packaged" product and requires much more customisation. It is evident that both platforms have much to gain from the collaboration. I would bet that the EPrints people may well have cause to worry about their future market dominance, given this development.

I'm particularly interested because Fedora makes much greater use of RDF, a technology that has its supporters and detractors, but has not been the basis for a wholesale change to the promised Semantic Web that might have been hoped. However, one can see the potential application within content management systems such as repositories. One stumbling block seems to be that triple stores are not particularly efficient databases and need significant optimisation efforts before they rival traditional relational databases, a subject on which I am not a great expert at present. (I thank a colleague at UKOLN for educating me on this.) It is particularly interesting, then, to note the reference to efforts to improve the triplestore-based storage layer Mulgara.

I'm awaiting further developments with considerable interest, noting the new version DSpace 1.5.2 and recent references to the planned versions 1.6 and 2.0. I wonder how much the repositories community will have changed in a year's time? Things seem to be moving fast right now.

The collaborative research environment: publications management, CRIS systems and repositories

Talat Chaudhri — Thu, 02 Apr 2009 16:41:00 +0000

Some months ago, I intended to write a post inspired by this post by Chris Rusbridge on CRIS systems. One particular motivation for this was reference he made to some comments by Stuart Lewis, particularly relating to the Symplectic Publications Management System. This was the subject of an impressive demo in Aberystwyth by Daniel Hook of Symplectic and Imperial College, London, while I was in my previous post as Repository Advisor. We were very interested by CRIS systems as back-end systems for managing research outputs at source. I believe that Stuart had been in touch with Niamh Brennan, who has since been kind enough to give me some general details about the CRIS system that has been produced in-house at Trinity College, Dublin. I should note that I have not seen it in operation.

One reason that I never published the original post is that most commentators in the repository community probably agree that a CRIS is innately a Good Thing in any event, so it needs no detailed endorsement from me. (A further concern is that I do not wish to be seen to be making specific comments on developments at Aberystwyth, which is a matter for them, so I shall confine my remarks to the systems at issue here.) In short, as Stuart’s comments cited by Chris in the above blog make clear, the CRIS does not replace the repository but “sits behind it” and provides content. There are of course a great variety of ways in which this could be achieved in practice, depending on the particular system in question. I will address the definition of a CRIS further below. However, the Symplectic system deserves more detailed commentary here, for reasons that will become clear.

Symplectic Publications Management System

There seems no reason to doubt Symplectic’s obvious competence in publications management, having witnessed how effective this piece of software had already been, albeit having only been tested in a limited number of institutions at the time. It was, to my mind, impressive that Daniel Hook was both frank and receptive about the merits and development potential of his system. Indeed he had no real need for the hard sell because the software had been built "by academics for academics", so it consequently fulfilled most of the basic requirements already and many of the other features that he was questioned about were already in the pipeline.

However, this post is by no means intended as a wholehearted eulogy directed at the Symplectic system. It was developed in the context of scientific disciplines, and had not at the time been tested in arts and humanities disciplines, although to be fair this point was made openly by Daniel Hook himself. I do not see why it should fail in principle in non-science disciplines, although the take-up may be less enthusiastic in the same way that one finds in repositories. The real problem is the limited coverage of these disciplines by union databases such as Web of Science, though this is not the fault of Symplectic. Their efforts may in time provide one reason to improve those databases, which underlie the core functionality of the system.

Briefly, the system replaces the university's reporting system. The academic logs on and merely has to choose whether to agree that a new paper in Web of Science belongs to him or her. They are able to declare that it is identical to another version, perhaps in another database, or else they can separate two papers of the same name, e.g. a conference paper and a published paper. They can optionally correct the metadata. Evidently, for papers not found in Web of Science, academics can enter the details of the papers manually, although this process then loses all of its automated advantage over author self-deposit in a repository. For this reason, the system has the greatest advantage in science subjects. The administrator can see whether academics are responding to its suggestions and thus, in disciplines with good database coverage, can gain a good idea of how comprehensively they are archiving content. Finally the system is entirely interoperable with DSpace, so that items may be deposited through either interface (I am less sure about EPrints and Fedora). Permissions can be set and delegated on a fairly granular level, unlike in repositories, so that academics can have control over who can deposit items on their behalf, such as co-authors.

The system provides useful tools to help simplify and automate research reporting that is already mandatory for academics, rather than introducing new obligations and unnecessarily duplicating the deposit/ingest phase. By contrast, repositories require this as an additional process, after research reporting. Of course, certain repositories are used for research reporting, which requires an effective or explicit institutional mandate, but in their present forms they are badly suited to describing the finer details of funding grants and projects. This is handled rather better by Symplectic, but obviously requires manual metadata input.

It may come as no surprise, given my involvement with SWAP, that I should level a criticism at the Symplectic system on the basis of its simple versioning model. Clearly one might wish to be able to say more than whether two papers are identical or not. It was clear that no complex relationships could be described at the time of the demo. I hope that this is addressed in future iterations of the software.

CRIS systems

A further issue is whether the Symplectic system qualifies as a CRIS at all, as pointed out to me by Niamh Brennan. Apparently it does not. There is no indication that it supports the CERIF standard maintained by EuroCRIS, and moreover it does not provide the means to support the creation and versioning of research outputs and related project information from their very inception, which is one of the purposes of a CRIS system. Instead, it only deals with papers after publication, in the same way as a repository.

Whether or not academics, particularly those whose working practices are long-established, would willingly switch over to using a CRIS in this way is perhaps a matter for doubt. It might be possible to compare, for example, mandated and voluntary use of CRIS systems, but at present I have no evidence for the relative feasibility or success of either approach. I hope to be able to see other CRIS systems in action, and that an Open Source platform will become available before long. I suspect that they have considerable potential in managing research and making it freely available. However, it may be naïve to simply assume that they will provide a complete solution for research management. At present, the only commercial software seems to be PURE, given the apparent demise of UniCRIS there appear to be three commercial competitors: CONVERIS, PURE and UniCRIS [updated 22/09/2009].

The functionality that these systems highlight that traditional repository platforms lack – with the proviso of course that no two in-house CRIS systems necessarily share the same functionality – is the ability to offer a collaborative environment for research management and research reporting. In the case of a CRIS, this is supposed to extend to the creation of research from its inception. As a repository manager, I recall being asked by several academics why they were unable to manage their own metadata in our repository, considering that responsibility for their own research is a core function of their employment. This is an entirely fair criticism, to which I will return.

It is difficult to see why a CRIS should not share all of the main characteristics of a repository. Both manage the ingest of bitstreams and the creation of metadata. Both repositories and the Symplectic system can be used to expose these materials to the web or alternatively set permissions to view them, and to generate publications lists for authors or research groups. Ultimately these are a set of very similar systems with slightly different but related, complementary purposes.

Implications for repositories

This discussion highlights the conceptual similarity of the current repository platforms to publication platforms, despite the usual claim that they are not publishing research but merely archiving it after publication. In legal and practical terms at least, they are self-evidently re-publishing it by making it publicly available in a further location. (One might speculate that this is the reason for the animosity of certain publishers towards repositories.) After all, to publish ultimately means to make public. But in academic usage, publishing also includes peer review. In my view, it might be a good time to start drawing a clear distinction between these functions.

I have heard Paul Walk refer many times to the “silo effect”, which seems to be at the core of the problem with the present repository platforms. Unlike other websites, the repository is not an interactive site. Having seen the statistics for repository access using Google Analytics, I can report that only a small minority of visitors access the site directly or by referral from university web pages. Virtually all papers are found using Google (only rarely other search engines) and those visitors do not remain on the site after they have found the content. They are largely oblivious to the existence of the repository. Very few users were referred by Intute or OAIster. All of these problems are to some extent unavoidable and well-known, but the repository should at least be more interactive than it is from the point of view of the depositors.

I should note that the large part of my experience with repositories relates to DSpace, though I am also familiar with EPrints. The My DSpace (user area) function offers the user no functionality other than being able to change their personal details and submit content, and even the latter usually requires manual authorisation by the administrator. EPrints is generally similar. This immediately results in confusion, erroneous error reports and so on from sometimes irate new users, but it is quite logical for the repository manager, who needs to know who is submitting content and generally wants to intervene to give basic initial copyright advice in order to save problems later with incorrect versions being supplied.

There is clearly a role for automating the setting of permissions for deposit on the basis of staff records. Though it presents problems, it can be done routinely for other university software systems (including apparently the Symplectic system). I would be interested to hear whether this has been done successfully in the majority of EPrints and DSpace repositories.

But the issue runs deeper: the only tool that the repository manager has to safeguard copyright liabilities is the ability to monitor submissions. It would be most unwise to set editorial permissions for all authors over their own metadata because there is no function to view recent changes, only recent deposits. Such changes might easily be in breach of copyright, despite the author’s best intentions. (In DSpace up to version 1.5, permissions can only be set by group, not for individual authors.) There is no easy versioning system, such as the sort used in wikis, and any changes to bitstreams can normally only be reversed by recourse to restoring from general back-ups.

It is evident to me that the present situation is not scalable. When the majority or even the plurality of research outputs are available, there will be far too much in each repository for its staff to deal with. A more scalable, collaborative solution is needed before that situation is reached, involving all parties concerned with the production of research in maintaining it on the web.

General recommendations

To summarise, the consequence is that items are “frozen” as soon as the repository manager completes checking for copyright checking and basic metadata compliance, after which any changes must be requested by email. The following features would need to be in place in order to allow repository managers to implement a more collaborative worklow:

(1) historical versioning control needs to be in place for editors to have the ability to easily roll back records and choose
which may be viewed by the public.

(2) in addition to the initial checking step in the workflow, all recent changes to the repository need to be visible to the repository manager. These could be vetted, i.e. they could remain pending until allowed by the repository manager, if desired.

(3) users need to be better identified as authors, even where another individual, e.g. co-author or administrator, has deposited on their behalf.

(4) the permissions system needs to allow granular control over what an individual may change within their own deposits, whether bitstreams or metadata, and within groups in the repository.

(5) bitstreams and links to versions elsewhere on the web should be treated equally in the interface, since the user is concerned with getting the resource, not where it is.

(6) as in the Symplectic system, users should be allowed to delegate those rights to individuals within their workgroup.

(7) as in Symplectic, users should have control over which publications appear in their publications list, and in what order.

It would also be desirable for repositories to use a method similar to that used by Symplectic to compare records to new items appearing in Web of Science. In that way, authors would have an ongoing dialogue with the repository, and an impetus to use it as a collaborative tool. In effect, there should be no conceptual difference between a publications management system and a repository – and, frankly, the former term is instantly meaningful to the user, while the latter means nothing to them at all. Such public relations disasters have led to the present stagnation in repositories. I would suggest that in any working system, the CRIS, the publications management system and the repository ideally need to be modular parts of the same software.

The repository world currently suffers from a dictatorial, top-down management structure that is in considerable part imposed by the design of the current repository platforms. At present, repositories do not meet the standards of collaborative software that we come to expect from Web 2.0 services. They also seem to be in direct conflict with the traditional responsibility of the academic and/or department for proper research reporting within the institution. The Symplectic Publications Management System and the concept of CRIS systems offer a more collaborative model that can help avoid repeating the same tasks. In addition, they can help record complex grant information, some of it confidential or commercial, that may not be for public release on the Web. Moreover, they offer an interactive workflow with real time-saving benefits for academics. In contrast, self-archiving in repositories merely repeats a task that is mandatory elsewhere, and does not represent joined-up thinking.

Incorporating the already mandatory process of research reporting into such systems would effectively by-pass the need for institutional mandates for self-archiving by incorporating it into the existing workflow using new software tools. Progress with such mandates has been crushingly slow, despite the success of the few that have been achieved. However, since all institutions are organised differently, it is simply not the case that only one single method for acquiring freely accessible content, however effective when implemented, is the only possible way to success. Collaborative research management is clearly a good idea for this end, as well as for saving time and effort on the part of academics, irrespective of the merits of institutional mandates. It would also be a more scalable, sustainable basis for repositories to work that the present situation. On that basis, I suggest that it may well be a Good Thing.