Microservices in (and beyond) Research Information Management
Microservices: are they all that new?
Recently there has been something of a revival of interest in a small-scale development approach towards software design for repositories: microservices. This is far from an entirely new idea but seems to have been somewhat slow to develop in practice, even to date; a useful summary of the approach was given by Neil Jacobs back in 2010. Moreover, a modular approach towards software that fulfils various related functions in managing web content related to research clearly has a much longer history, and is not in itself particularly surprising in software development more broadly. However, it seems that microservices as an approach is gradually acquiring a clearer identity within this space, so it may be worth taking a look back at the nature of the types of software used in managing research content of various types, how they are related, and whether and to what extent terms like "repository", "Current Research Information System", "Research Information Management system" and so forth overlap in terms of software functionality that they offer.
Defining terms: "repository", CRIS, RIM etc
Institutions within Higher Education are often faced with questions of procurement such as technical suitability and sustainable technical support. Although these areas are broader than those normally covered by the Technical Foundations web site, since they encompass non-technical considerations related to funding, policy and practice that drive software acquisition in universities and related institutions, the purely technical aspects are securely within scope and of considerable interest to the community at large in terms of developing useful technical guidance.
The question "What is a repository?" is likely to have a range of possible answers, but Neil Jacobs noted the revival of an approach summarised in Cliff Lynch’s 2007 description of the institutional repository as “a set of services that a university offers to the members of its community for the management and dissemination of digital materials created by the institution and its community members”. Without reiterating the points made by Neil Jacobs in detail, suffice it to say that these efforts have been led by institutions such as the California Digital Library and notably by John Kunze and others. The difficulty with this approach in general is not a purely technical one but one of technical resources, and it is not unique to the microservices approach but can for example be seen with systems such as Fedora Commons as well.
Software development approaches
While the most modular, customisable and flexible technical approaches are often able to be adapted most quickly (and arguably most effectively) to the challenging technical demands placed on them, it is usually the case that significant development resources, usually in-house, are required in order to tailor the software to local requirements. In practice, the result is often that only certain large institutions are able to justify and support software systems such as Fedora or even "roll their own" local software solutions. A useful example is the eSciDoc suite of services, developed by the Max Planck Foundation and FIZ Karlsruhe. Together, these effectively represent what in other contexts (e.g. the Linux world) might be called a "distribution", in this case based on Fedora. It is also worth noting that these services have been developed so that they can be used independently of eSciDoc, for example with DSpace or another repository system. In this way, true to Cliff Lynch's definition, each aspect of what together we call a "repository" is handled by a different piece of software, which then interoperates with a range of other web services according to local requirements.
"Does it do more than we already do?"
This, in a nutshell, is the microservices approach. However, there is no reason why the question should be restricted to repositories, since "repository" is itself something of a catch-all term for a class of web content services that are by no means identical in their principal functions and aims, even where they are using the same underlying software. Where, for instance, does the functionality of a repository end and that of a research management system, research information management system or Current Research Information System begin? Without a clear understanding of what these systems do, it is possible if not likely that higher education institutions, especially where decisions about procurement could be made by relatively non-technical managers, might easily end up acquiring more than one system with overlapping functions. Clearly, in times of difficult financial circumstances, this ought to be avoided wherever possible. It is worth spelling out what exactly different systems do in order to minimise duplication of effort.
Similar software issues facing HEIs
The question need not be limited to repositories and research information management either, although it is not the intention to get into great detail in this particular blog post. For example, libraries are frequently offered new products either by vendors with whom they have existing contracts or by their rivals. It is always in the interests of a vendor to sell a new product, so the question of duplication of technical functionality and/or the most effective technology to address a local need is of far more pressing concern to the institution than the vendor. A range of commercial library portals are on offer, built on but extending the functionality of library catalogues and commercial publications databases related to e-journals such as Web of Science. It is a common experience amongst library staff to feel unsure to what extent new software is offering new functionality, how it fits their technical requirements, and to what extent it may be re-packaging existing functionality in new clothes. The same could perhaps be said, for example, of systems relating to human resources or institutional finance offices.
What else can these systems do?
Returning to repositories and research information management, it is clear that a wide range of resource types are being published on the web through a range of related systems. The best recognised use of the repository is as a research publications repository, which is unsually how the wider term "institutional repository" is understood within the context of higher education and issues relating to but not confined to Open Access. Increasingly, attention has turned to Current Research Information Systems, based on the CERIF standard, and similar research information systems. Of particular interest is the RMAS approach, effectively building such a system from a range of related pieces of software, i.e. a microservices approach outside the limits of the repository sphere. Research information management covers all aspects of the processes of research creation and dissemination, including research reporting, human resources, finance and publication, while publications repositories commonly focus only on the last of these. This is usually the area where institutions operate systems whose functionality overlaps, as there is no reason in principle why a CRIS, for example, cannot expose research publications on the Web: this is possible with the main commercial systems such as PURE and Converis, for example.
In any case, there is no necessary limitation on the term "repository" to cover only resources relating to the outputs of research. Teaching and learning materials, amongst a wider range of educational resources, are another major area that has seen substantial growth in the last two or three years. Various types of media resources from images to time-based media such as audio and video recordings are found in institutional repositories for a number of different academic purposes, e.g. art collections, media archives, music collections, health information and so on, not all of which are the direct products of either research or teaching but may be connected with one or both. In this context, it is as well to remember that the term "repository" means little more in essence than "organised place or system to put something [on the Web]" and that many such systems, especially older ones, have always been known as "digital archives", "electronic libraries", "media collections" and so on, in contexts where the word "repository" would still not generally be recognised. Large data collections are often stored in systems that are, in effect, repositories, but whose development has been through systems not normally known by that term.
Solutions that fit problems
In summary, dividing the world of software systems in academic and related outputs too rigidly into "repositories" and "research information systems" may be at the root of much of the difficulties that may arise in understanding which technical functionality is required for any given local purpose and the extent to which systems overlap. A better, more precise understanding of these functionalities would help to avoid unnecessary duplication of effort and proliferation of systems. Some approaches are effectively bundled within one piece of software for a particular purpose, e.g. DSpace and EPrints in the repositories space. These offer a conventional set of services that fit the requirements of most institutions but may place some limits on the ability to customise those services indefinitely. Even these systems are built to be general purpose systems with considerable potential for local customisation. However, there is the tendency seen elsewhere (for instance in open source software with a large and disparate user base) to introduce software bloat: more and more functionality, some of it never used by the majority of implementations, is shipped with each succeeding version as new scenarios are met with.
While potentially introducing the problem of sufficient availability and sustainability of technical development effort, microservices are the opposite end of this spectrum. Each service is ideally a separate entity on the web server, built for maximum interoperability with the other services that may be required for local purposes. Rather than acting as plug-ins to a base software system (which is perhaps an intermediate approach), these are separate code bases able to run independently, even where they may have been intended, as in RMAS or eSciDoc, to be used frequently together. The technical issues and demands of each system will be different in every case.