Technical Principles for the Discovery Ecosystem

22 July 2011 - 12:46pm — Paul Walk

Editorial note: This is a draft proposal and the contents were not finalised. It was a very early attempt to outline some over-arching technical principles to underpin the Discovery ecosystem, and contributed to the discussion that informed that work.

This is a revised version, following comments from the Discovery management group

Underlying Concerns

The 7 technical principles outlined below should be considered within the context of the following underlying, non-technical concerns.

Pragmatism, balancing consensus with agility

Consensus on technical and information standards is what allows information systems in the Discovery ecosystem to interoperate. Discovery favours open standards, but is also pragmatic about the adoption of less open standards where they are in mainstream use.

While there are, undoubtedly, benefits to be gained from consolidation, standardisation and consistency of approach in the Discovery ecosystem, it is also understood that there are domains and communities of practice within the ecosystem which take different approaches, use different standards etc. Beyond standards, Discovery recognises the likely emergence of conventions - agreements based on less formal consensus which generally emerges from community practice. This is anticipated especially in the case of smaller domains or use-cases, where formal standardisation is too expensive in terms of resources.

The ecosystem includes many domains which may overlap to a greater or lesser extent, so context sensitivity is an important principle of Discovery.

Staying open to innovation

No information system is perfect. Advances and changes in technology and human behaviour introduce a steady flow of opportunities and challenges - it is these which drive innovation. Discovery recognises the need for stability and persistence of data and services, balancing the needs of infrastructure with a more responsive approach to continuous innovation (or ‘perpetual beta’). In order to manage this, Discovery encourages the adoption of a set of operating guidelines designed to allow a guaranteed, low level of interoperability.

A particular type of user which Discovery recognises as having particular relevance for the ecosystem is the developer.

Technical Principles

1. Discovery is heterogeneous

The Discovery ecosystem is a heterogeneous environment, encompassing a wide variety of users, resources and types of resources, domains, technologies and and business models. Discovery balances the need for a degree of homogeneity to serve management and interoperability requirements, with a recognition of the importance of variety in any ecosystem.

2. Discovery is resource-oriented

Discovery is innately resource-oriented. It is a principle of Discovery that metadata resources may have intrinsic value, and that the ‘opening up’ of these to all will create more value as they are used, enhanced and combined with other resources. This is not to say that services are not also important within the Discovery ecosystem. However, the principles of resource orientation are designed to allow the value of the resources themselves to become immediately exploitable by providing simple and easy access to them, without a requiring a ‘thick’ layer of services.

The Web is clearly one of the dominant global information distribution systems, and it is the common (though by no means only) context within which people search for digital resources. Any strategy aimed at increasing the discoverability of metadata resources must therefore have a clear approach to exploiting The Web.

The Resource Oriented Architecture (ROA), as defined by Richardson and Ruby, offers guidelines for the implementation of the REST architecture, specifically on The Web. Where REST is a theoretical framework, the ROA guidelines give pragmatic advice, showing how REST can be utilised to good effect in a Web context. The success of the ‘Web 2.0’ paradigm owes much to the widespread adoption and validation of the principles outlined in the ROA. It is, therefore, an approach which is proven to work in some contexts.

The ROA approach is well suited to Discovery’s core principle of open data.

3. Discovery is distributed

The Web is a network - actually a directed graph - of resources, deployed on The Internet which is a global network of devices (typically computers). Sandwiched between these two in the logical hierarchy is a network of systems - primarily clustered around the ubiquitous Web server and Web browser components. Often we refer to The Web to mean all three of these layers taken together. The so-called social network is the graph of relationships between people interacting with resources and, indirectly, each other, mediated by The Web and other information systems deployed on The Internet. Reed’s Law tells us that the potential value of such networks is considerable, and as the current phase in the development of these networks matures, there is plenty of emerging evidence of real value.

To be ready to operate in a network-friendly manner is an important principle for Discovery. By nature, Discovery is concerned with a plethora of information resources and services from a wide variety of sources and is prepared, where appropriate, to deal with these in situ.

Although The Web has some part to play in most information systems in the Discovery ecosystem already, the client-server paradigm is still strongly represented. With a resource-oriented approach, the predictable interaction implied by the client-server model is challenged. The Web is starting to be realised as a network where nodes are both client and server - functioning in potentially many different interactions with other nodes.

This allows for, and even encourages, the possibility that systems operating in the Discovery ecosystem can be both providers of information resources and services at the same time that they consume and use other, remote resources and services.

The idea of the Application Programming Interface (API), and principles of modular systems design, are important concepts for Discovery.

4. Discovery relies on persistent global identifiers

The resource oriented architecture encourages the identification of information entities. In the Discovery ecosystem, such entities are typically metadata records, although there is growing interest in experimenting with a finer granularity of metadata in a Linked Data context. In any information system, such entities are uniquely identified. As Discovery deals with open data, such identifiers must be globally unique for the distribution of resources and services to work. The default global identifier scheme for The Web is the HTTP URI, however there are other important schemes in use in the Discovery ecosystem.

In addition to this, a commitment to the persistence of these identifiers is an important principle of Discovery. Careful planning for and design of URIs and other identifiers is essential. While ‘persistent’ does not necessarily mean ‘permanent’, it certainly does imply carefully managed.’

Good practice in the design and management of persistent global identifiers is fundamental to success of Discovery if it is to realise the value of its growing networks of information.

5. Discovery is built on aggregations of metadata

Metadata aggregation is a foundational aspect of the Discovery vision. This might seem somewhat in opposition to the previous principle: however, The Web is sufficiently unrestrictive that it allows both distribution and aggregation as useful strategies in certain contexts. Dempsey uses the terms diffusion and concentration to describe these two approaches and indicates how they are complementary.

Aggregation of information resources is made possible by a combination of open data, widely used and understood information standards, and networked systems. It is apparent that in various information domains the aggregation of metadata into information ‘hubs’, or large nodes in the network, is a powerful approach to increasing discoverability. Aggregation raises visibility - itself a simple path towards greater discoverability of resources.

Discovery anticipates the emergence of a hybrid of distributed information resources described in aggregations of metadata resources, with many relationships and interactions between these.

6. Discovery works well with global search engines

Search Engine Optimisation (SEO) is the process of exploiting an understanding of the functions and algorithms of the major global search engines. With such an understanding, Web content providers can present web resources in such a way that they gain the optimum ranking in the indexes created by those search engines. SEO is a fully developed industry in the commercial sector, but many of it principles and techniques are well known and applicable to the Discovery ecosystem.

While global search engines are not the only route to resource Discovery on The Web they are, in many contexts, the most important. Discovery will aid providers of publicly-funded information resources from the education, research and cultural heritage sectors to develop their own SEO strategies.

7. Discovery data is explicitly licensed for (re)use

Data is made available within the Discovery ecosystem in the expectation that it will be used and, potentially, combined with other data to be re-used in new ways. Such re-use of data is controlled through licensing. It is therefore crucial that data is explicitly licensed, and that the license declaration is itself a resource available on the network.

The use of an open data license - such as CC0 - is welcomed by Discovery, as this significantly reduces the complexity involved in its re-use.

Key links:

Discovery website

http://discovery.ac.uk/

Attachments:

rdtftechguidelines.pdf

rdtftechguidelines.epub

rdtftechguidelines.doc

Printer-friendly version