Skip to Content

Exchanging Scientific Metadata with Dublin Core

Scientific data constitute a vital evidence base for validating the conclusions of research. Without a supporting set of information and metadata to provide context, though, data can easily lose all meaning. While collecting and storing this metadata in an ad hoc manner is possible, it is not a robust solution: it is all too easy to miss important information, and hard for researchers other than the author to navigate through the collection. Therefore, standard metadata schemes have evolved that formalize both the information that should be collected and (usually) how it should be formatted. In designing these schemes, there is always a compromise to be made between efficiently supporting specific scientific methodologies, and making the information portable to a wide range of applications. The two most common approaches are to create an original scheme and provide mappings to more general (or more widely used) schemes such as Dublin Core, or to construct a specialist scheme as a profile of one or more existing schemes.

 

Author: Alex Ball
Last updated on 2 May 2013 - 3:26pm

The metadata schemes used to describe scientific data, and indeed research data generally, vary widely according to several inter-related factors. Perhaps of prime importance is the discipline: different areas of research have different measures of quality for data, and therefore place different requirements on the supporting information and metadata. Also significant is the application: the metadata needed to make a certain type of experiment repeatable will clearly depend on the experiment, and will be a quite different set to that needed to operate a data discovery service. To a certain extent, the type of data being described will influence the metadata provided, simply because different types exhibit different properties.

With all these possibilities for variation, there is a tendency for scientific metadata to be highly specialized. For example, the MIBBI (Minimum Information for Biological and Biomedical Investigations) Project lists over thirty metadata schemes, each setting out the minimum information needed to reproduce a different type of experiment. One of these, MIAME (Minimum Information About a Microarray Experiment) has been further specialized to cover four further sub-types: environmental transcriptomics, nutrigenomics, plant transcriptomics and toxicogenomics.

While this specialization is efficient and for the most part intuitive for the scientist working the data, it has its disadvantages. Most notably, it makes it harder to integrate data from related areas into a single data catalogue and, scaling up, forms an obstacle for interdisciplinary research. There are at least three different approaches that can be used to overcome this difficulty.

It is sometimes possible to sidestep the issue using full-text indexing and text mining techniques on the metadata. For example, in the field of bioinformatics, data models and metadata schemes are evolving rapidly, so any mappings between schemes become obsolete very quickly. Because of this, the European Bioinformatics Institute does not rely on such mappings to provide a cross-search facility for its wide collection of datasets and databases. Instead, it uses a Lucene-based search engine to enable users to search across the full text of the metadata. The problem with this approach is that it relies upon text mining techniques to infer the semantics of the metadata, instead of using those inherent in the scheme.

Where metadata schemes are relatively stable, a more common approach is to provide mappings or crosswalks from specialist schemes to more general and widely adopted schemes. For example, the NERC DataGrid uses a metadata scheme called MOLES (Metadata Objects for Linking Environmental Sciences); the maintainers of the scheme provide mappings to both Directory Interchange Format and ISO 19115, the international standard for geographic metadata. This approach is most often used where the application using the metadata, or the data type being described, has unique requirements not adequately met by other schemes. It almost certainly involves the loss or corruption of some metadata semantics when the crosswalk is performed, and whenever a new version is released of any of the metadata schemes involved, the mappings need to be reviewed.

An alternative approach is to form the specialist scheme as a profile of one or more existing schemes. In other words, the local scheme ‘borrows’ metadata elements from existing schemes, adding additional conventions or semantics as necessary. For example, EDMED version 1 (used by the European Directory of Marine Environmental Datasets) and ANZLIC (the Australia and New Zealand Land Information Council Metadata Profile) are both profiles of ISO 19115. UK AGMAP (the UK Academic Geospatial Metadata Application Profile) is both a profile of ISO 19115 and a superset of another profile, UK GEMINI (the UK Geospatial Metadata Interoperability Initiative Standard). The metadata scheme used by the British Geological Survey’s National Geoscience Data Centre is for the most part a profile of ISO 19115, but also contains additional elements intended to allow lossless conversion of metadata written according to the National Geospatial Data Framework’s Discovery Metadata Guidelines, and elements supporting three dimensional models.

The advantage of the profiling approach is that applications that understand the general scheme from which elements are drawn will, with no further effort, be able to understand the corresponding parts of the specialist profile. This approach also requires less maintenance effort, as for each element in the profile, only one relationship with an external scheme needs to be monitored, instead of (potentially) many.

As Dublin Core metadata enjoys wide support across many applications, it is a popular choice both as a source of elements for use in local profiles, and as a target for metadata crosswalks and mappings.

Dublin Core is the primary metadata scheme used by, among others,

Dublin Core contributes significantly to the Dryad Metadata Application Profile; the other schemes profiled by Dryad include the Darwin Core standard and the Bibliographic Ontology. Dublin Core also influenced the specification for metadata collected during the assessment stage of a Data Asset Framework audit.1

Even where Dublin Core is not integrated directly into metadata schemes, it is still recognised as an important common language for metadata. For example, mappings to Dublin Core are known to exist for the following metadata schemes:

  • the Directory Interchange Format maintained by the Committee on Earth Observation Satellites International Directory Network and used as the native metadata scheme for the Global Change Master Directory, hosted by NASA;
  • MOLES, the metadata scheme maintained by the Centre for Environmental Data Archival at STFC, and used by the NERC DataGrid;
  • UK AGMAP, the application profile used and maintained by the GoGeo service at EDINA;
  • the DataCite Metadata Schema for the publication and citation of research data.

Dublin Core is designed to be useful as a way of describing as wide a range of resources as possible. As such it is well suited as a fallback format for exchanging scientific metadata across disciplines, and for incorporating data into systems and archives that are not explicitly set up to deal with them. On the other hand, being a general scheme, it necessarily misses some of the less common metadata concepts a specialist scheme might use, and would render others it does capture with less detailed semantics.

Metadata profiles go some way to resolving these kinds of issues, particularly when the metadata are expressed as Linked Data. The profiling concept means that Dublin Core need not be the only format used for exchange purposes, as elements from other schemes can be used as needed. Linked Data mechanisms (for declaring one element to be a special case of another) allow applications to take advantage of detailed semantics where they are understood, and fall back to general semantics where they are not.

When producing profiles in this way, it is important to understand the domain and range of the metadata elements being used or specialized. The domain of an element refers to the things that can have the property represented by the element. The range refers to the values that the element can take. The properties in the Dublin Core Metadata Element Set do not have any restrictions to their domain or scope, but those in the set of DCMI Metadata Terms do have restrictions. Any specializations of the latter properties should therefore have the same or narrower restrictions; if this not possible, perhaps because it would cause a clash with the data model of another scheme in use, an equivalent property from the former set (or another scheme) should be used as the basis.

It should be noted that while the technology for handling Linked Data is becoming more commonly available, it is by no means ubiquitous among scientific tools, applications and data repositories. There is something of a chicken-and-egg situation where, until there is a critical mass of scientific metadata in a Linked Data format, wide support for it will not be forthcoming, and the full benefit of using it will not be realized. Even so, expressing metadata in this way is not only good preparation for future applications but also assists with more manual migration of metadata between current systems.



Dr. Radut | subject