The metadata schemes used to describe scientific data, and indeed research data generally, vary widely according to several inter-related factors. Perhaps of prime importance is the discipline: different areas of research have different measures of quality for data, and therefore place different requirements on the supporting information and metadata. Also significant is the application: the metadata needed to make a certain type of experiment repeatable will clearly depend on the experiment, and will be a quite different set to that needed to operate a data discovery service. To a certain extent, the type of data being described will influence the metadata provided, simply because different types exhibit different properties.
With all these possibilities for variation, there is a tendency for scientific metadata to be highly specialized. For example, the MIBBI (Minimum Information for Biological and Biomedical Investigations) Project lists over thirty metadata schemes, each setting out the minimum information needed to reproduce a different type of experiment. One of these, MIAME (Minimum Information About a Microarray Experiment) has been further specialized to cover four further sub-types: environmental transcriptomics, nutrigenomics, plant transcriptomics and toxicogenomics.
While this specialization is efficient and for the most part intuitive for the scientist working the data, it has its disadvantages. Most notably, it makes it harder to integrate data from related areas into a single data catalogue and, scaling up, forms an obstacle for interdisciplinary research. There are at least three different approaches that can be used to overcome this difficulty.
It is sometimes possible to sidestep the issue using full-text indexing and text mining techniques on the metadata. For example, in the field of bioinformatics, data models and metadata schemes are evolving rapidly, so any mappings between schemes become obsolete very quickly. Because of this, the European Bioinformatics Institute does not rely on such mappings to provide a cross-search facility for its wide collection of datasets and databases. Instead, it uses a Lucene-based search engine to enable users to search across the full text of the metadata. The problem with this approach is that it relies upon text mining techniques to infer the semantics of the metadata, instead of using those inherent in the scheme.
Where metadata schemes are relatively stable, a more common approach is to provide mappings or crosswalks from specialist schemes to more general and widely adopted schemes. For example, the NERC DataGrid uses a metadata scheme called MOLES (Metadata Objects for Linking Environmental Sciences); the maintainers of the scheme provide mappings to both Directory Interchange Format and ISO 19115, the international standard for geographic metadata. This approach is most often used where the application using the metadata, or the data type being described, has unique requirements not adequately met by other schemes. It almost certainly involves the loss or corruption of some metadata semantics when the crosswalk is performed, and whenever a new version is released of any of the metadata schemes involved, the mappings need to be reviewed.
An alternative approach is to form the specialist scheme as a profile of one or more existing schemes. In other words, the local scheme ‘borrows’ metadata elements from existing schemes, adding additional conventions or semantics as necessary. For example, EDMED version 1 (used by the European Directory of Marine Environmental Datasets) and ANZLIC (the Australia and New Zealand Land Information Council Metadata Profile) are both profiles of ISO 19115. UK AGMAP (the UK Academic Geospatial Metadata Application Profile) is both a profile of ISO 19115 and a superset of another profile, UK GEMINI (the UK Geospatial Metadata Interoperability Initiative Standard). The metadata scheme used by the British Geological Survey’s National Geoscience Data Centre is for the most part a profile of ISO 19115, but also contains additional elements intended to allow lossless conversion of metadata written according to the National Geospatial Data Framework’s Discovery Metadata Guidelines, and elements supporting three dimensional models.
The advantage of the profiling approach is that applications that understand the general scheme from which elements are drawn will, with no further effort, be able to understand the corresponding parts of the specialist profile. This approach also requires less maintenance effort, as for each element in the profile, only one relationship with an external scheme needs to be monitored, instead of (potentially) many.