In the previous post I gave an overview of the JiscEXPO project outputs available so far, and hinted at ones coming soon. In this post I focus more on the themes and issues that are starting to appear. It can be quite difficult to distill these out of the information available, but I have been able to see a few patterns emerging, even though it is still relatively early days.
Archives Hub Record for Sir Ernest Shackleton
Given that linked data is of course, about data, a number of issues have been appearing around this subject. Linked data will generally require some data modeling, and as the Locah project report, this may mean having to change your data model mindset:
“it took me quite a while to get away from the idea of modelling the EAD record, rather than the actual data.”
“I found actually getting a ’starting point’ a bit difficult. I think this is because everything can be a starting point”
There can also be inherent complexities in the existing data that can make the modeling difficult:
“perhaps one of the thorniest [questions] is that arising from one of the fundamental characteristics of the nature of archival description [which is] typically based on a “hierarchical”, “multi-level” approach”
“One consequence of the multi-level approach in archival description practice is a strong sense of the importance of “context” … the descriptions of the “lower level” units should be read and interpreted in the context of the higher levels of description”
“So, there is arguably a (perhaps unavoidable) element of tension between the strongly “contextual” emphasis of EAD and ISAD(G) and the “bounded descriptions” of “Linked Data”.”
The process of transforming and exposing linked data can also highlight ‘dirty’ data, and issues around disambiguation. The MusicNet project mentions problems arising from different naming conventions and input error when looking for records that represent the same musical composer in multiple data sets. They’ve been experimenting with a data alignment tool they developed to help solve these issues, and have put together this YouTube video demo:
Locah have also been finding numerous examples of inconsistencies, such as where the ‘creator’ is ‘Joe Bloggs and others’ rather than just a name for example.
Linkbrainz have noted some scalability challenges that can arise with some linked data:
“[the] problem becomes acute for classical composers like Bach who are credited with tens of thousands recordings … the complete RDF resource description for Bach would be immense. This would cause an unacceptable load on the database server and long wait times for dereferenced URIs.”
They suggest a solution that uses the pagination of the HTML pages for the RDF or RDFa, but note that this is not ideal from a modeling point of view. They also mention that including RDFa in the MusicBrainz HTML pages can increase the page by somewhere between 5% and 30%.
Linkbrainz have also had to contend with some licensing issues:
“… some content in the MusicBrainz database is licensed as by-nc-sa… JISC considers this license incompatible with completely open data. Therefore, this small subset of the MusicBrainz database will likely be omitted from our translation moving forward.”
However, the JiscOpenBib project appears to have sorted out data licensing without too many problems, having recently announced that the British Library is providing bibliographic data under CC0 Public Domain Dedication Licence.
MusicNet have drawn attention to the question of how to sustain the data from the JiscEXPO projects, and the HE sector in general in the longer term, suggesting that we need provision for UK academic data to be hosted on the JANET network under a suitable .ac.uk domain. A hosted data.ac.uk is proposed, possibly JISC funded, to lower the technical and financial barrier to entry to publish RDF. One suggestion is that this could be possible via the data.gov.uk education datastore.
Locah believe there is a significant skills and training gap in the linked data area, noting a lack of domain specific examples, and a lack of helpful information about how to create a data model. They suggest that at the moment, a certain level of expertise is needed to model data and output RDF, and that efforts to address this and make it easier would help the take up of linked data. They do however note that adopting the linked data approach is already paying dividends by making development more user focused:
“the very big plus with this different kind of thinking is that by definition it puts what the user is interested in at the forefront of your thinking”
So we can see the projects are meeting a range of challenges in exposing their linked data. It’s worth noting that many of the difficulties do not uniquely arise from outputting linked data, and in fact, the projects are in many cases simply ‘exposing’ existing problems that have thus far remained hidden behind data silos. It’s good to hear about the positive effects the linked data approach can have on helping to steer development in a more user focused direction. It will be interesting to see how the projects get on and what further themes arise when the demonstration prototypes start to appear this year.