Broadening support for linked data in MARC

Posted on Fri 24 January 2014 in Libraries

The following is an email that I sent to the MARC mailing list on January 24, 2014 that might be of interest to those looking to provide better support for linked data in MARC (hopefully as just a transitional step):


In the spirit of making it possible to express linked data in MARC for any data field, would it be worthwhile exploring the possibility of defining subfield $0 as valid for all data fields, and then relaxing the definition such that in the absence of a specific MARC Organization Code or Source Identifier code, it would be understood to be the default of Source Identifier "(uri)" (that is, a URI)?

Right now the mechanism for fields that can be controlled by authority records would be to either figure out the mapping between the MARC Organization code or Source Identifier code and some URI (if the subfield 0 directly identifies the source of the authority record or source identifier), or (in many systems) look up the local authority record that controls the field, then look up the source for the authority record (again having to use localized logic for the MARC Organization code / Source Identifier code).

The current limitations are that:

  1. subfield $0 can only currently be applied to a handful of fields
  2. systems have to parse out the code and number and reassemble them to (potentially) find a linked source on the other side. For example, I could teach Evergreen or VuFind or Koha (all systems to which I've contributed varying amounts of code) to take any $0 that starts with "(LoC)n" and know that it needs to map that to "http://id.loc.gov/authorities/names/n + number", as well as mapping "(LoC)sh" to "http://id.loc.gov/authorities/subjects/sh + number", but that's laborious and repetitive and likely to get out of sync between systems rather rapidly.

The alternative that I'm proposing--to allow $0 on any field, and to assume a default Source Identifier of "(uri)" in the absence of any explicit identifier--would enable systems to provide links for entities that are currently uncontrolled. For example, field 264 (producers/publishers/distributors/manufacturers) are currently not controllable fields. If the proposal was accepted, however, when systems generate record detail pages, they could include structured data that identifies the producer/publisher/distributor/manufacturer.

I will certainly acknowledge that it's not a perfect proposal as-is, as for a 264 you would most likely want to provide a link for subfield $b and a separate link for subfield $a, whereas for many other fields you're providing a link for the entire combination of subfields -- but it would be a step forward from where we are now.

An extension, then, would be to provide some optional means of defining for which subfield(s) the $0 provides a controlling link; for example, using square-brackets and 1-indexed positional integers you could do something slightly horrible like:

264 #1 $a Cambridge : $b Elsevier, $c 2013 $0 [1]http://dbpedia.org/resource/Cambridge,_Massachusetts $0 [2]http://dbpedia.org/resource/Elsevier

The advantage here is that you can maintain the existing punctuation but have tightly defined linked entities that you can then express when you publish information about this record elsewhere--and you have a ready handle for pulling in more information about any of the linked resources within your MARC-based systems--without having to subsequently do string clean-up and entity matching, etc. And this gives us, perhaps, a way forward from MARC to something else that is more focused on linked data.

Note: I want to thank Karen Coyle for first getting me to think about this problem with her blog post Linked Data First Steps & Catch-21

Thanks for any consideration that might go into this informal proposal,

Dan Scott