Friday, January 24. 2014
The following is an email that I sent to the MARC mailing list on January 24, 2014 that might be of interest to those looking to provide better support for linked data in MARC (hopefully as just a transitional step):
In the spirit of making it possible to express linked data in MARC for any data field, would it be worthwhile exploring the possibility of defining subfield $0 as valid for all data fields, and then relaxing the definition such that in the absence of a specific MARC Organization Code or Source Identifier code, it would be understood to be the default of Source Identifier "(uri)" (that is, a URI)?
Right now the mechanism for fields that can be controlled by authority records would be to either figure out the mapping between the MARC Organization code or Source Identifier code and some URI (if the subfield 0 directly identifies the source of the authority record or source identifier), or (in many systems) look up the local authority record that controls the field, then look up the source for the authority record (again having to use localized logic for the MARC Organization code / Source Identifier code).
The current limitations are that:
The alternative that I'm proposing--to allow $0 on any field, and to assume a default Source Identifier of "(uri)" in the absence of any explicit identifier--would enable systems to provide links for entities that are currently uncontrolled. For example, field 264 (producers/publishers/distributors/manufacturers) are currently not controllable fields. If the proposal was accepted, however, when systems generate record detail pages, they could include structured data that identifies the producer/publisher/distributor/manufacturer.
I will certainly acknowledge that it's not a perfect proposal as-is, as for a 264 you would most likely want to provide a link for subfield $b and a separate link for subfield $a, whereas for many other fields you're providing a link for the entire combination of subfields -- but it would be a step forward from where we are now.
An extension, then, would be to provide some optional means of defining for which subfield(s) the $0 provides a controlling link; for example, using square-brackets and 1-indexed positional integers you could do something slightly horrible like:
264 #1 $a Cambridge : $b Elsevier, $c 2013 $0 http://dbpedia.org/resource/Cambridge,_Massachusetts $0 http://dbpedia.org/resource/Elsevier
The advantage here is that you can maintain the existing punctuation but have tightly defined linked entities that you can then express when you publish information about this record elsewhere--and you have a ready handle for pulling in more information about any of the linked resources within your MARC-based systems--without having to subsequently do string clean-up and entity matching, etc. And this gives us, perhaps, a way forward from MARC to something else that is more focused on linked data.
Note: I want to thank Karen Coyle for first getting me to think about this problem with her blog post Linked Data First Steps & Catch-21
Thanks for any consideration that might go into this informal proposal,
Tuesday, January 21. 2014
Last week I was putting the finishing touches on the first serious academic paper I have written in a long time, and decided that I wanted to provide backup for some of the assertions I had made. Naturally, the deadline was tight, so getting any articles via interlibrary loan was out of the question. This was going to be a purely electronic, immediate access affair.
So what does a systems librarian with a vast array of licensed materials (the dark web as the info lit people like to say) at his university's disposal do when faced with a research problem like this? Well, turn to Google Scholar, naturally.
As it turns out, I was able to find the majority of what I needed through Google Scholar: this will come as a surprise to no-one who has used it, but it's remarkably good at finding electronic copies of articles and conference proceedings. Sometimes they are preprints on the conference website; sometimes they are copies posted in the institutional repository or on the researcher's own website; sometimes they are what appears to be illicit copies* (you can tell by the watermark) posted on random domains. The more recent the article, the more likely it seemed I was able to find an on-demand copy.
My work is in the intersection of the semantic web and library systems, so it's perhaps not surprising that the library-oriented articles tended to have been published further in the past and were less likely to have a freely available copy available, whereas almost anything of interest on the semantic web side was immediately available. I suspect that not many people were thinking about open access to research in the 90's; still, it made me cringe a little to find familiar names amongst the authors of papers on open source library software that I would have liked to cite, but which were locked behind a paywall that not even my university (with its amazing provincial and federal consortium deals) had licensed access. So, of course, the citations went to papers that were available to me.
Call it anecdotal, call me a lazy researcher, but to me the evolution seems inevitable. If your work is freely available (ideally via a properly legal venue, like publishing in an open access journal, or deposiiting copy of your paper in your institutional repository or on your web site--assuming your publication contract allows it) then you are more likely to get citations; and if that pattern continues and coincides with citation counts as one measure of a researcher's effectiveness, what will the incentive be for keeping your work locked behind a paywall?
*A notable example is the seminal article "The Semantic Web" written by Tim Berners-Lee, James Hendler, and Ora Lassila and published in Scientific American in 2001. The official version of the paper is locked behind Scientific American's paywall at http://www.scientificamerican.com/article/the-semantic-web/ and they serve up interstitial ads between searches on their site(!). The primary electronic version offered up by Google Scholar, on the other hand, is a PDF posted at Google Code. Google Code is hardly a notable scholarly publishing site, but I bet it serves up way more copies than SciAm does.**
**Note that if you dig into all of the available copies of the article, there are hundreds scattered across university course pages and semantic web community sites. (Cue Darth Vader: "The infringement is strong in this one.") I assume SciAm knows that the blowback of trying to enforce copyright measures against infringers with this particular high-profile article would be intense; I'm not sure what lesson we're supposed to derive from that.
Thursday, October 31. 2013
I released File_MARC 1.0.1 yesterday after receiving a bug report from the most excellent Mark Jordan about a basic (but data corrupting) problem that had existed since the very early days (almost seven years ago). If you generate MARC binary output from File_MARC, you should upgrade immediately.
In the MARC binary output code, I was testing a string for the presence of a value--roughly, "if ($value)"--and returning false if no value was present. Which is fine, except when said value was '0', in which case that test returns FALSE. Whoops.
It's one of the oldest gotchas in PHP, and it lived for a very long time in this library. Probably because very few people want to generate MARC binary output. But now, that bug is squashed, and a unit test will ensure that it does not come back.
Thursday, October 17. 2013
I submitted the following proposal to the Library Technology Conference 2014 and thought it might be of general interest.
Structuring library data on the web with schema.org: we're on it!
This work is licensed under a Creative Commons Attribution-Share Alike 2.5 Canada License.