Cataloguing for the open web: schema.org in library catalogues and websites

Posted on Tue 01 July 2014 in Libraries

tldr; my slides are href="http://stuff.coffeecode.net/2014/understanding_schema">here, and the slides from Jenn and Jason are also available from href="http://connect.ala.org/node/222959">ALA Connect.

On Sunday, June 29th Jenn Riley, Jason Clark, and I presented at the ALCTS/LITA jointly sponsored session href="http://ala14.ala.org/node/14382">Understanding schema.org. The build-up to the session was pretty amazing; I was delighted to learn that Jason and I had been working on pretty much parallel efforts over the past couple of years. Jenn did a great job of organizing the session, and by the time we started talking 276 people had indicated their interest in attending: that was two more than those who had indicated an interest in attending the BIBFRAME Forum Update scheduled in the same time slot. Our room was large and quite full.

Jenn started the session out string by advancing her concept that libraries need to target discovery elsewhere: that is, that there is no way that libraries can compete directly with major search engines like Google, Bing, and Yahoo, either through the discovery tools that we have to offer, our presence in the consciousness of most of the population as the starting point for discovery, or in the resources we can direct towards closing the huge gap in technology, usability, and mindshare that the search engines have opened up over the past two decades. But, we can take steps to start working with the search engines to enable our resources to be discovered and accessed more directly by them.

That led quite naturally to my own part of the session, in which I talked about my attempt to turn cataloguing's efforts to provide access points in our niche catalogues into access points for the open web by publishing schema.org structured data from library catalogues like Evergreen, Koha, and VuFind. I started things out by pointing out the legacy of restrictive robots.txt files that still live on in many catalogues today, then worked through some basics like how sitemaps enable search engines--which strive to provide relevant, useful results that matter to users in their context at a particular place and time--to efficiently crawl just the most recently changed pages of interest. Then I launched into the heart of the talk that showed how catalogues that publish schema.org structured data can turn an undifferentiated mass of presentation-oriented HTML and words into machine-comprehensible entities: classes like Book and Organization, connected by properties like publisher, and with values for properties like author, datePublished, and isbn.

For this talk I used visualizations generated by the href="http://rdfa.info/play">RDFa playground to illustrate the structured data contained in some real examples of a production Evergreen system (thanks to Bibliomation). Given that I'm normally a text-and-talk kind of guy, the illustrations seemed to help out--particularly in showing how holdings map quite readily to the Product / Offer structure more commonly used by commercial enterprises to reflect the prices, locations, and availability of their products.

Of course, the evolution from unstructured, to structured, to linked data had its payoff beginning with the link from holdings to the libraries that hold the resources. We have plenty more we can and must do, but unlike other efforts which are still crystallizing and which will require significant architectural work to happen before libraries can even begin trying out real systems, you can use schema.org-enabled systems today. And adapting systems to publish schema.org structured data only requires access to the HTML templates for your system (which, hopefully, you have: otherwise you have bigger problems to deal with!) and following the patterns that have already been established by Evergreen, Koha, and VuFind.

Jason did a great job showing both a broader use case for schema.org, including work he has led on digital collections such as embedding the Recipe type in a book of recipes. And he covered some of the evolution of the vocabulary, including the exciting possibilities introduced by the Action type and potentialAction property for describing RESTful APIs... which naturally led to an off-the-top-of-the-head enumeration of such actions as BorrowAction and LendAction that are perfect for libraries.

Perhaps the best part of the session, however, were the insightful questions from the audience (along with the genuinely enthusiastic response to our talks). We had deliberately left 15 minutes for questions, and we were not disappointed: from questions about how we move from structured data to more linked data (I riffed on the Dodds/Davis href="http://patterns.dataincubator.org/book/progressive-enrichment.html">Progressive Enrichment linked data pattern, suggesting that we should be able to href="/archives/278-Broadening-support-for-linked-data-in-MARC.html">store links for each field or value of interest directly in our MARC records), to questions about what proprietary systems are doing this with schema.org today (alas, none that I'm aware of, unless something has changed since href="/archives/282-Were-not-waiting-for-the-ILS-to-change.html">February).