How discovery layers have closed off access to library resources, and other tales of schema.org from LITA Forum 2014

Posted on Sat 08 November 2014 in Libraries

At the LITA Forum yesterday, I accused (presentation) most discovery layers of not solving the discoverability problems of libraries, but instead exacerbating them by launching us headlong to a closed, unlinkable world. Coincidentally, Lorcan Dempsey's opening keynote contained a subtle criticism of discovery layers. I wasn't that subtle.

Here's why I believe commercial discovery layers are not "of the web": check out their robots.txt files. If you're not familiar with robots.txt files, these are what search engines and other well-behaved automated crawlers of web resources use to determine whether they are allowed to visit and index the content of pages on a site. Here's what the robots.txt files look like for a few of the best-known discovery layers:

User-Agent: *
Disallow /

That effectively says "Go away, machines; your kind isn't wanted in these parts." And that, in turn, closes off access to your libraries resources to search engines and other aggregators of content, and is completely counter to the overarching desire to evolve to a linked open data world.

During the question period, Marshall Breeding challenged my assertion as being unfair to what are meant to be merely indexes of library content. I responded that most libraries have replaced their catalogues with discovery layers, closing off open access to what have traditionally been their core resources, and he rather quickly acquiesced that that was indeed a problem.

(By the way, a possible solution might be to simply offer two different URL patterns, something like /library/* for library-owned resources to which access should be granted, and /licensed/* for resources to which open access to the metadata is problematic due to licensing issues, and which robots can therefore be restricted from accessing.)

Compared to commercial discovery layers on my very handwavy usability vs. discoverability plot, general search engines rank pretty high on both axes; they're the ready-at-hand tool in browser address bars. And they grok schema.org, so if we can improve our discoverability by publishing schema.org data, maybe we get a discoverability win for our users.

But even if we don't (SEO is a black art at best, and maybe the general search engines won't find the right mix of signals that makes them decide to boost the relevancy of our resources for specific users in specific locations at specific times) we get access to that structured data across systems in an extremely reusable way. With sitemaps, we can build our own specialized search engines (Solr or ElasticSearch or Google Custom Search Engine or whatever) that represent specific use cases. Our more sophisticated users can piece together data to, for example, build dynamic lists of collections, using a common, well-documented vocabulary and tools rather than having to dip into the arcane world of library standards (Z39.50 and MARC21).

So why not iterate our way towards the linked open data future by building on what we already have now?

As Karen Coyle wrote in a much more elegant fashion, the transition looks roughly like:

Stored data -> transform/template -> human readable HTML page
Stored data -> transform/template (tweaked) -> machine & human readable HTML page

That is, by simply tweaking the same mechanism you already use to generate a human readable HTML page from the data you have stored in a database or flat files or what have you, you can embed machine readable structured data as well.

That is, in fact, exactly the approach I took with Evergreen, VuFind, and Koha. And they now expose structured data and generate sitemaps out of the box using the same old MARC21 data. Evergreen even exposes information about libraries (locations, contact information, hours of operation) so that you can connect its holdings to specific locations.

And what about all of our resources outside of the catalogue? Research guides, fonds descriptions, institutional repositories, publications... I've been lucky enough to be working with Camilla McKay and Karen Coyle on applying the same process to the Bryn Mawr Classical Review. At this stage, we're exposing basic entities (Reviews and People) largely as literals, but we're laying the groundwork for future iterations where we link them up to external entities. And all of this is built on a Tcl + SGML infrastructure.

So why schema.org? It has the advantage of being a de-facto generalized vocabulary that can be understood and parsed across many different domains, from car dealerships to streaming audio services to libraries, and it can be relatively simply embedded into existing HTML as long as you can modify the templating layer of your system.

And schema.org offers much more than just static structured data; schema.org Actions are surfacing in applications like Gmail as a way of providing directly actionable links--and there's no reason we shouldn't embrace that approach to expose "SearchAction", "ReadAction", "WatchAction", "ListenAction", "ViewAction"--and "OrderAction" (Request), "BorrowAction" (Borrow or Renew), "Place on Reserve", and other common actions as a standardized API that exists well beyond libraries (see Hydra for a developing approach to this problem).

I want to thank Richard Wallis for inviting me to co-present with him; it was a great experience, and I really enjoy meeting and sharing with others who are putting linked data theory into practice.