Tales of a semantic web dropout (or what I meant to say at code4lib 2014)

Posted on Thu 03 April 2014 in Linked Open Data

Last week I had the fantastic experience of returning to the code4lib conference for the first time since 2008, and as a speaker to boot.

The title of my talk was Structured data NOW: seeding schema.org in library systems. I had given two talks the prior week on a substantially similar subject (about teaching Koha, Evergreen, and VuFind how to express schema.org structured data via RDFa), but all three conferences had very different audiences. I felt great about my talks at LibTechConf and the Evergreen International Conference, but those were one hour long and 45 minutes long respectively. code4lib, on the other hand, schedules 20 minute slots; it is a veritable crucible for speakers. I remixed and rewrote my code4lib talk obsessively leading up to the conference, and ultimately ended up adding content to my overall message, which was obviously the wrong direction to take things... but before this audience of my peers, I felt an absolute need to explain why I had chosen to spend much of the past year and a half focused on RDFa and schema.org. And that ultimately led to having to cut a significant amount out of the actual delivery, which meant that the audience didn't get the takeaway message that I actually wanted to deliver. One peer, in fact, described it as "a good refresher on microdata" which was almost exactly what I had wanted to avoid doing (microdata vs. RDFa aside) for this audience!

So, let me pick things up where I left off; but first, I'll give you time to go watch the video (YouTube) or read through the slides until we get to the slide titled Structured library information

.............................

All caught up? Good! Now let's pretend that I had about ten more minutes; here's roughly what I wanted to impart:

Structured library information: given that schema.org offers the Library type, and library systems often contain information such as the hours of operation, contact information, physical address, and branch relationships, we can teach our library systems to express that as structured data. And good news, Evergreen (as of the 2.6 release) will do exactly that! So if you remember all the way back to the start of the presentation where I was pointing at various map services that had differing levels of knowledge about our libraries often requiring different social media accounts, publishing your data out in an openly accessible, standard format should make it possible for those map services (including OpenStreetMap) to do a better job of reflecting our presence in the world.

Thought experiment: Now that we're publishing our holdings in a commonly understood Offer format, and linking those holdings to the library that holds them, and (in the case of Evergreen) providing information about those libraries, when can we stop batch uploading MARC at irregular intervals just to create union catalogs? In fact, wouldn't we be able to build ILL systems that can do a much better, more competitive job once we're making this information openly available on the web?

Sitemaps: Of course, to tell search engines and crawlers what pages are of interest and when they have been updated, you have to offer a sitemap. Fortunately, Koha, VuFind, and Evergreen (to a lesser extent) all support generation of sitemaps today.

Quick union catalogues: As a proof of concept, I proved that we can build union catalogues on the backs of existing general search engines by creating a Google Custom Search Engine (CSE) that tied together the holdings of two different VuFind instances along with an Evergreen instance under a single search box. It is as ugly as sin, but it took me all of about ten minutes to cobble together; Google had already crawled all of the pages, so I just had to tell it what hostnames and URL patterns I cared about. The CSE even gives you some limited support for directly querying the underlying structured data. Later on, Sean Aery from Duke gave a lightning talk that showed off how they had taken exactly this approach to provide a search interface for their finding aids and digital collections and made it beautiful` <>`__!

Quick union catalogues: in progress: As a firm believer in the importance of decentralization, I pointed at a simple in-progress Python script that would crawl sitemaps and extract structured data from all of the indexed pages. My intention was to provide a complete indexed solution with a simple web frontend, but I got a bit bogged down in first updating the Fedora packages for several of the dependencies, then tackling some bugs in the upstream libraries themselves. More to be done here!

.............................

Hmm. Well maybe I didn't miss conveying as much as I had feared. On the bright side, there was a great deal of interest in the SchemaBibEx "best practices and recommendations" documentation that I had promised we were working on... and today Richard Wallis described some of his work in this area. So that's a good thing. And even if some of the audience walked away from my talk with just an introduction to RDFa and schema.org, that put them in an extremely good position to be able to enjoy and understand Sean's subsequent lightning talk.

Oh, and my admission of being a semantic web dropout (due to the complexity of content negotiation and heterogeneous vocabularies and billions of triples and RDF/XML) ended up being a perfect setup for the immediately following talk Next Generation Catalogue - RDF as a Basis for New Services by Anne-Lena Westrum and Asgeir Rekkavik from the Oslo Public Library, who basically said "Semantic web? Oh yeah, we can totally do that!" and proceeded to show their MARC2RDF and RDF2MARC workflows. Very cool stuff (and delightful scheduling by the conference program committee!)