Parsing the schema.org vocabulary for fun and frustration

Posted on Thu 01 August 2013 in misc

For various reasons I've spent a few hours today trying to parse the schema.org vocabulary into a nice, searchable database structure. Unfortunately, for a linked data effort that's two years old now and arguably one of the most important efforts out there, it's been an exercise in frustration.

OWL - oww, oww, oww

My first attempt was to work through the "`official OWL version of the terms <http://schema.org/docs/schemaorg.owl>`__" (per http://schema.rdfs.org/). That quickly led me to filing a bug asking for the OWL to be served up as MIME type application/xml rather than text.html so that my browser would do a better job of rendering it. But that was a minor issue. As I worked through the OWL, I discovered that it was not at all in sync with the vocabulary as documented via the http://schema.org web site.

For example, looking at the OWL for datePublished, things look okay at first glance:

<!-- http://schema.org/datePublished --><owl:DatatypeProperty rdf:about="http://schema.org/datePublished">    <rdfs:label xml:lang="en">datePublished</rdfs:label>    <rdfs:comment xml:lang="en">Date of first broadcast/publication.</rdfs:comment>    <rdfs:range rdf:resource="http://www.w3.org/2000/01/rdf-schema#Literal"/>    <rdfs:domain>        <owl:Class>            <owl:unionOf rdf:parseType="Collection">                <rdf:Description rdf:about="http://schema.org/CreativeWork"/>            </owl:unionOf>        </owl:Class>    </rdfs:domain></owl:DatatypeProperty>

But the first problem is that the range property is a lie. Per http://schema.org/datePublished, the value is supposed to be restricted to http://schema.org/Date (backed up by an RDF(S) assertion in the page itself.

In fact, this problem affects all of the range declarations in the OWL document. Everything has a range of "literal". Not good.

RDF/XML

Next, I tried the RDF/XML format. Looking at http://schema.org/datePublished again, we see:

<rdf:Description rdf:about="http://schema.org/datePublished"> <rdf:type rdf:resource="http://www.w3.org/1999/02/22-rdf-syntax-ns#Property"/>    <rdfs:label xml:lang="en">Date Published</rdfs:label>  <rdfs:comment xml:lang="en">Date of first broadcast/publication.</rdfs:comment>    <rdfs:domain rdf:resource="http://schema.org/CreativeWork"/>  <rdfs:range rdf:resource="http://www.w3.org/2001/XMLSchema#date"/>    <rdfs:isDefinedBy rdf:resource="http://schema.org/CreativeWork"/></rdf:Description>

Better, in that the range property is not literal, but instead of pointing to http://schema.org/Date, it's pointing to http://www.w3.org/2001/XMLSchema#date? That seems weird and concerning. In addition, working with this format quickly becomes difficult because you have to dereference assertions like the following to make sense of things:

<rdf:Description rdf:nodeID="node180s0ohklx169">    <rdf:type rdf:resource="http://www.w3.org/2002/07/owl#Class"/></rdf:Description><rdf:Description rdf:nodeID="node180s0ohklx170">   <rdf:first rdf:resource="http://schema.org/Distance"/>    <rdf:rest rdf:nodeID="node180s0ohklx171"/></rdf:Description><rdf:Description rdf:nodeID="node180s0ohklx171">   <rdf:first rdf:resource="http://schema.org/QuantitativeValue"/>   <rdf:rest rdf:resource="http://www.w3.org/1999/02/22-rdf-syntax-ns#nil"/></rdf:Description><rdf:Description rdf:nodeID="node180s0ohklx169">    <owl:unionOf rdf:nodeID="node180s0ohklx170"/></rdf:Description><rdf:Description rdf:about="http://schema.org/depth">   <rdfs:range rdf:nodeID="node180s0ohklx169"/>  <rdfs:isDefinedBy rdf:resource="http://schema.org/Product"/></rdf:Description>

Insert Ain't nobody got time for that! meme image here, if you will...

JSON

Somewhat ironically, given the XML orientation of the semantic web world, the best option I've found so far ends up being the JSON representation of the schema. Here's our http://schema.org/datePublished friend, again:

"datePublished": {      "comment": "Date of first broadcast/publication.",       "comment_plain": "Date of first broadcast/publication.",       "domains": [        "CreativeWork"      ],       "id": "datePublished",       "label": "Date Published",       "ranges": [        "Date"      ]    },

Very easy to work with, and the data seems to be accurate! So far I've noticed only a few quirks:

datatypes and properties have valid comment properties, but types only have an empty string. Of course I could harvest those myself for each type by hitting the corresponding schema.org page, but I shouldn't have to do that.
datatypes and types have url properties, but properties do not. It's straightforward to create them yourself by appending the id property to "http://schema.org/", but consistency would be nice.

Summary

I would recommend anyone looking to do something meaningful with the schema.org vocabulary currently to start with the JSON format; you will likely be much more efficient than if you wade into one of the other formats and start dealing with multiple namespaces, etc before ultimately realizing that the format you're working with is deeply flawed or out of sync... And I've given myself the task of offering up a merge request or two to Git repository where the scrapers that generate these formats live, so hopefully these problems will be resolved sooner rather than later