Book (introduction): RDFa with schema.org codelab

From basic HTML to RDFa: first steps

In this exercise, you will learn the basic steps required to add simple RDFa structured data to an existing library catalogue page for a book.

View the page source HTML

Open step1/rdfa_book.html in an HTML-friendly text editor such as vim, emacs, nano, the Chrome Dev Editor... anything but Notepad! You should see something like the following HTML source for the web page:

<!DOCTYPE html>
<html>
  <head>
    <meta charset="UTF-8">
    <title>Data: a collection of problems from many fields for the student and research worker</title>
  </head>
  <body>
    <div id="record">
      <div style="float:right;">
        <img alt="Cover" class="recordcover" src="../cover.png">
      </div>
      <h1>Data</h1>
      <h2>a collection of problems from many fields for the student and research worker </h2>
      <h4>D. F. Andrews ; A. M. Herzberg</h4>
      <table class="citation" summary="Bibliographische Detailangaben">
        <tr valign="top">
          <th>Autor:</th>
          <td class="recordAuthor"><a href="/Summon/Search?lookfor=%22Herzberg%2C+Agnes+M%22&type=Author">Herzberg, Agnes M</a></td>
        </tr>
        <tr valign="top">
          <th>Weitere Autoren:</th>
          <td class="recordSecAuthor"><a href="/Summon/Search?lookfor=%22Andrews%2C+David+F%22&type=Author">Andrews, David F</a></td>
        </tr>
        <tr valign="top">
          <th>Format:</th>
          <td class="recordFormat"><span class="iconlabel book">Buch</span></td>
        </tr>
        <tr valign="top">
          <th>Sprache:</th>
          <td class="recordLanguage">German</td>
        </tr>
        <tr valign="top">
          <th>Veröffentlicht:</th>
          <td>New York [u.a.] : Springer, 1985</td>
        </tr>
        <tr valign="top">
          <th>Umfang:</th>
          <td>XX, 442 S. : graph. Darst.</td>
        </tr>
        <tr valign="top">
          <th>ISBN:</th>
          <td>3-540-96125-9<br>
              0-387-96125-9<br>
          </td>
        </tr>
      </table>
    </div>
  </body>
</html>

Note: In a pinch, you can use the browser development tools to view and edit the source of the web page (CTRL-Shift-i in Chrome or Firefox, in the Elements or Inspector tab respectively).

Extract and view structured data in HTML

There are a number of RDFa parsers, both online and locally installable, that can help you check the results of your work. Copy and paste the HTML source into each of the following online structured data extraction tools:

The results should (not suprisingly!) show that the page currently contains no structured data.

Add the RDFa vocabulary declaration

Preamble

RDFa (Resource Description Framework in attributes) enables us to embed descriptions of things (types) and their properties within HTML documents using just a handful of HTML attributes.

To avoid a tower of Babel situation where one person uses the type name "author" to refer to the same concept that someone else calls a "writer", collections of types and their properties are typically standardized and published as a vocabulary (also known as an ontology).

Each type and property is expected to have a dereferenceable URI so that you (or more realistically the machines) can look up the definition of the vocabulary element and determine its relationship (if any) to other vocabulary elements. For example, you can look up http://schema.org/Book and learn that it is a subclass of the Thing / CreativeWork hierarchy.

You could use the full URI for each vocabulary element, but that would be extremely verbose - especially given vocabularies that publish URIs like http://rdaregistry.info/Elements/a/countryAssociatedWithThePerson. Therefore, RDFa offers the @vocab attribute; if you add a vocab="http://<path/for/vocab>" attribute to an HTML element, any of the RDFa @typeof and @property attributes within its scope will automatically prepend the specified value to those attributes.

Declare the schema.org vocabulary as your default

We're going to use the schema.org vocabulary for our exercise, as it includes types and properties that enable us to describe many things of general interest without having to mix and match multiple vocabularies. Declare the default vocabulary for the HTML document as http://schema.org/ on the <body> element. Note: Do not forget the trailing slash (/)!

Check your markup

<!DOCTYPE html>
<html>
  <head>
    <meta charset="UTF-8">
    <title>Data: a collection of problems from many fields for the student and research worker</title>
  </head>
  <body vocab="http://schema.org/">
...

Checkpoint: Your HTML page should now look like step1/check.html

Specialized versus general vocabularies

Many vocabularies focus on a particular domain; for example:

Friend-of-a-Friend (FOAF): social connections
Portable Contacts (PoCo): contact information
Bibliographic Ontology (Bibo): bibliographic references
Good Relations (gr): sales offers, orders, and agents

In practice, documents often ended up using types and properties from several different vocabularies. While vocabulary description languages like RDF Schema (RDFS) and the Web Ontology Language (OWL) offer ways to express equivalence between types and properties of different vocabularies, it can still be extremely complex to publish and consume mixed-vocabulary documents.

schema.org, on the other hand, tries to provide a vocabulary that can describe almost everything, albeit in many cases with less granularity than more specialized vocabularies.

Add the type and associated properties for your page

Preamble

Unless declared otherwise, web pages are assumed to have a type of WebPage. The choice of type is important as it dictates which properties you can "legally" use, so this section will help you find a more specific match for your purposes.

Browse the schema.org type hierarchy

The schema.org types are arranged in a top-down hierarchy. Starting at the top level of the type hierarchy, browse through the CreativeWork type hierarchy. Notice how each type inherits the properties from its parent (beginning with Thing), offers its own more specific definition for its raison d'etre, and may add its own properties to enable you to describe it more completely.

Add the type declaration for the document

To declare an RDFa type for an HTML document, add the @typeof attribute to the <body> element and set the value of the attribute to Book.

Check your markup

<!DOCTYPE html>
<html>
  <head>
    <meta charset="UTF-8">
    <title>Data: a collection of problems from many fields for the student and research worker</title>
  </head>
  <body vocab="http://schema.org/" typeof="Book">
...

Checkpoint: Your HTML page should now look like step2/check_a.html

Add a name property for the type

Every schema.org type has a name property available to it, because the property is declared on the Thing type from which every other type inherits. In the case of a Book, the title of the book is mapped to its name. Go ahead and add a @property="name" attribute to the <strong> element to assert that the content of that element is the name of the technical book.

Note: You might be tempted to add the attribute to the <title> element of the HTML document, but this would fall outside of the scope of your @typeof attribute. And while a search engine would likely make a best guess that, if the content of the <title> and <h1> for a given web page match then that's likely the title, your explicit assertion of that property is stronger than an inference.

Note: As the title is split between an <h1> element for the main title and an <h2> element for the subtitle, you can simply wrap both elements in a new <div> element to designate the combination of the two as the value of the name property.

Check your markup

<!DOCTYPE html>
...
<body vocab="http://schema.org/" typeof="Book">
...
    <div property="name">
      <h1>Data</h1>
      <h2>a collection of problems from many fields for the student and research worker </h2>
    </div>
...

Add an author property for the type

This book has an author, and if you check the documentation for Book you will find that there is indeed an author property. Notice that the expected type of the author property is either a Person or Organization type. For now, go ahead and add the @property="author" attribute to the <a> element for each of the authors' names.

Note: You might be tempted to add the attribute to the <tr> element of the HTML document, but the scope of the <tr> element includes more than just the name of the author, so you would be asserting (falsely!) that the author was "Author Nix, Garth".

Check your markup

<!DOCTYPE html>
...
<body vocab="http://schema.org/" typeof="Book">
...
    <tr valign="top">
      <th>Autor:</th>
      <td class="recordAuthor">
        <a href="/Author?lookfor=%22Herzberg%2C+Agnes+M%22" property="author">Herzberg, Agnes M</a>
      </td>
    </tr>
    <tr valign="top">
      <th>Weitere Autoren:</th>
      <td class="recordSecAuthor">
        <a href="/Author?lookfor=%22Andrews%2C+David+F%22" property="author">Andrews, David F</a>
      </td>
    </tr>
...

Improve the author property

Check the results from various structured data parsers. Do they match your expectations? Look closely at the author value; you probably did not expect the value of the author property to be a URL. This is one of the subtleties of RDFa; a elements are special, in that the href attribute value is used for an RDFa property value rather than the content of the <a> element.

Let's fix that: move the @property="author" attribute to the td element that surrounds the a element. Run your structured data parsers again to ensure that you're getting the results that you expect.

Check your markup

<!DOCTYPE html>
...
<body vocab="http://schema.org/" typeof="Book">
...
    <tr valign="top">
      <th>Autor:</th>
      <td class="recordAuthor" property="author">
        <a href="/Author?lookfor=%22Herzberg%2C+Agnes+M%22">Herzberg, Agnes M</a>
      </td>
    </tr>
    <tr valign="top">
      <th>Weitere Autoren:</th>
      <td class="recordSecAuthor" property="author">
        <a href="/Author?lookfor=%22Andrews%2C+David+F%22">Andrews, David F</a>
      </td>
    </tr>
...

Add a datePublished property for the type

Right now a date of publication is visible on the page, but as the data just lives inside an undifferentiated string of text, it would difficult for a machine to know what the data means. To remove this uncertainty, wrap the date in a <time> tag and add the @property="datePublished" attribute.

Check your markup

<!DOCTYPE html>
...
<body vocab="http://schema.org/" typeof="Book">
...
    <tr valign="top">
      <th>Veröffentlicht:</th>
      <td>New York [u.a.] : Springer, <time property="datePublished">1985</time></td>
    </tr>
...

Checkpoint: Your HTML page should now look like step2/check_b.html

Add an image property for the Book type

Every type in schema.org can have an image property. One potential use case for search engines is to use the image property to guide the search engine to choose the appropriate image from a page that might contain multiple images to provide a more visually attractive search result. Your catalogue page contains an image. Add the @property="image" attribute to the <img> element.

Check your markup

<!DOCTYPE html>
...
<body vocab="http://schema.org/" typeof="Book">
  <div style="float:right;">
    <img alt="Cover" class="recordcover" property="image" src="../cover.png">
  </div>
...

Add book-specific properties to the Book entity

When you look at the documentation for the schema.org Book type, one of the properties that is specific to the Book type is the bookEdition property--and our sample book says that it is a second edition, which just might be of interest to researchers. Add the @property="bookEdition" attribute to the corresponding td element.

Repeat for the isbn and numberOfPages properties.

Note: schema.org processors in particular understand that this level of granularity is not always possible in practice, and will do the best they can with the data they receive. So if the best you can do is mark the value of an ISBN in your web page as 9780545522458 (hbk.) : $12.99 instead of just the actual ISBN itself, processors may still be able to work out the actual value.

Check your markup

<!DOCTYPE html>
...
<body vocab="http://schema.org/" typeof="Book">
...
    <tr valign="top">
      <th>Sprache:</th>
      <td class="recordLanguage" property="inLanguage">German</td>
    </tr>
    <tr valign="top">
      <th>Ausgabe:</th>
      <td property="bookEdition">2. ed.</td>
    </tr>
    <tr valign="top">
      <th>Veröffentlicht:</th>
      <td>New York [u.a.] : Springer, <time property="datePublished">1985</time></td>
    </tr>
    <tr valign="top">
      <th>Umfang:</th>
      <td>XX, <span property="numberOfPages">442</span> S. : graph. Darst.</td>
    </tr>
    <tr valign="top">
      <th>ISBN:</th>
      <td><span property="isbn">3-540-96125-9</span><br>
          <span property="isbn">0-387-96125-9</span><br>
      </td>
    </tr>
...

Add a description property

You might have noticed that some of the RDFa parsers generate a rich snippet that shows you what your page might look like as a search result. You may also have noticed that the rich snippet did not contain much content of your page other than its title. To help search engines generate a better rich snippet, you should include a @property="description" attribute in your web page.

This record does not currently have a good description, so let's assume that you have flagged this record for enrichment and either you or your cataloguers have added an abstract. Use the publisher's description to create a new Abstract section with the @property="description" attribute to the appropriate td element.

Check your markup

<!DOCTYPE html>
...
<body vocab="http://schema.org/" typeof="Book">
...
    <tr valign="top">
      <th>Abstrakt</td>
      <td class="recordDesc" property="description">
        Statistics provides tools and strategies for the analysis of data. While much
        has been written about the methodology, sometimes without reference to data,
        little has been said about the data. In this volume we present sets of data
        obtained from many situations without any direct reference to a particular type
        of analysis. Our view of the usefulness of bringing together a broad collection
        of sets of data has been shared by many friends and contributors.
      </td>
    </tr>
...

Add an implicit property

If a book is part of a series, it can be helpful to provide some information about that series. One way of indicating that relationship is to use the isPartOf property, which was included for this purpose on the CreativeWork type and its children. However, while your page lists the series ("Springer series in statistics"), the link simply launches a series keyword search.

When you realize that a vocabulary has pointed out a possible deficiency in your work, you could revisit the web page and add an "Series (publisher link)" field that you could then use to classify all of your work. In this step, assume that you are working with a strict designer who forbids you from altering the look or content of the page. In that situation, your only option is to use a <link> element to define the property value for the machines.

Go ahead and add <link property="isPartOf" href="http://www.springer.com/series/692"> anywhere within the scope of the Book. The solutions add the element directly under the <body> element.

For properties that take a plain text value instead of a URL, you can use <meta property="propertyName" value="value"> to provide that implicit information.

Note: Do not use this approach as a license to stuff your web page full of lascivious keywords that have no connection to your content in the hopes of drawing a larger audience to your site. The search engines learned about this "spiderfood" tactic back in the 90's and will punish your site mercilessly with low relevancy ranking if you are determined to have been trying to game their systems. The generally accepted best practice is to try to only add machine-readable markup to the same content that humans can see. Reserve <meta> elements only for the most important purposes.

Check your markup

<!DOCTYPE html>
...
<body vocab="http://schema.org/" typeof="Book">
  <link property="isPartOf" href="http://www.springer.com/series/692">
...

Checkpoint: Your HTML page should now look like step2/check_c.html

Lessons learned

In this exercise, you learned:

The basics of structured data vocabularies: types, properties, and type inheritance
How to navigate the schema.org vocabulary documentation
How to define the default vocabulary for a Web page using RDFa
How to define types and properties in a Web page using RDFa
How to use common structured data validators to check your work
How to use the @content attribute to supply machine-readable versions of human-oriented data
How to use the <meta> element to supply properties that would not otherwise be part of the content

RDFa with schema.org codelab: Book - introduction

About this codelab

From basic HTML to RDFa: first steps

View the page source HTML

Extract and view structured data in HTML

Add the RDFa vocabulary declaration

Preamble

Declare the schema.org vocabulary as your default

Specialized versus general vocabularies

Add the type and associated properties for your page

Preamble

Browse the schema.org type hierarchy

Add the type declaration for the document

Add a name property for the type

Add an author property for the type

Improve the author property

Add a datePublished property for the type

Add an image property for the Book type

Add book-specific properties to the Book entity

Add a description property

Add an implicit property

Lessons learned

About the author

Informational resources