In this codelab, you're going to take a catalogue page that describes a book and enhance it so that it contains structured data. You will use the schema.org vocabulary and express it via RDFa attributes.
Audience: Beginner
Prerequisites: To
complete this codelab, you will need a basic familiarity with HTML. The
exercises can be found in codelab.zip,
with the solutions found in the rdfa_exercises
subdirectory. There are
frequent checkpoints through the code lab, so if you get stuck at any point,
you can use the checkpoint file to resume and work through this codelab
at your own pace.
In this exercise, you will learn the basic steps required to add simple RDFa structured data to an existing library catalogue page for a book.
Open step1/rdfa_book.html
in an HTML-friendly text editor such as vim, emacs, nano, the Chrome
Dev Editor... anything but Notepad! You should see something like the following
HTML source for the web page:
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Data: a collection of problems from many fields for the student and research worker</title>
</head>
<body>
<div id="record">
<div style="float:right;">
<img alt="Cover" class="recordcover" src="../cover.png">
</div>
<h1>Data</h1>
<h2>a collection of problems from many fields for the student and research worker </h2>
<h4>D. F. Andrews ; A. M. Herzberg</h4>
<table class="citation" summary="Bibliographische Detailangaben">
<tr valign="top">
<th>Autor:</th>
<td class="recordAuthor"><a href="/Summon/Search?lookfor=%22Herzberg%2C+Agnes+M%22&type=Author">Herzberg, Agnes M</a></td>
</tr>
<tr valign="top">
<th>Weitere Autoren:</th>
<td class="recordSecAuthor"><a href="/Summon/Search?lookfor=%22Andrews%2C+David+F%22&type=Author">Andrews, David F</a></td>
</tr>
<tr valign="top">
<th>Format:</th>
<td class="recordFormat"><span class="iconlabel book">Buch</span></td>
</tr>
<tr valign="top">
<th>Sprache:</th>
<td class="recordLanguage">German</td>
</tr>
<tr valign="top">
<th>Veröffentlicht:</th>
<td>New York [u.a.] : Springer, 1985</td>
</tr>
<tr valign="top">
<th>Umfang:</th>
<td>XX, 442 S. : graph. Darst.</td>
</tr>
<tr valign="top">
<th>ISBN:</th>
<td>3-540-96125-9<br>
0-387-96125-9<br>
</td>
</tr>
</table>
</div>
</body>
</html>
Note: In a pinch, you can use the browser development tools to
view and edit the source of the web page (CTRL-Shift-i
in
Chrome or Firefox, in the Elements or Inspector tab
respectively).
There are a number of RDFa parsers, both online and locally installable, that can help you check the results of your work. Copy and paste the HTML source into each of the following online structured data extraction tools:
The results should (not suprisingly!) show that the page currently contains no structured data.
RDFa (Resource Description Framework in attributes) enables us to embed descriptions of things (types) and their properties within HTML documents using just a handful of HTML attributes.
To avoid a tower of Babel situation where one person uses the type name "author" to refer to the same concept that someone else calls a "writer", collections of types and their properties are typically standardized and published as a vocabulary (also known as an ontology).
Each type and property is expected to have a dereferenceable URI so that
you (or more realistically the machines) can look up the definition of
the vocabulary element and determine its relationship (if any) to other
vocabulary elements. For example, you can look up
http://schema.org/Book
and learn that it is a subclass of the Thing / CreativeWork
hierarchy.
You could use the full URI for each vocabulary element, but that would
be extremely verbose - especially given vocabularies that publish URIs
like http://rdaregistry.info/Elements/a/countryAssociatedWithThePerson.
Therefore, RDFa offers the @vocab
attribute; if you
add a vocab="http://<path/for/vocab>"
attribute to an HTML element, any of the RDFa @typeof
and
@property
attributes within its scope will automatically
prepend the specified value to those attributes.
We're going to use the schema.org vocabulary for our exercise, as it
includes types and properties that enable us to describe many things of
general interest without having to mix and match multiple vocabularies.
Declare the default vocabulary for the HTML document
as http://schema.org/
on the <body>
element.
Note: Do not forget the trailing slash (/
)!
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Data: a collection of problems from many fields for the student and research worker</title>
</head>
<body vocab="http://schema.org/">
...
Checkpoint: Your HTML page should now look like step1/check.html
Many vocabularies focus on a particular domain; for example:
In practice, documents often ended up using types and properties from several different vocabularies. While vocabulary description languages like RDF Schema (RDFS) and the Web Ontology Language (OWL) offer ways to express equivalence between types and properties of different vocabularies, it can still be extremely complex to publish and consume mixed-vocabulary documents.
schema.org, on the other hand, tries to provide a vocabulary that can describe almost everything, albeit in many cases with less granularity than more specialized vocabularies.
Unless declared otherwise, web pages are assumed to have a type of WebPage. The choice of type is important as it dictates which properties you can "legally" use, so this section will help you find a more specific match for your purposes.
The schema.org types are arranged in a top-down hierarchy. Starting at
the top level of the type
hierarchy, browse through the CreativeWork
type hierarchy. Notice how each type inherits the properties from its parent
(beginning with Thing
), offers its own more specific definition
for its raison d'etre, and may add its own properties to enable you
to describe it more completely.
To declare an RDFa type for an HTML document, add the
@typeof
attribute to the <body>
element
and set the value of the attribute to Book
.
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Data: a collection of problems from many fields for the student and research worker</title>
</head>
<body vocab="http://schema.org/" typeof="Book">
...
Checkpoint: Your HTML page should now look like step2/check_a.html
Every schema.org type has a name
property available to it, because the
property is declared on the Thing
type from which every other type
inherits. In the case of a Book
, the title of the book
is mapped to its name. Go ahead and add a @property="name"
attribute to the <strong>
element to assert that
the content of that element is the name of the technical book.
Note: You might be tempted to add the attribute to the
<title>
element of the HTML document, but this would
fall outside of the scope of your @typeof
attribute. And
while a search engine would likely make a best guess that, if the
content of the <title>
and <h1>
for a given web page match then that's likely the title, your explicit
assertion of that property is stronger than an inference.
Note: As the title is split between an <h1>
element
for the main title and an <h2>
element for the subtitle, you
can simply wrap both elements in a new <div>
element to designate
the combination of the two as the value of the name
property.
<!DOCTYPE html>
...
<body vocab="http://schema.org/" typeof="Book">
...
<div property="name">
<h1>Data</h1>
<h2>a collection of problems from many fields for the student and research worker </h2>
</div>
...
This book has an author, and if you check the documentation for Book you will find that
there is indeed an author
property. Notice that the expected
type of the author
property is either a Person
or Organization
type. For now, go ahead and add the
@property="author"
attribute to the <a>
element for each of the authors' names.
Note: You might be tempted to add the attribute to the
<tr>
element of the HTML document,
but the scope of the <tr>
element includes more
than just the name of the author, so you would be asserting (falsely!)
that the author was "Author Nix, Garth".
<!DOCTYPE html>
...
<body vocab="http://schema.org/" typeof="Book">
...
<tr valign="top">
<th>Autor:</th>
<td class="recordAuthor">
<a href="/Author?lookfor=%22Herzberg%2C+Agnes+M%22" property="author">Herzberg, Agnes M</a>
</td>
</tr>
<tr valign="top">
<th>Weitere Autoren:</th>
<td class="recordSecAuthor">
<a href="/Author?lookfor=%22Andrews%2C+David+F%22" property="author">Andrews, David F</a>
</td>
</tr>
...
Check the results from various structured data parsers. Do they match
your expectations? Look closely at the author
value; you
probably did not expect the value of the author
property
to be a URL. This is one of the subtleties of RDFa; a
elements are special, in that the href
attribute value is
used for an RDFa property value rather than the content of the
<a>
element.
Let's fix that: move the @property="author"
attribute to the td
element that surrounds the a
element. Run your structured data
parsers again to ensure that you're getting the results that you expect.
<!DOCTYPE html>
...
<body vocab="http://schema.org/" typeof="Book">
...
<tr valign="top">
<th>Autor:</th>
<td class="recordAuthor" property="author">
<a href="/Author?lookfor=%22Herzberg%2C+Agnes+M%22">Herzberg, Agnes M</a>
</td>
</tr>
<tr valign="top">
<th>Weitere Autoren:</th>
<td class="recordSecAuthor" property="author">
<a href="/Author?lookfor=%22Andrews%2C+David+F%22">Andrews, David F</a>
</td>
</tr>
...
Right now a date of publication is visible on the page, but as the
data just lives inside an undifferentiated string of text, it would
difficult for a machine to know what the data means. To remove
this uncertainty, wrap the date in a <time>
tag and add the @property="datePublished"
attribute.
<!DOCTYPE html>
...
<body vocab="http://schema.org/" typeof="Book">
...
<tr valign="top">
<th>Veröffentlicht:</th>
<td>New York [u.a.] : Springer, <time property="datePublished">1985</time></td>
</tr>
...
Checkpoint: Your HTML page should now look like step2/check_b.html
Every type in schema.org can have an image
property. One
potential use case for search engines is to use the image
property to guide the search engine to choose the appropriate image
from a page that might contain multiple images to provide a more
visually attractive search result. Your catalogue page contains an image.
Add the @property="image"
attribute to the
<img>
element.
<!DOCTYPE html>
...
<body vocab="http://schema.org/" typeof="Book">
<div style="float:right;">
<img alt="Cover" class="recordcover" property="image" src="../cover.png">
</div>
...
When you look at the documentation for the schema.org Book type, one of the properties that is
specific to the Book
type is the bookEdition
property--and our sample book says that it is a second edition, which just might
be of interest to researchers. Add the @property="bookEdition"
attribute to the corresponding td
element.
Repeat for the isbn
and numberOfPages
properties.
Note:
schema.org processors in particular understand that this level of granularity
is not always possible in practice, and will do the best they can with the data
they receive. So if the best you can do is mark the value of an ISBN in your
web page as 9780545522458 (hbk.) : $12.99
instead of just the
actual ISBN itself, processors may still be able to work out the actual value.
<!DOCTYPE html>
...
<body vocab="http://schema.org/" typeof="Book">
...
<tr valign="top">
<th>Sprache:</th>
<td class="recordLanguage" property="inLanguage">German</td>
</tr>
<tr valign="top">
<th>Ausgabe:</th>
<td property="bookEdition">2. ed.</td>
</tr>
<tr valign="top">
<th>Veröffentlicht:</th>
<td>New York [u.a.] : Springer, <time property="datePublished">1985</time></td>
</tr>
<tr valign="top">
<th>Umfang:</th>
<td>XX, <span property="numberOfPages">442</span> S. : graph. Darst.</td>
</tr>
<tr valign="top">
<th>ISBN:</th>
<td><span property="isbn">3-540-96125-9</span><br>
<span property="isbn">0-387-96125-9</span><br>
</td>
</tr>
...
You might have noticed that some of the RDFa parsers generate a rich
snippet that shows you what your page might look
like as a search result. You may also have noticed that the rich snippet
did not contain much content of your page other than its title. To help
search engines generate a better rich snippet, you should include a
@property="description"
attribute in your web page.
This record does not currently have a good description, so let's assume
that you have flagged this record for enrichment and either you or your
cataloguers have added an abstract. Use the publisher's
description to create a new Abstract section with the
@property="description"
attribute to the appropriate
td
element.
<!DOCTYPE html>
...
<body vocab="http://schema.org/" typeof="Book">
...
<tr valign="top">
<th>Abstrakt</td>
<td class="recordDesc" property="description">
Statistics provides tools and strategies for the analysis of data. While much
has been written about the methodology, sometimes without reference to data,
little has been said about the data. In this volume we present sets of data
obtained from many situations without any direct reference to a particular type
of analysis. Our view of the usefulness of bringing together a broad collection
of sets of data has been shared by many friends and contributors.
</td>
</tr>
...
If a book is part of a series, it can be helpful to provide some information
about that series. One way of indicating that relationship is to use the
isPartOf
property, which was included for this purpose on the
CreativeWork
type and its children. However, while your page
lists the series ("Springer series in statistics"), the link simply launches
a series keyword search.
When you realize that a vocabulary has pointed out a possible
deficiency in your work, you could revisit the web page and add an
"Series (publisher link)" field that you could then use to classify all of your
work. In this step, assume that you are working with a strict designer
who forbids you from altering the look or content of the page. In that
situation, your only option is to use a <link>
element to define the property value for the machines.
Go ahead and add <link property="isPartOf" href="http://www.springer.com/series/692">
anywhere within the scope of the Book
. The solutions
add the element directly under the <body>
element.
For properties that take a plain text value instead of a URL, you can
use <meta property="propertyName"
value="value">
to provide that implicit information.
Note: Do not use this approach as a license to stuff your
web page full of lascivious keywords that have no connection to your
content in the hopes of drawing a larger audience to your site. The
search engines learned about this "spiderfood" tactic back in the 90's
and will punish your site mercilessly with low relevancy ranking if you
are determined to have been trying to game their systems. The generally
accepted best practice is to try to only add machine-readable markup to
the same content that humans can see. Reserve <meta>
elements only for the most important purposes.
<!DOCTYPE html>
...
<body vocab="http://schema.org/" typeof="Book">
<link property="isPartOf" href="http://www.springer.com/series/692">
...
Checkpoint: Your HTML page should now look like step2/check_c.html
In this exercise, you learned:
@content
attribute to supply
machine-readable versions of human-oriented data<meta>
element to supply properties
that would not otherwise be part of the contentNext codelab: Book - embedded types
Dan Scott is a systems librarian at Laurentian University.
This work
is licensed under a Creative
Commons Attribution-ShareAlike 4.0 International License.