Structured data with schema.org codelab

By Dan Scott,

About this codelab

In this codelab, you're going to take a simple web page that contains a codelab and enhance it so that it contains structured data. You will use the schema.org vocabulary and express it via RDFa attributes.

Audience: Beginner

Prerequisites: To complete this codelab, you will need a basic familiarity with HTML. The exercises can be found in schema_org_codelab.zip, with the solutions found in the exercises subdirectory. There are frequent checkpoints through the code lab, so if you get stuck at any point, you can use the checkpoint file to resume and work through this codelab at your own pace.

Exercise 1: From basic HTML to RDFa: first steps

In this exercise, you will learn the basic steps required to add simple RDFa structured data to an existing web page.

Step 1(a): View the page source HTML

Open exercise1.html in a text editor. You should see something like the following HTML source for the web page:

<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8">
    <title>Structured data with schema.org codelab</title>
</head>
<body>
<h1>Structured data with schema.org codelab</h1>
    <img style="float:right" src="squares.png" />
    <p class="byline">
        By <a href="http://example.com/AuthorName">Author Name</a>,
        January 29, 2014
    </p>
<h2>About this codelab</h2>
<h2>Exercise 1: From basic HTML to RDFa: first steps</h2>
<h2>Exercise 2: Embedded types</h2>
<h2>Exercise 3: From strings to things</h2>
</body>
</html>

Note: In a pinch, you can use the browser development tools to view and edit the source of the web page (CTRL-Shift-i in Chrome or Firefox, in the Elements or Inspector tab respectively).

Step 1(b): Extract and view structured data in HTML

There are a number of RDFa parsers, both online and locally installable, that can help you check the results of your work. Copy and paste the HTML source into each of the following online structured data extraction tools:

The results should (not suprisingly!) show that the page currently contains no structured data.

Step 1(c): Add the RDFa vocabulary declaration

Preamble

RDFa (Resource Description Framework in attributes) enables us to embed descriptions of things (types) and their properties within HTML documents using just a handful of HTML attributes.

To avoid a tower of Babel situation where one person uses the type name "author" to refer to the same concept that someone else calls a "writer", collections of types and their properties are typically standardized and published as a vocabulary (also known as an ontology).

Each type and property is expected to have a dereferenceable URI so that you (or more realistically the machines) can look up the definition of the vocabulary element and determine its relationship (if any) to other vocabulary elements. For example, you can look up http://schema.org/TechArticle and learn that it is a subclass of the Thing / CreativeWork / Article hierarchy.

You could use the full URI for each vocabulary element, but that would be extremely verbose - especially given vocabularies that publish URIs like http://rdaregistry.info/Elements/a/countryAssociatedWithThePerson. Therefore, RDFa offers the @vocab attribute; if you add a vocab="http://<path/for/vocab> attribute to an HTML element, any of the RDFa @typeof and @property attributes within its scope will automatically prepend the specified value to those attributes.

Declare the schema.org vocabulary as your default

We're going to use the schema.org vocabulary for our exercise, as it includes types and properties that enable us to describe many things of general interest without having to mix and match multiple vocabularies. Declare the default vocabulary for the HTML document as http://schema.org/ on the <body> element. Note: Do not forget the trailing slash (/)!

Checkpoint: Your HTML page should now look like check_1c.html

Specialized versus general vocabularies

Many vocabularies focus on a particular domain; for example:

In practice, documents often ended up using types and properties from several different vocabularies. While vocabulary description languages like RDF Schema (RDFS) and the Web Ontology Language (OWL) offer ways to express equivalence between types and properties of different vocabularies, it can still be extremely complex to publish and consume mixed-vocabulary documents.

schema.org, on the other hand, tries to provide a vocabulary that can describe almost everything, albeit in many cases with less granularity than more specialized vocabularies.

Step 1(d): Add the type and associated properties for your page

Preamble

Unless declared otherwise, web pages are assumed to have a type of WebPage. The choice of type is important as it dictates which properties you can "legally" use, so this section will help you find a more specific match for your purposes.

Browse the schema.org type hierarchy

The schema.org types are arranged in a top-down hierarchy. Starting at the top level of the type hierarchy, browse through the CreativeWork -> Article -> TechArticle type hierarchy. Notice how each type inherits the properties from its parent (beginning with Thing), offers its own more specific definition for its raison d'etre, and may add its own properties to enable you to describe it more completely.

Add the type declaration for the document

To declare an RDFa type for an HTML document, add the @typeof attribute to the <body> element and set the value of the attribute to TechArticle.

Checkpoint: Your HTML page should now look like check_1d.html

Add a name property for the type

Every schema.org type has a name property available to it, because the property is declared on the Thing type from which every other type inherits. In the case of a TechArticle, the title of the article is mapped to its name. Go ahead and add a @property="name" attribute to the <h1> element to assert that the content of that element is the name of the technical article.

Note: You might be tempted to add the attribute to the <title> element of the HTML document, but this would fall outside of the scope of your @typeof attribute. And while a search engine would likely make a best guess that, if the content of the <title> and <h1> for a given web page match then that's likely the title, your explicit assertion of that property is stronger than an inference.

Add an author property for the type

This article has an author, and if you check the documentation for TechArticle you will find that there is indeed an author property. Notice that the expected type of the author property is either a Person or Organization type. For now, go ahead and add the @property="author" attribute to the <a> element for the author's name.

Note: You might be tempted to add the attribute to the <p class="byline"> element of the HTML document, but the scope of the <p> element includes more than just the name of the author, so you would be asserting (falsely!) that the author was "By Author Name, January 29, 2014".

Add a datePublished property for the type

Right now a date of publication is visible on the page, but as the data just lives inside an undifferentiated string of text, it would difficult for a machine to know what the data means. To remove this uncertainty, wrap the date in a <time> tag and add the @property="datePublished" attribute.

Checkpoint: Your HTML page should now look like check_1d_i.html

Improve the author property

Check the results from various structured data parsers. Do they match your expectations? Look closely at the author value; you probably did not expect the value of the author property to be a URL. This is one of the subtleties of RDFa, where the href attribute value is used for an RDFa attribute value rather than the content of the <a> element.

Let's fix that: add a new span element that wraps the <a> tag, and move the @property="author" attribute to the new span tag. Run your structured data parsers again to ensure that you're getting the results that you expect.

Improve the datePublished property

If you check the datePublished documentation, you will find that the expected value of the property is Date, which in turn is defined as a date value in ISO 8601 format. For a date, then, the expectation is a value in either YYYYMMDD or YYYY-MM-DD format.

You could change the human-visible content to match the ISO 8601 standard, but, while that will make the machines happier, it might confuse some of the poor humans in the audience who do not undertand perfectly logical date formats. Let's make them both happy by supplying an inline, properly formatted value using the @content attribute. Go ahead and add @content="2014-01-29" to the <time> element.

Add an image property for the TechArticle type

Every type in schema.org can have an image property. One potential use case for search engines is to use image entities to provide a more visually attractive search result. Your technical article contains an image (let's not discuss whether it is visually attractive!). Add the @property="image" attribute to the <img> element.

Add an articleBody property for the TechArticle type

You can help processors of your article find the actual substance of the article. Assume that all of the relevant content can be found in the sections under the <h2> elements... so once again, you need to add a new element strictly to support structured data. In this case, add a <div> element that wraps all of the <h2> elements. Include the @property="articleBody" attribute on the new element.

Checkpoint: Your HTML page should now look like check_1d_ii.html

Step 1(e): Add a description property

You might have noticed that some of the RDFa parsers generate a rich snippet that shows you what your page might look like as a search result. You may also have noticed that the rich snippet did not contain much content of your page other than its title. To help search engines generate a better rich snippet, you should include a @property="description" attribute in your web page.

Move the <div property="articleBody"> element down so that it wraps only the "Exercise" <h2> elements. Then add a new <div property="description"> element and attribute that wraps only the "About this codelab" <h2> element.

Step 1(f): Add an implicit property

It can be useful for technical articles to include an indication of their intended use. For example, this is a code lab, intended for hands-on learning that reinforces concepts that are introduced gradually throughout the code lab. Fortunately, schema.org offers the educationalUse property for this purpose on the CreativeWork type and its children. However, your page does not include an obvious place to attach this markup.

When you realize that a vocabulary has pointed out a possible deficiency in your work, you could revisit the web page and add an "Intended use" field that you could then use to classify all of your work. In this step, assume that you are working with a strict designer who forbids you from altering the look or content of the page. In that situation, your only option is to use a <meta> element to define the property value for the machines.

Go ahead and add <meta property="educationalUse" content="codelab"> anywhere within the scope of the TechArticle. The solutions add the element directly under the <h1> element.

Note: Do not use this approach as a license to stuff your web page full of lascivious keywords that have no connection to your content in the hopes of drawing a larger audience to your site. The search engines learned about this "spiderfood" tactic back in the 90's and will punish your site mercilessly with low relevancy ranking if you are determined to have been trying to game their systems. The generally accepted best practice is to try to only add machine-readable markup to the same content that humans can see. Reserve <meta> elements only for the most important purposes.

Checkpoint: Your HTML page should now look like check_1f.html

Bonus step

Go back to TechArticle page and view the source for the Properties from CreativeWork table. Notice that the type and property hierarchies are themselves defined in RDFa using the rdfs:subClassOf and rdfs:domainIncludes properties from the RDF Schema vocabulary (a vocabulary for defining vocabularies).

Lessons learned

In this exercise, you learned:

Exercise 2: Embedded types

So far you have described the page using a single type and a handful properties. However, when you added the @property="author" attribute, the expected value for the property (the range) was not a simple text string; it was supposed to be either a Person or Organization type.

In this exercise, you will add several embedded types to the page to conform to the vocabulary definition and make your structured data even more useful.

Continue working with the HTML file that you have been editing so far, or for a fresh start, copy check_1f.html into a new file. As a refresher, your HTML should now look something like the following:

<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8">
    <title>Structured data with schema.org codelab</title>
</head>
<body vocab="http://schema.org/" typeof="TechArticle">
<h1 property="name">Structured data with schema.org codelab</h1>
    <img style="float:right" src="squares.png" property="image" />
    <meta property="educationalUse" content="codelab">
    <p class="byline">
        By <span property="author"><a href="http://example.com/AuthorName">Author Name</a></span>,
        <time property="datePublished" content="20140129">January 29, 2014</time>
    </p>
<div property="description">
    <h2>About this codelab</h2>
</div>
<div property="articleBody">
    <h2>Exercise 1: From basic HTML to RDFa: first steps</h2>
    <h2>Exercise 2: Embedded types</h2>
    <h2>Exercise 3: From strings to things</h2>
</div>
</body>
</html>

Step 2(a): Define the Person type

Your @property="author" attribute needs to define a Person type to satisfy the expected value of author. Simply add the @typeof="Person" attribute to the same HTML element so that you are, in one step, defining the author attribute for the overall TechArticle type, while simultaneously starting a new Person type scope.

Step 2(a): Define basic properties of the Person type

Now that you have defined a Person type, you can define specific properties for it.

Declare that the <a> element is the sameAs property of the Person. While url might be tempting, it is usually reserved for linking to a URL where the thing that is described is available, whereas sameAs is used to link to a description of the thing.

Declare that the person's name is the name property of the Person type. For bonus points, nest the givenName and familyName properties inside of the name property.

Tip: Remember that you might need to add <span> tags to create a new scope for the properties that you want to add.

Checkpoint: Your HTML page should now look like check_2a.html

Step 2(b): Declare that the same Person is both the author and copyrightHolder

Copyright is an important subject for both creators and organizations and individuals seeking to reuse or republish work, so naturally schema.org includes a copyrightHolder property that you can apply. In this case, however, the author and the copyrightHolder are one and the same, and you have already used the @property attribute.

To define multiple property values for the same attribute, simply include the values as a whitespace-delimited list. In this case, edit the HTML to declare @property="author copyrightHolder" and check your work in one or more structured data validators.

Note: These are still relatively early days for structured data validators, and their output varies for more esoteric cases like multi-valued attributes. For example, the Structured Data Linter recognizes the second value for copyrightHolder but generates a "blank node" identifier for it, whereas Google's Structured Data Testing Tool only recognizes the last value of the multi-valued attribute. To complicate matters further, the search engines recognize that their tools have bugs that differ from what their actual production parser understands... so don't be overly alarmed if it seems like your markup is not being recognized by the testing tool.

Step 2(c): Use the @resource attribute to group assertions for a type

Sometimes your HTML document does not group all of the content in such a way that you can cleanly keep all of the attributes for a given instance of a type within a single scope. In these cases, you may be able to use the @resource attribute to logically group the properties for that instance.

For example, assume that you have been asked to add the following author biography to the bottom of your technical article: <p>Dan Scott is a systems librarian at Laurentian University.</p> Now you have the perfect description property for your author--but it is separated from your existing Person instance by all of the content in the middle.

To resolve the problem, simply add a @resource attribute to your existing Person declaration. The value of the new attribute should be unique on this page; use "authorName" for the sake of simplicity.

Add a wrapping <div> element around the biography section, including a @resource attribute with a value of "authorName" to match what you added above. This creates a new scope for the existing type instance, such that any properties declared within this new scope will be added to the existing type instance.

Now add a @property="description" attribute to the <p> element for the biography, and check your work in the RDFa parser tools.

Checkpoint: Your HTML page should now look like check_2c.html

Bonus exercise: Add more structured data for the Person type

Given the new author biography, there are several other structured data assertions you can now make on the Person type:

Lessons learned

In this exercise, you learned:

Exercise 3: Strings to things

So far you have described the page using types and properties that are inside the page itself. But if you have to update some information that is common to many of your pages, that could be painful to roll out... and even if you have an automated process for updating that information across all of your pages, there is no guarantee that anything extracting data from your site will extract all of the updates at one time.

Fortunately, the problem of providing one copy of information on the web was solved at the same time the web was created: via the simple power of the link! And structured data is no different; in fact, linked data is a term that has emerged over the past few years marking a more pragmatic approach to building a web of structured data than the somewhat classically academic semantic web.

The following principles of linked data were first articulated by Tim Berners-Lee in a 2006 design note:

  1. Use URIs as names for things
  2. Use HTTP URIs so that people can look up those names.
  3. When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL)
  4. Include links to other URIs. so that they can discover more things.

Keep these principles in mind as you work through the following steps!

Continue working with the HTML file that you have been editing so far, or for a fresh start, copy check_2c.html into a new file. As a refresher, your HTML should now look something like the following:

<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8">
    <title>Structured data with schema.org codelab</title>
</head>
<body vocab="http://schema.org/" typeof="TechArticle">
<h1 property="name">Structured data with schema.org codelab</h1>
    <img style="float:right" src="squares.png" property="image" />
    <meta property="educationalUse" content="codelab">
    <p class="byline">
        By <span property="author" typeof="Person" resource="#authorName">
            <a href="http://example.com/AuthorName" property="sameAs"><span
                property="name"><span property="givenName">Dan</span>
                <span property="familyName">Scott</span></span></a></span>,
        <time property="datePublished" content="20140129">January 29, 2014</time>
    </p>
<div property="description">
    <h2>About this codelab</h2>
</div>
<div property="articleBody">
    <h2>Exercise 1: From basic HTML to RDFa: first steps</h2>
    <h2>Exercise 2: Embedded types</h2>
    <h2>Exercise 3: From strings to things</h2>
</div>
<div resource="#authorName">
    <p property="description">
        Author Name is a systems librarian at Laurentian University.
    </p>
</div>
</body>
</html>

Step 3(a): Create a separate page for a core type

Take a look at how the page has developed over time; there is now a lot of HTML markup just to describe the author. It's a perfect candidate for refactoring; you can move the bulk of the markup to a separate page about the author. Once it is a separate page, then you can simply link to it from this page... as well as from any other pages that want to provide information about this author.

Create a new file named authorName.html in your text editor, and copy the @resource="#authorName" markup into the file.

As the new file describes a single type, you can move the declaration of the type into the <body> element, and you can remove the @resource attributes from the markup that you pasted into the file. Don't forget the @vocab declaration! Use your existing page as a template.

Use the RDFa parsers to ensure that the markup in the new file expresses the same information as it did in the original file.

Step 3(b): Link to the author page

Now, replace the inline markup in the original page with a simple link to your new file. You still want to state that "Author Name" is the author of the technical article using the @property="author" assertion, but now you can simply add that property directly to the <a> element that links to your new file. This is a signal to any RDFa parser that the linked resource contains the data for the named property.

Note: "when the element contains the href (or src) attribute, @property is automatically associated with the value of this attribute rather than the textual content of the <a> element" (Adida, Ben; Birbeck, Mark; Herman, Ivan; Sporny, Manu. RDFa 1.1 Primer - Second edition). Using a @property attribute on the same element as a @resource attribute works in a similar fashion; the target of the @resource attribute is used as the value of the @property attribute.

Checkpoint: Your original HTML page should now look like check_3b.html and your new author HTML page should look like exercises/check_3b_authorName.html.

Step 3(c): Augment the author page

Now that you have created an entirely separate author page, you can add much more information about the author; for example, you can include an email address, links to their personal web sites and social media accounts, a list of their publications and previous talks... far more information than you would have wanted to publish inline in the article itself.

Following the principles of linked data can lead not only to more efficient maintenance of information and (potentially) more useful results in search engines and other aggregators of data, but also to a better information design and experience for your users.

Use the Person properties to flesh out the "about this author" page with properties such as address, birthDate, email, follows, and telephone. Be adventurous, and remember to try to use nested types and ranges appropriately!

Lessons learned

In this exercise, you learned:

About the author

Dan Scott is a systems librarian at Laurentian University.

Informational resources

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.