Build an RDFa with schema.org crawler codelab: Python

By Dan Scott,

About this codelab

In this codelab, you will explore open source Python code that uses sitemaps to crawl websites and extract structured data. Our particular focus will be on RDFa and the schema.org vocabulary, but the tools can be used for any vocabulary and most forms of structured data, including RDFa and microdata.

Audience: Intermediate

Prerequisites: To complete this codelab, you will need to have:

About sitemaps

A sitemap that follows the sitemaps.org specification lists all of the URLs that would be of interest to a search engine or other agent that wants to index the content of a site. Sitemaps offer the ability to record the date that each file was last modified, so that agents can minimize their impact on the target website by retrieving only those URLs that have changed since their last crawl.

Sitemaps are built from two basic components: a sitemap index file, and individual sitemaps. A site with fewer than 50,000 URLs of interest can forego tine index file and just supply a single sitemap file, but as we are dealing with library systems, we will provide a brief overview of both components.

Sitemap files

A sitemap file lists each URL of interest in a simple XML format that contains the following elements:

<urlset>
The root element of the XML file.
<url>
The element that defines each URL.
<loc>
The element that contains the address of a given URL.
<lastmod>
(Optional:) The date when the contents of a given URL was last modified.

For example, the first few lines of a sitemap might look like the following:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>http://example.org/eg/opac/record/100000?locg=105</loc>
    <lastmod>2009-05-03</lastmod>
  </url>
  <url>
    <loc>http://example.org/eg/opac/record/100001?locg=105</loc>
    <lastmod>2009-05-03</lastmod>
  </url>
  <url>
    <loc>http://example.org/eg/opac/record/100002?locg=105</loc>
    <lastmod>2009-05-03</lastmod>
  </url>
</urlset>

Sitemap index files

Per the sitemaps.org specification, a single sitemap file should include no more than 50,000 URLs. Once a site grows beyond this limit, you should create multiple sitemap files, and then create a sitemap index file that lists each of the sitemap files. The sitemap index file format looks like the following:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap><loc>http://example.org/sitemap1.xml</loc></sitemap>
  <sitemap><loc>http://example.org/sitemap2.xml</loc></sitemap>
  <sitemap><loc>http://example.org/sitemap3.xml</loc></sitemap>
</sitemapindex

Sitemap index files may also contain <lastmod> elements to indicate the last time that particular sitemap changed. This is a further means of reducing network and processing demands on crawlers and websites.

New approaches

Although effective on their own for surfacing non-structured data, combining sitemaps with open structured data enables more dedicated search applications that have the potential to compete with general search engines and to replace current dependencies on more laborious practices.

For example, the current approach for creating and maintaining union catalogues is to periodically transfer MARC records in bulk to a central location, which then rebuilds a local index based on the aggregation of those MARC records. The process is generally the same whether the union catalogue is a local creation or something as complex as OCLC, and due to the vagaries of the MARC Format for Holdings Data, additional information such as the meanings of various strings representing locations and statuses must be communicated outside of the system.

The Python script schema_union represents a step towards a possible new approach to union catalogues. Now that you have worked through the exercises for marking up bibliographic records with RDFa and schema.org, including a simple but effective approach for representing holdings as schema.org Offer entities, and you have also understood how libraries can surface their own location, hours of operation, and branch relationship information... if you combine that with sitemaps, then you can offer a much simpler process for generating and maintaining union catalogues. The schema_union script offers a proof-of-concept for union catalogues that leverage the general discovery advantages of offering RDFa with schema.org markup, combined with the freshness of automatically generated sitemaps, to disintermediate traditional aggregators of library holdings information and democratize access to this information.

Open source library system participants

The following open source library systems are known to offer support for sitemaps:

Running the Python crawler script

In this exercise, you will obtain a local copy of the Python crawler script, ensure that you can run it in your environment, and learn how to customize it to reflect your requirements.

Getting the script

There are two ways you can get a copy of the script to work with:

  1. git: If you are comfortable using the git version control system, you can simply clone the git repository:
    git clone https://github.com/dbs/schema-unioncat.git
  2. Download: You can also download the latest version of the script directly.

Run the script

By default, the script will only try to extract structured data from a single HTML page, which means that you can test running the script without having to worry about using too much bandwidth (either your own, or that of the target site). To run the script, you will need to open a terminal window and run the following command from the directory in which the script is located:

python schema_union.py

You should see the following output:

usage: schema_union.py [-h] -o OUTPUT [-s SITEMAP] [-p PARSER] [-t SERIALIZER]
schema_union.py: error: argument -o/--output is required

This tells you that only a single argument is required: -o OUTPUT. That argument tells the script in which file the output should be stored.

If you use the -h argument, the script will respond with a more complete set of help as follows:

usage: schema_union.py [-h] -o OUTPUT [-s SITEMAP] [-p PARSER] [-t SERIALIZER]

Crawl a sitemap.xml and extract RDFa from the documents

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        Path / filename for the output
  -s SITEMAP, --sitemap SITEMAP
                        Location of the sitemap to parse
  -p PARSER, --parser PARSER
                        Parser to use for the input format ("rdfa",
                        "microdata", etc)
  -t SERIALIZER, --serializer SERIALIZER
                        Serializer to use for the output format ("n3", "nt",
                        "turtle", "xml", etc)

Troubleshooting ImportError: No module named 'rdflib'

This means that the prerequisite RDFLib module could not be found on your system. Try running pip install rdflib or easy_install rdflib to set up your environment.

Troubleshooting URLError: <urlopen error [Errno -2] Name or service not known>

This means that either your internet connection is not working, or the target URL is no longer responding. Try the URL from your browser to see if it responds.

Checking the output of the script

Run the script once, providing a file location for the -o argument such as /tmp/blah. By default, the script parses a single URL of a VuFind site.

Open the output file in a text editor. The file contains content like the following:

@prefix ns1: <http://schema.org/> .
@prefix ns2: <http://www.w3.org/1999/xhtml/vocab#> .
@prefix ns3: <http://www.w3.org/ns/rdfa#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<https://find.senatehouselibrary.ac.uk/Record/.b24804241> ns3:usesVocabulary ns1: .

<https://find.senatehouselibrary.ac.uk/Record/.b24804241#record> a ns1:Book,
        ns1:Product ;
    ns1:author "Medici, House of."@en-gb ;
    ns1:contributor "Richards, Gertrude R. B. 1886-"@en-gb,
        "Selfridge, Harry Gordon, 1858-"@en-gb ;
    ns1:creator "Harvard University. Graduate School of Business Administration."@en-gb ;
    ns1:name "Florentine merchants in the age of the Medici : letters and documents from the Selfridge collection of Medici manuscripts"@en-gb ;
    ns1:publicationDate "1932"@en-gb ;
    ns1:publisher [ a ns1:Organization ;
            ns1:location "Cambridge, Mass. :"@en-gb ;
            ns1:name "Harvard University Press,"@en-gb ] ;
    ns1:subjectLine "HNH - Action > Cultural History > Italian History > Sources. General"@en-gb .

[] a ns1:Offer ;
    ns1:availability ns1:InStock ;
    ns1:businessFunction <http://purl.org/goodrelations/v1#LeaseOut> ;
    ns1:itemOffered <https://find.senatehouselibrary.ac.uk/Record/.b24804241#record> ;
    ns1:seller "WARBURG Main"@en-gb ;
    ns1:serialNumber "1234567890123"@en-gb ;
    ns1:sku "HNH 1816"@en-gb .

This is the n3 serialization format of RDF data. You should experiment with other formats using the -t SERIALIZER argument to find a type that meets your needs. The nt format, for example, can be easily parsed for input into a relational database.

Lessons learned

In this exercise, you learned how to run the Python crawler script, including some basic troubleshooting steps, and how to control the input and output options from the command line.

Modifying the script for your site(s)

The script, as-is, is suitable for a proof of concept but requires some customization to be a more useful starting point for your own purposes. In this section, we examine the contents of the script and make a few modifications as examples of what you can do.

Enable the script to parse a complete sitemap

Until now the script has ignored any -s SITEMAP argument that you may have set at the command line. This is because one of the constants in the script, SHORT_CIRCUIT, is set to True. While that constant is true, the script ignores any sitemap argument and simply uses the URLs contained in the list defined by the SAMPLE_URLS constant.

To enable the script to parse actual sitemaps:

  1. Open schema_union.py in an editor.
  2. Find the line SHORT_CIRCUIT=True around line 73 and change True to False.
  3. Change the SITEMAP_URL constant to point at a sitemap you are interested in crawling. If you leave it as-is, it will parse the entire set of sitemaps for a single site, resulting in the download of megabytes of sitemap files and many thousands of subsequent HTTP requests for the pages listed in the sitemap... which is overkill for this exercise.

Other projects

The LDSpider project is much more mature and powerful than the simple script I created. If you are considering a serious linked data crawling application, LDSpider should be one of your first options, as it provides both a command line interface and an API. That said, it can be challenging to work with as it is quite complex!

Lessons learned

In this section, you learned how sitemaps work and how to modify the simple schema_union script to build a structured data crawler that could be used as the basis of a union catalogue built upon schema.org, structured data, and sitemaps.

About the author

Dan Scott is a systems librarian at Laurentian University.

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.