In this codelab, you will explore open source Python code that uses sitemaps to crawl websites and extract structured data. Our particular focus will be on RDFa and the schema.org vocabulary, but the tools can be used for any vocabulary and most forms of structured data, including RDFa and microdata.
Audience: Intermediate
Prerequisites: To complete this codelab, you will need to have:
A sitemap that follows the sitemaps.org specification lists all of the URLs that would be of interest to a search engine or other agent that wants to index the content of a site. Sitemaps offer the ability to record the date that each file was last modified, so that agents can minimize their impact on the target website by retrieving only those URLs that have changed since their last crawl.
Sitemaps are built from two basic components: a sitemap index file, and individual sitemaps. A site with fewer than 50,000 URLs of interest can forego tine index file and just supply a single sitemap file, but as we are dealing with library systems, we will provide a brief overview of both components.
A sitemap file lists each URL of interest in a simple XML format that contains the following elements:
<urlset>
<url>
<loc>
<lastmod>
For example, the first few lines of a sitemap might look like the following:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://example.org/eg/opac/record/100000?locg=105</loc>
<lastmod>2009-05-03</lastmod>
</url>
<url>
<loc>http://example.org/eg/opac/record/100001?locg=105</loc>
<lastmod>2009-05-03</lastmod>
</url>
<url>
<loc>http://example.org/eg/opac/record/100002?locg=105</loc>
<lastmod>2009-05-03</lastmod>
</url>
</urlset>
Per the sitemaps.org specification, a single sitemap file should include no more than 50,000 URLs. Once a site grows beyond this limit, you should create multiple sitemap files, and then create a sitemap index file that lists each of the sitemap files. The sitemap index file format looks like the following:
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap><loc>http://example.org/sitemap1.xml</loc></sitemap>
<sitemap><loc>http://example.org/sitemap2.xml</loc></sitemap>
<sitemap><loc>http://example.org/sitemap3.xml</loc></sitemap>
</sitemapindex
Sitemap index files may also contain <lastmod>
elements
to indicate the last time that particular sitemap changed. This is a further
means of reducing network and processing demands on crawlers and websites.
Although effective on their own for surfacing non-structured data, combining sitemaps with open structured data enables more dedicated search applications that have the potential to compete with general search engines and to replace current dependencies on more laborious practices.
For example, the current approach for creating and maintaining union catalogues is to periodically transfer MARC records in bulk to a central location, which then rebuilds a local index based on the aggregation of those MARC records. The process is generally the same whether the union catalogue is a local creation or something as complex as OCLC, and due to the vagaries of the MARC Format for Holdings Data, additional information such as the meanings of various strings representing locations and statuses must be communicated outside of the system.
The Python script schema_union
represents a step towards a
possible new approach to union catalogues. Now that you have worked
through the exercises for marking up bibliographic records with RDFa
and schema.org, including a simple but effective approach for representing
holdings as schema.org Offer
entities, and you have also
understood how libraries can surface their own location, hours of operation,
and branch relationship information... if you combine that with sitemaps,
then you can offer a much simpler process for generating and maintaining
union catalogues. The schema_union
script offers a
proof-of-concept for union catalogues that leverage the general discovery
advantages of offering RDFa with schema.org markup, combined with the
freshness of automatically generated sitemaps, to disintermediate traditional
aggregators of library holdings information and democratize access to this
information.
The following open source library systems are known to offer support for sitemaps:
In this exercise, you will obtain a local copy of the Python crawler script, ensure that you can run it in your environment, and learn how to customize it to reflect your requirements.
There are two ways you can get a copy of the script to work with:
git clone https://github.com/dbs/schema-unioncat.git
By default, the script will only try to extract structured data from a single HTML page, which means that you can test running the script without having to worry about using too much bandwidth (either your own, or that of the target site). To run the script, you will need to open a terminal window and run the following command from the directory in which the script is located:
python schema_union.py
You should see the following output:
usage: schema_union.py [-h] -o OUTPUT [-s SITEMAP] [-p PARSER] [-t SERIALIZER] schema_union.py: error: argument -o/--output is required
This tells you that only a single argument is required: -o
OUTPUT
. That argument tells the script in which file the output
should be stored.
If you use the -h
argument, the script will respond with a
more complete set of help as follows:
usage: schema_union.py [-h] -o OUTPUT [-s SITEMAP] [-p PARSER] [-t SERIALIZER] Crawl a sitemap.xml and extract RDFa from the documents optional arguments: -h, --help show this help message and exit -o OUTPUT, --output OUTPUT Path / filename for the output -s SITEMAP, --sitemap SITEMAP Location of the sitemap to parse -p PARSER, --parser PARSER Parser to use for the input format ("rdfa", "microdata", etc) -t SERIALIZER, --serializer SERIALIZER Serializer to use for the output format ("n3", "nt", "turtle", "xml", etc)
ImportError: No module named 'rdflib'
This means that the prerequisite RDFLib module could not be found on your
system. Try running pip install rdflib
or easy_install
rdflib
to set up your environment.
URLError: <urlopen error [Errno -2] Name or service not known>
This means that either your internet connection is not working, or the target URL is no longer responding. Try the URL from your browser to see if it responds.
Run the script once, providing a file location for the -o
argument such as /tmp/blah
. By default, the script
parses a single URL of a VuFind site.
Open the output file in a text editor. The file contains content like the following:
@prefix ns1: <http://schema.org/> .
@prefix ns2: <http://www.w3.org/1999/xhtml/vocab#> .
@prefix ns3: <http://www.w3.org/ns/rdfa#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<https://find.senatehouselibrary.ac.uk/Record/.b24804241> ns3:usesVocabulary ns1: .
<https://find.senatehouselibrary.ac.uk/Record/.b24804241#record> a ns1:Book,
ns1:Product ;
ns1:author "Medici, House of."@en-gb ;
ns1:contributor "Richards, Gertrude R. B. 1886-"@en-gb,
"Selfridge, Harry Gordon, 1858-"@en-gb ;
ns1:creator "Harvard University. Graduate School of Business Administration."@en-gb ;
ns1:name "Florentine merchants in the age of the Medici : letters and documents from the Selfridge collection of Medici manuscripts"@en-gb ;
ns1:publicationDate "1932"@en-gb ;
ns1:publisher [ a ns1:Organization ;
ns1:location "Cambridge, Mass. :"@en-gb ;
ns1:name "Harvard University Press,"@en-gb ] ;
ns1:subjectLine "HNH - Action > Cultural History > Italian History > Sources. General"@en-gb .
[] a ns1:Offer ;
ns1:availability ns1:InStock ;
ns1:businessFunction <http://purl.org/goodrelations/v1#LeaseOut> ;
ns1:itemOffered <https://find.senatehouselibrary.ac.uk/Record/.b24804241#record> ;
ns1:seller "WARBURG Main"@en-gb ;
ns1:serialNumber "1234567890123"@en-gb ;
ns1:sku "HNH 1816"@en-gb .
This is the n3
serialization format of RDF data. You should
experiment with other formats using the -t SERIALIZER
argument
to find a type that meets your needs. The nt
format, for
example, can be easily parsed for input into a relational database.
In this exercise, you learned how to run the Python crawler script, including some basic troubleshooting steps, and how to control the input and output options from the command line.
The script, as-is, is suitable for a proof of concept but requires some customization to be a more useful starting point for your own purposes. In this section, we examine the contents of the script and make a few modifications as examples of what you can do.
Until now the script has ignored any -s SITEMAP
argument
that you may have set at the command line. This is because one of the
constants in the script, SHORT_CIRCUIT
, is set to
True
. While that constant is true, the script ignores any
sitemap argument and simply uses the URLs contained in the list defined
by the SAMPLE_URLS
constant.
To enable the script to parse actual sitemaps:
schema_union.py
in an editor.
SHORT_CIRCUIT=True
around line 73 and
change True
to False
.
SITEMAP_URL
constant to point at a sitemap you
are interested in crawling. If you leave it as-is, it will parse the entire set
of sitemaps for a single site, resulting in the download of megabytes of
sitemap files and many thousands of subsequent HTTP requests for the pages
listed in the sitemap... which is overkill for this exercise.
The LDSpider project is much more mature and powerful than the simple script I created. If you are considering a serious linked data crawling application, LDSpider should be one of your first options, as it provides both a command line interface and an API. That said, it can be challenging to work with as it is quite complex!
In this section, you learned how sitemaps work and how to modify the simple
schema_union
script to build a structured data crawler
that could be used as the basis of a union catalogue built upon schema.org,
structured data, and sitemaps.
Dan Scott is a systems librarian at Laurentian University.
This work
is licensed under a Creative
Commons Attribution-ShareAlike 4.0 International License.