Doing useful things with periodical holdings, part 2: comparing with print holdings in Evergreen

Posted on Tue 17 November 2009 in Libraries

Doing interesting things with Evergreen serials data

Update: 2010-05-31 Running through the process again, I found a few typos in the pg_dump commands, so I fixed those up.

I'm working on a project to compare our electronic journal holdings with our print journal holdings. This is probably a task that most academic libraries have been working on over the past few years, as collection space dwindles, the duplication of holdings in electronic and print formats increases, and electronic delivery and 24/7 access becomes the default expectation of our patrons.

In my previous post, I worked through the hoops required to get our SFX holdings into a usable database for query purposes. In this post, I'll walk through the steps required to get the serials holdings from Evergreen into the same database so that we can generate reports based on the authoritative sources for both our electronic and print holdings.

We'll start by dumping the schema for the biblio.record_entry and serial.record_entry tables from our Evergreen database. In the previous post, we could have added the tables from the SFX export to the Evergreen database, but I don't like mixing these more experimental projects with our production system - so we'll work with the a database named periodicals instead.

pg_dump --no-owner --schema-only --table biblio.record_entry \    --table serial.record_entry evergreen > bre_sre_schema.dump

We have to munge the schema to not create the indexes on the tables - should lead to faster loads. Also, remove any triggers that point at other stuff that doesn't exist in this limited subset of data. Then create the schema in our periodicals holdings database:

psql -f bre_sre_schema.ddl -d periodicals

Now dump the data for those tables from the Evergreen database. If you have a large set of bibliographic records like we do, make sure you have a few gigabytes of space available in the output location.

pg_dump --no-owner --data-only --table biblio.record_entry \    --table serial.record_entry evergreen \    > bre_sre_data.ddl

Okay, now you can load the data into your serials holdings database:

psql -f bre_sre_data.dump -d periodicals

And now we add the indexes that we previously culled from the schema. You can be more selective in the indexes you create, if you know what you're doing.

For some reason, I opted to play with PostgreSQL's support for XML as a native column type and converted the plain text marc column into an XML column:

ALTER TABLE biblio.record_entry     ALTER COLUMN marc     SET DATA TYPE XML USING marc::XML;

Now we add the Evergreen holdings to the holdings.conifer table. We use the xpath() function to retrieve the desired values from the MARC XML in biblio.record_entry, and wrap the results in the unnest() function to return the nodeset as a plain text string, rather than an array of values. The WHERE clause restricts the holdings to those owned by the library in which I am interested.

CREATE TABLE holdings.conifer (    record BIGINT,     issn TEXT,     coverage TEXT,     call_number TEXT);INSERT INTO holdings.conifer (record, issn)    SELECT bre.id, UNNEST(XPATH('//*[local-name()="datafield"][@tag="022"]' ||            '/*[local-name()="subfield"][@code="a"]/text()', bre.marc))        FROM biblio.record_entry bre INNER JOIN serial.record_entry sre        ON sre.record = bre.id        WHERE sre.owning_lib = 103;

We'll populate the call number based on the 852 field in the serial record. We could pull this from the asset.call_number table, but this will be good enough for the first pass.

UPDATE holdings.conifer     SET call_number = UNNEST(      XPATH(        '//*[local-name()="datafield"][@tag="852"]/' ||            '*[local-name()="subfield"][@code="h"]/text()',         (            SELECT sre.marc::xml            FROM serial.record_entry sre              INNER JOIN holdings.conifer hc ON hc.record = sre.record            WHERE hc.record = holdings.conifer.record            LIMIT 1        )      )    );

Now we need to generate usable holdings statements for the print. Evergreen includes a great MFHD parsing library written in Perl, and PostgreSQL thankfully enables you to create functions written in Perl, but to get the following to work on a non-Evergreen machine, I had to copy Open-ILS/src/perlmods/OpenILS/Utils/MFHD/* to /usr/local/share/perl/5.10.0 and edit the occurrences of OpenILS::Utils::MFHD::* to *.

CREATE OR REPLACE FUNCTION holdings.parse_mfhd ( xml TEXT ) RETURNS TEXT AS $_$    use MARC::Record;    use MARC::File::XML;    use MFHD;    my $xml = shift;    my $text;    my $captions;    my $marc = MARC::Record->new_from_xml( $xml );    my $mfhd = MFHD->new($marc);    foreach my $field ($marc->field('866')) {        my $holdings = $field->subfield('a');        if ($holdings) {            my $public_note = $field->subfield('z');            if ($public_note) {                $text .= "$holdings - $public_note";            } else {                $text .= "$holdings";            }        }    }    foreach my $cap_id ($mfhd->captions('853')) {        my @curr_holdings = $mfhd->holdings('863', $cap_id);        next unless scalar @curr_holdings;        foreach (@curr_holdings) {            if ($captions) {                $captions .= ', ';            }            $captions .= $_->format();        }    }    if ($text and $captions) {        $text = "$text / $captions";    } else {        $text = "$text$captions";    }    return $text;$_$ LANGUAGE PLPERLU;

And update the table:

UPDATE holdings.conifer SET coverage = (    SELECT holdings.parse_mfhd(marc)    FROM serial.record_entry    WHERE serial.record_entry.record = holdings.conifer.record    LIMIT 1);

That almost works, but it only retrieves the coverage from a single serial holdings record for a given bibliographic record, even though there might be multiple serial holdings records. To amend that, we'll create a PL/pgSQL function that concatenates all of the coverage statements from all of the pertinent serial holdings records for a given bibliographic record:

CREATE OR REPLACE FUNCTION holdings.print_coverage(marc_record BIGINT)    RETURNS TEXT AS $$    DECLARE         r RECORD;        coverage TEXT;    BEGIN        -- If coverage is NULL to begin with, then concatenating to it results in NULL        coverage := '';        -- RAISE NOTICE 'marc_record = %', marc_record;        -- Loop over the serial records attached to the targeted bib record        FOR r IN SELECT marc FROM serial.record_entry             WHERE record = marc_record            ORDER BY id        LOOP            coverage := coverage || holdings.parse_mfhd(r.marc);            -- RAISE NOTICE 'r.marc = %', r.marc;        END LOOP;        -- RAISE NOTICE 'coverage = %', coverage;        RETURN coverage;    END$$ LANGUAGE 'plpgsql';

And we'll use this fancy new function to update the print holdings statements again with the more complete coverage:

UPDATE holdings.conifer SET coverage = (    SELECT holdings.print_coverage(record)        FROM serial.record_entry        WHERE serial.record_entry.record = holdings.conifer.record        LIMIT 1    );

Now the payoff: generating a list of matching ISSNs from the electronic holdings and our print holdings, with the coverage statements for each, for a subset of the SFX collections to which we have access:

-- Set the display to expanded format for easy reading\x-- Basic report for perusalSELECT hsfx.issn AS "ISSN", hsfx.title AS "Title",        hsfx.collection AS "SFX Collection",        hsfx.coverage AS "Electronic Coverage",        hc.coverage AS "Print Coverage", hc.call_number AS "Call Number"    FROM holdings.sfx hsfx        INNER JOIN holdings.conifer hc ON hsfx.issn = hc.issn    WHERE (hsfx.collection ILIKE '%JStor%' OR hsfx.collection LIKE '%Scholars%')        AND hc.coverage > ''    LIMIT 5;

That results in:

-[ RECORD 1 ]-------+--------------------------------------------------------------------------ISSN                | 0142-2774Title               | Journal of Occupational BehaviorSFX Collection      | JSTOR Arts and Sciences 4Electronic Coverage | Available from 1980 until 1987. Print Coverage      | Vol. 1 No.  - Vol. 8 No. 4 (1980-1987)Call Number         | DESM-PER-[ RECORD 2 ]-------+--------------------------------------------------------------------------ISSN                | 0741-6261Title               | The Rand Journal of EconomicsSFX Collection      | JSTOR Arts and Sciences 2Electronic Coverage | Available from 1984 until 2006. Print Coverage      | V.17 (1986) - v.23 (1992)Call Number         | DESM-PER-[ RECORD 3 ]-------+--------------------------------------------------------------------------ISSN                | 0002-8614Title               | Journal of the American Geriatrics SocietySFX Collection      | Scholars PortalElectronic Coverage | Available from 2001 volume: 49 issue: 1 until 2009 volume: 57 issue: 10. Print Coverage      | Vol. 1 - 37 (1953-1989)Call Number         | DESM-PER-[ RECORD 4 ]-------+--------------------------------------------------------------------------ISSN                | 0023-7639Title               | Land EconomicsSFX Collection      | JSTOR Arts and Sciences 7Electronic Coverage | Available from 1948 until 2005. Print Coverage      | v.62 (1986) - v.68 (1992)Call Number         | DESM-PER-[ RECORD 5 ]-------+--------------------------------------------------------------------------ISSN                | 0090-2616Title               | Organizational dynamicsSFX Collection      | Scholars PortalElectronic Coverage | Available from 1995 volume: 23 issue: 3 until 2009 volume: 38 issue: 3. Print Coverage      | Vol. 15 No.  - Vol. 23 No. 5 (Summer 1986-Spring 1995)Call Number         | DESM-PER

Looks pretty good to these eyes. Okay, now we'll get serious and dump the output to a tab-delimited file so we can easily open it in OpenOffice.org Calc or another spreadsheet:

-- Set delimiter to TAB (CTRL-V )\f '^V'-- Set the output to being unaligned\a-- Dump the output to a file\o /tmp/periodicals.tsv-- Generate URLs for quick catalogue lookupsSELECT 'http://laurentian.concat.ca/opac/en-CA/skin/lul/xml/rdetail.xml?r='             || hc.record || '&l=105&d=1' AS "URL",        hsfx.issn AS "ISSN", hsfx.title AS "Title",        hsfx.collection AS "SFX Collection",        hsfx.coverage AS "Electronic Coverage", hc.coverage AS "Print Coverage",        hc.call_number AS "Call Number"    FROM holdings.sfx_complete hsfx        INNER JOIN holdings.conifer hc ON hsfx.issn = hc.issn    WHERE (hsfx.collection ILIKE '%JStor%' OR hsfx.collection LIKE '%Scholars%')        AND hc.coverage > '';

And that's it. It might seem complex, but I've found that investing the effort into learning how to lean on PostgreSQL to do the hard work pays plenty of dividends. This exploration should help me contribute more functionality to Evergreen core; for example, I hope to use my experiments with the pl/Perl function to start populating the serial.bib_summary tables using an INSERT/UPDATE/DELETE trigger on serial.record_entry so that we don't have to generate the summaries for every item details request in the catalogue.