More granular identifier indexes for your Evergreen SRU / Z39.50 servers
Posted on Wed 10 March 2010 in Libraries
In June of 2009 I was moaning about how “Evergreen, by default, has no identifier index for limiting searches by ISBN / ISSN / LCCN / OCLCnum” and that “if [fixing this problem] requires work from me, it will probably be 2010 before any of it happens”. Due to some of the tools our consortium relies on, we really needed a solution for identifier searches in Z39.50 that was better than just a general keyword search: we were returning too many false positives that cause extra work and frustration for everyone.
Well, here it is, 2010, and as of today Conifer's Evergreen server now has a very handy identifier index. Most of the required pieces were already there, in one form or another, but they all needed to be brought together. This blog post is going to try to do that (and serve as documentation for my ever-decaying brain, too). At the time of this post, we're running a 1.6.0.4-ish Evergreen system; you'll need to be running 1.6.0.4 to get ISSN searching to work properly, too.
First, we need to create the identifier index. Evergreen comes with the following indexes out of the box:
- author
- title
- series
- subject
- keyword
Pretty standard. With the exception of keyword, each of these indexes is composed of more granular indexes; for example, the title index is composed of the following specific indexes, with the XML format that the MARCXML is converted to and then the XPath expression that extracts the text from the pertinent XML format:
- abbreviated - MODS32 - //mods32:mods/mods32:titleInfo[mods32:title and (@type='abbreviated')]
- translated - MODS32 - //mods32:mods/mods32:titleInfo[mods32:title and (@type='translated')]
- alternative - MODS32 - //mods32:mods/mods32:titleInfo[mods32:title and (@type='alternative')]
- uniform - MODS32 - //mods32:mods/mods32:titleInfo[mods32:title and (@type='uniform')]
- proper - MODS32 - //mods32:mods/mods32:titleInfo[mods32:title and (@type='proper')]
Aside: You can search against these more granular indexes in the Evergreen OPAC, by the way, by appending the granular index name to the index class name with a | as a delimiter. For example, a search query of title|uniform: canada will search only the uniform titles for the term "canada". Okay, sorry for that detour, but I bet you weren't aware of that - we haven't done a good job of exposing some of the magic that has been there for a long time in Evergreen in the OPAC interface.
Back to understanding the configuration - as you can see above, the conversion to MODS does the heavy lifting in pulling out the fields of interest to us from the MARCXML. The full set of indexed fields and their definitions is visible in the database via the query:
SELECT * FROM config.metabib_field;
For our purposes, we're interested in pulling the raw 010 (LCCN), 020 (ISBN), and 022 (ISSN) a subfields directly from the MARCXML source. Our first step is to add an entry to the config.metabib_field table defining our new index. We'll create a new granular index under the "keyword" index class and call it "identifier", because that's what it is, right? That's as easy as:
INSERT INTO config.metabib_field (field_class, name, xpath, weight, format, search_field, facet_field) VALUES ('keyword', 'identifier', '//marcxml:datafield[@tag="010" or @tag="020" or @tag="022"]/marcxml:subfield[@code="a"]', 1, 'marcxml', true, false);
Next, we need to restart the open-ils.storage and open-ils.ingest services to make them aware of this new entry. Go ahead, I'll wait while you run osrf_ctl.sh -a restart_perl or use opensrf-perl.pl to restart the services individually. Done? Good.
We have to make up for lost time, now, as all of the bibliographic records in your system didn't have this definition in place when they were first ingested. The easiest thing to do is to just pull the pertinent data directly from the metabib.full_rec view (which is a shredded version of the source MARCXML from your bibliographic records, with one tag/subfield value per row. Ergo:
-- Get the ID from the row that you just inserted for the new index;-- we'll use this in the INSERT statementSELECT id FROM config.metabib_field WHERE field_class = 'keyword' AND name = 'identifier';-- Let's say the ID was 18; we'll use that to identify the index in the SELECT statementINSERT INTO metabib.keyword_field_entry (field, source, value) SELECT 18, record, agg_text(value) FROM metabib.full_rec WHERE tag IN ('010', '020', '022') AND subfield = 'a' GROUP BY 1, 2;
All right! Now you can run some test searches in the OPAC for ISSNs, ISBNs, and LCCNs in your OPAC using the keyword|identifier: some_identifier prefix. Cool. So that's part one, mostly lifted from the "magic spell" in the Evergreen wiki.
Part two is configuring SRU to use the new identifier index. The bulk of the Evergreen SRU implementation is contained in the Perl module OpenILS::WWW::SuperCat.pm (located in your install directory in /openils/lib/perl5/OpenILS/Application/SuperCat.pm). Get out your patch tool or open up the Perl module in a text editor, we're going to make a few changes. The pertinent diff follows:
--- old/OpenILS/WWW/SuperCat.pm 2010-03-09 17:26:20.000000000 -0500+++ new/OpenILS/WWW/SuperCat.pm 2010-03-10 00:11:58.000000000 -0500@@ -1410,6 +1410,7 @@ 'bib.titlealternative' => 'title', 'bib.titleseries' => 'series', 'eg.series' => 'title',+ 'eg.identifier' => 'keyword|identifier', # Author/Name class: 'eg.author' => 'author',@@ -1438,7 +1439,7 @@ 'srw.serverchoice' => 'keyword', # Identifiers:- 'dc.identifier' => 'keyword',+ 'dc.identifier' => 'keyword|identifier', # Dates: 'bib.dateissued' => undef,@@ -1497,6 +1498,7 @@ subject => ['subject'], keyword => ['keyword'], series => ['series'],+ identifier => ['keyword|identifier'], }, dc => { title => ['title'],@@ -1504,7 +1506,7 @@ contributor => ['author'], publisher => ['keyword'], subject => ['subject'],- identifier => ['keyword'],+ identifier => ['keyword|identifier'], type => [undef], format => [undef], language => ['lang'],
Essentially, we've defined a new qualifier (eg.identifier) and pointed it and the dc.identifier indexes at the new, more specific keyword|identifier index. Once the updated file is in place, reload your Apache configuration (/etc/init.d/apache reload) and SRU requests using those qualifiers will now point at the identifier index. FABULOUS.
Our last step is to teach our simple2zoom-based Z39.50 configuration about the new index by mapping the corresponding BIB-1 attributes to the new eg.identifier qualifier, like so:
<database name="FOOBAR"> <zurl>http://localhost/opac/extras/sru/FOOBAR/holdings</zurl> <option name="sru">get</option> <charset>marc-8</charset> <search> <querytype>cql</querytype> <map use="4"><index>eg.title</index></map> <map use="7"><index>eg.identifier</index></map> <map use="8"><index>eg.identifier</index></map> <map use="9"><index>eg.identifier</index></map> <map use="21"><index>eg.subject</index></map> <map use="1003"><index>eg.creator</index></map> <map use="1018"><index>eg.publisher</index></map> <map use="1035"><index>eg.keyword</index></map> <map use="1016"><index>eg.keyword</index></map> </search> </database>
Kill your simple2zoom processes and restart simple2zoom and you should be in heaven - farewell, false positive matches! Oh, and about that SFX target parser for Evergreen; now you can remove all of the gimmickry around exact searches and worrying about ISSNs that contain an 'X' and just point at the identifier index. For example:
if (defined($ISSN)) { $searchString .= "keyword|identifier: $ISSN"; } elsif (defined($ISBN)) { $ISBN =~ s/-//g; # Most of our ISBNs are normalized to no hyphens $searchString .= "keyword|identifier: $ISBN"; }
Things still aren't perfect in Evergreen identifier-land: we still need to do some work to normalize hyphenation of our ISBNs, for example, and ensure we have 10-digit & 13-digit ISBN equivalents. But we're a lot closer to perfection now - and with the work that Mike Rylander is doing in trunk, normalization of that kind should be relatively straightforward to implement on both the indexing and query-parsing side.