Doing useful things with the TXT dump of SFX holdings, part 1: database

Posted on Tue 10 November 2009 in Libraries

There must be other people who have much more intelligent things than me with the TXT dump of SFX holdings that you can generate via the Web administration interface, but as I've gone through this process at least twice and rediscovered it each time, perhaps I'll save myself an hour or two by writing this down. Maybe it will save you a bit of time, too. Or give you an idea of a day in the life of a systems librarian. In part 1, I turn the TXT dump into something that I can load back into a local database. Maybe I want to massage the data, write arbitrary SQL queries, join it against data that I load into other tables in the database... I'll talk about that in a subsequent post. For now, let's get munging!

  1. First, I load the raw TXT file into a text editor (gvim, of course) and munge all of the lines so that they are ready to bulk load into my local database. That local database is PostgreSQL, so these commands will be vim-and-PostgreSQL specific, but you should be able to adapt this to whatever your target may be. Note the duplication of the first command is intentional!


  2. Then I prepend the DDL required to create the holdings table and copy the data into the newly created holdings table to the top of the TXT file:

    -- I don't want a huge scrolling list of errors if there's something wrong in the dataSET ON_ERROR_STOP = 1;-- I typically get something wrong the first time, so this avoids having to manually-- drop and re-rcreate the schema the second and subsequent timesDROP SCHEMA IF EXISTS holdings CASCADE;-- This is PostgreSQL, so we can use schemas. Yay!CREATE SCHEMA holdings;-- Note that I have no idea what many of these fields are, and frankly don't careCREATE TABLE holdings.sfx (title_key TEXT, title TEXT, misc1 TEXT, issn TEXT,     ejournal_id TEXT, collection TEXT, coverage TEXT, eissn TEXT, complete_title TEXT,     target TEXT, misc2 TEXT, misc3 TEXT, misc4 TEXT, misc5 TEXT, misc6 TEXT,     misc7 TEXT, misc8 TEXT, target_parser TEXT, target_id BIGINT, service_id BIGINT,     misc_id BIGINT, category TEXT, misc9 TEXT, misc10 TEXT, misc11 TEXT,     publisher TEXT, pub_place TEXT, pub_date TEXT, item_type TEXT, active TEXT,     misc12 TEXT, misc13 TEXT, misc14 TEXT, language TEXT, title2 TEXT, misc15 TEXT);-- The COPY command bulk-loads the data from the subsequent lines into the tableCOPY holdings.sfx (title_key, title, misc1, issn, ejournal_id, collection, coverage, eissn,     complete_title, target, misc2, misc3, misc4, misc5, misc6, misc7, misc8, target_parser,     target_id, service_id, misc_id, category, misc9, misc10, misc11, publisher, pub_place,     pub_date, item_type, active, misc12, misc13, misc14, language, title2, misc15) FROM STDIN;-- All of the TXT data follows

  3. Now I can load this into my database, where "datafile.txt" is the munged TXT dump, and "holdings" is the name of the database:

    psql -f datafile.txt -d holdings

  4. Assuming that all went well, I can connect to the database and create a few indexes that will probably be useful while doing queries.

    -- Create indexes on the ISSN and E-ISSNCREATE INDEX holdings_sfx_issn_idx ON holdings.sfx(issn);CREATE INDEX holdings_sfx_eissn_idx ON holdings.sfx(eissn);-- Create index on the collection, because we'll limit some queries on this basisCREATE INDEX holdings_sfx_collection_idx ON holdings.sfx(collection);-- Update the statistics on the tableANALYZE;

Now I'm ready to load data from another source - say, from my ILS that includes print holdings - so that I can start matching on ISSN and identify candidates for pruning our print collection based on criteria such as the electronic collection. Perhaps I'll talk about that in part 2...