Unicorn to Evergreen migration: rough notes

Posted on Sun 08 February 2009 in Libraries

Updated 2009-02-25 00:29 EST: Corrected setuptools installation step.

Updated 2009-02-08 23:39 EST: Trimmed width of some of the <pre> code sections for better formatting. Created bzr repository for unicorn2evergreen scripts at http://bzr.coffeecode.net/unicorn2evergreen

I did this once a long time ago for the Robertson Library at the University of Prince Edward Island. For our own migration to Evergreen, I have to load a representative sample of records from our Unicorn system onto one of our test servers. This has been a good refresher of the process... and a reminder to myself to post the other part of the Unicorn to Evergreen migration scripts in a publicly available location. Okay, they're posted to this bzr repository called unicorn2evergreen

  1. Export bibliographic records from Unicorn using Unicorn's catalog key (basic sequential accession number) as the unique identifier (I plopped the catalog key into the 935a field/subfield combo). I use the catalog key because the "flexkey" is not guaranteed to be unique within a single Unicorn instance - and because the catalog key makes it easy for us to match call numbers and copies.

  2. For each item, export call number / barcode / owning library / current location / home location / item type using the catalog key as the identifier.

  3. Set up the organization unit hierarchy on your Evergreen system. You can dump it from an existing Evergreen system into a file named "orgunits.dump" like so:

    pg_dump -U evergreen --data-only --table actor.org_unit_type \    --table actor.org_unit > orgunits.sql
    

    Then drop all of the existing org_units and org_unit_types and load your custom data in a psql session:

    BEGIN;SET CONSTRAINTS ALL DEFERRED;DELETE FROM actor.org_unit;DELETE FROM actor.org_unit_type;\i orgunits.sqlCOMMIT
    
  4. Import bibliographic records using the standard marc2bre.pl / direct_ingest.pl / pg_loader.pl process. Point the --idfield / --idsubfield and --tcnfield / --tcnsubfield options for marc2bre.pl at 935a (yes, this sucks for title control numbers, but as noted above they are not guaranteed to be unique in Unicorn and we need uniqueness in Evergreen). We need the bibliographic record entry ID field to be the catalog key to set up subsequent call number/barcode matches.

  5. Enable the subsequent addition of new bibliographic records by setting the sequence object values to avoid conflicting ID / TCN values by issuing the following SQL statements:

    SELECT setval('biblio.autogen_tcn_value_seq',     (select max(id) from biblio.record_entry) + 100);SELECT setval('biblio.record_entry_id_seq',     (select max(id) from biblio.record_entry) + 100);
    

  6. Process holdings records.

    1. Call numbers might have MARC8 encoded characters, so process'em and convert to UTF8. Theoretically "yaz-iconv -f MARC-8 -t UTF-8 < holdings.lst > holdings_utf8.lst" should do it, but instead it eats linefeeds and creates an unusable field. Ugh. We use a little Python script instead that requires pymarc, which in turn requires a version of setuptools (0.6c5) newer than Debian Etch's packaged version (0.6c3). So:

      wget http://pypi.python.org/packages/2.4/s/setuptools/setuptools-0.6c9-py2.4.eggsudo sh setuptools-0.6c9-py2.4.eggsudo easy_install pymarc
      
    2. Now actually generate the 'holdings_utf8.lst' file.

      cat holdings.lst | python marc8_to_utf8.py
      
    3. Adjust parse_unicorn.py to match up the holdings fields (added flexkey to the start). Then parse the holdings_utf8.lst to generate an SQL file (holdings_eg.sql) that we can load into the import staging table.

      python parse_unicorn.py
      

      Note that the holdings data for the item with barcode 30007007751786 didn't process cleanly and won't load. Weird - possibly a corrupt character in the item data? Augh, no - there are flexkeys and callnumbers that contain '|' characters (16 occurrences for "|z", 37 for "|b"), which is of course also what we are using as our delimiters. ARGH. I deleted it for now with:

      grep -v '|z' holdings_utf8.lst > holdings_clean.lstgrep -v '|z' holdings_clean.lst > holdings_clean.lst2mv holdings_clean.lst2 holdings_clean.lst
      

      Adjust parse_unicorn.py to match the new input name and generate a new holdings_eg.sql.

  7. Create the import staging table:

    psql -f Open-ILS/src/extras/import/import_staging_table.sql
    
  8. Load the items into the import staging table:

    psql -f holdings_eg_clean.sql
    

    We discover that some more of our data sucks - for example, one item ("Research in autism spectrum disorders", HIRC PER-WEB) has a create date of '0' which is not a valid date format because the barcode is "1750-9467|21". For now, grep it out as above and reload.

  9. Investigate possibilities of collapsing unnecessary duplicate item types:

    SELECT item_type, COUNT(item_type)FROM staging_itemsGROUP BY item_typeORDER BY item_type; item_type  | item_count ------------+------------ ATLAS      |        162 AUDIO      |        792 AUD_VISUAL |       1790 AV         |         69 AV-EQUIP   |        182 BOOK       |        996 BOOKS      |     581592 BOOK_ART   |          1 BOOK_RARE  |       4949 BOOK_SHWK  |          5 BOOK_WEB   |      49163 COMPUTER   |         33...(40 rows)
    

    How about locations?

    SELECT location, COUNT(location)FROM staging_itemsGROUP BY locationORDER BY location;  location  | count  ------------+-------- ALGO-ACH   |     13 ALGO-ATLAS |    148 ALGO-AV    |   1837...(212 rows)
    

    Now we can collapse categories pretty simply inside the staging table. For example, if we want to collapse all of the BOOK types into a single type of BOOK:

    UPDATE staging_itemsSET item_type = 'BOOK'WHERE item_type IN ('BOOKS', 'BOOK_ART', 'BOOK_RARE', 'BOOK_SHWK', 'BOOK_WEB', 'REF-BOOK');
    

  10. Update legacy library names to new Evergreen library short names (we're using OCLC codes where possible). Some will be straightforward old names to new names. Others will require a little more logic based on location + legacy library name; we're splitting the DESMARAIS collection into multiple org-units (Music Resource Centre, Hearst locations, hospital locations, etc).

    -- Laurentian Music Resource CentreUPDATE staging_itemsSET owning_lib = 'LUMUSIC'WHERE location = 'DESM-MRC';-- Hearst - Kapuskasing locationUPDATE staging_itemsSET owning_lib = 'KAP'WHERE location LIKE 'HRSTK%';-- Hearst - Timmins locationUPDATE staging_itemsSET owning_lib = 'TIMMINS'WHERE location LIKE 'HRSTT%';
    
  11. Generate the copies in the system:

    psql -f generate_copies.sql
    
  12. Make the metarecords:

    psql -f quick_metarecord_map.sql
    

Ah, recognize that any electronic resources (which don't have associated copies) won't appear. Check for 856 40 and change the bre source to a transcendent one mayhaps?

-- Create a new transcendant resource; -- this autogenerates an ID of 4 in a default, untouched systemINSERT INTO config.bib_source (quality, source, transcendant)VALUES (10, 'Electronic resource', 't');-- Make the electronic full text resources (856 40) transcendant-- by setting their bib record source to the new bib_source value of 4UPDATE biblio.record_entry SET source = 4 WHERE id IN (    SELECT DISTINCT(record)     FROM metabib.full_rec     WHERE tag = '856' AND ind1 = '4' AND ind2 = '0');

And no transcendence. Hmm. Oh well, worry about that later.