Progress with Project Conifer

Posted on Thu 27 March 2008 in Libraries

Project Conifer is the effort by McMaster University, University of Windsor, and Laurentian University to put together a consortial instance of Evergreen. A few weeks back, we agreed that May 2009 would be our go-live date. So the clock is ticking quite loudly in my ears.

Today I got an Evergreen test server up and running, loaded with the records from the consortium of Laurentian partners. I hit a few bumps on the road, but eventually successfully loaded about 800,000 bibliographic records and about 500,000 items. I also turned on the Syndetics enrichment data, so some items offer cover images, tables of contents, reviews, and author information. The response time is pretty snappy (it's running on a 4-core server with 12GB of RAM).

Things that made my task harder than it probably should have been:

  • yaz-marcdump generated invalid XML when I converted our MARC records from MARC21 to MARC21XML format. Maybe this problem is fixed in later versions of yaz-marcdump (I was using the stable Debian Etch version, 2.1.56, which is crazy old), or I could have tried marc4j or MarcEdit instead to try for better results, but I didn't, and it cascaded into problems with...

  • Dumping all of the holdings as part of the bibliographic records threw things off when some of the records had so many holdings attached (think a weekly periodical that a library circulates and therefore each issue has its own barcode) that they spilled over MARC's record length limit, resulting in multiple MARC records just to hold the holdings - which causes some problems for the basic import process. I eventually punted on trying to parse the MARC21XML for holdings and just dumped the data I needed directly from Unicorn in pipe-delimited format.

  • Not tuning PostgreSQL before starting to load data into the database was just plain stupid. The defaults for PostgreSQL are incredibly conservative, and must be modified to handle large transactions and to perform. Here are the tweaks I made for our 12GB machine, starting with the Linux kernel memory settings:

    # -- in /etc/sysctl.conf --# Set SHMMAX to 8GB for PostgreSQLkernel.shmmax=8589934592
    

    # -- in /etc/postgresql/8.1/main/postgresql.conf --# Crank up shared_buffers and work_memshared_buffers = 10000work_mem=8388608 # 8 GB, equal to our kernel.shmmaxmax_fsm_pages = 200000
    
  • Evergreen depends on accurate fixed fields to determine the format of an item. Unfortunately, many of our electronic resources appear not to have been coded as such... so we have some data clean-up to do.

Ah well: as Jerry Pournelle used to say in his Chaos Manor column, "I do these things so that you don't have to." Hopefully it makes a smoother path for others to get to Evergreen.