I'm the proud parent of two new releases over the past couple of days: one official PEAR release for linked list fans, and another revision of the File_MARC proposal for library geeks.
On the library geek front, I pushed out File_MARC 0.0.9 via the PEAR Proposal process today. This new release repairs another embarassing problem that I originally blamed for breaking down during our Hackfest work. You see, I hadn't touched emilda.org's php-marc core routine for parsing MARC files, and it happened to call file() to read the entire target MARC file into memory as an array of lines before enabling you to start parsing the individual MARC records. That's nifty if you just want to count all of the MARC records in a given file, but it doesn't scale up very well when you've brought, oh, a single file with a half-million MARC records to parse. In fact, PHP kind of gets very upset with you.
The solution, as Dan Chudnov suggested on the fly during my Hackfest interview, was to go with streams. It turns out that stream_get_line() was perfectly suited to the task: given a file pointer, it sucks in the contents of that file until it reaches a maximum length or a given string, then waits until you ask it to suck in the next chunk of data. It was a breeze to convert the code to the following approach:
const END_OF_RECORD = "\x1D";const MAX_RECORD_LENGTH = 99999;...$this->source = fopen($source, 'rb');...$record = stream_get_line($source, MAX_RECORD_LENGTH, END_OF_RECORD);
That change solved the "big file" problem, but as File_MARC represents MARC records as linked lists (fields) containing linked lists (subfields), the big file issue was just covering up the slightly more twisted memory managment issue in the Structures_LinkedList library. However, after those two changes, testing out the same code I had hastily written at Hackfest shows that the script to parse a 512M MARC file now never takes more than 0.8% of my system memory.
So, library geeks -- this is a last call for significant comments on the File_MARC API. In a couple of days, I plan to put this proposal up for a vote to become an official PEAR package. Of course, if you want to test it out right now, I have high confidence in the code: you can grab it from marc.coffeecode.net. And yes, if you visit that site, I am grasping for the worst throwback HTML design award ever, thank you very much!
Update 2006-10-19: Correct XHTML syntax errors. Heh.