Double-barreled PHP releases
Posted on Wed 18 October 2006 in PHP
I'm the proud parent of two new releases over the past couple of days: one official PEAR release for linked list fans, and another revision of the File_MARC proposal for library geeks.
Structures_LinkedList
A few days ago marked the first official PEAR release of the Structures_LinkedList. Yes, it's only at 0.1.0-alpha, but I'm pretty damned happy with the code at this stage and unless something drastic happens the only significant change I foresee between now and 1.0-stable is the addition of some user-oriented documentation. This code got a severe workout at the Access 2006 Hackfest, where I ran headlong into some significant limitations in parsing huge files.
A few days later, after misdirecting some precious #php.pecl brainpower (sorry, sorry, sorry Wez, Ilia, and Tony) on the wrong problem, I discovered the reason writing your own __destruct() methods can be very, very necessary. If you don't clean up variables that PHP doesn't know how to deal with--say, nodes in a doubly-linked list that look like circular reference hell to PHP--then you're going to be in for a world of hurt for anything but the smallest of test scenarios. This particular problem has had a stake put through its heart in Structures_LinkedList as of the 0.1.0-alpha release. Go forth and create linked lists!
File_MARC
On the library geek front, I pushed out File_MARC 0.0.9 via the PEAR Proposal process today. This new release repairs another embarassing problem that I originally blamed for breaking down during our Hackfest work. You see, I hadn't touched emilda.org's php-marc core routine for parsing MARC files, and it happened to call file() to read the entire target MARC file into memory as an array of lines before enabling you to start parsing the individual MARC records. That's nifty if you just want to count all of the MARC records in a given file, but it doesn't scale up very well when you've brought, oh, a single file with a half-million MARC records to parse. In fact, PHP kind of gets very upset with you.
The solution, as Dan Chudnov suggested on the fly during my Hackfest interview, was to go with streams. It turns out that stream_get_line() was perfectly suited to the task: given a file pointer, it sucks in the contents of that file until it reaches a maximum length or a given string, then waits until you ask it to suck in the next chunk of data. It was a breeze to convert the code to the following approach:
const END_OF_RECORD = "\x1D";const MAX_RECORD_LENGTH = 99999;...$this->source = fopen($source, 'rb');...$record = stream_get_line($source, MAX_RECORD_LENGTH, END_OF_RECORD);
That change solved the "big file" problem, but as File_MARC represents MARC records as linked lists (fields) containing linked lists (subfields), the big file issue was just covering up the slightly more twisted memory managment issue in the Structures_LinkedList library. However, after those two changes, testing out the same code I had hastily written at Hackfest shows that the script to parse a 512M MARC file now never takes more than 0.8% of my system memory.
So, library geeks -- this is a last call for significant comments on the File_MARC API. In a couple of days, I plan to put this proposal up for a vote to become an official PEAR package. Of course, if you want to test it out right now, I have high confidence in the code: you can grab it from marc.coffeecode.net. And yes, if you visit that site, I am grasping for the worst throwback HTML design award ever, thank you very much!
Update 2006-10-19: Correct XHTML syntax errors. Heh.