First Go program: converting Google Scholar XML holdings to EBSCO Discovery Service holdings
Posted on Mon 11 June 2012 in Libraries
Update 2012-06-19: And here's how to implement stream-oriented XML parsing
Many academic libraries are already generating electronic resource holdings summaries in the Google Scholar XML holdingsformat, and it seems to provide most of the metadata you would need to provide a discovery layer summary in a nice, granular format... but unfortunately EBSCO doesn't want that for their Discovery Service. They want a tab-delimited file with just a few fields, and they don't want to go and fetch the Google Scholar XML holdings file and parse it themselves--even though that would seem to be a nice way to avoid having to teach each potential library client how to export holdings in their own desired input format. Lots of libraries don't expose their Google Scholar XML holdings publicly for some reason; I don't get why not, but have to investigate locally...
That gave me the excuse and opportunity to go off and invest some time in learning something new. I could have whipped up a script in Perl or Python or PHP, or written an XSL transform, but I opted to try out Go for this task. I've been introduced to Go twice in the past two years and was impressed by the language, but there's only so much you absorb in a one-hour workshop, and unless you need to get real work done, you never really learn a programming language.
Thus, I present to you my first crappy Go program: eds_holdings.go As I wrote this, I noted:
- I appreciate the clear reference documentation at http://golang.org but it would really benefit from more inline examples. I ended up writing the XML parsing portion of this using xml.Unmarshal primarily because there was an example for it. Of course, Unmarshal parses the entire document into structures in memory at once... I knew that wasn't what I wanted, but for whatever reason I didn't find any mention of "SAX" or "event" or "pull" that would lead me to a stream-oriented, low-memory parsing option on the page.
- However, the #go-nuts IRC channel on Freenode gave me a reply within minutes, pointing me at xml.Decoder for that purpose. Which is really great - a supportive community is critical. My only problem is that without a simple example to follow, I didn't want to dive into rewriting my quasi-working code, so I ploughed onward.
- My current approach to I/O is far from optimal. Not only am I parsing the entire XML structure in memory, I'm also reading the entire XML file into memory to begin with (via ioutil.FileRead as a natural outgrowth of my "begin by parsing a hard-coded string"). Don't follow my example! Please point me at a better example
- The standards and consistency wonk in me likes that Go delegates whitespace wars to go fmt <filename>. Problem solved and time-wasting arguments averted for everyone.
- Go is (subjectively) fast and for all of my worrying about in-memory work, it was pretty lean -- at its maximum, consuming less than 200 MB of RAM while iterating over a 32 MB XML file. For comparison, Firefox was consuming around 750 MB the entire time.
In summary: I enjoyed writing in Go, and hope to find an excuse to do it again. Also, I have much to learn!