Running libraries on PostgreSQL: PGCon 2012 talk

Posted on Sun 20 May 2012 in Libraries

On Friday, May 18th I gave a talk at the PGCon 2012 conference on the use of PostgreSQL by the Evergreen project. My talk fell in the case study track, which meant that I had been asked to describe to PostgreSQL developers what Evergreen was, why it was a project they might want to care about, enumerate the advantages that Evergreen gets from using PostgreSQL, and where our project has some difficulties with PostgreSQL.

I have given a lot of talks before, but I’m used to being on the developer side of the discussion. In this case, the tables were turned; with noted PostgreSQL contributors like Josh Berkus, Chris Brown, Simon Riggs, and Robert Treat in the audience, I was a user talking to the developers of something that I was very much dependent on and which I understood at a much more basic level than they did. This was both liberating and humbling; it definitely adds some perspective to my experiences as a developer in the Evergreen project.

Along with my slides, the whole talk has been professionally recorded - both video and audio - thanks to Heroku’s sponsorship, so you will be able to relive each and every word if you really want to. I’ll summarize the main points that I wanted to convey to the PostgreSQL developers:

I was quite candid that most libraries can’t afford dedicated database administrators, and that therefore the more that PostgreSQL can provide reasonable out-of-the-box configuration settings, the better. For example, results from the survey that I sent out at the last minute (THANK YOU to the nine sites that responded!) showed many sites running with a default statistics target of 50, whereas the default had been increased to 100 back in PostgreSQL 8.1 and much higher settings are often recommended to help the planner make its decisions. That said, my survey didn’t ask for table-level statistics settings (did you know that you could change the statistics for particular tables?), so perhaps some sites are using higher statistics levels for particular tables and a lower default threshold.
It was probably hokey, but I noted that as libraries are often called the heart of their community, that PostgreSQL was effectively the heart of Evergreen — and I invited the PostgreSQL community to help our heart beat faster. With the Evergreen Oversight Board contemplating a strategic investment fund for initiatives that will have a long-term benefit to Evergreen, this might be an avenue for getting PostgreSQL experts to assist us on areas that represent particular bottlenecks (beyond helping us out of the goodness of their own hearts). As well, I invited the PostgreSQL community to join in advocacy efforts to get their local libraries to consider adopting Evergreen.
I described, at a high-level, many of the PostgreSQL features that Evergreen relies on (full-text search, stored procedures, Hstore, inheritance) and tried to convey why our schema takes up 355 tables (and counting) to deal with what, from outside a library perspective, must seem like a relatively simple problem to deal with. And of course I gave most of the credit for Evergreen’s PostgreSQL-savviness on multiple levels to Mike Rylander.

The talk was well-received, based on a number of people who approached me afterward to continue the discussion. Josh called it one of the first times he had seen a presentation designed to solicit assistance directly from the developers in attendance (I probably overplayed the "help us poor harried library system administrators" hand) and thought that it hit the mark for a case study; similarly, Simon was quite interested in Evergreen’s adoption patterns with (I suspect) an eye towards offering possible consulting in administration and optimization efforts.

On the "immediate takeaways" from that talk:

For straightforward connection pooling, pgbouncer is the current recommendation over the more flexible but more complicated pgpool-II.
Recent versions of Slony have lifted limitations that bit us in the past, like the inability to replicate a TRUNCATE command.
Solr, as a potential alternative to PostgreSQL’s full-text search, is seen as fast but brittle to manage, and adds in overhead to maintain consistency with the contents of the database. (I’m not so sure about the brittleness, given Hathitrust’s ability to run a massive Solr index, but it is worth following up on…)
Streaming replication in 9.1 has improved significantly over 9.0, although you’ll still want to have WAL archiving in case of disaster.

I have a lot more to say about the intersection of the PostgreSQL and Evergreen communities in general, but on the whole I think that a closer relationship has been long overdue. I was delighted that Ben Shum and Robin Isard were both able to attend the conference, and I firmly believe that building more PostgreSQL development and administration expertise within the Evergreen community is critical to our long-term success. While I have long been an advocate of pointing community members to the documentation of the underlying infrastructure components for specific administration recommendations, I believe that effective PostgreSQL tuning and administration is so critical to the successful implementation of a production Evergreen site that we should add a section to the Evergreen documentation containing a small set of considerations and/or processes for going into production—and I hope to start that relatively soon.