Sunday, August 11. 2013
On August 10, 2013, I gave the following talk at the PyCon Canada 2013 conference:
I’m a systems librarian at Laurentian University.
For the past six years, my day job and research have enabled me to contribute pretty heavily to Evergreen, an open source library system written largely in Perl and built on PostgreSQL.
But when I have the opportunity to create a project from scratch, for work or play, Python is my go-to language.
Last summer, the Laurentian University Library digitized 50 years of student newspapers - over 1,000 issues.
We posted them all to the Internet Archive and got OCR’d text as a result.
But finding things within a particular collection on the Internet Archive can be difficult for most people, so we felt the need to create a local search solution.
We were already using PostgreSQL to track all of the individual issues, with attributes like newspaper name, volume, edition, publication date.
This gave us the ability to filter through the issues by year and issue number, which was a good start. But we also wanted to be able to search the full text for strings like “Pierre Trudeau” or “Mike Harris”.
A common approach is to feed the data into a dedicated search engine like Solr or Sphinx, and build a search interface on top of that.
But PostgreSQL has featured full-text search support since the 8.3 release in 2008. So I opted to keep the moving parts to a minimum and reuse my existing database as my search platform.
Our example table contains a “doc” column of type TEXT; that’s where we store the text that we want to search and display. We also have a “tsv” column of type TSVECTOR to store the normalized text.
The TSVECTOR column is typically maintained by a trigger that fires whenever the corresponding row is created or updated. So... you just insert TEXT into the doc column, and the trigger maintains the tsv column for you.
PostgreSQL includes the tsvector_update_trigger() for your convenience, as well as a trigger that uses different language-oriented normalizations based on the value of a specified column. Naturally, you can define your own trigger that invokes the to_tsvector() function.
To provide good performance, as with any relational table, you need to define appropriate indexes. In the case of full-text search, you want to define a GIN (or GiST) index on the TSVECTOR column.
Note: GIN indexes take longer to build, but provide faster lookup than GiST.
Finally, we INSERT the text documents into the database. The ID and TSVECTOR columns are automatically generated for us.
Each text document is:
PostgreSQL bundles a number of language-specific dictionaries to support different stemming algorithms and default sets of stopwords.
In this example, we can see that PostgreSQL has stemmed “sketching” to remove the verb suffix, removed “the” altogether as a stop word, and stemmed “trees” to its singular form.
You can also see that the TSVECTOR stores the position where each token occurs, to support relevancy ranking algorithms that boost the relevancy of terms that are located closely together.
But it's a doozy of a query!
Yes, there is a lot going on here.
First, this is just a regular SQL statement that happens to use the WITH clause to define named subqueries (“q” and “ranked”).
The to_tsquery() function takes an incoming full-text search query and converts that into a parsed, normalized query.
The ts_rank_cd() function compares the TS_VECTOR column against the query to determine its relevancy score.
We need to restrict it to rows that match our query, so we use the @@ operator (PostgreSQL allows data-type specific operators like this) and then take the top ten.
Note: the query, limit, and offset are hardcoded here for illustrative purposes, but in the actual application these are supplied as parameters to the prepared SQL statement.
Finally, we use the ts_headline() function to give us a highlighted snippet of the results.
The harvester is a Python 3 application that uses the ‘postgresql’ module to connect to the database.
The REST and Web applications in the IA Harvester application are Python 2, largely because they use Flask (which iswas Python 2 only). But in the demo application, I’ve converted them to Python 3.
While I could have simply written the Web and REST applications as a single Flask Web app that talks directly to the database, I opted to couple them via a JSON interface for a few reasons:
I can change from Flask to another Web application framework on either piece.
I can separate the hosts if I need to throw more hardware at the service, and/or virtualize it on something like Google App Engine.
With so many Python web frameworks available, why Flask?
At the time I opted to try out Flask, the project’s public stance towards Python 3 was not warm. However, with the 0.10 release in June 2013, all that has (thankfully!) changed; Python 3.3 is supported.
We’ve already seen the decorator-route pattern before, and of course we need to quote, encode, and decode our search URL and results.
The Flask-specific parts are helper methods for getting GET params from the query string, and rendering the template (in this case, via Jinja2), by passing in values for the template placeholders.
At this point, the UI is functional but spartan. I’m a database / code guy, not a designer. Luckily, I have a student working on improving the front end (hi Emily!)
Friday, July 5. 2013
On August 10, 2013, I'll be giving a twenty-minute talk at PyCon Canada on A Flask of full-text search with PostgreSQL. I'm very excited to be talking about Python, at a Python conference, and to be giving the Python audience a peek at PostgreSQL's full-text search capabilities. With a twenty minute slot, I'll be leaning on my code4lib experience to compress the right amount of technical information into an entertaining package.
Setting aside my talk, the line-up for PyCon Canada looks fantastic; the keynote speakers are Karen Brennan, Hilary Mason, and Jakob Kaplan-Moss, and there are a ton of great talks. Did I mention that I'm really looking forward to this conference?
Update 2013-07-11: Now that the schedule is official, the presentation URL needed to be updated. Also, the impetus for this proposal came straight from PGCon 2013, where the PostgreSQL community was urged to get the good word out about PostgreSQL to other communities. Et voila!
Friday, May 28. 2010
I'm one of those people who actually keeps different passwords for every site and service I use. So far I'm up to over 400 passwords, so I'm dependent on a password manager. For a long, long time I have used Figaro's Password Manager (FPM) (and KedPM and most recently FPM2 as continuations of FPM), but now that I have an Android smartphone on which I can browse without wanting to die, I've been itching to get access to my passwords on that. I noticed that KeePassDroid was available, and that KeePassX would work on my desktop. I just had to get from FPM's password export format to one of KeePass's import formats. It turns out that nobody had made that particular leap before (or hadn't shared their conversion script).
Thus... I bring you the FPM to KeePass converter. A straightfoward Python script licensed under the GPL v3 that does a passable job of converting an FPM XML export to a KeePass 1.x or 2.x XML import file. It worked for me, and that's all that I needed; but maybe it will work for you, too.
Thursday, March 15. 2007
For the past couple of days, I've been playing with Casey Durfee's code that uses Solr and Django to offer a faceted catalogue. My challenge? Turn a dense set of code focused on Dewey and Horizon ILS into a catalogue that speaks LC and Unicorn. Additionally, I want it to serve as both a proof of several technologies (Solr for faceted searching and Django as a Web application framework) to my colleagues and as a reasonable backup catalogue for when our main catalogue fails (as it all too often does).
I emailed Casey today to tell him that I had a number of patches to contribute as a result of my experiments. It turns out that he's not really interested in pursuing this particular project much further, so he gave me his blessing to take his throwaway code and do whatever I want with it. Thus, the emergence of the FacBackOPAC project on code.google.com. If there's a grant out there for worst project name ever, this project's in the running... Anyways, I have contorted Casey's code so that it supports both Dewey and LC, and with a bit more torture it should be flexible enough to support both Horizon and Unicorn.
Right now I've twisted it all the way to meet my Unicorn needs and consequently have broken Horizon support, but it won't take much to make it support Horizon again - or any other ILS, for that matter. The main requirement is that you have to be able to get your MARC records and holdings out of your ILS. A secondary requirement is to know how to create links to detailed item views in your current catalogue, because this thing does not yet have any current awareness about item status.
There. My itch has been scratched for the time being. Go play with the FacBackOPAC project -- I even have (very) rough documentation on how to get the pieces installed andthe MARC records indexed, although you'll have to dig through the source in the Django catalog tree to overcome some hardcoded strings and URLs for the time being. Don't worry, pulling that hardcoded stuff out of the templates is high on the list of priorities.
So, a huge thank you to Casey for freeing this code and making this possible. For something he considers throwaway code, I've learned a lot from walking through it and making it start to meet my needs. I hope it helps you, too!
Edited links to point to the FacBackOPAC project page, rather than the wiki (which is subject to change, and which did -- breaking the dang links in the original version of this story. Argh!)
(Page 1 of 1, totaling 4 entries)
This work is licensed under a Creative Commons Attribution-Share Alike 2.5 Canada License.