Dear database vendor: defending against sci-hub.org scraping is going to be very difficult
Posted on Wed 10 December 2014 in Libraries
Our library receives formal communications from various content/database vendors about "serious intellectual property infringement" on a reasonably regular basis, that urge us to "pay particular attention to proxy security". Here is part of the response I sent to the most recent such request:
We use the UsageLimit directives that OCLC's EZProxy solution offers to block users who go over certain thresholds. However, the UsageLimit directives are really too coarse to be extremely useful. For example, you can set a limit based on the number of transfers in a given time period, but you can't set different thresholds for content types (such as CSS, JavaScript, HTML, images, or PDFs). The compromised account had gathered a set of URLs that enabled them to directly request a series of PDFs, thus staying below the general threshold for transfers. If EZProxy offered a "transfer threshold by MIME type" directive, then we could easily block users who tried to download more than, say, 100 PDFs in an hour.
We also set UsageLimit directives for total bandwidth consumed. However, again this is limited by the coarseness of the directives available to us in EZProxy, as well as the increased richness of the variety of content available from electronic resources these days. With individual PDFs varying in size from 0.25 MB to 2.5 MB, not to mention streaming audio and video services, finding the right threshold without locking out legitimate users is quite challenging.
I therefore urge you to contact OCLC directly and demand that they add the ability to include finer-grained directives for UsageLimit throttling to EZProxy. As EZProxy is by far the most common proxy solution deployed by libraries worldwide, this would enable many of your customers to benefit from the enhancement. While OCLC's customers have been requesting functionality like this for years via the EZProxy mailing list, they are slow to react (having taken months to update EZProxy to address recent SSL vulnerabilities, for example). Perhaps OCLC will listen to an enterprise partner.
For our part, at Laurentian, I have asked our IT Services department (who controls our proxy server) to write a simple script that parses the EZProxy event logs and emails us when a user is blocked due to going past a threshold. This would have helped us catch the compromised account much earlier on, and should also be another basic feature of EZProxy. Right now, every library has to implement their own solution for this basic requirement, and many do not.
All that said, even with finer-grained threshold directives and active monitoring of account blocking events, I have to note that a savvy attacker intent on harvesting your content will, once they have compromised an account, simply slow down the number of requests to the level that emulates the activity that a normal human would generate, and spread the requests out across all of the accounts they have compromised, and introduce a level of randomness into the requests so that they aren't detectable patterns (such as linear requests for only PDFs), etc. No system is going to offer a perfect defence against those efforts.
I'm sympathetic to the content vendors' concerns, but really, even if OCLC does add some of these features to their core EZProxy offering, the content-scraping approaches will simply increase in sophistication. Removing proxy access isn't a real option for our users, even though cutting off proxy access is what the content vendors do. This is a game that nobody is going to win.