Occasionally on the #OpenILS-Evergreen IRC channel, a question comes up what kind of hardware a site should buy if they're getting serious about trying out Evergreen. I had exactly the same chat with Mike Rylander back in December, so I thought it might be useful to share the strategy we developed in case other organizations are interested in piggy-backing on our research. We came up with three different scenarios, depending on the funding available to the organization and how serious the organization is about testing, developing, and deploying Evergreen.
You can also look at the scenarios as stages, as the scenarios enable
progressively more realistic testing. An organization can always
start with a single server and add more servers over time; if you can
swing a significant discount for buying in bulk, however, it might
make sense to bite the bullet early.
Some pertinent facts about our requirements: we will eventually be loading around 5 million bibliographic records onto the system. We're an academic organization, so concurrent searching and circulation loads will be low relative to public libraries.
Scenario 1: A single bargain-basement testing server
In this scenario, the organization purchases a single server for the short
term, and configures it to run the entire Evergreen + OpenSRF stack:
- database
- Web server
- Jabber messaging
- memcached
- OpenSRF applications
This server needs to have powerful CPUs, large amounts of RAM, and many fast (10K RPM or higher) hard drives in a
striped RAID configuration (the latter because database performance
typically gets knee-capped by disk access). A "higher education" quote online from a reputable big-name vendor for a rack-mounted 2U database server with 2x4-core
CPU, 16GB RAM, 6x73GB RAID 5 drives comes in at approximately $7000.
This scenario is fine for development and testing with a limited
number of users, but if you intend to do any sort of stress testing
with this server or throw it open to the public, performance will
likely grind to a halt. Note: This is close to the system that we're currently running at http://biblio-dev.laurentian.ca - 12 GB of RAM, 2 dual-core CPUs - with 800K bibliographic records and pretty snappy search performance. It's certainly nothing to sneeze at.
Scenario 2: one database server, one network server
In this scenario, you purchase a database server and a network server.
We'll use the same specs from scenario 1 for the database server, and
a CPU + RAM-oriented server for the network server (disk access isn't
a factor for the network apps, so you just buy two small mirrored
drives). The stock higher education quote for a rack-mounted 1U
network server with 2x4-core CPU, 16GB RAM, 2x73GB RAID 1 drives is
approximately $5250.
This scenario will support development and testing, as well as enable
you perform relatively representative stress testing runs with a
significant number of simultaneous users.
Scenario 3: two database servers, two or three network servers
In this scenario, you purchase two database servers so that you can test
database replication, split database loads between search and reporting, and two or three network servers to test
different distributions of the caching and network apps across the servers to
determine the configuration that best meets your expected demands. The cost of the five servers adds up to less than $30,000 - less than a single traditional proprietary UNIX server - and would be less if you can negotiate a bulk discount.
The third scenario supports development and testing, and will give you
practical experience with a configuration that would approximate your
production deployment of servers. When you go live, you could move one of the database servers
and all but one of the network servers over to the production cluster, and revert back to scenario one for your ongoing test and development environment.
The Conifer approach
We opted to go with the third scenario to build a serious test cluster for our consortium. However, the "scenarios as stages" approach ended up being our strategy as our original choice of Dell servers came with RAID controllers that do not work well under Debian. After returning the servers to Dell, we were forced to press one of our backup servers into service as a scenario-one style server while waiting for our new order from HP to arrive.
Let me add one suggestion: your recommendation for fast disk drives for the DB server is absolutely correct, but that advantage is lost if you use RAID 5 (or even striped RAID 5; often called RAID 50). You should never run a DB on any RAID level that requires reading multiple drives for each bit of data (in current practice that means RAID 5, 50, 6 or, if such there be, 60). If you do, you're basically taking all the advantage gained by having fast disks and throwing it out the window. Remember that to read any data from a RAID 5 array (same applies to RAID 50), the RAID subsystem has to read from 3 disk drives and perform calculations on the RAID card to assemble the data. Presumably the RAID card can perform the calculations fast enough that it adds minimal overhead, but your seek times are off the charts (in a bad way).
For a DB, never even think about using anything other than RAID 1 or RAID 0 + 1 (both of which actually improve read performance--far and away the more common operation on a library DB, particularly in an academic setting).
In general, for a DB server, to maximize price-performance, you want to get the fastest disk subsystem you can get (as of now, that would be, for instance Seagate Savvio 15K rpm drives or similar SCSI U320 drives). But then be sure you're not crippling your disk performance by using an inappropriate RAID level. RAID 5 (or the now emerging RAID 6--uses 4 spindles for each byte, rather than 3, for higher fault tolerance) is great for storing backups; it's not what you want for a database. A DB's throughput during heavy use, can easily double or triple on a mirrored disk subsystem as compared to any split-the-bytes form of RAID setup.
John
John
I do think your statement "For a DB, never even think about using anything other than RAID 1 or RAID 0+1" is questionable. See http://www.tpc.org/tpcc/results/tpcc_result_detail.asp?id=107012201 for a DB2 result published by IBM in 2007, for example; the full disclosure shows that they're running on RAID 5. See also http://www.tpc.org/tpcc/results/tpcc_result_detail.asp?id=107012201 for a result published by Oracle in 2008; they're also running on RAID 5. I doubt these companies are willingly giving up performance.
I know that different databases can interact with filesystems in different ways (DB2 liked bypassing filesystem caching, for example), so perhaps PostgreSQL performs better with RAID 1+0 than RAID 5. Again, there are certainly claims out on the 'net that it does. But then, I'm firmly of the "try it yourself" camp, and that's why we have two database servers - we can always bring one down, set up a RAID 1+0 configuration, and run benchmarks against otherwise identically configured servers to determine which RAID configuration suits Evergreen on PostgreSQL.
I'll post an update once we've done that.
--my reading is that RAID 5 is a requirement of the benchmark. Historically, RAID 5 no doubt made sense, and now, when it's of much less practical use, they may just want to keep things consistent with older test results (back in the day of 2GB drives, you needed RAID 5 to get a big array).
But, just looking at raw performance, a simple measure of disk subsystem performance (using say IOMeter's File Server simulator) with RAID 1 vs. RAID 5 will definitely show that the disk subsystem performs substantially better with RAID 1 (particularly a fair number of queued IOs and for a high read/write ratio--and a library system is a fairly extreme case of reads to writes). Note that the raw performance of the disk subsystem is independent of how the OS interacts with the drives vs. how the DB does: there's no getting around the fact that a RAID 5 system will have higher seek times (even with good command queuing, positioning heads on 1 drive will beat positioning them on 3--although it won't be a simple 1 to 3 ratio, of course) and the XORs to recreate the actual data read from the RAID 5 array can't be free, no matter how efficient they are.
I'll be interested to see your test results, if you have time to do them (without using something like IOMeter, it may be moderately difficult to simulate an appropriate load at which the performance would really diverge--but more on that below), but the only reason for using RAID 5 over RAID 1 or RAID 0 + 1 is that it's more efficient from a cost-per-GB point of view than RAID 1 or 0+1. (The fact that you can have a hot spare drive configured for either setup means the fault tolerance is equivalent: either survives a single-drive failure.) This cost-per-GB fact was important in years past, perhaps, and for truly huge databases it might still be a consideration. But, for a library DB (all but the absolute biggest ones I've encountered in 17 years in the business are less than the size of one 73GB drive--most a LOT less), with current hard drive technology, the cost difference of more spindles (taking system cost as a whole into consideration) is insignificant. There's also the issue that RAID 5 performance varies tremendously from RAID controller to controller--but since RAID 1 is so much simpler, if you don't happen to have the best RAID card available, you've got a better chance of getting good performance from whatever you've got if you take the XORs and more complex head-positioning out of the mix by not using RAID 5.
I certainly wouldn't recommend anyone ever assume that they're getting the best performance from a DB server if they host the DB on a RAID 5 setup. We have a test server at one of our clients' sites, it has copies of the live DB on both RAID 5 and just individual disks (no RAID--which is slower for reads than RAID 1, but faster for writes). The individual disks are definitely faster (and not just a little faster--I mean at least 3 to 5 times faster). The RAID 5 performance is acceptable (especially at low load levels), but we would never use that arrangement for a production system. (We have the one copy on the RAID 5 array because there's extra space there in addition to the backups we keep on the array.) It is true that the RAID 5 performance of the Adaptec controller in that box isn't stellar.
I agree with the "try it yourself" philosophy too, and my actual experience with the same DB server hitting different DBs on different RAID arrangements has been crystal clear: RAID 1 or 0 + 1 gives substantially better performance. At low loads you might not be able to see the difference, but in the case of Evergreen's keyword searching, for instance, which is highly random in terms of disk accesses, my guess is that doing cache-cold searches for common terms would immediately show a perceptible (not just a measurable) difference.
My real-world experience, plus the fact that disk space (even employing the fastest available drives) is a relatively inexpensive part of a system, leads me to conclude that there's no reason to use RAID 5 to store DB data. And if I were spec'ing a system now, rather than a RAID 5 array for backups, I'd consider using mirrored big and inexpensive SATA drives for that too. For databases of the size typically encountered in the library world, RAID 5 has no practical benefits.
The reason for this is that once you add more than 3 drives the data can be read from multiple sets of 3-at-a-time, bringing performance back toward the RAID10 level. Of course, if you're going to put 6 or 12 or 24 drives in one RAID set then you might as well use RAID10, because you're obviously not concerned about disk cost. Unless your controller is better at RAID5 parity calculation than it is at concurrent read on RAID10. Such animals are said to exist.
With storage costs being what they are these days, the biggest consideration with RAID5 is that you need to have a really good controller with fast and sane hardware parity calculation. Of course, you need a good controller for RAID10 too, but RAID10 is easier to implement well than is RAID5.
One other thing, there's a big difference between RAID 1+0 and RAID 0+1 ... the former is Good(tm), but the latter will cause you to lose an entire "side" of your mirror if one drive goes bad -- decidedly ungood. The RAID 0+1 is usually only an option with software RAID, though, for this very reason.
Did that answer any questions or clear anything up? Nope. It really just means this: test with your real workload, and when all else fails add mor RAM.
And with that hardware nerdiness out of the way, I bid you all adieu!
--miker
RE RAID 1+0 vs. 0+1, though, I want to suggest that the naming may not match the patterns as exactly as Mike implies. I've not encountered the type of combination of stripes and mirroring that Mike refers to in which a one drive failure kills the whole array, but I have encountered different terms for the same conceptual beast among the different vendors. And also different ways of setting up the combination of striping and mirroring (which can make the terminology all that much more confusing--especially if you have in mind that one way provides fault-tolerance and one does not).
Depending upon the controller manufacturer's terminology, RAID levels 0 + 1, 1 + 0, and 10 are somewhat interchangable. And, although the way the setup actually happens suggests that it should be called one thing or the other, the end result is the same--and is fault-tolerant. (3Ware and Adaptec, for instance, use opposite setup approaches, as I recall--befuddled me no end, initially because I was concerned about just the sort of issue Mike raised with regard to fault-tolerance). It turns out that the end result is the same: striped and mirrored with fault-tolerance of a single-disk failure.
With one of these vendor's cards, you have to set up your 4 drives (the minimum for either of these approaches) as RAID 0 (2 sets of 2 striped drives each) and then mirror the one striped pair onto the other striped pair: disks 0 & 1 are a striped pair; 2 & 3 are a striped pair; and then 2 & 3 are set up to mirror 0 & 1. Most logically, IMO, this might be referred to as 0 + 1: RAID 0 [striped] followed by the addition of RAID 1 [mirrored]. With the other manufacturer, you set up 2 mirrored pairs and then stripe across the 2 mirrored pairs: 0 & 1 are mirrors; 2 & 3 are mirrors; and 2 & 3 stripe 0 & 1. Now, for naming, I'd choose 1 + 0 for this. But, the thing is, even though the sequence is different between 3Ware and Adaptec, the terminology isn't consistent with what my brain thinks it should be: so they may call it the same thing even though the levels get applied in a different order. (Is the terminology a FIFO: first applied level listed first; or a FILO: last applied level first? And if you have that straight, doesn't
it seem that you'd attend to which order you apply the levels? Well, apparently not.)
The important point is that whatever your RAID controller manufacturer calls this RAID setup combining striping and mirroring (which is generally as good as you can get for a DB, in terms of general-purpose performance--the advantage of the striping being relatively minimal in most cases), you do not lose the whole array with a 1-drive failure. Each drive in such a setup contains a copy of the stripes for half the array. Failure of one drive means you have to rely only on the other half of the mirror for that set of stripes (until your hot spare comes on line--hot spares are generally worth the cost, I think). So, regardless of whether you apply the stripes and then mirror, or the other way around (which would possibly imply the different naming mentioned by Mike), you do have fault tolerance for single-drive losses.
I'm not disagreeing with Mike that there are non-fault-tolerant ways to set things up, but, in general, anything you can do via a RAID controller in terms of combining drives in any kind of configuration that includes a 1 (mirroring): 1 + 0; 0 + 1; or 10, will, based on my experience with 3 different manufacturer's controllers, give you the ability to survive a single disk failure--regardless of what they choose to call it and regardless of what order the setup happens in.
If it's a software RAID setup, I don't have enough experience to offer an opinion. Given Mike's comments, it sounds like some care would be in order in that case. But if you're configuring a server for a serious production system, you wouldn't want software RAID for anything except perhaps the OS and application installations anyway (RAID 1). And I would not recommend using any type of RAID 0 (striping) in a software RAID situation anyway: the only way it's any good is in hardware with mirroring. storagereview.com is an excellent site for many things disk related. Their discussion of RAID 0 is enlightening (they basically show that the received wisdom about striping is not correct).
Happy RAIDing.
I'll attempt some ASCII art of a 4 drive RAID set to illustrate (if it doesn't come out I can put some images up):
O+1 (strip, then mirror)
|-----m-----|
-s- -s-
|1| |3|
|2| |4|
--- ---
|-------------|
So, you can see that the mirroring step essentially sees 2 drives, the two stripe sets [s]. If physical drive 3 fails, as far as the mirror [m] is concerned, the entire right "side" of the mirror is bad, and only drives 1 and 2 stay in use. This leads to two problems. First, the entire right side must be rebuilt when drive 3 is replaced, and second, before drive 3 is replaced you can't lose either 1 or 2, or the entire RAID set dies.
1+0 (mirror, then stripe)
|-----s------|
-m- -m-
|1| |3|
|2| |4|
--- ---
|-------------|
Here, if we stripe mirrors, we can lose physical drive 3 and only the data residing on the right side of the stripe set acts in a degraded fashion with regard to read performance. What's more, we can still lose either of physical drive 1 or 2 and not lose any data.
Again, I've had this happen to me in the past with a Dell OEM controller (I think it was a Mylex variant, but you never can tell with Dell...). At least I can say that I inherited that setup.
John is 100% correct that both setups will survive a single-disk failure, but there is a distinct difference in the worst-case for each, and degraded performance under 1+0 will be better than under 0+1. That is the point that I was attempting to illuminate ... badly, it would seem.
As for software RAID, I don't think anyone would ever recommend that for a production system except, as John pointed out, if it is backed by hardware mirroring. In that case it can be very useful, as expanding the RAID set by adding more mirrored sets to the stripe can be simpler using an OS-supplied storage virtualization scheme than a controller-based scheme -- think Veritas storage virtualization or Linux LVM.
Man ... I haven't geeked out over hardware like this in ages. Thanks, Dan!
--miker
-- Jason
(But seriously, congrats on a very clear explanation.)
If I might suggest a moral to this story: if you want to use any kind of RAID 0 & 1 combination, before you buy it, check the documentation of a candidate controller and see how it suggests/requires combining mirroring and striping: if it does not appear to allow the more optimal mirror-then-stripe approach, buy a different controller (or just skip the striping step altogether; see below). Some controllers, at least, will only let you choose the less optimal stripe-then-mirror approach.
You might also try leaving the striping out of the equation and just using mirroring (if you have no choice of controller [Buhler?... Dell? ... Dell?]). RAID 10's advantage over straight mirroring (for Evergreen DBs) is probably a ways down into the hard-to-measure range anyway (striping is most useful for big sequential reads [think video streaming]--which isn't characteristic of much of the info in Evergreen's DB--or Postgres, in general, for that matter). Striping is also less useful with current disk technology (better caching, lower rotational latency, and faster seeks) than it once was (see the storagereview.com article for lot of techie bits).
And Mike, Geek On, bro!
John
This is a non-technical musing-type question regarding your comment "This scenario is fine for development and testing with a limited number of users, but if you intend to do any sort of stress testing with this server or throw it open to the public, performance will likely grind to a halt. "
We run our current Unicorn system on a single HP Proliant DL380R03 (4GB RAM). With roughly 800,000 holdings and 6,500 users, performance has yet to `grind to a halt', or even slow down.
Do Evergreen's hardware requirements so far exceed those of Unicorn?
Gord
Well, apples to apples and all that. Without an independent benchmark test suite to run, with control over the number of concurrent searches / updates / reports, or asterisks beside numbers for immediate updates vs. updates that take place overnight due to required reindexing, the number of bibs or potential users is fairly meaningless.
I think it would be great if there was an independent benchmark organization a la TPC (http://tpc.org) in the database world to set benchmark tests and monitor results in the library world, so that we could actually have meaningful quantitative comparisons. But I suspect the library world is too small of a market to justify or sustain such an organization.
So my comments were thinking more of the public library consortium world (dozens or hundreds of concurrent staff client sessions updating the data while hundreds or thousands of simultaneous user search sessions were going on, with various reports running in the background), rather than the academic library world (maybe a dozen concurrent staff client sessions and a few dozen concurrent searches, with the odd report running occasionally).
And of course, even that is guess work and needs to be benchmarked - one of the goals we have as part of Project Conifer is to run realistic stress testing scenarios with representative workloads to get a better idea of what our real hardware needs will be.