Hello, Anthony here.
My last couple of blogs have been about Zabbix, the enterprise network monitoring system that we recently installed. As I mentioned in my last piece after a few teething problems this is now working fine and I think we all sleep easier at night knowing that any problems will be picked up before they become critical. And if the worst does happen then we'll know about them immediately.
So this month I'd like to discuss something that Zabbix indirectly highlighted and has taken up quite a bit of my time over the last week or so. That is Netapp.
Now Asial maintains installations in a couple of data-centers but we don't go there more often than we have to. For a start almost everything that needs doing can be done remotely, secondly they're inconvenient to get to, and finally they are extraordinarily inconvenient to actually get into. Security is rigorous, you have to book in advance, take several forms of ID, be escorted into and out of the building, and, well to put it bluntly the data-centers do everything they can to discourage visitors. So while I was aware we had a Netapp server I'd had very little to do with it.
For the uninitiated a Netapp server can best be described as a large noisy box. It doesn't look particularly impressive, at least from the front, and anyone randomly wandering in off the street would guess it was fairly low down the digital pecking order. It certainly doesn't have all the flashing lights that, say a router has. And it has lots of clumsy chunks of plastic poking out of it, not at all the sleek lines of our HP Proliant servers. However the fact is our Netapp server is critically important to us, probably the single most important item on the rack, because it provides high availability, high redundancy storage.
All our servers have a certain storage capacity of course but all the really important stuff goes on Netapp. This means while the OS runs on the server, in many cases the data itself will be on Netapp. Why? Well behind each one of those chunks of plastic sits a hard drive, 14 of them in Asial's case, providing not just several terabytes of storage, but more importantly speed and reliability. Each hard drive is working with all the others to distribute load and maintain service. It can take snapshots of data at regular intervals so that recovering from an accidental deletion is trivial. Indeed the snapshot process is so fast and efficient that in many ways it renders traditional backups obsolete. Finally in addition to those 14 active hard drives there are a number of spares. If Netapp detects a problem with one of the active drives it will bring one of the spares online and deactivate the defective disk. In other words if a disk dies, service is unaffected.
My problem therefore is how to know when a disk has failed. There are no signs, no interruption in service, no complaints from users, just a seamless switching over from one disk to another. Well getting this information, and various other important details has been what I've been up to recently. I've had to do quite a bit of reading around the Netapp OS, checking the settings currently in use. Even if I don't know what they mean now they're still a useful benchmark for anything that might change in the future.
So what happens if Tokyo gets hit by a major earthquake and takes the entire Netapp installation out? Hmm... perhaps that's a topic for next month.