2012年8月14日

Further adventures with Zabbix

Following on from my last blog about installing Zabbix I thought I'd go into it in a bit more depth this time because, as it turns out getting it installed and running is really just the beginning.

The problem is that all the servers are doing different jobs and have subtle differences in the way they're configured. Therefore while you can start getting feedback from Zabbix very quickly I've had to spend a fair bit of time tweaking it for our environment.

The main issue is that the templates supplied by Zabbix are very detailed and the alerts have low trigger thresholds. This is exactly what you need to get started but it doesn't take long to start collecting a large number of alerts, most of which will be false. Getting a red alert that a server was down was alarming until I realised it was for a news server, something we don't actually run. Clearly some template editing was called for.

This can be quite formidable at first sight but fortunately because the supplied ones are so detailed its mostly a case of taking a hatchet to everything you don't need, at least until you're comfortable with Zabbix. So from the (literally) thousands of things you can monitor in almost all cases the important ones will be

Disks and filesystems
CPU load
Memory
Services

Disk performance metrics are really concerned with availability and I/O performance. Its always good to know there's enough free disk space on your partitions. I find it more helpful to show this as a percentage of available space than an absolute figure in Mbs. You will also want to monitor reads and writes per second. Actual values are a bit geeky in themselves but over time they'll build up into useful historical trends.

CPU performance? Well clearly you need to know how hard the processors are working so keep an eye on CPU load average and idle time. Load average is normally expressed in values over 1, 5 and 15 minutes. A value of 0.7 (meaning the processor is at 70% capacity) or below is good, occasional peaks as high as 3 are probably OK too, anything higher than that, especially if it's sustained spells trouble. In the default configuration these metrics returned a blizzard of alerts but are now more or less under control (more about that in a moment).

Memory, this covers both physical RAM and virtual memory. Generally what doesn't fit into RAM is swapped so you should keep an eye out for high swap rates.

And services will depend on what function your server is performing, but Zabbix can ping your HTTPD or MySQL service regularly to make sure its still there.

Once everything seemed to be under control I was pretty alarmed to discover that load on the Zabbix server itself had gone through the roof. My next job therefore was to reduce the load on the server.

This screen shows what happened when I deployed a fairly basic monitoring template across the servers based on the supplied one for Linux

As you can see the Zabbix server struggled to keep up for a while before gradually losing the battle. Well fter a bit of research I found I only had to do two things.

First of all a bit of tinkering with MySQL's configuration file (/etc/my.cnf)
Adding these two lines reduced CPU utilization by 50%!

innodb_buffer_pool_size=256M
innodb_flush_method=O_DIRECT

The next step was to reduce the polling time for the monitored items dramatically. The default for many is every 30 seconds. Multiply this by 40 different metrics on 50 servers and its not hard to see why the server was struggling to keep up.

By throttling back the the polling threshold to once per minute on many values and considerably more than that on others. You really don't need to check free disk space more than once every 15 mins or so. By doing this I was able to reduce CPU utilization by another 50%.

So load on the server reduced by over 100%. Here is a screenshot of the result of these two steps. Right now MySQL is taking up 2% of the CPU resources, against about 130% last week!