アシアルブログ

アシアルの中の人が技術と想いのたけをつづるブログです

Further adventures with Zabbix

Following on from my last blog about installing Zabbix I thought I'd go into it in a bit more depth this time because, as it turns out getting it installed and running is really just the beginning.

The problem is that all the servers are doing different jobs and have subtle differences in the way they're configured. Therefore while you can start getting feedback from Zabbix very quickly I've had to spend a fair bit of time tweaking it for our environment.

The main issue is that the templates supplied by Zabbix are very detailed and the alerts have low trigger thresholds. This is exactly what you need to get started but it doesn't take long to start collecting a large number of alerts, most of which will be false. Getting a red alert that a server was down was alarming until I realised it was for a news server, something we don't actually run. Clearly some template editing was called for.

This can be quite formidable at first sight but fortunately because the supplied ones are so detailed its mostly a case of taking a hatchet to everything you don't need, at least until you're comfortable with Zabbix. So from the (literally) thousands of things you can monitor in almost all cases the important ones will be

Disks and filesystems
CPU load
Memory
Services

Disk performance metrics are really concerned with availability and I/O performance. Its always good to know there's enough free disk space on your partitions. I find it more helpful to show this as a percentage of available space than an absolute figure in Mbs. You will also want to monitor reads and writes per second. Actual values are a bit geeky in themselves but over time they'll build up into useful historical trends.

CPU performance? Well clearly you need to know how hard the processors are working so keep an eye on CPU load average and idle time. Load average is normally expressed in values over 1, 5 and 15 minutes. A value of 0.7 (meaning the processor is at 70% capacity) or below is good, occasional peaks as high as 3 are probably OK too, anything higher than that, especially if it's sustained spells trouble. In the default configuration these metrics returned a blizzard of alerts but are now more or less under control (more about that in a moment).

Memory, this covers both physical RAM and virtual memory. Generally what doesn't fit into RAM is swapped so you should keep an eye out for high swap rates.

And services will depend on what function your server is performing, but Zabbix can ping your HTTPD or MySQL service regularly to make sure its still there.

Once everything seemed to be under control I was pretty alarmed to discover that load on the Zabbix server itself had gone through the roof. My next job therefore was to reduce the load on the server.

This screen shows what happened when I deployed a fairly basic monitoring template across the servers based on the supplied one for Linux



As you can see the Zabbix server struggled to keep up for a while before gradually losing the battle. Well fter a bit of research I found I only had to do two things.

First of all a bit of tinkering with MySQL's configuration file (/etc/my.cnf)
Adding these two lines reduced CPU utilization by 50%!

innodb_buffer_pool_size=256M
innodb_flush_method=O_DIRECT

The next step was to reduce the polling time for the monitored items dramatically. The default for many is every 30 seconds. Multiply this by 40 different metrics on 50 servers and its not hard to see why the server was struggling to keep up.

By throttling back the the polling threshold to once per minute on many values and considerably more than that on others. You really don't need to check free disk space more than once every 15 mins or so. By doing this I was able to reduce CPU utilization by another 50%.

So load on the server reduced by over 100%. Here is a screenshot of the result of these two steps. Right now MySQL is taking up 2% of the CPU resources, against about 130% last week!



Zabbix to the rescue!

Hello Anthony here, from infrastructure ( & England)

Well it’s with great relief that I've been allowed to write this in English so welcome to my first, and AFAIK Asial's first English-language blog.

The big news here in infrastructure is that we're getting towards the end of a rollout of the Zabbix network management system. This has been quite a long and protracted process as well as a steep learning curve for me. But first let me give you a bit of background.

Well as you might expect Asial has dozens of servers. Some of them, like the mail server or the main website are pretty high profile. Others, located in dark and far-flung corners of the organisation are much less obvious, but the fact is they're all important for someone. Our problem is how to keep an eye on them all, make sure they're working properly, have advance warning if they’re about to go wrong and immediate notification if they do.

Most of this information is available in the log files of course but who wants to spend all day trawling through those? Enter Zabbix. First of all you build your Zabbix server, then you install the Zabbix agent on all your other servers. The agent runs in the background collecting information about processor load, free disk space, running services and so on, and periodically sends this info back to your Zabbix server. And not just servers either, Zabbix can collect just about any information you want from just about any network device you can think of. Printers, routers, disk arrays – no problem.

The basic installation was pretty straightforward and the basic templates are enough to get you started but with all this information gathering potential, configuration has taken a while. In fact I expect to be tweaking this for a good while yet.

Anyway the good news is that it works, better than I ever thought. I can get an up to the minute overview of the server status any time, as well as detailed information on specific servers, times of spikes in load, long—term historical trends and goodness knows what else. And an email or text message if anything goes horribly wrong. Blimey, this is great stuff.



So how much does Zabbix cost and where can I get it? Well thanks to the hard work of Alexei Vladishev and the Open Source community, nothing, it’s free! You can download it from here

What did we do before Zabbix? I don't remember but it wasn't as pretty as this.

Thanks for reading and hope to have more for you next month
読んでくれてありがとう、また来月

Anthony