Sunday, October 14, 2012

Monitoring tool - pnp4nagios special template


When we use Icinga or Nagios with pnp4nagios for graphing the resources of hardwares or middle wares, we often want to look a resource graph from several services in one graph like load average of all of the servers or HTTP response time of web servers to compare them and examine each the difference.
We can make it by creating own templates called special template with pnp4nagios.

"special templates (starting with PNP 0.6.5) are used to combine data from arbitrary hosts and services and thus are not connected directly to a host or service.", says about pnp4nagios special template, here.

Then, what is it like the special template? This is an example which combines load average from all of the hosts in one graph.
sample_load.php
<?php
$this->MACRO['TITLE']   = "LOADAVERAGE";
$this->MACRO['COMMENT'] = "For All Servers";
$services = $this->tplGetServices("","LOADAVERAGE$");
# The Datasource Name for Graph 0
$ds_name[0] = "LOADAVERAGE";
$opt[0]     = "--title \"LOADAVERAGE\"";
$def[0]     = "";
# Iterate through the list of hosts
foreach($services as $key=>$val){
  $data = $this->tplGetData($val['host'],$val['service']);
  #throw new Kohana_exception(print_r($a,TRUE));
  $hostname   = rrd::cut($data['MACRO']['HOSTNAME']);
  $def[0]    .= rrd::def("var$key" , $data['DS'][0]['RRDFILE'], $data['DS'][0]['DS'] );
  $def[0]    .= rrd::line1("var$key", rrd::color($key), $hostname);
  $def[0]    .= rrd::gprint("var$key", array("MAX", "AVERAGE"));
}
?> 


The sample template is on my github. Please see the official reference for more detailed information about how to define the special template.

Next, I would like to demonstrate about how to create own template for the number of http accesses one by one, including setting up nagios plugin, pnp4nagios custom template and special template.
If you need to install Icinga or pnp4nagios, please see the past articles, here.

This is the graph which sample_apache_access.php generates.


  • setup nagios plugin (/usr/local/icinga/libexec)
# for modules in LWP::UserAgent Time::HiRes Digest::MD5 ; docpan -if $modules ; done
# wget  http://blog.spreendigital.de/wp-content/uploads/2009/07/check_apachestatus_auto.tgz -O- | tar zx
# ./check_apachestatus_auto.pl -H 127.0.0.1
APACHE OK - 0.050 sec. response time, Busy/Idle 1/9, open 246/256, ReqPerSec 0.4, BytesPerReq 17, BytesPerSec 5|Idle=9 Busy=1 OpenSlots=246 Slots=256 Starting=0 Reading=0 Sending=1 Keepalive=0 DNS=0 Closing=0 Logging=0 Finishing=0 ReqPerSec=0.350877 BytesPerReq=17 BytesPerSec=5.988304 Accesses=60

  • enable mod_status module if it's not enable(httpd.conf or including configuration)
ExtendedStatus On
<VirtualHost *:80>
  ServerName 127.0.0.1
  <Location /server-status>
    SetHandler server-status
    Order deny,allow
    Allow from 192.168.0.0/24
  </Location>
</VirtualHost>

  • define command and service configuration for Icinga/Nagios
define command{
        command_name    check_apache_performance
        command_line    $USER1$/check_apachestatus_auto.pl -H $HOSTADDREESS$ -t $ARG1$
}
define  service{
        use                    generic-service
        host_name               ha-mgr02, eco-web01, eco-web02
        service_description     Apache:Performance
        check_command           check_apache_performance!60
}
※Make sure that the hosts are defined on hosts.cfg.

  • setup custom template
    put check_apache_performance.php on /usr/local/pnp4nagios/share/template/
  • setup special template
    put sample_apache_access.php on /usr/local/pnp4nagios/share/templates.special/
  • Take a look at http://<your icinga server>/pnp4nagios/special?tpl=sample_apache_access

Lastly, I'm going to show you some examples.

Let's enjoy creating your own template and saving time to look around all of the graphs.

Sunday, August 26, 2012

Backup/Restore software - Amanda

It is tiresome work to implement a backup system with even Open Source Software  or Enterprise Backup Software, while it is also absolutely necessary to restore data after its loss or corruption.  

"Backups have two distinct purposes. The primary purpose is to recover data after its loss, be it by data deletion or corruption. Data loss can be a common experience of computer users. The secondary purpose of backups is to recover data from an earlier time, according to a user-defined data retention policy, typically configured within a backup application for how long copies of data are required.", says Wikipedia about bakcup.

I believe that Amanda Open Source Backup(Community Edition) is relatively easy and rapidly to set up and well documented on their wiki about how to install, tune the parameters or troubleshoot. It is enable to back up multiple hosts over network to tape changers, disks, optical media or AWS S3.

I would like to introduce how to set up Amanda server/client and try backup/restore process.
  • The relation between amanda server and client 
  • Prepare Amanda server and client in common
・/etc/hosts
As Amanda uses /etc/hosts to resolve their hostname, both Amanda server and client have their hostname on /etc/hosts.
# cat > /etc/hosts <<EOF
192.168.0.192 amanda_client
192.168.0.193 amanda_server
EOF

・install xinetd and the related libraries
# yum -y install xinetd.x86_64 gnupg.x86_64 sharutils.x86_64

・start xinetd
# /etc/init.d/xinetd start

・activate xinetd
# chkconfig xinetd on
 
・allow amanda backup services with iptables if necessary
# vi /etc/sysconfig/iptables
...
-A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
-A INPUT -s 192.168.0.0/24 -p tcp --dport 10080 -j ACCEPT
...
COMMIT
# /etc/init.d/iptables restart

  • Set up Amanda Client
・install amanda client
# wget http://www.zmanda.com/downloads/community/Amanda/3.3.2/Redhat_Enterprise_6.0/amanda-backup_client-3.3.2-1.rhel6.x86_64.rpm
# rpm -ivh amanda-backup_client-3.3.2-1.rhel6.x86_64.rpm
# rpm -qa | egrep 'amanda-backup'
amanda-backup_client-3.3.2-1.rhel6.x86_64

・create amandahosts file
# cat > ~amandabackup/.amandahosts << EOF
amanda_server amandabackup amdump
EOF
# chmod 700 ~amandabackup/.amandahosts
・create amanda-client.conf(Amanda client configuration file)
# mkdir /etc/amanda/bk01
# cat > /etc/amanda/bk01 <<EOF
conf "default"               
index_server "amanda_server"
tape_server  "amanda_server"
tapedev      "file://var/lib/amanda/vtapes"
auth "bsdtcp"
ssh_keys "/var/lib/amanda/.ssh/id_rsa_amrecover"
EOF
・create amandates file
※It has to be there whether it is currently used or not, as it was used to calculate the file size.
http://wiki.zmanda.com/index.php/FAQ:What_is_the_'amandates'_file_for%3F
# touch /var/amanda/amandates


・setup the directories' owner
# chown -fR amandabackup:disk /var/*/amanda /etc/amanda

  • Set up Amanda server
・install amanda server
# wget http://www.zmanda.com/downloads/community/Amanda/3.3.2/Redhat_Enterprise_6.0/amanda-backup_server-3.3.2-1.rhel6.x86_64.rpm
# rpm -ivh amanda-backup_server-3.3.2-1.rhel6.x86_64.rpm
# rpm -qa | egrep 'amanda-backup'
amanda-backup_server-3.3.2-1.rhel6.x86_64


・create amanda.conf(Amanda server configuration file)
※There's sample and template files under /var/lib/amanda/example and /var/lib/amanda/template.d.
$ cat > /etc/amanda/bk01/amanda.conf <<EOF
org      "bk01"
send-amreport-on all
dumpuser "amandabackup"
inparallel 4            
dumporder "sssS"        
taperalgo first         
displayunit "k"         
netusage  100000 Kbps   
dumpcycle 1 weeks       
runspercycle 7        
tapecycle 16 tapes      
bumpsize 20 Mb          
bumppercent 20          
bumpdays 1              
bumpmult 4              
ctimeout 120
etimeout 1800
dtimeout 300
connect-tries 3
req-tries 5
device_output_buffer_size 1280k
usetimestamps yes
flush-threshold-dumped 0
flush-threshold-scheduled 0
taperflush 0
autoflush no
runtapes 1                      
maxdumpsize -1          
labelstr "^default-[0-9][0-9]*$"
amrecover_changer "changer"     
holdingdisk hd1 {
    comment "main holding disk"
    directory "/var/lib/amanda/holding"
    use 3 Gb                    
    chunksize 1Gb       
    }
infofile "/etc/amanda/default/state/curinfo"     
logdir   "/etc/amanda/default/state/log"         
indexdir "/etc/amanda/default/state/index"               
tpchanger "chg-disk"
tapedev "file:/var/lib/amanda/vtapes"
tapetype HARDDISK
define tapetype global {
    part_size 1G
    part_cache_type none
}
define tapetype HARDDISK {
 length 3072 mbytes
}
define dumptype global {
    comment "Global definitions"
    index yes
    auth "bsdtcp"
}
define dumptype root-tar {
    global
    program "GNUTAR"
    comment "root partitions dumped with tar"
    compress none
    index
    priority low
}
define dumptype user-tar {
    root-tar
    comment "user partitions dumped with tar"
    priority medium
}
define dumptype comp-user-tar {
    user-tar
    compress client fast
    estimate calcsize
}
define taperscan taper_lexical {
    comment "lexical"
    plugin "lexical"
}
taperscan "taper_lexical"
 
・create the directories
# mkdir -p /var/lib/amanda/holding /etc/amanda/default/state/{curinfo,log,index}
 
・setup the directories' owner
# chown -fR amandabackup:disk /var/*/amanda /etc/amanda
 
・create the virtual tape drive
# su - amandabackup
$ for slot_num in `seq 1 25` ; do
  mkdir -p /var/lib/amanda/vtapes/slot${slot_num}
done
 
・set up the virtual tape drive
$ ln -s /var/lib/amanda/vtapes/slot1 /var/lib/amanda/vtapes/data
 
・label the volume in the slot
$ for i in `seq 1 9`; do
  amlabel default default-0${i} slot ${i}
done

Reading label...
Found an empty tape.
Writing label 'default-01'...
Checking label...
Success!
...
Reading label...
Found an empty tape.
Writing label 'default-09'...
Checking label...
Success!
 
・show the contents of all slot
$ amtape default show
slot   9: date X              label default-09
slot  10: unlabeled volume
slot  11: unlabeled volume
slot  12: unlabeled volume
slot  13: unlabeled volume
slot  14: unlabeled volume
slot  15: unlabeled volume
slot  16: unlabeled volume
slot   1: date X              label default-01
slot   2: date X              label default-02
slot   3: date X              label default-03
slot   4: date X              label default-04
slot   5: date X              label default-05
slot   6: date X              label default-06
slot   7: date X              label default-07
slot   8: date X              label default-08
 
・reset the tape changer
$ amtape default reset
changer is reset
 
・create disklist(the directories to be archived)
$ cat > /etc/amanda/default/disklist <<EOF
amanda_client /var/www comp-user-tar
EOF
 
・run the self-check on both the amanda tape server and amanda client hosts
$ amcheck default
Amanda Tape Server Host Check
-----------------------------
WARNING: holding disk /var/lib/amanda/holding: only 3076096 KB available (3145728 KB requested)
found in slot 1: volume 'default-01'
slot 1: volume 'default-01' is still active and cannot be overwritten
found in slot 2: volume 'default-02'
slot 2: volume 'default-02'
Will write to volume 'default-02' in slot 2.
NOTE: skipping tape-writable test
NOTE: info dir /etc/amanda/default/state/curinfo/amanda_client/_var_www does not exist
NOTE: it will be created on the next run.
NOTE: index dir /etc/amanda/default/state/index/amanda_client/_var_www does not exist
NOTE: it will be created on the next run.
Server check took 1.683 seconds

Amanda Backup Client Hosts Check
--------------------------------
Client check: 2 hosts checked in 2.160 seconds.  0 problems found.

(brought to you by Amanda 3.3.2)
  • verify backups
[Amanda server]
・backup the disk
$ amdump defaut

・show the archived data
$ amadmin defaut find
date                host          disk     lv tape or file      file part status
2012-08-15 18:21:59 amanda_client /var/www  0 defaut-02    1  1/1 OK

[Amanda client]
Amanda offers 2 ways to restore the archived data with amrecover and amrestore.
I am using amrecover in an interactive manner.
・connect to the amanda server
# amrecover -s amanda_server -t amanda_server -C default
AMRECOVER Version 3.3.2. Contacting server on amanda_server ...
220 magento AMANDA index server (3.3.2) ready.
Setting restore date to today (2012-08-15)
200 Working date set to 2012-08-15.
200 Config set to default.
200 Dump host set to amanda_client.
Use the setdisk command to choose dump disk to recover
 
・list the all diskname on the amanda client hosts
amrecover> listdisk
200- List of disk for host amanda_client
201- /var/www
200 List of disk for host amanda_client
 
・specify which disk to restore
amrecover> setdisk /var/www
200 Disk set to /var/www.
 
・specify the working directory which the archived data is restored
amrecover> lcd /tmp/var
amrecover> lpwd
/tmp/var
 
・add the specified files to be restored(all of the data specified here with the wild card)
amrecover> add *
Added dir /icons/ at date 2012-08-15-18-21-59
Added dir /html/ at date 2012-08-15-18-21-59
Added dir /error/ at date 2012-08-15-18-21-59
Added dir /cgi-bin/ at date 2012-08-15-18-21-59


・restore
amrecover> extract
...
./icons/small/unknown.png
./icons/small/uu.gif
./icons/small/uu.png
amrecover> exit
 
・verify the difference between the data to be archived and the one restored
# diff -r /var/www/ /tmp/var/


verify the size between the data to be archived and the one restored
# du -cks /var/www/ /tmp/var/
1176    /var/www/
1176    /tmp/var/
2352    total

That's all!

Monday, July 16, 2012

Monitoring tool - pnp4nagios custom template

I introduced about how to setup icinga an icinga-web, and setup icinga-web with pnp4nagios  to setup a monitoring server with icinga and pnp4nagios before. I'm going to show you pnp4nagios custom templates which influences the appearance of RRD graphs.

Why is it necessary to create custom templates?


I belive that the reason is that we are sometimes obliged to look into the graphs with specific hardware resource or performance data, when we analyze or investigate network devices, servers or middle ware performance, for example, how much cpu or memory resources are utilized, how much disk space is left, how much traffic is transferred, and so on.

If you need further information about custom templates for pnp4nagios, please see the official reference.

I'll give you an example of custom template based on default templates($pnp4nagios_prefix/share/templates.dist/interger.php) for traffic and graphs with nagios plugins, check_tcptraffic.

  • check_tcptraffic
# for module in \
Carp \
English \
Nagios::Plugin \
Readonly
do cpan -i install $module ; done
# wget https://trac.id.ethz.ch/projects/nagios_plugins/downloads/check_tcptraffic-2.2.4.tar.gz
# tar zxf check_tcptraffic-2.2.4.tar.gz
# cd check_tcptraffic-2.2.4
# check_tcptraffic -i eth0 -s 100 -w 10 -c 20
TCPTRAFFIC CRITICAL - eth0 182216 bytes/s | TOTAL=182216Byte;10;20 IN=180221Byte;; OUT=1995Byte;; TIME=204852Byte;; 
  • commands.cfg
define command{
        command_name    check_traffic
        command_line    $USER1$/check_tcptraffic -t $ARG1$ -s 1000 -w $ARG2$ -c $ARG3$ -i $ARG4$
        }
  • services.cfg 
define  service{
        use                     generic-service
        host_name               <hostname>
        service_description     TRAFFIC:eth0
        check_command           check_traffic!60!10000000!20000000!eth0
}
  • check_traffic.php (custom template for pnp4nagios)
    ※template_dirs=/usr/local/pnp4nagios/share/templates
<?php
$ds_name[1] = "$NAGIOS_AUTH_SERVICEDESC"; 
$opt[1] = "--vertical-label \"$UNIT[1]\" --title \"$hostname / $servicedesc\" ";
$def[1]  = rrd::def("var1", $RRDFILE[1], $DS[1], "AVERAGE");
$def[1] .= rrd::def("var2", $RRDFILE[2], $DS[2], "AVERAGE");
$def[1] .= rrd::def("var3", $RRDFILE[3], $DS[3], "AVERAGE");

if ($WARN[1] != "") {
    $def[1] .= "HRULE:$WARN[1]#FFFF00 ";
}
if ($CRIT[1] != "") {
    $def[1] .= "HRULE:$CRIT[1]#FF0000 ";       
}
$def[1] .= rrd::line1("var1", "#000000", "$NAME[1]") ;
$def[1] .= rrd::gprint("var1", array("LAST", "AVERAGE", "MAX"), "%6.2lf");
$def[1] .= rrd::area("var2", "#00ff00", "$NAME[2]") ;
$def[1] .= rrd::gprint("var2", array("LAST", "AVERAGE", "MAX"), "%6.2lf");
$def[1] .= rrd::line1("var3", "#0000ff", "$NAME[3]") ;
$def[1] .= rrd::gprint("var3", array("LAST", "AVERAGE", "MAX"), "%6.2lf");
?>

check_traffic.php generates the graphs below.









Other custom templates are open to the public at my github.
These are the list of custom templates and sample graphs.

  • check_apache_performance.php




  • check_connections.php


  • check_cpu.php
 
  • check_disk.php 
 
  • check_diskio.php
 
  • check_http.php

  • check_load.php
 
  • check_mem.php
 
  • check_mysql_health.php
 
  • check_nagios_latency_service.php
 
  • check_traffic.php

Sunday, May 20, 2012

HA Monitoring - MySQL replication

It is necessary to construct a redundant system with High Availability to serve it all times or reduce downtime.
High availability is a system design approach and associated service implementation that ensures a prearranged level of operational performance will be met during a contractual measurement period. Wikipedia High Availability
And, it is also important to assure that a system with high availability i s running under such a condition, Master/Slave or Primary/Secondary.
I am going to introduce of how to monitor a such a system with nagios in the following examples and let's see the mysql replication first.
  • MySQL Replication
  • PostgresSQL Replication
  • HA Cluster with DRBD & Pacemaker

MySQL Replication

It is important to monitor a master server binary dump running and a slave server I/O, SQL thread running and slave lag(seconds behind master). MySQL official introduces about the details about MySQL replication implementation , here. I would like to show you about monitoring the status of slave server( I/O and SQL thread) with a nagios plug-in called check_mysql_health, released by Console Labs.

This plug-in, by the way, is absolutely useful because it is enable to check the various mysql parameters, such as the number of connections, query cache hit rate, or the number of slow queries including the health of mysql replication.

System Structure


OS CentOS-5.8
Kernel 2.6.18-274.el5
DB mysql-5.5.24
Scripting Language perl-5.14.2
Nagios Plugin check_mysql_health-2.1.5.1
icinga core icinga-1.6.1

Install check_mysql_health

  •  compile & install
# wget http://labs.consol.de/wp-content/uploads/2011/04/check_mysql_health-2.1.5.1.tar.gz
# tar zxf check_mysql_health-2.1.5.1.tar.gz
# cd check_mysql_health-2.1.5.1
# ./configure \
--with-nagios-user=nagios \
--with-nagios-group=nagios \
--with-mymodules-dir=/usr/lib64/nagios/plugins
# make
# make instal
# cp -p plugins-scripts/check_mysql_health /usr/local/nagios/libexec
  • install cpan modules
# for modules in \
DBI \
DBD::mysql \
Time::HiRes \
IO::File \
File::Copy \
File::Temp \
Time::HiRes \
IO::File \
Data::Dumper \File::Basename \
Getopt::Long
 do cpan -i $modules
done
  • grant privileges for mysql user
# mysql -uroot -p mysql -e "GRANT SELECT, SUPER,REPLICATION CLIENT ON *.* TO nagios@'localhost' IDENTIFIED BY 'nagios'; FLUSH PRIVILEGES ;" 
# mysql -uroot -p mysql -e "SELECT * FROM user WHERE User = 'nagios'\G;"
*************************** 1. row ***************************
                  Host: localhost
                  User: nagios
              Password: *82802C50A7A5CDFDEA2653A1503FC4B8939C4047
           Select_priv: Y
           Insert_priv: N
           Update_priv: N
           Delete_priv: N
           Create_priv: N
             Drop_priv: N
           Reload_priv: N
         Shutdown_priv: N
          Process_priv: N
             File_priv: N
            Grant_priv: N
       References_priv: N
            Index_priv: N
            Alter_priv: N
          Show_db_priv: N
            Super_priv: Y
 Create_tmp_table_priv: N
      Lock_tables_priv: N
          Execute_priv: N
       Repl_slave_priv: N
      Repl_client_priv: Y
      Create_view_priv: N
        Show_view_priv: N
   Create_routine_priv: N
    Alter_routine_priv: N
      Create_user_priv: N
            Event_priv: N
          Trigger_priv: N
Create_tablespace_priv: N
              ssl_type: 
            ssl_cipher: 
           x509_issuer: 
          x509_subject: 
         max_questions: 0
           max_updates: 0
       max_connections: 0
  max_user_connections: 0
                plugin: 
 authentication_string: NULL
  • revise parentheses deprecated error
    Please just revise the line below if parentheses deprecated error detected.
# check_mysql_health --hostname localhost --username root --mode uptime
Use of qw(...) as parentheses is deprecated at check_mysql_health line 1247.
Use of qw(...) as parentheses is deprecated at check_mysql_health line 2596.
Use of qw(...) as parentheses is deprecated at check_mysql_health line 3473.
OK - database is up since 2677 minutes | uptime=160628s
# cp -p check_mysql_health{,.bak}
# vi check_mysql_health
...
# diff -u check_mysql_health.bak check_mysql_health
--- check_mysql_health.bak    2011-07-15 17:46:28.000000000 +0900
+++ check_mysql_health        2011-07-17 14:04:45.000000000 +0900
@@ -1244,7 +1244,7 @@
   my $message = shift;
   push(@{$self->{nagios}->{messages}->{$level}}, $message);
   # recalc current level
-  foreach my $llevel qw(CRITICAL WARNING UNKNOWN OK) {
+  foreach my $llevel (qw(CRITICAL WARNING UNKNOWN OK)) {
     if (scalar(@{$self->{nagios}->{messages}->{$ERRORS{$llevel}}})) {
       $self->{nagios_level} = $ERRORS{$llevel};
     }
@@ -2593,7 +2593,7 @@
   my $message = shift;
   push(@{$self->{nagios}->{messages}->{$level}}, $message);
   # recalc current level
-  foreach my $llevel qw(CRITICAL WARNING UNKNOWN OK) {
+  foreach my $llevel (qw(CRITICAL WARNING UNKNOWN OK)) {
     if (scalar(@{$self->{nagios}->{messages}->{$ERRORS{$llevel}}})) {
       $self->{nagios_level} = $ERRORS{$llevel};
     }
@@ -3469,8 +3469,8 @@
   $needs_restart = 1;
   # if the calling script has a path for shared libs and there is no --environment
   # parameter then the called script surely needs the variable too.
-  foreach my $important_env qw(LD_LIBRARY_PATH SHLIB_PATH 
-      ORACLE_HOME TNS_ADMIN ORA_NLS ORA_NLS33 ORA_NLS10) {
+  foreach my $important_env (qw(LD_LIBRARY_PATH SHLIB_PATH 
+      ORACLE_HOME TNS_ADMIN ORA_NLS ORA_NLS33 ORA_NLS10)) {
     if ($ENV{$important_env} && ! scalar(grep { /^$important_env=/ } 
         keys %{$commandline{environment}})) {
       $commandline{environment}->{$important_env} = $ENV{$important_env};

Verification

I am going to verify the mysql replication status about slave lag, I/O thread and SQL thread in the following condition, supposing that mysql replication is running.
Please see the official information of how to setup mysql replication.
  1. Both I/O thread and SQL thread running
  2. I/O thread stopped, SQL thread running
  3. I/O thread running, SQL thread stopped
  • Both I/O thread and SQL thread running
# check_mysql_health --hostname localhost --username nagios --password nagios --warning 5 --critical 10 --mode slave-lag
OK - Slave is 0 seconds behind master | slave_lag=0;5;1
# check_mysql_health --hostname localhost --username nagios --password nagios --warning 5 --critical 10 --mode slave-io-running
OK - Slave io is running
# check_mysql_health --hostname localhost --username nagios --password nagios --warning 5 --critical 10 --mode slave-sql-running
OK - Slave sql is running
# mysql -uroot -p myql -e "STOP SLAVE IO_THREAD;" 
# check_mysql_health --hostname localhost --username nagios --password nagios --warning 5 --critical 10 --mode slave-lag
CRITICAL - unable to get slave lag, because io thread is not running
# check_mysql_health --hostname localhost --username nagios --password nagios --warning 5 --critical 10 --mode slave-io-running
CRITICAL - Slave io is not running
# check_mysql_health --hostname localhost --username nagios --password nagios --warning 5 --critical 10 --mode slave-sql-running
OK - Slave sql is running
  • I/O thread running, SQL thread stopped
# mysql -uroot -p myql -e "STOP SLAVE SQL_THREAD;"  
# check_mysql_health --hostname localhost --username nagios --password nagios --warning 5 --critical 10 --mode slave-lag
CRITICAL - unable to get replication inf
# check_mysql_health --hostname localhost --username nagios --password nagios --warning 5 --critical 10 --mode slave-io-running
OK - Slave io is running
# check_mysql_health --hostname localhost --username nagios --password nagios --warning 5 --critical 10 --mode slave-sql-running
CRITICAL - Slave sql is not running


Let's see how to monitor PostgreSQL streaming replication, next.

Sunday, May 13, 2012

Monitoring tool - monitor icinga/nagios with monit

As introducing of how to install monit before, I am explaining how to monitor icinga/nagios with monit. Monit has several kinds of testing and defines them here. I adopt PPID testing which tests the process parent process identification number (ppid) of a process for changes to check icinga daemon.

The configurations for service entry statement are released on my github.

Configuration

  • setup pidfile of icinga.cfg(nagios.cfg)
    Though the directive says "lock_file", it actually outputs the process id number to the file. Nagios official says, here.
    "This option specifies the location of the lock file that Nagios should create when it runs as a daemon (when started with the -d command line argument). This file contains the process id (PID) number of the running Nagios process."
# grep '^lock_file' icinga.cfg 
lock_file=/var/run/icinga.pid
# /etc/init.d/icinga reload
  • setup service entry statement of icinga
# cat > /etc/monit.d/icinga.conf >> EOF
check process icinga
      with pidfile "/var/run/icinga.pid"
      start program = "/etc/init.d/icinga start"
      stop program = "/etc/init.d/icinga stop"
      if 3 restarts within 3 cycles then alert

EOF

Start up

  • begin monitoring
# monit monitor icinga
# monit start icinga
  • see the summary
# monit summary | grep 'icinga'
Process 'icinga'                    Running
  • see the monit log file
# tail -f /var/log/monit/monit.log
[JST May 13 14:35:48] info     : 'icinga' monitor on user request
[JST May 13 14:35:48] info     : monit daemon at 13661 awakened
[JST May 13 14:35:48] info     : Awakened by User defined signal 1
[JST May 13 14:35:48] info     : 'icinga' monitor action done
[JST May 13 14:37:07] error    : monit: invalid argument -- staus  (-h will show valid arguments)
[JST May 13 14:37:39] info     : 'icinga' start on user request
[JST May 13 14:37:39] info     : monit daemon at 13661 awakened
[JST May 13 14:37:39] info     : Awakened by User defined signal 1
[JST May 13 14:37:39] info     : 'icinga' start action done

Verification 

  • verify icinga daemon begins if  its process is stopped 
# /etc/init.d/icinga status
icinga (pid  31107) is running...
# kill `pgrep icinga`
  • see the log file that monit begins icinga
# cat /var/log/monit/monit.log
[JST May 13 14:37:39] info     : 'icinga' start on user request
[JST May 13 14:37:39] info     : monit daemon at 13661 awakened
[JST May 13 14:37:39] info     : Awakened by User defined signal 1
[JST May 13 14:37:39] info     : 'icinga' start action done
[JST May 13 14:45:40] error    : 'icinga' process is not running
[JST May 13 14:45:40] info     : 'icinga' trying to restart
[JST May 13 14:45:40] info     : 'icinga' start: /etc/init.d/icinga
  • check icinga is running.
# /etc/init.d/icinga status
icinga (pid  21093) is running...

Configuration examples(ido2db, npcd)

  • setup pidfile of ido2db.cfg (ndo2db)
# grep '^lock_file' ido2db.cfg 
lock_file=/var/run/ido2db.pid
  • setup service entry statement of ido2db
# cat > /etc/monit.d/ido2db.monit << EOF
check process ido2db
      with pidfile "/var/run/ido2db.pid"
      start program = "/etc/init.d/ido2db start"
      stop program = "/etc/init.d/ido2db stop"
      if 3 restarts within 3 cycles then alert

EOF
  • begin monitoring
# monit monitor ido2db
# monit start ido2db
  • setup pidfile of npcd.cfg (pnp4nagios)
# grep '^pid_file' npcd.cfg 
pid_file=/var/run/npcd.pid
  • setup service entry statement of npcd
# cat > /etc/monit.d/npcd.monit << EOF
check process npcd
      with pidfile "/var/run/npcd.pid"
      start program = "/etc/init.d/npcd start"
      stop program = "/etc/init.d/npcd stop"
      if 3 restarts within 3 cycles then alert

EOF
  • begin monitoring
# monit monitor npcd
# monit start npcd


Monitoring tool - install monit

Icinga, nagios and other monitoring tools can monitor a specified daemon or process running. Though they can monitor the icinga or nagios daemon and check that they are running, what would happen if icinga or nagios daemon themselves stop.
Monit is capable of monitoring a daemon by checking a specified process or port running and restarting the daemon or even stopping it.
"Monit is a free open source utility for managing and monitoring, processes, programs, files, directories and filesystems on a UNIX system. Monit conducts automatic maintenance and repair and can execute meaningful causal actions in error situations." MONIT Official
I'd like to introduce about installing monit first, and how to monitor icinga with monit then.
The configurations are released on my github, here.

Reference

Install monit

  •  setup rpmforge repository
# rpm -ivh http://pkgs.repoforge.org/rpmforge-release/rpmforge-release-0.5.2-2.el6.rf.x86_64.rpm
# sed -i 's/enabled = 1/enabled = 0/' /etc/yum.repos.d/rpmforge.repo
  • install monit
# yum -y --enablerepo=rpmforge install monit
  • verify installation
# monit -V
This is Monit version 5.3.2
Copyright (C) 2000-2011 Tildeslash Ltd. All Rights Reserved.

Configuration

  •  /etc/monitrc (monit control file)
    Please see the official documentation if you need further information about monit control file.
    The set alert directive below means that monit sends alert if it matches the actions except for from checksum to timestamp.
# cat > /etc/monitrc << EOF
set daemon 120 with start delay 30
set logfile /var/log/monit/monit.log
## Sending E-mail, put off the comment below
set mailserver localhost
set alert username@domain not {
checksum
content
data
exec
gid
icmp
invalid
fsflags
permission
pid
ppid
size
timestamp
#action
#nonexist
#timeout
}
mail-format {
from: monit@$HOST
subject: Monit Alert -- $SERVICE $EVENT --
message:
Hostname:       $HOST
Service:        $SERVICE
Action:         $ACTION
Date/Time:      $DATE
Info:           $DESCRIPTION
}
set idfile /var/monit/id
set statefile /var/monit/state
set eventqueue
    basedir /var/monit  
    slots 100           
set httpd port 2812 and
    allow localhost 
    allow 192.168.0.0/24
    allow admin:monit      
include /etc/monit.d/*.conf
EOF
  • setup logging 
# mkdir /var/log/monit
# cat > /etc/logrotate.d/monit <<EOF
/var/log/monit/*.log {
  missingok
  notifempty
  rotate 12
  weekly
  compress
  postrotate
    /usr/bin/monit quit  
  endscript
}
EOF 
  • setup include file (service entry statement)
    The following is example of monitoring ntpd.
# cat > /etc/monit.d/ntpd.conf
check process ntpd
        with pidfile "/var/run/ntpd.pid"
        start program = "/etc/init.d/ntpd start"
        stop program = "/etc/init.d/ntpd stop"
        if 3 restarts within 3 cycles then alert

EOF
  •  verify syntax
# monit -t
Control file syntax OK

Start up

  • run monit from init
    It is enable to run monit from init script, but I want to make it certain of always having a running Monit daemon on the system.
# cat >> /etc/inittab <<EOF
mo:2345:respawn:/usr/bin/monit -Ic /etc/monitrc
EOF
  • re-examine /etc/inittab 
# telinit q
# tail -f /var/log/messages
May 13 12:34:35 ha-mgr02 init: Re-reading inittab
  • check monit running
# ps awuxc | grep 'monit'
root      1431  0.0  0.0  57432  1876 ?        Ssl  11:38   0:00 monit 
  • stop monit process and check that init begins monit
# kill `pgrep monit` ; ps cawux | grep 'monit'
root     13661  0.0  0.0  57432  1780 ?        Ssl  13:31   0:00 monit

  • show status and summary
# show status
Process 'ntpd'
  status                            Running
  monitoring status                 Monitored
  pid                               32307
  parent pid                        1
  uptime                            12d 17h 44m 
  children                          0
  memory kilobytes                  5040
  memory kilobytes total            5040
  memory percent                    0.2%
  memory percent total              0.2%
  cpu percent                       0.0%
  cpu percent total                 0.0%
  data collected                    Sun, 13 May 2012 12:34:35

System 'system_ha-mgr02.forschooner.net'
  status                            Running
  monitoring status                 Monitored
  load average                      [0.09] [0.20] [0.14]
  cpu                               1.6%us 3.2%sy 0.3%wa
  memory usage                      672540 kB [32.6%]
  swap usage                        120 kB [0.0%]
  data collected                    Sun, 13 May 2012 12:32:35
  • show summary 
# monit summary
The Monit daemon 5.3.2 uptime: 58m 

Process 'sshd'                      Running
Process 'ntpd'                      Running
System 'system_ha-mgr02.forschooner.net' Running

Start up from upstart

As RHEL-6.x and CentOS-6.x adopts upstart, it is necessary to use upstart but for init with those OS.
  • setup /etc/init/monit.conf
# monit_bin=$(which monit)
# cat > /etc/init/monit.conf << EOF
# monit respawn
description     "Monit"

start on runlevel [2345]
stop on runlevel [!2345]
 
respawn
exec $monit_bin -Ic /etc/monit.conf
EOF 
  • show a list of the known jobs and instances
# initctl list
 rc stop/waiting
 tty (/dev/tty3) start/running, process 1249
 ...
 monit stop/waiting
 serial (hvc0) start/running, process 1239
 rcS-sulogin stop/waiting
  • begin monit
# initctl start monit
 monit start/running, process 6873
  • see the status of the job(monit)
 # initctl status monit
 monit start/running, process 6873
  • stop monit process
# kill `pgrep monit`
  • check that upstart begins monit
# ps cawux | grep monit
 root      7140  0.0  0.1   7004  1840 ?        Ss   21:42   0:00 monit
  • see the log file that monit is respawning
# tail -1 /var/log/messages
 Oct 20 12:42:41 ip-10-171-47-212 init: monit main process ended, respawning

Verification

  • access to the monit service manager (http://IP Address:2812)

  • check ntp daemon starts if it stops 
# /etc/init.d/ntpd status
ntpd (pid  32307) is running...
# /etc/init.d/ntpd stop  
Shutting down ntpd:                                        [  OK  ]
  • see the log file that monit starts ntpd 
# cat /var/log/monit/monit.log
[JST May 13 12:52:24] error    : 'ntpd' process is not running
[JST May 13 12:52:24] info     : 'ntpd' trying to restart
[JST May 13 12:52:24] info     : 'ntpd' start: /etc/init.d/ntpd
  • check ntpd is running
# /etc/init.d/ntpd status
ntpd (pid  9475) is running...

Mail sample format

The following is examples of alert mail when monit works.
  • notifying that the daemon is stopped
<Subject>
Monit Alert -- ntpd Does not exist --
<Body>
Hostname:       ha-mgr02.forschooner.net
Service:        ntpd
Action:         restart
Date/Time:      Sun, 13 May 2012 12:52:24
Info:           process is not running 
  • notifying that the daemon starts
<Subject>
Monit Alert -- ntpd Action done --
<Body>
Hostname:       ha-mgr02.forschooner.net
Service:        ntpd
Action:         alert
Date/Time:      Sun, 13 May 2012 12:54:15
Info:           start action done 
  • notifying that the daemon is stopped
<Subject>
Monit Alert -- ntpd Exists --
<Body>
Hostname:       ha-mgr02.forschooner.net
Service:        ntpd
Action:         alert
Date/Time:      Sun, 13 May 2012 12:54:15
Info:           process is running with pid 9475









Friday, May 4, 2012

Key Value Store - monitor cassandra and multinode cluster

As installing cassandra and creating multinode cluster, I'm introducing of how to monitor cassandra and multinode cluster with own nagios-plugin.

Monitor cassandra node(check_by_ssh+cassandra-cli)

There's several ways to monitor cassandra node with Nagios or Icinga such as, JMX or check_jmx. Though they are fairly effective way to monitor cassandra, they need to take some time to prepare. I am afraid that using check_by_ssh and cassandra-cli  is more simple than those ones and no need to install any libraries except for cassandra itself.
  • commands.cfg
define command{
        command_name    check_by_ssh
        command_line    $USER1$/check_by_ssh -l nagios -i /home/nagios/.ssh/id_rsa -H $HOSTADDRESS$ -t $ARG1$ -C '$ARG2$'

  • services.cfg
define service{
      use                     generic-service
      host_name               cassandra
      service_description     Cassandra Node
      check_command           check_by_ssh!22!60!"/usr/local/apache-cassandra/bin/cassandra-cli -h localhost --jmxport 9160 -f /tmp/cassandra_load.txt"
  • setup the file to load statements
    setup the statement file in the cassandra node to be monitored.
    "show cluster name;" shows its cluster name.
# cat > /tmp/cassandra_load.txt << EOF
show cluster name;
EOF
  • plugin status when cassandra is running(service status is OK)
# su - nagios
$ check_by_ssh -l nagios -i /home/nagios/.ssh/id_rsa -H 192.168.213.91 -p 22 -t 10 -C "/usr/local/apache-cassandra/bin/cassandra-cli -h 192.168.213.91 --jmxport 9160 -f /tmp/load.txt"
Connected to: "Test Cluster" on 192.168.213.91/9160
Test Cluster
  • plugin status when cassandra is stopped(service status is CRITICAL)
# su - nagios
$ check_by_ssh -l nagios -i /home/nagios/.ssh/id_rsa -H 192.168.213.91 -l root -p 22 -t 10 -C "/usr/local/apache-cassandra/bin/cassandra-cli -h 192.168.213.91 --jmxport 9160 -f /tmp/load.txt"
Remote command execution failed: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused

Monitor multinode cluster(check_cassandra_cluster.sh)

 The plugin has been released at Nagios Exchange and see the detail there, please.
  • overview
    check if the number of live nodes which belong to multinode cluster is less than the specified number.
    it is enable to specify the threshold with option -w <warning> and -c <critical>.
    get the number of live nodes, their status, and performance data.
  • software requirements
    cassandra(using nodetool command)
  • command help
# check_cassandra_cluster.sh -h
Usage: ./check_cassandra_cluster.sh -H <host> -P <port> -w <warning> -c <critical>

 -H <host> IP address or hostname of the cassandra node to connect, localhost by default.
 -P <port> JMX port, 7199 by default.
 -w <warning> alert warning state, if the number of live nodes is less than <warning>.
 -c <critical> alert critical state, if the number of live nodes is less than <critical>.
 -h show command option
 -V show command version 
  •  when service status is OK
# check_cassandra_cluster.sh -H 192.168.213.91 -P 7199 -w 1 -c 0
OK - Live Node:2 - 192.168.213.92:Up,Normal,65.2KB,86.95% 192.168.213.91:Up,Normal,73.76KB,13.05% | Load_192.168.213.92=65.2KB Owns_192.168.213.92=86.95% Load_192.168.213.91=60.14KB Owns_192.168.213.91=13.05%
  •  when service status is WARNING
# check_cassandra_cluster.sh -H 192.168.213.91 -P 7199 -w 2 -c 0
WARNING - Live Node:2 - 192.168.213.92:Up,Normal,65.2KB,86.95% 192.168.213.91:Up,Normal,73.76KB,13.05% | Load_192.168.213.92=65.2KB Owns_192.168.213.92=86.95% Load_192.168.213.91=60.14KB Owns_192.168.213.91=13.05% 
  •  when status is CRITICAL
# check_cassandra_cluster.sh -H 192.168.213.91 -P 7199 -w 3 -c 2
CRITICAL - Live Node:2 - 192.168.213.92:Up,Normal,65.2KB,86.95% 192.168.213.91:Up,Normal,73.76KB,13.05% | Load_192.168.213.92=65.2KB Owns_192.168.213.92=86.95% Load_192.168.213.91=60.14KB Owns_192.168.213.91=13.05%
  •  when the threshold of warning is less than the one of critical
# check_cassandra_cluster.sh -H 192.168.213.91 -P 7199 -w 3 -c 4
-w <warning> 3 must be less than -c <critical> 4.