Sunday, May 20, 2012

HA Monitoring - MySQL replication

It is necessary to construct a redundant system with High Availability to serve it all times or reduce downtime.
High availability is a system design approach and associated service implementation that ensures a prearranged level of operational performance will be met during a contractual measurement period. Wikipedia High Availability
And, it is also important to assure that a system with high availability i s running under such a condition, Master/Slave or Primary/Secondary.
I am going to introduce of how to monitor a such a system with nagios in the following examples and let's see the mysql replication first.
  • MySQL Replication
  • PostgresSQL Replication
  • HA Cluster with DRBD & Pacemaker

MySQL Replication

It is important to monitor a master server binary dump running and a slave server I/O, SQL thread running and slave lag(seconds behind master). MySQL official introduces about the details about MySQL replication implementation , here. I would like to show you about monitoring the status of slave server( I/O and SQL thread) with a nagios plug-in called check_mysql_health, released by Console Labs.

This plug-in, by the way, is absolutely useful because it is enable to check the various mysql parameters, such as the number of connections, query cache hit rate, or the number of slow queries including the health of mysql replication.

System Structure


OS CentOS-5.8
Kernel 2.6.18-274.el5
DB mysql-5.5.24
Scripting Language perl-5.14.2
Nagios Plugin check_mysql_health-2.1.5.1
icinga core icinga-1.6.1

Install check_mysql_health

  •  compile & install
# wget http://labs.consol.de/wp-content/uploads/2011/04/check_mysql_health-2.1.5.1.tar.gz
# tar zxf check_mysql_health-2.1.5.1.tar.gz
# cd check_mysql_health-2.1.5.1
# ./configure \
--with-nagios-user=nagios \
--with-nagios-group=nagios \
--with-mymodules-dir=/usr/lib64/nagios/plugins
# make
# make instal
# cp -p plugins-scripts/check_mysql_health /usr/local/nagios/libexec
  • install cpan modules
# for modules in \
DBI \
DBD::mysql \
Time::HiRes \
IO::File \
File::Copy \
File::Temp \
Time::HiRes \
IO::File \
Data::Dumper \File::Basename \
Getopt::Long
 do cpan -i $modules
done
  • grant privileges for mysql user
# mysql -uroot -p mysql -e "GRANT SELECT, SUPER,REPLICATION CLIENT ON *.* TO nagios@'localhost' IDENTIFIED BY 'nagios'; FLUSH PRIVILEGES ;" 
# mysql -uroot -p mysql -e "SELECT * FROM user WHERE User = 'nagios'\G;"
*************************** 1. row ***************************
                  Host: localhost
                  User: nagios
              Password: *82802C50A7A5CDFDEA2653A1503FC4B8939C4047
           Select_priv: Y
           Insert_priv: N
           Update_priv: N
           Delete_priv: N
           Create_priv: N
             Drop_priv: N
           Reload_priv: N
         Shutdown_priv: N
          Process_priv: N
             File_priv: N
            Grant_priv: N
       References_priv: N
            Index_priv: N
            Alter_priv: N
          Show_db_priv: N
            Super_priv: Y
 Create_tmp_table_priv: N
      Lock_tables_priv: N
          Execute_priv: N
       Repl_slave_priv: N
      Repl_client_priv: Y
      Create_view_priv: N
        Show_view_priv: N
   Create_routine_priv: N
    Alter_routine_priv: N
      Create_user_priv: N
            Event_priv: N
          Trigger_priv: N
Create_tablespace_priv: N
              ssl_type: 
            ssl_cipher: 
           x509_issuer: 
          x509_subject: 
         max_questions: 0
           max_updates: 0
       max_connections: 0
  max_user_connections: 0
                plugin: 
 authentication_string: NULL
  • revise parentheses deprecated error
    Please just revise the line below if parentheses deprecated error detected.
# check_mysql_health --hostname localhost --username root --mode uptime
Use of qw(...) as parentheses is deprecated at check_mysql_health line 1247.
Use of qw(...) as parentheses is deprecated at check_mysql_health line 2596.
Use of qw(...) as parentheses is deprecated at check_mysql_health line 3473.
OK - database is up since 2677 minutes | uptime=160628s
# cp -p check_mysql_health{,.bak}
# vi check_mysql_health
...
# diff -u check_mysql_health.bak check_mysql_health
--- check_mysql_health.bak    2011-07-15 17:46:28.000000000 +0900
+++ check_mysql_health        2011-07-17 14:04:45.000000000 +0900
@@ -1244,7 +1244,7 @@
   my $message = shift;
   push(@{$self->{nagios}->{messages}->{$level}}, $message);
   # recalc current level
-  foreach my $llevel qw(CRITICAL WARNING UNKNOWN OK) {
+  foreach my $llevel (qw(CRITICAL WARNING UNKNOWN OK)) {
     if (scalar(@{$self->{nagios}->{messages}->{$ERRORS{$llevel}}})) {
       $self->{nagios_level} = $ERRORS{$llevel};
     }
@@ -2593,7 +2593,7 @@
   my $message = shift;
   push(@{$self->{nagios}->{messages}->{$level}}, $message);
   # recalc current level
-  foreach my $llevel qw(CRITICAL WARNING UNKNOWN OK) {
+  foreach my $llevel (qw(CRITICAL WARNING UNKNOWN OK)) {
     if (scalar(@{$self->{nagios}->{messages}->{$ERRORS{$llevel}}})) {
       $self->{nagios_level} = $ERRORS{$llevel};
     }
@@ -3469,8 +3469,8 @@
   $needs_restart = 1;
   # if the calling script has a path for shared libs and there is no --environment
   # parameter then the called script surely needs the variable too.
-  foreach my $important_env qw(LD_LIBRARY_PATH SHLIB_PATH 
-      ORACLE_HOME TNS_ADMIN ORA_NLS ORA_NLS33 ORA_NLS10) {
+  foreach my $important_env (qw(LD_LIBRARY_PATH SHLIB_PATH 
+      ORACLE_HOME TNS_ADMIN ORA_NLS ORA_NLS33 ORA_NLS10)) {
     if ($ENV{$important_env} && ! scalar(grep { /^$important_env=/ } 
         keys %{$commandline{environment}})) {
       $commandline{environment}->{$important_env} = $ENV{$important_env};

Verification

I am going to verify the mysql replication status about slave lag, I/O thread and SQL thread in the following condition, supposing that mysql replication is running.
Please see the official information of how to setup mysql replication.
  1. Both I/O thread and SQL thread running
  2. I/O thread stopped, SQL thread running
  3. I/O thread running, SQL thread stopped
  • Both I/O thread and SQL thread running
# check_mysql_health --hostname localhost --username nagios --password nagios --warning 5 --critical 10 --mode slave-lag
OK - Slave is 0 seconds behind master | slave_lag=0;5;1
# check_mysql_health --hostname localhost --username nagios --password nagios --warning 5 --critical 10 --mode slave-io-running
OK - Slave io is running
# check_mysql_health --hostname localhost --username nagios --password nagios --warning 5 --critical 10 --mode slave-sql-running
OK - Slave sql is running
# mysql -uroot -p myql -e "STOP SLAVE IO_THREAD;" 
# check_mysql_health --hostname localhost --username nagios --password nagios --warning 5 --critical 10 --mode slave-lag
CRITICAL - unable to get slave lag, because io thread is not running
# check_mysql_health --hostname localhost --username nagios --password nagios --warning 5 --critical 10 --mode slave-io-running
CRITICAL - Slave io is not running
# check_mysql_health --hostname localhost --username nagios --password nagios --warning 5 --critical 10 --mode slave-sql-running
OK - Slave sql is running
  • I/O thread running, SQL thread stopped
# mysql -uroot -p myql -e "STOP SLAVE SQL_THREAD;"  
# check_mysql_health --hostname localhost --username nagios --password nagios --warning 5 --critical 10 --mode slave-lag
CRITICAL - unable to get replication inf
# check_mysql_health --hostname localhost --username nagios --password nagios --warning 5 --critical 10 --mode slave-io-running
OK - Slave io is running
# check_mysql_health --hostname localhost --username nagios --password nagios --warning 5 --critical 10 --mode slave-sql-running
CRITICAL - Slave sql is not running


Let's see how to monitor PostgreSQL streaming replication, next.

Sunday, May 13, 2012

Monitoring tool - monitor icinga/nagios with monit

As introducing of how to install monit before, I am explaining how to monitor icinga/nagios with monit. Monit has several kinds of testing and defines them here. I adopt PPID testing which tests the process parent process identification number (ppid) of a process for changes to check icinga daemon.

The configurations for service entry statement are released on my github.

Configuration

  • setup pidfile of icinga.cfg(nagios.cfg)
    Though the directive says "lock_file", it actually outputs the process id number to the file. Nagios official says, here.
    "This option specifies the location of the lock file that Nagios should create when it runs as a daemon (when started with the -d command line argument). This file contains the process id (PID) number of the running Nagios process."
# grep '^lock_file' icinga.cfg 
lock_file=/var/run/icinga.pid
# /etc/init.d/icinga reload
  • setup service entry statement of icinga
# cat > /etc/monit.d/icinga.conf >> EOF
check process icinga
      with pidfile "/var/run/icinga.pid"
      start program = "/etc/init.d/icinga start"
      stop program = "/etc/init.d/icinga stop"
      if 3 restarts within 3 cycles then alert

EOF

Start up

  • begin monitoring
# monit monitor icinga
# monit start icinga
  • see the summary
# monit summary | grep 'icinga'
Process 'icinga'                    Running
  • see the monit log file
# tail -f /var/log/monit/monit.log
[JST May 13 14:35:48] info     : 'icinga' monitor on user request
[JST May 13 14:35:48] info     : monit daemon at 13661 awakened
[JST May 13 14:35:48] info     : Awakened by User defined signal 1
[JST May 13 14:35:48] info     : 'icinga' monitor action done
[JST May 13 14:37:07] error    : monit: invalid argument -- staus  (-h will show valid arguments)
[JST May 13 14:37:39] info     : 'icinga' start on user request
[JST May 13 14:37:39] info     : monit daemon at 13661 awakened
[JST May 13 14:37:39] info     : Awakened by User defined signal 1
[JST May 13 14:37:39] info     : 'icinga' start action done

Verification 

  • verify icinga daemon begins if  its process is stopped 
# /etc/init.d/icinga status
icinga (pid  31107) is running...
# kill `pgrep icinga`
  • see the log file that monit begins icinga
# cat /var/log/monit/monit.log
[JST May 13 14:37:39] info     : 'icinga' start on user request
[JST May 13 14:37:39] info     : monit daemon at 13661 awakened
[JST May 13 14:37:39] info     : Awakened by User defined signal 1
[JST May 13 14:37:39] info     : 'icinga' start action done
[JST May 13 14:45:40] error    : 'icinga' process is not running
[JST May 13 14:45:40] info     : 'icinga' trying to restart
[JST May 13 14:45:40] info     : 'icinga' start: /etc/init.d/icinga
  • check icinga is running.
# /etc/init.d/icinga status
icinga (pid  21093) is running...

Configuration examples(ido2db, npcd)

  • setup pidfile of ido2db.cfg (ndo2db)
# grep '^lock_file' ido2db.cfg 
lock_file=/var/run/ido2db.pid
  • setup service entry statement of ido2db
# cat > /etc/monit.d/ido2db.monit << EOF
check process ido2db
      with pidfile "/var/run/ido2db.pid"
      start program = "/etc/init.d/ido2db start"
      stop program = "/etc/init.d/ido2db stop"
      if 3 restarts within 3 cycles then alert

EOF
  • begin monitoring
# monit monitor ido2db
# monit start ido2db
  • setup pidfile of npcd.cfg (pnp4nagios)
# grep '^pid_file' npcd.cfg 
pid_file=/var/run/npcd.pid
  • setup service entry statement of npcd
# cat > /etc/monit.d/npcd.monit << EOF
check process npcd
      with pidfile "/var/run/npcd.pid"
      start program = "/etc/init.d/npcd start"
      stop program = "/etc/init.d/npcd stop"
      if 3 restarts within 3 cycles then alert

EOF
  • begin monitoring
# monit monitor npcd
# monit start npcd


Monitoring tool - install monit

Icinga, nagios and other monitoring tools can monitor a specified daemon or process running. Though they can monitor the icinga or nagios daemon and check that they are running, what would happen if icinga or nagios daemon themselves stop.
Monit is capable of monitoring a daemon by checking a specified process or port running and restarting the daemon or even stopping it.
"Monit is a free open source utility for managing and monitoring, processes, programs, files, directories and filesystems on a UNIX system. Monit conducts automatic maintenance and repair and can execute meaningful causal actions in error situations." MONIT Official
I'd like to introduce about installing monit first, and how to monitor icinga with monit then.
The configurations are released on my github, here.

Reference

Install monit

  •  setup rpmforge repository
# rpm -ivh http://pkgs.repoforge.org/rpmforge-release/rpmforge-release-0.5.2-2.el6.rf.x86_64.rpm
# sed -i 's/enabled = 1/enabled = 0/' /etc/yum.repos.d/rpmforge.repo
  • install monit
# yum -y --enablerepo=rpmforge install monit
  • verify installation
# monit -V
This is Monit version 5.3.2
Copyright (C) 2000-2011 Tildeslash Ltd. All Rights Reserved.

Configuration

  •  /etc/monitrc (monit control file)
    Please see the official documentation if you need further information about monit control file.
    The set alert directive below means that monit sends alert if it matches the actions except for from checksum to timestamp.
# cat > /etc/monitrc << EOF
set daemon 120 with start delay 30
set logfile /var/log/monit/monit.log
## Sending E-mail, put off the comment below
set mailserver localhost
set alert username@domain not {
checksum
content
data
exec
gid
icmp
invalid
fsflags
permission
pid
ppid
size
timestamp
#action
#nonexist
#timeout
}
mail-format {
from: monit@$HOST
subject: Monit Alert -- $SERVICE $EVENT --
message:
Hostname:       $HOST
Service:        $SERVICE
Action:         $ACTION
Date/Time:      $DATE
Info:           $DESCRIPTION
}
set idfile /var/monit/id
set statefile /var/monit/state
set eventqueue
    basedir /var/monit  
    slots 100           
set httpd port 2812 and
    allow localhost 
    allow 192.168.0.0/24
    allow admin:monit      
include /etc/monit.d/*.conf
EOF
  • setup logging 
# mkdir /var/log/monit
# cat > /etc/logrotate.d/monit <<EOF
/var/log/monit/*.log {
  missingok
  notifempty
  rotate 12
  weekly
  compress
  postrotate
    /usr/bin/monit quit  
  endscript
}
EOF 
  • setup include file (service entry statement)
    The following is example of monitoring ntpd.
# cat > /etc/monit.d/ntpd.conf
check process ntpd
        with pidfile "/var/run/ntpd.pid"
        start program = "/etc/init.d/ntpd start"
        stop program = "/etc/init.d/ntpd stop"
        if 3 restarts within 3 cycles then alert

EOF
  •  verify syntax
# monit -t
Control file syntax OK

Start up

  • run monit from init
    It is enable to run monit from init script, but I want to make it certain of always having a running Monit daemon on the system.
# cat >> /etc/inittab <<EOF
mo:2345:respawn:/usr/bin/monit -Ic /etc/monitrc
EOF
  • re-examine /etc/inittab 
# telinit q
# tail -f /var/log/messages
May 13 12:34:35 ha-mgr02 init: Re-reading inittab
  • check monit running
# ps awuxc | grep 'monit'
root      1431  0.0  0.0  57432  1876 ?        Ssl  11:38   0:00 monit 
  • stop monit process and check that init begins monit
# kill `pgrep monit` ; ps cawux | grep 'monit'
root     13661  0.0  0.0  57432  1780 ?        Ssl  13:31   0:00 monit

  • show status and summary
# show status
Process 'ntpd'
  status                            Running
  monitoring status                 Monitored
  pid                               32307
  parent pid                        1
  uptime                            12d 17h 44m 
  children                          0
  memory kilobytes                  5040
  memory kilobytes total            5040
  memory percent                    0.2%
  memory percent total              0.2%
  cpu percent                       0.0%
  cpu percent total                 0.0%
  data collected                    Sun, 13 May 2012 12:34:35

System 'system_ha-mgr02.forschooner.net'
  status                            Running
  monitoring status                 Monitored
  load average                      [0.09] [0.20] [0.14]
  cpu                               1.6%us 3.2%sy 0.3%wa
  memory usage                      672540 kB [32.6%]
  swap usage                        120 kB [0.0%]
  data collected                    Sun, 13 May 2012 12:32:35
  • show summary 
# monit summary
The Monit daemon 5.3.2 uptime: 58m 

Process 'sshd'                      Running
Process 'ntpd'                      Running
System 'system_ha-mgr02.forschooner.net' Running

Start up from upstart

As RHEL-6.x and CentOS-6.x adopts upstart, it is necessary to use upstart but for init with those OS.
  • setup /etc/init/monit.conf
# monit_bin=$(which monit)
# cat > /etc/init/monit.conf << EOF
# monit respawn
description     "Monit"

start on runlevel [2345]
stop on runlevel [!2345]
 
respawn
exec $monit_bin -Ic /etc/monit.conf
EOF 
  • show a list of the known jobs and instances
# initctl list
 rc stop/waiting
 tty (/dev/tty3) start/running, process 1249
 ...
 monit stop/waiting
 serial (hvc0) start/running, process 1239
 rcS-sulogin stop/waiting
  • begin monit
# initctl start monit
 monit start/running, process 6873
  • see the status of the job(monit)
 # initctl status monit
 monit start/running, process 6873
  • stop monit process
# kill `pgrep monit`
  • check that upstart begins monit
# ps cawux | grep monit
 root      7140  0.0  0.1   7004  1840 ?        Ss   21:42   0:00 monit
  • see the log file that monit is respawning
# tail -1 /var/log/messages
 Oct 20 12:42:41 ip-10-171-47-212 init: monit main process ended, respawning

Verification

  • access to the monit service manager (http://IP Address:2812)

  • check ntp daemon starts if it stops 
# /etc/init.d/ntpd status
ntpd (pid  32307) is running...
# /etc/init.d/ntpd stop  
Shutting down ntpd:                                        [  OK  ]
  • see the log file that monit starts ntpd 
# cat /var/log/monit/monit.log
[JST May 13 12:52:24] error    : 'ntpd' process is not running
[JST May 13 12:52:24] info     : 'ntpd' trying to restart
[JST May 13 12:52:24] info     : 'ntpd' start: /etc/init.d/ntpd
  • check ntpd is running
# /etc/init.d/ntpd status
ntpd (pid  9475) is running...

Mail sample format

The following is examples of alert mail when monit works.
  • notifying that the daemon is stopped
<Subject>
Monit Alert -- ntpd Does not exist --
<Body>
Hostname:       ha-mgr02.forschooner.net
Service:        ntpd
Action:         restart
Date/Time:      Sun, 13 May 2012 12:52:24
Info:           process is not running 
  • notifying that the daemon starts
<Subject>
Monit Alert -- ntpd Action done --
<Body>
Hostname:       ha-mgr02.forschooner.net
Service:        ntpd
Action:         alert
Date/Time:      Sun, 13 May 2012 12:54:15
Info:           start action done 
  • notifying that the daemon is stopped
<Subject>
Monit Alert -- ntpd Exists --
<Body>
Hostname:       ha-mgr02.forschooner.net
Service:        ntpd
Action:         alert
Date/Time:      Sun, 13 May 2012 12:54:15
Info:           process is running with pid 9475









Friday, May 4, 2012

Key Value Store - monitor cassandra and multinode cluster

As installing cassandra and creating multinode cluster, I'm introducing of how to monitor cassandra and multinode cluster with own nagios-plugin.

Monitor cassandra node(check_by_ssh+cassandra-cli)

There's several ways to monitor cassandra node with Nagios or Icinga such as, JMX or check_jmx. Though they are fairly effective way to monitor cassandra, they need to take some time to prepare. I am afraid that using check_by_ssh and cassandra-cli  is more simple than those ones and no need to install any libraries except for cassandra itself.
  • commands.cfg
define command{
        command_name    check_by_ssh
        command_line    $USER1$/check_by_ssh -l nagios -i /home/nagios/.ssh/id_rsa -H $HOSTADDRESS$ -t $ARG1$ -C '$ARG2$'

  • services.cfg
define service{
      use                     generic-service
      host_name               cassandra
      service_description     Cassandra Node
      check_command           check_by_ssh!22!60!"/usr/local/apache-cassandra/bin/cassandra-cli -h localhost --jmxport 9160 -f /tmp/cassandra_load.txt"
  • setup the file to load statements
    setup the statement file in the cassandra node to be monitored.
    "show cluster name;" shows its cluster name.
# cat > /tmp/cassandra_load.txt << EOF
show cluster name;
EOF
  • plugin status when cassandra is running(service status is OK)
# su - nagios
$ check_by_ssh -l nagios -i /home/nagios/.ssh/id_rsa -H 192.168.213.91 -p 22 -t 10 -C "/usr/local/apache-cassandra/bin/cassandra-cli -h 192.168.213.91 --jmxport 9160 -f /tmp/load.txt"
Connected to: "Test Cluster" on 192.168.213.91/9160
Test Cluster
  • plugin status when cassandra is stopped(service status is CRITICAL)
# su - nagios
$ check_by_ssh -l nagios -i /home/nagios/.ssh/id_rsa -H 192.168.213.91 -l root -p 22 -t 10 -C "/usr/local/apache-cassandra/bin/cassandra-cli -h 192.168.213.91 --jmxport 9160 -f /tmp/load.txt"
Remote command execution failed: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused

Monitor multinode cluster(check_cassandra_cluster.sh)

 The plugin has been released at Nagios Exchange and see the detail there, please.
  • overview
    check if the number of live nodes which belong to multinode cluster is less than the specified number.
    it is enable to specify the threshold with option -w <warning> and -c <critical>.
    get the number of live nodes, their status, and performance data.
  • software requirements
    cassandra(using nodetool command)
  • command help
# check_cassandra_cluster.sh -h
Usage: ./check_cassandra_cluster.sh -H <host> -P <port> -w <warning> -c <critical>

 -H <host> IP address or hostname of the cassandra node to connect, localhost by default.
 -P <port> JMX port, 7199 by default.
 -w <warning> alert warning state, if the number of live nodes is less than <warning>.
 -c <critical> alert critical state, if the number of live nodes is less than <critical>.
 -h show command option
 -V show command version 
  •  when service status is OK
# check_cassandra_cluster.sh -H 192.168.213.91 -P 7199 -w 1 -c 0
OK - Live Node:2 - 192.168.213.92:Up,Normal,65.2KB,86.95% 192.168.213.91:Up,Normal,73.76KB,13.05% | Load_192.168.213.92=65.2KB Owns_192.168.213.92=86.95% Load_192.168.213.91=60.14KB Owns_192.168.213.91=13.05%
  •  when service status is WARNING
# check_cassandra_cluster.sh -H 192.168.213.91 -P 7199 -w 2 -c 0
WARNING - Live Node:2 - 192.168.213.92:Up,Normal,65.2KB,86.95% 192.168.213.91:Up,Normal,73.76KB,13.05% | Load_192.168.213.92=65.2KB Owns_192.168.213.92=86.95% Load_192.168.213.91=60.14KB Owns_192.168.213.91=13.05% 
  •  when status is CRITICAL
# check_cassandra_cluster.sh -H 192.168.213.91 -P 7199 -w 3 -c 2
CRITICAL - Live Node:2 - 192.168.213.92:Up,Normal,65.2KB,86.95% 192.168.213.91:Up,Normal,73.76KB,13.05% | Load_192.168.213.92=65.2KB Owns_192.168.213.92=86.95% Load_192.168.213.91=60.14KB Owns_192.168.213.91=13.05%
  •  when the threshold of warning is less than the one of critical
# check_cassandra_cluster.sh -H 192.168.213.91 -P 7199 -w 3 -c 4
-w <warning> 3 must be less than -c <critical> 4.

Key Value Store - create cassandra multinode cluster

As introducing of installing cassandra before, I am explaining how to create cassandra multinode cluster.

Reference

Create Multinode cluster

  • cassandra nodes
_images/cassandra_node.png
  • Configuring Multinode Cluster 1st node (kvs01)
As cassandra.yaml is for setting up single node by default, it is necessary to change the configurations to create multinode cluster.
# cd /usr/local/apache-cassandra/conf/
# vi cassandra.yaml
auto_bootstrap : false
- seeds: "192.168.213.91"
listen_address: 192.168.213.91
rpc_address: 192.168.213.91
  • difference between the unrevised cassandra.yaml  and revised one
# diff -u cassandra.yaml.bak cassandra.yaml
--- cassandra.yaml.bak 2012-02-22 23:21:44.000000000 +0900
+++ cassandra.yaml     2012-05-04 07:51:31.000000000 +0900
@@ -8,6 +8,7 @@
 # The name of the cluster. This is mainly used to prevent machines in
 # one logical cluster from joining another.
 cluster_name: 'Test Cluster'
+auto_bootstrap : false

 # You should always specify InitialToken when setting up a production
 # cluster for the first time, and often when adding capacity later.
@@ -95,7 +96,7 @@
       parameters:
           # seeds is actually a comma-delimited list of addresses.
           # Ex: "<ip1>,<ip2>,<ip3>"
-          - seeds: "127.0.0.1"
+          - seeds: "192.168.213.91"

 # emergency pressure valve: each time heap usage after a full (CMS)
 # garbage collection is above this fraction of the max, Cassandra will
@@ -178,7 +179,7 @@
 # address associated with the hostname (it might not be).
 #
 # Setting this to 0.0.0.0 is always wrong.
-listen_address: localhost
+listen_address: 192.168.213.91

 # Address to broadcast to other Cassandra nodes
 # Leaving this blank will set it to the same value as listen_address
@@ -190,7 +191,7 @@
 #
 # Leaving this blank has the same effect it does for ListenAddress,
 # (i.e. it will be based on the configured hostname of the node).
-rpc_address: localhost
+rpc_address: 192.168.213.91
 # port for Thrift to listen for clients on
 rpc_port: 9160
  • restart daemon
# pgrep -f cassandra | xargs kill -9
# /usr/local/apache-cassandra/bin/cassandra
  • Configuring Multinode Cluster other node (kvs02,kvs03)
 listen_address and rpc_address are replaced with those of each servers'
It is no need to enable auto_bootstrap as cassandra-1.x is enabled by default.
# cd /usr/local/apache-cassandra/conf/
# vi cassandra.yaml
- seeds: "192.168.213.91"
listen_address: 192.168.213.92
rpc_address: 192.168.213.92
  • difference between the unrevised cassandra.yaml  and revised one
# diff -u cassandra.yaml.bak cassandra.yaml
--- cassandra.yaml.bak 2012-03-23 04:00:43.000000000 +0900
+++ cassandra.yaml     2012-05-04 08:44:14.000000000 +0900
@@ -8,6 +8,7 @@
 # The name of the cluster. This is mainly used to prevent machines in
 # one logical cluster from joining another.
 cluster_name: 'Test Cluster'
+auto_bootstrap: true

 # You should always specify InitialToken when setting up a production
 # cluster for the first time, and often when adding capacity later.
@@ -95,7 +96,7 @@
       parameters:
           # seeds is actually a comma-delimited list of addresses.
           # Ex: "<ip1>,<ip2>,<ip3>"
-          - seeds: "localhost"
+          - seeds: "192.168.213.91"

 # emergency pressure valve: each time heap usage after a full (CMS)
 # garbage collection is above this fraction of the max, Cassandra will
@@ -178,7 +179,7 @@
 # address associated with the hostname (it might not be).
 #
 # Setting this to 0.0.0.0 is always wrong.
-listen_address: localhost
+listen_address: 192.168.213.92

 # Address to broadcast to other Cassandra nodes
 # Leaving this blank will set it to the same value as listen_address
@@ -190,7 +191,7 @@
 #
 # Leaving this blank has the same effect it does for ListenAddress,
 # (i.e. it will be based on the configured hostname of the node).
-rpc_address: localhost
+rpc_address: 192.168.213.92
 # port for Thrift to listen for clients on
 rpc_port: 9160
  • restart daemon
# pgrep -f cassandra | xargs kill -9
# /usr/local/apache-cassandra/bin/cassandra
  • Verify ring status
# nodetool -h localhost ring
Address         DC          Rack        Status State   Load            Owns    Token
                                                                               100438156989107092060814573762535799562
192.168.213.92  datacenter1 rack1       Up     Normal  53.6 KB         93.47%  89332387546649365392870509741689618961
192.168.213.93  datacenter1 rack1       Up     Normal  49.19 KB        3.26%   94885272267878228726842541752112709261
192.168.213.91  datacenter1 rack1       Up     Normal  55.71 KB        3.26%   100438156989107092060814573762535
 Finally, I'm introducing of the monitoring phase next.

Key Value Store - install cassandra

I recently got an opportunity to monitor Key-Value Store, cassandra.  Though I know how to monitor RDBMS, such as MySQL or PostgreSQL, I know little about cassandra. I'm going to install cassandra and introduce of  monitoring cassandra with cassandra-cli. In addition, as I need to monitor cassandra multinode cluster, I'll show you about creating it and monitor it with nagios-plugins which I wrote.

Installation

  • install java(JDK)
    get the binary file here and transfer it.
# sh jdk-6u31-linux-x64-rpm.bin
  • install cassandra
# wget http://ftp.jaist.ac.jp/pub/apache//cassandra/1.0.8/apache-cassandra-1.0.8-bin.tar.gz
# tar -C /usr/local/ -zxf apache-cassandra-1.0.8-bin.tar.gz
# ln -s /usr/local/apache-cassandra-1.0.8 /usr/local/apache-cassandra
  • setup PATH
# vi /etc/profile
...
if [ "$EUID" = "0" ]; then
      pathmunge /usr/local/apache-cassandra/bin       : Add the directive
...
fi
# . /etc/profile

Verification

  • start up cassandra daemon behind
# cassandra
  • connect cassandra Using cassandra-cli
# cassandra-cli -h 127.0.0.1 -p 9160
Connected to: "Test Cluster" on 127.0.0.1/9160
Welcome to Cassandra CLI version 1.0.8

Type 'help;' or '?' for help.
Type 'quit;' or 'exit;' to quit.
[default@unknown]
  • verify open port
# netstat -lnpt | grep java
tcp        0      0 127.0.0.1:9160              0.0.0.0:*                   LISTEN      2124/java
tcp        0      0 0.0.0.0:34742               0.0.0.0:*                   LISTEN      2124/java
tcp        0      0 127.0.0.1:7000              0.0.0.0:*                   LISTEN      2124/java
tcp        0      0 0.0.0.0:47484               0.0.0.0:*                   LISTEN      2124/java
tcp        0      0 0.0.0.0:7199                0.0.0.0:*                   LISTEN      2124/java
  • Create keyspace
[default@unknown] create keyspace DEMO;
2bbaee00-7442-11e1-0000-242d50cf1fbc
Waiting for schema agreement...
... schemas agree across the cluster

[default@unknown] use DEMO;
Authenticated to keyspace: DEMO

[default@DEMO] create column family Users;
327382c0-7442-11e1-0000-242d50cf1fbc
Waiting for schema agreement...
... schemas agree across the cluster

[default@DEMO] set Users[utf8('1234')][utf8('name')] = utf8('scott');
Value inserted.
Elapsed time: 33 msec(s).

[default@DEMO] set Users[utf8('1234')][utf8('password')] = utf8('tiger');
Value inserted.
Elapsed time: 4 msec(s).

[default@DEMO] get Users[utf8('1234')];
=> (column=6e616d65, value=scott, timestamp=1332436350273000)
=> (column=70617373776f7264, value=tiger, timestamp=1332436354369000)
Returned 2 results.
Elapsed time: 36 msec(s).

[default@DEMO] assume Users keys as utf8;
Assumption for column family 'Users' added successfully.

[default@DEMO] assume Users comparator as utf8;
Assumption for column family 'Users' added successfully.

[default@DEMO] assume Users validator as utf8;
Assumption for column family 'Users' added successfully.

[default@DEMO] get Users['1234'];
=> (column=name, value=scott, timestamp=1332436350273000)
=> (column=password, value=tiger, timestamp=1332436354369000)
Returned 2 results.
Elapsed time: 2 msec(s).

 Let's create cassandra multinode cluster next time.

Monitoring tool - init script for icinga, ido2db(idoutils), and npcd(pnp4nagios)

As installing finishes icinga, icinga-web, and pnp4nagios, it's necessary to setup init scripts to run and stop daemon. Of course, each of the source files includes ones, but I prefer a typical format based on RPM package to the ones in the source file. So I modified the init scripts based on RPM packages.

I am going to introduce of  each of the init scripts and verification about how they work.
They are open to the public in my github.
  • daemon and init script
Icinga (based on Nagios RPM package) /etc/init.d/icinga
IDOUtils ( based on NDOUtils RPM package) /etc/init.d/ido2mod
PNP4nagios ( based on Nagios RPM Package a little) /etc/init.d/npcd

icinga

  • create init script based on nagios RPM package
    The patch file is stored here.
# yumdownloader --enablerepo=rpmforge icinga
# mkdir work
# cd work
# rpm2cpio ../ nagios-3.2.3-3.el5.rf.x86_64.rpm | cpid -id ./etc/rc.d/init.d/nagios
# cp etc/rc.d/init.d/nagios ./icinga
# cp icinga{,_diff}
...
# diff -c icinga icinga_diff > icinga.patch
# patch -p0 < icinga.patch
# cp icinga /etc/init.d/icinga
  • start daemon
# /etc/init.d/icinga start
Starting icinga:                                           [  OK  ]
  • stop daemon
# /etc/init.d/icinga stop
Stopping icinga:                                           [  OK  ]
  • restart daemon
# /etc/init.d/icinga restart
Stopping icinga:                                           [  OK  ]
Starting icinga:                                           [  OK  ]
  • condrestart daemon
# /etc/init.d/icinga condrestart
Stopping icinga:                                           [  OK  ]
Starting icinga:                                           [  OK  ]
  • reload daemon
# /etc/init.d/icinga reload
icinga (pid  17359) is running...
Reloading icinga:                                          [  OK  ]
  • check if daemon is running
# /etc/init.d/icinga status
icinga (pid  17359) is running...
  • difference between nagios(rpmpackage) and icinga
# diff -u nagios icinga_diff
--- nagios     2012-05-01 23:34:15.000000000 +0900
+++ icinga_diff        2012-05-03 20:52:17.000000000 +0900
@@ -1,36 +1,38 @@
 #!/bin/sh
 # $Id$
-# Nagios      Startup script for the Nagios monitoring daemon
+# Icinga      Startup script for the Nagios monitoring daemon
 #
 # chkconfig:  - 85 15
-# description:        Nagios is a service monitoring system
-# processname: nagios
-# config: /etc/nagios/nagios.cfg
-# pidfile: /var/nagios/nagios.pid
+# description:        Icinga is a service monitoring system
+# processname: icinga
+# config: /usr/local/icinga/etc/icinga.cfg
+# pidfile: /var/run/icinga.pid
 #
 ### BEGIN INIT INFO
-# Provides:           nagios
+# Provides:           icinga
 # Required-Start:     $local_fs $syslog $network
 # Required-Stop:      $local_fs $syslog $network
-# Short-Description:    start and stop Nagios monitoring server
-# Description:                Nagios is is a service monitoring system
+# Short-Description:    start and stop Icinga monitoring server
+# Description:                Icinga is is a service monitoring system
 ### END INIT INFO

 # Source function library.
 . /etc/rc.d/init.d/functions

-prefix="/usr"
-exec_prefix="/usr"
-exec="/usr/bin/nagios"
-prog="nagios"
-config="/etc/nagios/nagios.cfg"
-pidfile="/var/nagios/nagios.pid"
-user="nagios"
+user="icinga"
+prog="icinga"
+prefix="/usr/local/$prog"
+exec_prefix="${prefix}"
+exec="${prefix}/bin/$prog"
+config="${prefix}/etc/$prog.cfg"
+piddir="/var/run"
+lockdir="/var/lock/subsys"
+pidfile="$piddir/$prog.pid"
+lockfile="${lockdir}/$prog"

+[ -d "$piddir" ] || mkdir -p piddir && chown $prog:$prog $piddir
 [ -e /etc/sysconfig/$prog ] && . /etc/sysconfig/$prog

-lockfile=/var/lock/subsys/$prog
-
 start() {
     [ -x $exec ] || exit 5
     [ -f $config ] || exit 6
@@ -47,7 +49,7 @@
     killproc -d 10 $exec
     retval=$?
     echo
-    [ $retval -eq 0 ] && rm -f $lockfile
+    [ $retval -eq 0 ] && rm -f $lockfile $pidfile
     return $retval
 }

@@ -60,7 +62,7 @@
 reload() {
     echo -n $"Reloading $prog: "
     killproc $exec -HUP
-    RETVAL=$?
+    retval=$?
     echo
 }

@@ -70,8 +72,8 @@

 check_config() {
         $nice runuser -s /bin/bash - $user -c "$corelimit >/dev/null 2>&1 ; $exec -v $config > /dev/null 2>&1"
-        RETVAL=$?
-        if [ $RETVAL -ne 0 ] ; then
+        retval=$?
+        if [ $retval -ne 0 ] ; then
                 echo -n $"Configuration validation failed"
                 failure
                 echo
  • about the pidfile and the lockfile path
    Icinga.cfg(also nagios.cfg) defines lockfile as pidfile.
    I'm not sure why they're defined as so, but I think they should be separated.
    I defined he path of pidfile and lockfile in the init script and icinga.cfg
# grep '^lock_file'icinga.cfg
lock_file=/var/run/icinga.pid
# egrep '^(pid|lock)' /etc/init.d/icinga 
piddir="/var/run"
lockdir="/var/lock/subsys"
pidfile="$piddir/$prog.pid"
lockfile="${lockdir}/$prog"

ido2db

  • create init script for ndoutils based on ndo2utils RPM packageThe patch file is stored here.
# yumdownloader --enablerepo=rpmforge ndo2utils
# mkdir work
# cd work
# rpm2cpio ../ndoutils-1.4-0.beta7.3.el5.rf.x86_64.rpm | cpio -id ./etc/init.d/ndoutils
# cp etc/init.d/ndoutils ./ido2db
# cp ido2db{,_diff}
# vi ido2db_diff
...
# diff -c ido2db ido2db_diff > ido2db.patch
# patch -p0 < ido2db.patch
# cp ido2db /etc/init.d/ido2db
  • start daemon
# /etc/init.d/ido2db start
Starting ido2db:                                           [  OK  ]
  • stop daemon
# /etc/init.d/ido2db stop
Stopping ido2db:                                           [  OK  ]
  • restart daemon
# /etc/init.d/ido2db restart
Stopping ido2db:                                           [  OK  ]
Starting ido2db:                                           [  OK  ]
  • condrestart daemon
# /etc/init.d/ido2db condrestart
Stopping ido2db:                                           [  OK  ]
Starting ido2db:                                           [  OK  ]
  • difference between ndo2utils(rpmpackage) and ido2db
# diff ndoutils ndoutils_diff

@@ -1,37 +1,42 @@
 #!/bin/sh
-# Startup script for ndo-daemon
+# Startup script for ido2db-daemon
 #
 # chkconfig: 2345 95 05
-# description: Nagios Database Objects daemon
+# description: Icinga Database Objects daemon

 # Source function library.
 . /etc/rc.d/init.d/functions

-
-BINARY=ndo2db-3x
-DAEMON=/usr/sbin/$BINARY
-CONFIG=/etc/nagios/ndo2db.cfg
-
-[ -f $DAEMON ] || exit 0
-
-prog="ndo2db"
+prog=ido2db
+user=icinga
+prefix=/usr/local/icinga
+exec=$prefix/bin/$prog
+config=$prefix/etc/ido2db.cfg
+piddir="/var/run"
+lockdir="/var/lock/subsys"
+pidfile="$piddir/$prog.pid"
+lockfile="${lockdir}/$prog"

 start() {
+    [ -x $exec ] || exit 5
+    [ -f $config ] || exit 6
     echo -n $"Starting $prog: "
-    daemon --user nagios $DAEMON -c $CONFIG
-    RETVAL=$?
+    daemon --user $user $exec -c $config
+    retval=$?
+    [ $retval -eq 0 ] && touch $lockfile
     echo
-    return $RETVAL
+    return $retval
 }

 stop() {
-    if test "x`pidof $BINARY`" != x; then
+    if test "x`pidof $prog`" != x; then
         echo -n $"Stopping $prog: "
-        killproc ndo2db-3x
+        killproc $prog
         echo
     fi
-    RETVAL=$?
-    return $RETVAL
+    retval=$?
+    [ $retval -eq 0 ] && rm -f $lockfile $pidfile
+    return $retval
 }

 case "$1" in
@@ -44,14 +49,14 @@
             ;;

         status)
-            status $BINARY
+            status $prog
             ;;
         restart)
             stop
             start
             ;;
         condrestart)
-            if test "x`pidof $BINARY`" != x; then
+            if test "x`pidof $prog`" != x; then
                 stop
                 start
             fi
@@ -63,5 +68,5 @@

 esac

-exit $RETVAL
+exit $retval
  • about the pidfile and the lockfile path
    Icinga.cfg(also nagios.cfg) defines lockfile as pidfile.
    I'm not sure why they're defined as so, but I think they should be separated.
    I defined he path of pidfile and lockfile in the init script and icinga.cfg
# grep '^lock_file'ido2db.cfg
lock_file=/var/run/ido2db.pid
# egrep '^(pid|lock)' /etc/init.d/icinga 
piddir="/var/run"
lockdir="/var/lock/subsys"
pidfile="$piddir/$prog.pid"
lockfile="${lockdir}/$prog"


npcd

  • create init script for npcd based on nagios RPM packageThe patch file is stored here.
# yumdownloader --enablerepo=rpmforge icinga
# mkdir work
# cd work
# rpm2cpio ../ nagios-3.2.3-3.el5.rf.x86_64.rpm | cpid -id ./etc/rc.d/init.d/nagios
# cp etc/rc.d/init.d/nagios ./npcd
# cp npcd{,_diff}
...
# diff -c npcd npcd_diff > npcd.patch
# patch -p0 < npcd.patch
# cp npcd /etc/init.d/npcd
  • start daemon
# /etc/init.d/npcd start
npcd is stopped
Starting npcd:                                             [  OK  ]
  • stop daemon
# /etc/init.d/npcd stop
npcd (pid  14128) is running...
Stopping npcd:                                             [  OK  ]
  • restart daemon
# /etc/init.d/npcd restart
Starting npcd:                                             [  OK  ]
Starting npcd:                                             [  OK  ]
  • condrestart daemon
# /etc/init.d/npcd condrestart
npcd (pid  14216) is running...
Stopping npcd:                                             [  OK  ]
Starting npcd:                                             [  OK  ]
  • reload daemon
# /etc/init.d/npcd reload
npcd (pid  14233) is running...
Reloading npcd:                                            [  OK  ]
  • check if daemon is running
# /etc/init.d/npcd status
 npcd (pid 14233) is running...
  • difference between nagios(rpmpackage) and npcd
# diff -u npcd npcd_diff
--- npcd       2012-05-04 10:47:11.000000000 +0900
+++ npcd_diff  2012-05-03 22:45:28.000000000 +0900
@@ -1,41 +1,40 @@
 #!/bin/sh
-# $Id$
-# Nagios      Startup script for the Nagios monitoring daemon
-#
-# chkconfig:  - 85 15
-# description:        Nagios is a service monitoring system
-# processname: nagios
-# config: /etc/nagios/nagios.cfg
-# pidfile: /var/nagios/nagios.pid
 #
 ### BEGIN INIT INFO
-# Provides:           nagios
-# Required-Start:     $local_fs $syslog $network
-# Required-Stop:      $local_fs $syslog $network
-# Short-Description:    start and stop Nagios monitoring server
-# Description:                Nagios is is a service monitoring system
+# Short-Description: pnp4nagios NPCD Daemon Version 0.6.16
+# Description: Nagios Performance Data C Daemon
+# chkconfig: 345 99 01
+# processname: npcd
+# config: /usr/local/pnp4nagios/etc/npcd.cfg
+# pidfile: /var/run/npcd.pid
+# Provides:          npcd
+# Required-Start:
+# Required-Stop:
+# Default-Start:     2 3 4 5
+# Default-Stop:      0 1 6
 ### END INIT INFO

 # Source function library.
 . /etc/rc.d/init.d/functions

-prefix="/usr"
-exec_prefix="/usr"
-exec="/usr/bin/nagios"
-prog="nagios"
-config="/etc/nagios/nagios.cfg"
-pidfile="/var/nagios/nagios.pid"
-user="nagios"
+user="icinga"
+prog="npcd"
+prefix="/usr/local/pnp4nagios"
+exec_prefix="${prefix}"
+exec="${prefix}/bin/$prog"
+config="${prefix}/etc/$prog.cfg"
+piddir="/var/run"
+lockdir="/var/lock/subsys"
+pidfile="/var/run/$prog.pid"
+lockfile="${lockdir}/$prog"

 [ -e /etc/sysconfig/$prog ] && . /etc/sysconfig/$prog

-lockfile=/var/lock/subsys/$prog
-
 start() {
     [ -x $exec ] || exit 5
     [ -f $config ] || exit 6
     echo -n $"Starting $prog: "
-    daemon --user=$user $exec -d $config
+    daemon --user=$user $exec -d -f $config
     retval=$?
     echo
     [ $retval -eq 0 ] && touch $lockfile
@@ -47,7 +46,7 @@
     killproc -d 10 $exec
     retval=$?
     echo
-    [ $retval -eq 0 ] && rm -f $lockfile
+    [ $retval -eq 0 ] && rm -f $lockfile $pidfile
     return $retval
 }

@@ -60,31 +59,14 @@
 reload() {
     echo -n $"Reloading $prog: "
     killproc $exec -HUP
-    RETVAL=$?
+    retval=$?
     echo
 }

-force_reload() {
-    restart
-}
-
-check_config() {
-        $nice runuser -s /bin/bash - $user -c "$corelimit >/dev/null 2>&1 ; $exec -v $config > /dev/null 2>&1"
-        RETVAL=$?
-        if [ $RETVAL -ne 0 ] ; then
-                echo -n $"Configuration validation failed"
-                failure
-                echo
-                exit 1
-
-        fi
-}
-

 case "$1" in
     start)
         status $prog && exit 0
-      check_config
         $1
         ;;
     stop)
@@ -92,33 +74,21 @@
         $1
         ;;
     restart)
-      check_config
         $1
         ;;
     reload)
         status $prog || exit 7
-      check_config
         $1
         ;;
-    force-reload)
-      check_config
-        force_reload
-        ;;
     status)
         status $prog
         ;;
-    condrestart|try-restart)
+    condrestart)
         status $prog|| exit 0
-      check_config
         restart
         ;;
-    configtest)
-        echo -n  $"Checking config for $prog: "
-        check_config && success
-        echo
-      ;;
     *)
-        echo $"Usage: $0 {start|stop|status|restart|condrestart|try-restart|reload|force-reload|configtest}"
+        echo $"Usage: $0 {start|stop|status|restart|condrestart|reload}"
         exit 2
 esac
 exit $?


I will list the other configurations for icinga, idoutils, and pnp4nagios next time.