Nagios - Host Down

Started by Mark, September 12, 2011, 01:25:47 PM

Previous topic - Next topic

0 Members and 2 Guests are viewing this topic.

Mark

This is really bugging me (and hopefully this is the right board for it).

I have a server with a handful of IP addresses setup in Nagios with each IP address as an individual "host".  I've had one IP address that I wasn't using for a while so I disabled the service and host checks on it.  Now I am using it again (~45 days later), have service checks enabled, all checks enabled on the host, all other IP addresses on this server are reported as up, pinging this IP gives the same results as pinging all of the other IP addresses....

Nagios still reports this host as being down.

Manual ping results:
root@Nagios:/usr/local/nagios# ping 1.2.3.4 -c 5
PING 1.2.3.4 (1.2.3.4) 56(84) bytes of data.
From 10.x.x.x: icmp_seq=1 Redirect Network(New nexthop: 10.x.x.y)
64 bytes from 1.2.3.4: icmp_seq=1 ttl=56 time=62.3 ms
64 bytes from 1.2.3.4: icmp_seq=2 ttl=56 time=60.4 ms
64 bytes from 1.2.3.4: icmp_seq=3 ttl=56 time=79.0 ms
64 bytes from 1.2.3.4: icmp_seq=4 ttl=56 time=60.4 ms
64 bytes from 1.2.3.4: icmp_seq=5 ttl=56 time=60.3 ms

--- 1.2.3.4 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4005ms
rtt min/avg/max/mdev = 60.397/64.528/79.034/7.291 ms
root@Nagios:/usr/local/nagios#


Same results if I ping a different IP on that same host.
Mark Piontek, MBA
Director of Information Systems
BS in Information Systems Security

Bloody Jack Kidd

the problem we sometimes have is a bit different and it has to do with the OS getting an ICMP redirect when a router goes down, then not updating the route table once the route is again available.  Hence a blocking outage remains active in Nagios even though it's not longer the case.
Sysadmin - Parallel42

Mark

Ok, so I should check the routing table on the nagios box, or I how do you usually recover from this?
Mark Piontek, MBA
Director of Information Systems
BS in Information Systems Security

Bloody Jack Kidd

yeah, perhaps like a netstat -r or a route query to see if that's poisoned (but I think that's my issue, not your per se, since you can ping the device from the Nagios box)

my first check might be

# /usr/local/etc/rc.d/nagios reload

Sysadmin - Parallel42

Mark

Also, running check_ping manually with the same exact syntax as the commands.cfg uses returns:

./check_ping -H 1.2.3.4  -w 3000.0,80% -c 5000.0,100% -p 5
PING OK - Packet loss = 0%, RTA = 60.45 ms


I don't have any route issues as far as I can tell either.

Very odd.  I changed the host's IP, reloaded, and now the service check should fail because the port in question is not open on that IP and the host should show up. .. but service still shows up and host still shows down.

Removing and re-adding is next.
Mark Piontek, MBA
Director of Information Systems
BS in Information Systems Security

Jeff Golas

My question is - is it saying its down because of a ping or is it doing a different check? If I remember correctly don't you tell it what type of device it is in the config file? Maybe you have it checking via SNMP?

Jeff
Jeff Golas
Johnson, Kendall & Johnson, Inc. :: Newtown, PA
Epic Online w/CSR24
http://www.jkj.com

Mark

Quote from: Jeff Golas on September 13, 2011, 11:00:15 AM
My question is - is it saying its down because of a ping or is it doing a different check? If I remember correctly don't you tell it what type of device it is in the config file? Maybe you have it checking via SNMP?

Jeff

Nah, its using check_ping.
Mark Piontek, MBA
Director of Information Systems
BS in Information Systems Security

Mark

GAH!  USER ERROR!

I was just deleting some of the notifications and realized that I had the IP wrong in the host definition, but I am using it manually in the service check and typed it right in there.  It is x.22.26.x and I had 26.22!  You'd think after seeing hundreds of these notifications I would have realized this already!

What an idiot.   :-\
Mark Piontek, MBA
Director of Information Systems
BS in Information Systems Security

Bloody Jack Kidd

it can happen... (not to me)... but it can happen

;)
Sysadmin - Parallel42