Hardware & Infrastructure > Open Source

Nagios - Host Down

(1/2) > >>

This is really bugging me (and hopefully this is the right board for it).

I have a server with a handful of IP addresses setup in Nagios with each IP address as an individual "host".  I've had one IP address that I wasn't using for a while so I disabled the service and host checks on it.  Now I am using it again (~45 days later), have service checks enabled, all checks enabled on the host, all other IP addresses on this server are reported as up, pinging this IP gives the same results as pinging all of the other IP addresses....

Nagios still reports this host as being down.

Manual ping results:

--- Code: ---root@Nagios:/usr/local/nagios# ping -c 5
PING ( 56(84) bytes of data.
From 10.x.x.x: icmp_seq=1 Redirect Network(New nexthop: 10.x.x.y)
64 bytes from icmp_seq=1 ttl=56 time=62.3 ms
64 bytes from icmp_seq=2 ttl=56 time=60.4 ms
64 bytes from icmp_seq=3 ttl=56 time=79.0 ms
64 bytes from icmp_seq=4 ttl=56 time=60.4 ms
64 bytes from icmp_seq=5 ttl=56 time=60.3 ms

--- ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4005ms
rtt min/avg/max/mdev = 60.397/64.528/79.034/7.291 ms

--- End code ---

Same results if I ping a different IP on that same host.

Bloody Jack Kidd:
the problem we sometimes have is a bit different and it has to do with the OS getting an ICMP redirect when a router goes down, then not updating the route table once the route is again available.  Hence a blocking outage remains active in Nagios even though it's not longer the case.

Ok, so I should check the routing table on the nagios box, or I how do you usually recover from this?

Bloody Jack Kidd:
yeah, perhaps like a netstat -r or a route query to see if that's poisoned (but I think that's my issue, not your per se, since you can ping the device from the Nagios box)

my first check might be

--- Code: ---# /usr/local/etc/rc.d/nagios reload
--- End code ---

Also, running check_ping manually with the same exact syntax as the commands.cfg uses returns:

--- Code: ---./check_ping -H  -w 3000.0,80% -c 5000.0,100% -p 5
PING OK - Packet loss = 0%, RTA = 60.45 ms
--- End code ---

I don't have any route issues as far as I can tell either.

Very odd.  I changed the host's IP, reloaded, and now the service check should fail because the port in question is not open on that IP and the host should show up. .. but service still shows up and host still shows down.

Removing and re-adding is next.


[0] Message Index

[#] Next page

Go to full version