Nagios

Monitoring Routers and Switches


Up To: Contents
See Also: Monitoring Publicly Available Services

Introduction

This document describes how you can monitor the status of network switches and routers. Some cheaper "unmanaged" switches and hubs don't have IP addresses and are essentially invisible on your network, so there's not any way to monitor them. More expensive switches and routers have addresses assigned to them and can be monitored by pinging them or using SNMP to query status information.

I'll describe how you can monitor the following things on managed switches, hubs, and routers:

Notes:

Creating Required Definitions

You'll need to create some object definitions in order to monitor a new switch. These definitions can be placed in their own file or added to an already exiting object configuration file.

First, its best practice to create a new template for each different type of host you'll be monitoring. Let's create a new template for switches.

define host{
	name			generic-switch	; The name of this host template
	use			generic-host	; Inherit default values from the generic-host template
	check_period		24x7		; By default, switches are monitored round the clock
	check_interval		5		; Switches are checked every 5 minutes
	retry_interval		1		; Schedule host check retries at 1 minute intervals
	max_check_attempts		10		; Check each switch 10 times (max)
	check_command		check-host-alive	; Default command to check if routers are "alive"
	notification_period	24x7		; Send notifications at any time
	notification_interval	30		; Resend notifications every 30 minutes
	notification_options	d,r		; Only send notifications for specific host states
	contact_groups		admins		; Notifications get sent to the admins by default
	register			0		; DONT REGISTER THIS - ITS JUST A TEMPLATE
	}

Notice that the switch template definition is inheriting default values from the generic-host template, which is defined in the sample localhost.cfg file.

Next, define a new host for the switch that references the newly created generic-switch host template.

define host{
	use		generic-switch		; Inherit default values from a template
	host_name		linksys-srw224p		; The name we're giving to this switch
	alias		Linksys SRW224P Switch	; A longer name associated with the switch
	address		192.168.1.253		; IP address of the switch
	hostgroups	allhosts			; Host groups this switch is associated with
	}

Add an optional hostgroup for switches. This is useful if you create additional switches in the future and want to view them together in the CGIs. It can also be useful for object definition tricks that you can use to manage larger configurations later on.

define hostgroup{
	hostgroup_name	switches		; The name of the hostgroup
	alias		Network Switches	; Long name of the group
	members		linksys-srw224p	; Comma separated list of hosts that belong to this group
	}

The linksys-srw224p host will be a member of two hostgroups - allhosts (which is referenced in the host definition and defined in localhost.cfg) and switches (which is defined above).

Monitoring Packet Loss and RTA

Now its time to define some services that should be associated with the switch. First off, we should monitor packet loss and round trip average between the Nagios host and the switch. This can be accomplished by using the check_ping plugin. A command definition for using the check_ping plugin that has been defined in the commands.cfg file. That command definition looks like this...

define command{
	command_name	check_ping
	command_line	$USER1$/check_ping -H $HOSTADDRESS$ -w $ARG1$ -c $ARG2$ -p 5
	}

Let's create a service called PING as follows...

define service{
	use			generic-service	; Inherit values from a template
	host_name			linksys-srw224p	; The name of the host the service is associated with
	service_description	PING		; The service description
	check_command		check_ping!200.0,20%!600.0,60%	; The command used to monitor the service
	normal_check_interval	5	; Check the service every 5 minutes under normal conditions
	retry_check_interval	1	; Re-check the service every minute until its final/hard state is determined
	}

Notice that the check_command directive is passing "200.0,20%" and "600.0,60%" to the check_ping command, where they are substituted for the $ARG1$ and $ARG2$ macros, respectively. This means that the PING service will be:

Monitoring SNMP Status Information

If your switch or router supports SNMP, you can monitor a lot of information by using the check_snmp plugin. A command definition for using the check_snmp plugin that has been defined in the commands.cfg file. That command definition looks like this...

define command{
	command_name	check_snmp
	command_line	$USER1$/check_snmp -H $HOSTADDRESS$ $ARG1$
	}

Monitoring the uptime of a switch is fairly common. A service definition that would accomplish that looks like this...

define service{
	use			generic-service	; Inherit values from a template
	host_name			linksys-srw224p
	service_description	Uptime	
	check_command		check_snmp!-C public -o sysUpTime.0
	}

The check_command directive will pass the "-C public -o sysUpTime.0" options to the $ARG1$ macro in the check_snmp command definitions. The "-C public" tells the plugin that the SNMP community name is "public" and the "-o sysUpTime.0" is the OID that we want to check.

If you want to ensure that a specific port/interface on the switch is in an up state, you could create a service definition like this:

define service{
	use			generic-service	; Inherit values from a template
	host_name			linksys-srw224p
	service_description	Port 1 Link Status
	check_command		check_snmp!-C public -o ifOperStatus.1 -r 1 -m RFC1213-MIB
	}

In the example above, the "-o ifOperStatus.1" refers to the OID for the operational status of port 1 on the switch. The "-r 1" option tells the check_snmp plugin to return an OK state if "1" is found in the SNMP result (1 indicates an "up" state on the port) and CRITICAL if it isn't found. The "-m RFC1213-MIB" is optional and tells the check_snmp plugin to only load the "RFC1213-MIB" instead of every single MIB that's installed on your system, which can help speed things up.

That's it for the SNMP monitoring example. There are a million things that can be monitored via SNMP, so its up to you to decide what you need and want to monitor. Good luck!

Tip: You can usually find the OIDs that can be monitored on a switch by running the following command (replace 192.168.1.253 with the IP address of the switch): snmpwalk -v1 -c public 192.168.1.253 -m ALL .1

Monitoring Bandwidth / Traffic Rate

If you're monitoring bandwidth usage on your switches or routers using MRTG, you can have Nagios alert you when traffic rates exceed thresholds you specify. The check_mrtgtraf plugin (which is included in the Nagios plugins distribution) allows you to do this.

A sample check_local_mrtgtraf command that uses the check_mrtg plugin has been been defined in the commands.cfg file. It looks like this...

define command{
	command_name	check_local_mrtgtraf
	command_line	$USER1$/check_mrtgtraf -F $ARG1$ -a $ARG2$ -w $ARG3$ -c $ARG4$ -e $ARG5$
	}

You need to let the check_mrtgtraf plugin know what log file the MRTG data is being stored in, along with thresholds, etc. In my example, I'm monitoring one of the ports on the Linksys switch. The MRTG log file is stored in /var/lib/mrtg/192.168.1.253_1.log. Here's the service definition I use to monitor the bandwidth data that's stored in the log file...

define service{
	use			generic-service	; Inherit values from a template
	host_name			linksys-srw224p
	service_description	Port 1 Bandwidth Usage
	check_command		check_local_mrtgtraf!/var/lib/mrtg/192.168.1.253_1.log!AVG!1000000,1000000!5000000,5000000!10
	}

The values in the check_command directive will be substituted for the $ARGx$ variables in the check_local_mrtgtraf command line when the service is checked. Here's what the substituted command line looks like:

/usr/local/nagios/libexec/check_mrtgtraf -F /var/lib/mrtg/192.168.1.253_1.log -a AVG -w 1000000,1000000 -c 5000000,5000000 -e 10

The "-F /var/lib/mrtg/192.168.1.253_1.log" option tells the plugin which MRTG log file to read from. The "-a AVG" option tells it that it should use average bandwidth statistics. The "-w 120000,150000 -c 200000,500000" options are warning and critical threshold values (in bytes) for both incoming and outgoing traffic rates. The "-e 10" option causes the plugin to return a CRITICAL state if the MRTG log file is older than 10 minutes (it should be updated every 5 minutes).

Those are the basics on monitoring switches and routers. As a reminder, if you modify your configuration files, make sure you verify your configuration and restart Nagios.