Service Assurance

This section describes how Horizon tests the availability of a service or device and measures its latency. Refer to the service assurance topic in the Quick Start section for additional information on determining service availability.

In Horizon, a service monitor framework fulfills these tasks. The main component, pollerd, provides the following functionality:

  • Tracks the status of a management resource or an application for availability calculations.

  • Measures response times for service quality.

  • Correlates node and interface outages based on a critical service.

The following image shows the model and representation of availability and response time:

01 node model

This information is based on service monitors which pollerd stores and runs. A service can have any arbitrary name and is associated with a service monitor. For example, we can define two services with the names HTTP and HTTP-8080. Both are associated with the HTTP Service Monitor, but each uses a different TCP port configuration parameter. The following figure shows how pollerd interacts with other components in Horizon and applications or agents to be monitored.

The availability is calculated over the last 24 hours and is shown in the Surveillance Views, SLA Categories, and the Node Detail pages. Response times are displayed as resource graphs of the IP interface on the Node Detail page. Click the service name on the Node Detail page to see the configuration parameters of the service monitor. The status of a service can be up or down.

The Service page also includes timestamps that indicate the last time the service was polled and found to to be up (last good) or down (last fail). These fields help to validate that pollerd is polling the services as expected.

When a service monitor detects an outage, pollerd sends an event that is used to create an alarm. You can also use events to generate notifications for on-call network or server administrators. The following image shows how pollerd interacts with other Horizon components:

02 service assurance

Pollerd can generate the following events in Horizon:

Event name Description

Critical services are still up, just this service is lost.

Service came back up.

Critical service on an IP interface is down or all services are down.

Critical service on that interface came back up.

All critical services on all IP interfaces are down from node. The whole host is unreachable over the network.

Some critical services came back online.

The behavior to generate interfaceDown and nodeDown events is described in the Critical Service section.

This assumes that node-outage processing is enabled.