Service Assurance
This section will cover the basic functionalities how Horizon tests if a service or device available and measure his latency.
In Horizon this task is provided by a Service Monitor framework. The main component is Pollerd which provides the following functionality:
-
Track the status of a management resource or an application for availability calculations
-
Measure response times for service quality
-
Correlation of node and interface outages based on a Critical Service
The following image shows the model and representation of availability and response time.
This information is based on Service Monitors which are scheduled and executed by Pollerd. A Service can have any arbitrary name and is associated with a Service Monitor. For example, we can define two Services with the name HTTP and HTTP-8080, both are associated with the HTTP Service Monitor but use a different TCP port configuration parameter. The following figure shows how Pollerd interacts with other components in OpenNMS and applications or agents to be monitored.
The availability is calculated over the last 24 hours and is shown in the Surveillance Views, SLA Categories and the Node Detail Page. Response times are displayed as Resource Graphs of the IP Interface on the Node Detail Page. Configuration parameters of the Service Monitor can be seen in the Service Page by clicking on the Service Name on the Node Detail Page. The status of a Service can be Up or Down.
The Service Page also includes timestamps indicating the last time at which the service was polled and found to to be Up (Last Good) or Down (Last Fail). These fields can be used to validate that Pollerd is polling the services as expected. |
When a Service Monitor detects an outage, Pollerd sends an Event which is used to create an Alarm. Events can also be used to generate Notifications for on-call network or server administrators. The following images shows the interaction of Pollerd in Horizon.
Pollerd can generate the following Events in Horizon:
Event name | Description |
---|---|
uei.opennms.org/nodes/nodeLostService |
Critical Services are still up, just this service is lost. |
uei.opennms.org/nodes/nodeRegainedService |
Service came back up |
uei.opennms.org/nodes/interfaceDown |
Critical Service on an IP interface is down or all services are down. |
uei.opennms.org/nodes/interfaceUp |
Critical Service on that interface came back up again |
uei.opennms.org/nodes/nodeDown |
All critical services on all IP interfaces are down from node. The whole host is unreachable over the network. |
uei.opennms.org/nodes/nodeUp |
Some of the Critical Services came back online. |
The behavior to generate interfaceDown
and nodeDown
events is described in the Critical Service section.
This assumes that node-outage processing is enabled. |