Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions
Andy Brist's presentation on High Availability and Failover Solutions for Nagios XI. The presentation was given during the Nagios World Conference North America held Oct 13th - Oct 16th, 2014 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/conference
Published on: Mar 3, 2016
Transcripts - Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions
Failover and High Availability
Solutions for Nagios XI
• Who am I?
• Nagios Support Team Manager
• Team Lead for Nagios-Plugins
• Every environment is different
• Failover/HA by nature, is a customized
• My case studies are not your production
• I know Nagios/XI, not your SLA
• Test in a lab. First.
● Short overview of the different failback/failover solutions
● Nagios XI Data Locations and other files/services relevant to
● HA? Failover
● Observations, Considerations
● Restore VM snapshot or spin up a new instance and restore a
● Most common implementation
● Easiest of all options
● Most potential downtime of scenarios
● Maximum historical and configuration data lost = the interval
● Requires manual intervention
Automated XI Backups
● XI provides a method for scheduled backups through the
"Scheduled Backups Component"
– local fs
● Useful for remote backups or manual failback
● Secondary is periodically updated from an XI backup.
● The nagios process is started by hand when the master has an issue.
● Cronjob on the secondary restores newest backup once a day.
● If unconcerned with historical data and mrtg performance data, just
push/restore the object configs and sql dumps (if not offloaded)
● Not to be confused with snapback as this is a separate, different
instance/image, not just a previous state of the failed instance.
● Easy to implement with the “Scheduled Backups” XI
● Agents must maintain 2+ allowed hosts
● SNMP traps must be configured to push to 2+ hosts
● May experience substantial downtime if the backup is large
and the primary fails during a data restore on the secondary.
● Difficult to get right
● Demanding on i/o resources and network speed
● Very little to no loss of historical data
● Minimal downtime
● Fully automated
● Can provide minimal clustering for XI services through “High
● Object Configuration
● Check Status
● Object State
● Program State
● Historical State Data
● Performance Data
Nagios XI - Services
nagios – Monitoring engine
mysql – Object configuration and ndo historical data
ndo2db – Writes historical data to mysql database
postgresql – Nagios XI settings/user database
npcd – Performance data daemon
crond – Task scheduler
httpd – Web server
XI Data and Redundancy
Absolute minimum redundant data required for any failover
● (Working) Object configuration
● Mysql 'nagiosql' database
● Postgresql 'nagiosxi' database
Full Check Redundancy
Additional requirements for full check redundancy:
● mrtg config and RRDs (for bandwidth checks)
● nagios libexec folder (plugins)
Any additional dependencies for plugins.
● VMWare SDK
● Oracle Perl Library
● Java JRE
Runtime State Redundancy
Additional requirements for runtime state redundancy:
● retention.dat (state, runtime options, acknowledgments,
● NDO mysql database "nagios
Additional Data required for complete historical redundancy:
● nagios.log and archives directory
● perfdata RRDs
● mrtg config and RRDs
● NDO mysql database "nagios"
XI Data Summary
XI Data Summary
1. Elimination of single points of failure.
2. Reliable crossover/failover.
3. Detection of failures as they occur.
Why would you need it?
● Least amount of downtime
● (limited) Service clustering
● Shared volumes solve the issues with syncing historical data in
● Shared storage
● Virtual IP
● Management applications/scripts
● DRBD – block level replication, part of the linux kernel, well supported and
understood. Works well for all XI data types (including RRDs/DBs)
● NFS – Fine option, just make sure the NFS share does not have an i/o latency issue
or your checks WILL get behind. Do not mount the volume on more than one
server at time to avoid writing multiple checks in the case of a partial failover.
● Replicated DBs – Fine solution, clusters well. Use DNS or virtual ips to control access
to the databases.
● rsync – Not immediate replication, but close. Easy to implement.
● GlusterFS – More problematic to set up, but good for offloaded mrtg/RRDs
● Active/passive suggested
● Low latency storage
● Active mount should move with the vip
● Refer to Jeremy Rust's presentation notes for more
● pacemaker vip script
● Custom ifconfig/ip shell scripts
● uCarp Scripts
HA Failover Management
● Pacemaker/Heartbeat (the HA stack)
● uCarp scripts
● keepalived scripts
● nagios itself – Event handler driven
● cron – Job that checks the master for connectivity. Reuse the
check_icmp or check_http plugins for this purpose.
● DRBD/Shared Storage
● High Latency HA
(shoot the other node in the head)
● Mechanism by which a failing
server is guaranteed to be
removed from the cluster
● Not required, but advised
● Hardware (including ups) and
software (vmware stonith “device”
and shell scripts)
● Only failing over when the primary
is unreachable is safest
● Beware of overzealous failover
conditions as they can lead to a . .
No, really. Stonith gives your servers the ability to
KILL THEMSELVES and FRIENDS
● Beware of services whose init
actions/failures should not cause
● Any actions requiring a shared
volume in active/passive mode
should not immediately cause
failover due to potential latency
during volume mounts
● Test, test, test the disaster
scenarios in a LAB first or the
fragfest may include your job!
● A number of portions of Nagios Core and Nagios XI are clusterable.
Processes that can potentially be clustered:
– offloaded postgresql
– offloaded mysql/ndo2db
– offloaded mrtg
● Services that are dependent on the core monitoring engine and
filesystem and should not be clustered:
– nagios, npcd, cronjobs
– snmptrapd, snmptt
DUAL DRBD Primary
● Disconnecting from the master before mounting of the shared
volume during failover is no longer needed.
● Careful implementation allows multiple servers to
concurrently access the shared volume. Potentially useful for
ambitious clusters and shared historical records.
● Slower, as the “secondary” can lock blocks.
● More prone to “split-brains”
● Usually requires clustered file systems
High Latency HA
● Problematic if the HA solution was not designed for potential
● Will potentially cause i/o wait issues
● It may be better to push checks to a central server(s) with
NRDP/outbound checks/etc, keeping HA solutions local, or to
pay for a faster pipe.
● DRBD Proxy – A good solution if high latency HA is a must –
uses an asynchronous buffer for block writes to the secondary
volumes (does not support dual primary)
● Enforce single ndo instance access to mysql
● If multiple ndo processes connecting to a single ndo db is required,
consider using ndo db instances
● You can control ndo's access to the mysql server through iptables
and the vip.
● Offload ndo2db to the offloaded mysql server
● Configure ndomod it to connect through a tcp socket. This
can potentially decrease load on the nagios server.
● Initiating failover due to crashed DBs may cause a deathmatch
as all nodes will fail (due to their shared nature)
● Offload both postgresql and mysql databases. Requires a
virtual ip or careful management of DNS.
● XI has scripts to repair the databases, use them!
Recovering from Failover
● Degraded ex-primaries should not be added back to the cluster
automatically. Doing so may cause split brains.
● Split brains REQUIRE manual intervention if preservation of historical
data is desired.
● Stonith Deathmatches – Have a primary image/instance without
stonith enabled for recovery
● Maintain an ultimate disaster recovery server instance/image
outside of the cluster pool for when all else has failed.
A Plea from Nagios Support
● Failover/HA != backups
● Test, test, TEST! Use your lab please.
● Document. Everything. The biggest barrier and largest hurdle for
support are unknown, undocumented, non-standard configurations.
Failover/HA deployments definitely qualify.
● Snapback: Easy. Slow recovery. Requires manual intervention.
Highest potential historical loss.
● Failback: Intermediate. Moderate recovery. Can be automated. Less
● Failover: Difficult. Fast recovery. Fully automated. Nearly no historical
● High Availability: Difficult. Fast recovery. Automated. Redundancy
across WAN links. Limited clustering. Least potential downtime.
Multiple potential issues with split-brain, stonith/deathmatches and
latency, so care should be given, and scenarios tested.
Food for thought . . . .
● HA in a federated model . . . . . . . .
Final Questions For You
● How much of Nagios XI, or Core, can truly be
set up to be "HA"? Do you care? :P
● Do you need HA/failover, or will
● Is the time trade off in your environment
Questions for Me?
(common/critical answers noted below for the sake of efficiency)
● 11 meters/sec (unladen European swallow)
● The Prime Directive
● 3 Times
● The Categorical Imperative/Pragmatism (choose 1)
● Evasive Subjunctive
● . . . Yes?