Nagios Conference 2014 - Bryan Heden - 10,000 Services Across The State of Ohio
Bryan Heden's presentation on 10,000 Services Across The State of Ohio.
The presentation was given during the Nagios World Conference North America held Oct 13th - Oct 16th, 2014 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/conference
Published on: Mar 3, 2016
Transcripts - Nagios Conference 2014 - Bryan Heden - 10,000 Services Across The State of Ohio
Lead Solutions Provider
Introduction & Agenda
• What we do and what we needed
• Customized and configured
• IO issues...
• Offload all the things!
• What did we learn?
• What’s next?
Agile Networks: Who We Are
• We Engineer and Operate The Agile Network, a general purpose
backhaul network with Last-Mile AgilityTM
• Who We Serve
• Public Sector … particularly Public Safety
• Underserved Communities
What We Needed
General insight into network health
• Ability to maintain SLAs with customers
• To react to network downtime as fast as possible
• The Government doesn’t like to wait
• To monitor traffic across the network
What We Did
We chose Nagios XI
• Easy to use and understand interface
• No more text based configurations to manage
(Haha, just kidding!)
• Built on top of something we were already comfortable with
How'd That Go?
It worked, but not exactly how we wanted it to
• “WHAT DO YOU MEAN IT DOESN'T AUTOMATICALLY TRANSLATE OIDS
INTO HUMAN UNDERSTANDABLE ENGLISH?”
• “YOU MEAN TO TELL ME THAT OUR EQUIPMENT DOESN'T COME
STANDARD WITH NAGIOS PLUGINS OR THAT NAGIOS DOESN'T
PRODUCE ONE FOR EACH TYPE OF DEVICE WE USE?!”
• Ping worked just fine
If You Build It..
..The Network Engineers will use it
• We wrote our own configuration wizards for each different type of
device (PTP, PTMP, Routers, Power, GPS) We made some maps
• Executives love maps!
• One map tracked health of devices/links between sites along with
• Another map tracked the operating frequencies of active devices
Finally, Some Pictures!
The NOC Overview MAP provides
our teams insight into the health of
every node and their connections on
Our Network Engineers
can see from a central
source what the health
and operating frequencies
are of our equipment.
And More Pictures
My custom built
configuration wizards keep
our teams working on what
they need to work on and
allow me to be hands off
with system additions.
Stress Testing in Production
We reached maximum occupancy
• Our existing server setup wasn't meant for active checks for this
many hosts and services
• We introduced ModGearman
• We offloaded MySQL
• Things got better, but we still had some problems...
IO is a Major Factor
Lots of writes, not enough throughput
• There were suddenly more host and service checks than we could
handle with our setup
• Running on a VM on an ESX Host with 2x10K drives in a RAID1
• Bandwidth was only graphing once every 10 to 20 minutes
• Upgraded the ESX Host drives to 6x15K RAID10
• Okay, okay! We upgraded some other stuff on the ESX Host, too
• This was the single most important decision we had made
But We Didn't Stop There!
We offloaded MRTG
• Set up NFS Share for /var/lib/mrtg so that Nagios could read from it
• Set up NFS Share for /etc/mrtg so that Nagios could write to it, in
order to add host configuraiton files
• Put both Virtual Machines on the same Host (17 Gb/sec network
MRTG had some issues of its own…
• We had to split the cron job into separate processes
• This stops MRTG from taking too long to complete its checks,
preventing the next process from starting
• (Remember the 5 to 20 minute graphing issue a few slides back?)
Pictures of Text
Here is what MRTG’s cron file looks like after we’ve made our changes:
MRTG Process Splitting
How we did it
• Split the configuration files into logical chunks by size and created
separate cron entries for each
• /etc/mrtg/conf.d/ has multiple subdirectories (1/, 2/, 3/, etc.)
• Each corresponding process in cron loads the configuration files
present in those directories (Include: /etc/mrtg/conf.d/X/*.cfg)
• We measure each process separately (run time, errors, standard
We did some other things, too…
• We installed and offloaded SmokePing
• We created a SmokePing Nagios XI component to increase visibility
of our graphs in our NOC
• We built a portal to SmokePing for a particular client to login and
check device health
• We created a ModGearman Nagios XI component to manage our
servers from a central location
This component keeps
our gateway graphs up
at all times so we can
keep an eye on them,
and then rotates graphs
from other hosts in
each zone so we can
We use a portal that parses the config file for SmokePing hosts, pings
them, and shows current status. It also allows the portal user to ping
I was tired of having to repeatedly log
in to each ModGearman instance to
tweak something when we were still
getting everything set! So I wrote this
to make my life a little bit easier.
What did we learn?
• Nagios XI can be extended far beyond the default behavior
• Custom Configuration Wizards, Plugins and Components
• Custom MRTG installations and scans used in the Config Wizards
• IO will become an issue, and should be planned for
• How to build a process for creating customizations
• Offload what you can!
● Automating the MRTG Process Splitting
● Releasing a generic and well documented Configuration Wizard
● Continuing to grow and expand our current installation
● Do you have any?