Nagios Conference 2012 - Todd Groten - Monitoring Call of Duty: Elite
Todd Groten's presentation on using Nagios in a dynamically scaling environment
The presentation was given during the Nagios World Conference North America held Sept 25-28th, 2012 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/nwcna
Published on: Mar 3, 2016
Transcripts - Nagios Conference 2012 - Todd Groten - Monitoring Call of Duty: Elite
Case Study: Monitoring Call of Duty: Elite...or How To Dynamically Scale Monitoring in the Cloud Todd Groten Senior Network Operations Engineer Activision Blizzard / Beachhead Studio [ firstname.lastname@example.org ]
What is CoD:Elite? Call of Duty: Elite is an online service created by the Activision subsidiary Beachhead Studios for the multiplayer portion for the first-person shooter video game, Call of Duty: Modern Warfare 3. The service features lifetime statistics across multiple games as well as a multitude of social-networking options. As of August 2, 2012 there are currently 12 million players who have signed up for the service, 2.3 million of which are premium paid members. Its FPS franchise has over 40 million monthly active users, with players logging a total of greater than 1.6 billion hours of online gameplay in Modern Warfare 3. 2012 2
[The Problem]Dynamically scaling the monitoring we haveon our servers in the cloud without trashing the monitoring infrastructure.
The Problem Since July of last year, we were almost entirely cloud based and we found the need to automatically scale the environment to meet the load needs of our webservice. With the immediate and exponential growth, we encountered a number of issues; one of which was our lack of automatically adding servers into our monitoring environment and correctly tagging them to their correct purpose, as we scaled up servers by over 100+ at a time. 2012 5
The Problem (cont’d) Another issue we had was the polar opposite, which was that fact we had multitudes of stale server records in monitoring, when we scaled down greater than 50 servers at a time. With this constant up/down motion, at one point our monitoring environment had over 900+ stale records in the database, with all of those servers having been terminated in the cloud. 2012 6
[The Analysis]Dynamically scaling the monitoring we haveon our servers in the cloud without trashing the monitoring infrastructure.
The Analysis We tried out several monitoring solutions that touted they could handle cloud based deployments and exhibited some form of cloud awareness. After deploying most of them on a trial basis, we came to the conclusion that they didn’t have what we needed at the time to solve this extremely difficult server health monitoring / system maintenance issue. 2012 8
The Analysis (cont’d) Nagios was extensible, but didn’t have an adequate mechanism for the dynamic scalability of the cloud. It was great at monitoring the hosts and services once added, but not auto-discovery for public cloud architecture / subnetting. Upon the end of our analysis, the other monitoring solutions we tried might have had better scale-up ability, but failed when it came to resource assignment and scale- down ability. 2012 9
[The Solution]Dynamically scaling the monitoring we haveon our servers in the cloud without trashing the monitoring infrastructure.
The Solution We found Nagios to be the correct solution and modified some of the functionality to meet our needs, due to its inherit extensibility. Initially, we wrote some scripts to enumerate the IPs and hostnames into a comma-delimited list that we could read into Nagios, via the bulk import wizard and manually ran this task after every forced scale-up, once we had all the boxes accounted for and could grab their IPs. 2012 11
The Solution We found Nagios to be the correct solution and modified some of the functionality to meet our needs, due to its inherit extensibility. Initially, we wrote some scripts to enumerate the IPs and hostnames into a comma-delimited list that we could read into Nagios, via the bulk import wizard and manually ran this task after every forced scale-up, once we had all the boxes accounted for and could grab their IPs. 2012 12
The Solution (cont’d) This approach worked for a while, but fell short when we turned on the “auto scaling” features we had with our cloud provider. We started to get behind on setting up monitoring and what’s even worse, we got behind when the arrays scaled down and Nagios Alerts were set off for boxes that could no longer be reached, due to being automatically terminated. Another approach we used, which ended up being the correct one, was to create templates of the config files for certain types of servers. 2012 13
The Solution (cont’d) We had our deployment architecture then go out to the repository and pull down the correct host and service templates for that particular type of server and fill in all of the variables we parameterized. The final step was to ship it over to the Nagios server, via SCP, into the Import folder and then remotely run the reconfigure nagios script to import the templates into XI. The only problem we faced with this method was possibly overrunning the reconfigure step, if we had 100+ servers all spinning up at the same time. 2012 14
The Solution (cont’d) So, we did what any good development house would do, we wrote our own poller/parser. The methodology was simple…Whenever a server wanted to do a reconfigure, we had it check a folder for other tag files with each individual server’s name and extension of “.done”, if the process was complete for that file (or many files, if they were queued to be imported). The poller would check the folder every X minutes and then run the reconfigure itself, if it found files waiting, then marked them as “done”, once it imported them successfully. 2012 15
The Solution (cont’d) For server spin down, the process became a bit more complex, as we found there wasn’t really an easily exposed method for database manipulation to remove the old entries. After crawling through all of the code, I found some scripts in the XI scripts folder that apparently were the scripts Nagios used to do exactly what we wanted, but only on an individual basis. 2012 16
The Solution (cont’d) So, again we coded our own “patch” for this to handle multiple service deletions from a server decommissioning script. After many trials to get it completely right, we eventually took the approach to grab the server’s name and run a remote mysql query against the nagiosql database to grab all of the service and host IDs for that particular server. We then had the decommed server itself call out to the XI server and run the delete services scripts, looping through the snagged IDs, then finally running the delete hosts script and performing the reconfigure action to set everything in stone. 2012 17
[The Last Words]Dynamically scaling the monitoring we haveon our servers in the cloud without trashing the monitoring infrastructure.
The Last Words After creating all of this custom functionality, I ran these methods and snippets by Mike Guthrie, just for a gut- check to make sure we wouldn’t corrupt anything. Once we got his blessing, we scaled this up to production level and started to spin up and down servers in the hundreds without failure, as our load demanded. 2012 19
The Last Words Eventually, Mike wrote me back and told me that the functions we created were going to be part of the next patch of NagiosXI, after a bit of streamlining. So, if you’re looking for a reason to upgrade to one of the latest patch levels for 2011XI, and you need better cloud server management, you’ll find this new auto-deployment function part of NagiosXI 2011R3.2. 2012 20
Present day architecture 2012 21