Monitoring an MMO
I’ve been working on a free-to-play MMO which has been “officially” live since last April, and things have been going well – a steady growth of players; the game itself has been well-received, and all the important graphs are “up and to the right.” Part of my job involves detecting problems before they become serious and fixing problems when they inevitably do. So, there are two questions. “Is there a problem in the game?” “What is causing the problem?”
When trying to debug something on our development and test clusters, typically you can tail log files. We have a python script that can monitor the communication between various parts of the game and pretty-print it along with color to highlight “this is a problem!” Attaching a debugger to a running process is also not uncommon. However, looking at logs and bus traffic in realtime on a production environment gives you this neat “Matrix-y” experience. Attaching a debugger to a production process (assuming you could, which you can’t) would get you smacked with a rolled-up newspaper. “Bad Developer! No treat!” So, what can you do?
When you’ve got clusters full of machines, using Nagios to monitor things is an obvious solution. Beyond making sure the power is on and other sysadmin things, we’ve written other checks to see if the login process is working, the parts are working together, and automating typical in-game functions. For example, if nagios can’t successfully log into the game do basic game activity, then alerts happen.
Metrics for EVERYTHING
Anything that happens in game has metrics reporting tied to it, generating piles of data constantly. We use Cacti to visualize game activity. An example metric is concurrent users, or CCU. We graph how many people are in the game over time, which when things are healthy should be a nice smooth curve climbing to peak game hours, then descending nicely through the night.
We can tell by sight if the game looks healthy or not – if the CCU graph is jaggy, has a sudden drop or spike, or drops to zero then we know that something is wrong. Typically nagios alerts accompany the graphs, giving more data points on where to look. But this has also pointed out areas where a nagios check was missing or wasn’t working as intended.
When a player gets an error in game, the error dialog box gives them the opportunity to submit the error details back to us. If we see a spike in user-reported errors through this or other customer service means, we know we have something of interest to look for.
The game server components make use of log4j and similar logging frameworks. Anything that you’d want to watch happening in game needs to be aggressively logged. All components are configured so that operations can change the log level on the fly. That’s still quite a bit of data across many machines though, so all that information is run through Splunk to be indexed and searchable. This gives us a great tool for searching through log data, examining trends, or watching selected activity in real time. Unfortunately it is very expensive so we are selective about the data that passes through it.