It seems that most mmorpg games have some regular server maintenance, some every day, some once a week. What is it that they actually have to do, and why is it necessary ?
If you start with such a project what can you do to avoid this ?
It seems that most mmorpg games have some regular server maintenance, some every day, some once a week. What is it that they actually have to do, and why is it necessary ?
If you start with such a project what can you do to avoid this ?
I suspect that they're deploying the latest version of their code, which requires that they restart the application (and hopefully running some tests before re-enabling access). From that point of view, it's more of a StackOverflow problem and less of a ServerFault one.
I think it's possible to create a hot-patching system, but it would necessarily be incredibly complicated. From what I understand, an MMO server "application" consists of several different components --
Login server -- Handles authentication and acts as a "hub" between gameplay servers. Once a client is in-game they no longer interact with the login server. In such a system you could apply patches and restart the login server without interfering with gameplay (though you'll have a period of time where people won't be able to log in).
Gameplay servers -- Clusters of machines grouped into logical independent units ("worlds", etc). It's assumed that the each gameplay cluster uses some kind of internal communication protocol to correspond state to one another; you're probably going to have to patch each cluster all at once. One possible way to do this is to patch a warm-failover. You'd then need to be able to both
Database servers -- Some kind of persistent datastore, like an RDBMS. Hopefully you're not making changes to the datastore that often. Presumably each gameplay server/cluster has an independent datastore. You might be able to use the same trick with a warm failover (and tell the gameplay servers to disconnect, wait for the old and failover databases to sync, then reconnect to the failover) but that seems pretty risky to me.
All of the above cases add an incredible amount of complexity to an already complex system and introduce a bunch of places where a code failure can cause data loss or corruption.
Another solution is to use a language which is designed for 100% uptime and has built-in capabilties for hotpatching running code. Erlang is a good choice (hotpatching example), and Java has similar functionality.
No one else has experience actually running something like this? Huh.
There's several reasons that bridge both code and systems. First, remember that most of the current 'big' MMO engines were programmed several years ago, and despite graphics and technology upgrades since, still depend on the way many of these systems were written in 2000 or so. Eve-Online, for instance, still runs on one huge Microsoft SQL Server instance, which is why they're always trying to pull more out of it by upgrading hardware.
An example of an improvement since WoW and EVE got started is the work done in distributed key/value databases like Google's MapReduce (and it's open-source implementation, Hadoop), extremely fast affirmative response processing queue services (Amazon SQS), and other "cloud"-oriented technologies.
I have the most experience with EVE (I'm more of a lasers guy than a battleaxes guy), so some of these examples are more EVE-oriented.
As far as Systems reasons go:
As far as Software reasons go:
Running an economy with both closed and open loops is one problem for MMO operators -- if you don't believe me, read some of the academic papers that have been written about game economies and some of the studies of older games like Ultima Online that had relatively primitive economies. The analysis that needs to happen to replenish the open loops and to identify cheating and other negative economic activity needs to happen offline with a snapshot of the data, which can sometimes only be taken while the database is entirely locked.
If you'll note, Eve's maintenance happens when it's Noon in England, where the primary datacenter is.
I suspect that the total time that Blizzard (I'm inferring that given that it's a Tuesday morning that you're posting your question) quotes for maintenance is for the entire cluster; not every server takes that long to perform work on.
While it might be possible to bring individual servers back up more quickly, that would illicit cries of favouritism towards players whose realms happened to fall earlier in the schedule. As such, they keep everything down until all the work is done; with hundreds of realms to work on, they probably do much of the work in parallel, but still serialize a final check before bringing things back online. If you're doing a hardware upgrade, this is probably serialized across as many data centres as they have.
As to why they perform the maintenance, some of it might just be a performance reboot. While it would be great if such reboots weren't required, the cost of doing so vs the impact of not doing so may be directing their choice here.
When you look at why they can't cluster the processes and perform rolling maintenance, what little people know of the WoW infrastructure suggests that multiple machines provide service for each realm (i.e. one for the world, one for instances and raids, one for battlegrounds, etc.) they don't use a state-shared active-active process setup. There isn't sharing of live state, only of persistent data via a database.
In the end, the mechanics of providing a stateful online service to that large a subscriber base challenges some of the best practices that we might espouse when talking about a website or other traditional internet-based service.
Some of the more recent extended downtimes in EvE Online have been about installing new hardware like a faster SAN. While one can technically move the bulk of the data by creating a new filegroup on the new drive and then emptying the main one, that would have resulted in an extended period of reduced performance due to constant I/O. So they opted to detach the 1.1TB database and move it in one go.
The answer to this question also relies on the specific application. For example, a server handling a specific star system cannot be hotswapped without disrupting game play, so downtime is used to reassign more powerful servers into potential hotspots. In addition, the ownership calculations (sovereignty) of star systems are calculated. This depends on the tens of different variables, all of which can change depending on player actions. Needless to say, doing that live can cause excessive locking and/or other concurrency issues. But addressing those is best left to stackoverflow.
presumably something you couldn't deal with via clustering/load-balancing such as major DB schema changes.
In a recent topic How often should I reboot linux servers another good point were mentioned, verifying that everything starts up properly on a reboot or after any (major) configuration change.
A simple upgrade of hardware (or hardware replacement) is also presented as "server maintenance" by MMORPG games. So trivial we often forget about it.
I have implemented an MMO architecture in Erlang which supports hot code upgrades and distribution. For example, one "GamePlay Server" can run across an arbitary number of machines, if one needs a hardware upgrade its objects can be transfered (in realtime) to the other machines. This enables upgrades in software hardware without any downtime.
You can check out my site at http://www.next-gen.cc.
I'm led to believe the maintenance window also allows for routine hardware replacement to ensure components don't fail.