We all know that rebooting the home computer, router, backup device, DVR, or iPhone often solves mystery problems. (Have you noticed how frequently you’re rebooting your once-was-reliable iPhone?)
This works in large, distributed systems too. If you’ve got buggy code, a memory leak, or a shaky operating system, rebooting machines in a large distributed system works too. I’ve seen this in practice: periodic, scheduled reboots of boxes to reduce memory use, reduce CPU load, or generally cause a return to a known state.
Indeed, I’ve seen plenty of problems that occur when this doesn’t happen. A system remains untouched for a while, and things go south. After a problem has occured, I’ve heard quite a few folks say “we hadn’t rolled out code to that pool for a while” or “that box wasn’t rebooted for a few months”. In many cases, the issue was the gentle creep of increasing CPU or memory use.
Perhaps it’s good practice to ensure boxes are rebooted periodically. It’s probably wise when the machines are out-of-sight and out-of-mind: those less critical, less monitored, sometimes unowned services. It’s perhaps not even a bad thing: one of the wonderful properties of web services is they don’t have to be perfect, since you’re in control and the software’s running on your choice of hardware (unless you’re on someone’s virtual machine in some opaque cloud).