Rebooting: a trick to avoid bugs

We all know that rebooting the home computer, router, backup device, DVR, or iPhone often solves mystery problems. (Have you noticed how frequently you’re rebooting your once-was-reliable iPhone?)

This works in large, distributed systems too. If you’ve got buggy code, a memory leak, or a shaky operating system, rebooting machines in a large distributed system works too. I’ve seen this in practice: periodic, scheduled reboots of boxes to reduce memory use, reduce CPU load, or generally cause a return to a known state.

Indeed, I’ve seen plenty of problems that occur when this doesn’t happen. A system remains untouched for a while, and things go south. After a problem has occured, I’ve heard quite a few folks say “we hadn’t rolled out code to that pool for a while” or “that box wasn’t rebooted for a few months”. In many cases, the issue was the gentle creep of increasing CPU or memory use.

Perhaps it’s good practice to ensure boxes are rebooted periodically. It’s probably wise when the machines are out-of-sight and out-of-mind: those less critical, less monitored, sometimes unowned services. It’s perhaps not even a bad thing: one of the wonderful properties of web services is they don’t have to be perfect, since you’re in control and the software’s running on your choice of hardware (unless you’re on someone’s virtual machine in some opaque cloud).

6 thoughts on “Rebooting: a trick to avoid bugs

  1. Subbu Allamaraju (@sallamar)

    Good point on slow failures. Failures like memory leaks will always be there and process recycling is a reasonable preventative measure. However, IMO, getting owners of those processes prepared for such failures is the right first step. What would an owner of app answer if you ask “what happens if I inject a memory leak into your app and let it run for two weeks”?

  2. Ravi Aringunram

    Rebooting is a perfectly viable option as long as we have other systems to capture the state to figure out issues that creep into the system. That way we also get to idenitfy root causes and fix them. Relying on peridoic reboots to solve some of the issues described is going to lead to sloppy and crappy code base. Crappy code bases will eventually cause is more harm in the long term. As you running web-services gives us more control. Having systems monitoring the health and may be “auto-filing bugs”, augmented with period reboots is a good mix.

  3. Hugh E. Williams Post author

    Couldn’t agree with you more, Ravi. But sometimes that isn’t easy, particularly when the issues are in operating systems or hardware drivers or other parts of the system that aren’t built by the engineering team that writes the software that runs on the platforms. But, in general, couldn’t agree more: I’m suggesting this as a backstop, not the way of “running the business”.

  4. Stephen Green

    Your right of course, and as others have already pointed out, many other ‘issues’ come mind. I recently had to repair a couple of ‘donated’ PC’s for a school. Needless to say these kept me busy for awhile. Turns out fresh installs with all the updates, and Ram upgrades were the only real way to get some use from them. After that I contacted school board members to stop accepting used PC’s..

  5. Beth Cherkowsky

    I’ve always been told that operating systems and various softwares “degrade” over time and I should reboot periodically “just for the fun of it”. It does work, So Now I’m curious..how often does eBay “reboot” its’ machines…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s