Whither pagers?

Earlier this week we had a service outage. The proper chain of events would be:

  • 00:00:00 Server problem
  • 00:00:03 Monitor processes notice problem, send page to admin’s phone
  • 00:00:10 Phone rings with new message
  • 00:00:30 Admin logs in to server, fixes problem
  • 00:01:00 Problem resolved

But what happened was:

  • 00:00:00 Server problem
  • 00:00:03 Monitor processes notices problem, send page to admin’s phone
  • 00:00:10 T-Mobile doesn’t deliver the message

This started around 3am California time, which is why none of the PBwiki team noticed it independent of the sms alert mechanism. What should have been an isolated transient, simple to resolve and not user-visible turned into a cascade of unpleasant timeouts which caused the service to slow and eventually halt. We’ve done an extensive internal examination of what happened, and we’re changing some technology, adding some additional automated checks, and doing a few procedural things more intelligently.

The main process change is something that is probably old hat for old-school ops people — the absence of a page alert is not an indication of systemwide health. We’ve deployed a lot of new infrastructure in the last few weeks, and I’d been getting occasional pages for a while, but none for the prior day or two. I’ve set up the daily equivalent of the Tuesday-at-noon air raid siren test — in which the absence of a message every morning will be a problem itself. We’ve also got independent Nextel phones for on-call ops folks so there are now several routes for the alarm pages to take, plus that funny push-to-talk thing so we can annoy one another at all times.

Published by pbwikinathan

I'm the CTO of PBworks, Inc. We help organizations work better as teams with their clients and partners.

4 thoughts on “Whither pagers?

  1. 7/26, 11:15 EST I keep getting the “slow down” message for robots – at this time, I simply can’t access our wiki at all. Is there an “event” at pb wiki?

  2. I’ve replied over email as well but here’s some data for reference —

    Your browser sends us this User-Agent string, which our software classifies as being a likely robot:

    “Mozilla/5.0 (000000000; 0; 00000 000 00 0; 00000; 0000000000) 00000000000000 000000000000000”

    Do you have any idea why that would be sent instead of something more common such as:

    “Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv: Gecko/20070515 Firefox/”

  3. Howdy! This is my first comment here so I just wanted to give a
    quick shout out and tell you I really enjoy reading your articles.
    Can you recommend any other blogs/websites/forums that deal with
    the same subjects? Thanks!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: