Monday, March 21, 2011

Domino is down, not again I said

Funny how luck does find you and rarely the other way around.

I was driving out to a client site to literally plug in an external USB drive (don't ask, I tried to have the people in the office do it, server was not having it) so off I went.

5 minutes away I get a call that mail and sametime are down. Odd, I thought, staff has been there an hour and they now called? I said I was on my way already, see them in a few minutes. They must think I am a mind reader.

Naturally I get to the office and the server console is flying by with emails and such. BUT Traveler is showing errors in connecting to the very visible server being down.

Internet access was down, sounds like T-1 issue, or the Domain Controller died perhaps? Checking about saw the Firewall had the test light on, good old Sonicwall was letting me know something happened. When the test light is on, it means it went into SAFE MODE which is a euphemism for "nothing is going to work now, go get lunch and leave me alone".

OK, turning it on and off didn't change anything, power cut, still nothing. The only way to resolve this is to kill power, and kill power from the routers that cascaded off it. It seems the Sonicwall does not appreciate repeated attempts to flood itself with invalid packets. Who'd know :-)

Starting back up the Sonicwall then the 2 routers cleared the issue.

In the mean time had to reboot a bunch of servers that had invalid DNS entries.

So the problem we can now resolve, but what caused it to fail?

Best guess so far is there was a power surge or failure that affected either the Sonicwall unit or the Domain Controller and DNS server. No matter how many UPS units we use or what is plugged in where, some of the servers seem to be highly sensitive to fluctuations in power. Didn't want to go to heavy duty UPS but may be forced to do so. But for the few times a year this happens they can live.

Domino gets blamed for mail not working or Sametime or Traveler but as the client said, if they can't get to anything fromt he outside, they know its a firewall or T-1 down and that makes them MUCH happier than to think Domino is down.

In fact, aside from these outages, Domino is usually up for about 40-60 days before I pull it down for OS maintenance. Still trying to train the users it's not Domino, but at least the boss gets it.

2 comments:

  1. In my experience the problem usually isn't a UPS, but rather a piece of trunk-level network hardware (internal or possibly external) that has a bad - or no - UPS itself.

    ReplyDelete
  2. Erik, yep, that's what we saw also, but the problem seems to be hard to find the bad one. All are on one form of UPS or another. Thinking of getting some small units just for the DC itself, nothing else, as that is the main one that breaks everyone.
    One long weekend will need to rewire the UPS, maybe we have a balance issue there.

    ReplyDelete