On Jan. 15, 1990, around 60,000 AT&T long-distance customers tried to place long-distance calls as usual — and got nothing. Behind the scenes, the company’s 4ESS long-distance switches, all 114 of them, kept rebooting in sequence. AT&T assumed it was being hacked, and for nine hours, the company and law enforcement tried to work out what was happening. In the end, AT&T uncovered the culprit: an obscure fault in its new software.

Here’s how the switches were supposed to work: If one switch gets congested, it sends a “do not disturb” message to the next switch, which picks up its traffic. The second switch resets itself to keep from disturbing the first switch. Switch 2 checks back on Switch 1, and if it detects activity, it does another reset to reflect that Switch 1 is back online. So far, so simple.

The month before the crash, AT&T tweaked the code to speed up the process. The trouble was, things were too fast. The first server to overload sent two messages, one of which hit the second server just as it was resetting. The second server assumed that there was a fault in its CCS7 internal logic and reset itself. It put up its own “do not disturb” sign and passed the problem on to a third switch.

The third switch also got overwhelmed and reset itself, and so the problem cascaded through the whole system. All 114 switches in the system kept resetting themselves, until engineers reduced the message load on the whole system and the wave of resets finally broke.

In the meantime, AT&T lost an estimated $60 million in long-distance charges from calls that didn’t go through. The company took a further financial hit a few weeks later when it knocked a third off its regular long-distance rates on Valentine’s Day to make amends with customers

Posted by Nilesh Kumar
Comments (0)
January 31st, 2012

Comments (0)