January 10, 2003

NSIT evaluates December email disruptions

Networking Services and Information Technologies (NSIT) is still working with Sun Microsystems to understand the e-mail disruption that affected over 18,000 users last month. During the first week of December, students and faculty faced a challenge that had nothing to do with preparing for exams, as NSIT's central mail service experienced massive slowdowns and interruptions, the largest such event in University history.

"The whole thing would have been really funny if it hadn't been during finals week, pretty much the one time of the quarter that the undergraduates absolutely require having stable e-mail," said Erika Williams-Tully, a fourth-year in the College. "I'd really love to know why they thought changing servers during the busiest time of the quarter was a good idea."

Damian Rasch, a researcher and technical coordinator for the departments of medicine and psychology was also upset by the disruption. "This is not something that they planned to have happen," Rasch said. "On the other hand, the way the matter was dealt with, and more particularly the timeframe in which the problem was resolved are nothing short of a complete an utter embarrassment to our University. Logistically, it was a nightmare. Interdepartmental and interdivisional collaborations came to a virtual standstill."

According to Gregory Jackson, vice president and chief information officer for the University, the causes of the e-mail disruption "were partly exogenous, partly our own willingness to embrace unmanageable diversity, and certain hardware, software, and support failures."

The volume of e-mail and the size of individual messages have been growing rapidly in the last few years. Since the annual increase in NSIT funding has not matched this growth, NSIT is constrained in installing redundancies that would minimize the chance of system failure.

"The University is trying to hold costs level. To fully implement a truly high availability system of this size is extraordinarily expensive," said Bob Bartlett, director of Enterprise Network Servers & Security at NSIT. "It is very hard to predict exactly how the volume curve will unfold and therefore to have the right server capacity in place at the right time."

The day after Thanksgiving break usually presents a heavy load for NSIT's servers. Many individuals do not check their e-mail over break, leading to four or five days of accumulation. In addition, the amount of e-mail sent tends to multiply because of students' travel plans and communication with their families.

The resulting load on the server is enormous, as students, faculty, and staff check their e-mail after returning from break. This year, the number of messages and the demand on the server led to a system slowdown on Monday, December 2.

NSIT assumed that the backlog was due to its Sun 4500 server not being adequate. After rebooting the system on Tuesday, however, the backlog problem continued. NSIT then brought a new server online, the Sun 6800, Tuesday night.

While the 6800, with a capacity twenty times that of the 4500, was fast enough to handle the increased traffic, another problem became apparent. In order to have the 6800 operating promptly, the T3 disk arrays, which store incoming e-mail, were transferred from the 4500 machine to the 6800. NSIT felt it did not have the time to copy the T3 arrays so instead it transferred the physical disks. Yet one of these T3s had been physically degenerating for some time and, once it was installed in the 6800, was not able to keep up with the demand.

The nature of the T3 array failure meant that no errors were reported by the system. NSIT called to consult with Sun Microsystems on Wednesday of that week, but neither organization was able to pinpoint the failure. Sun engineers arrived on Thursday and spent the day unsuccessfully trying to uncover the problem. Students and faculty began to complain in large numbers, even as NSIT worked to solve the problem.

According to Jackson, the administration was highly supportive of NSIT's efforts and systems administrators around campus attempted to help with the problem.

"[In retrospect] I believe we should have been more skeptical about suggestions from Sun's support," Bartlett said. "They made several suggestions that we questioned but implemented. Those settings may have extended the downtime by an additional day."

By Friday afternoon, NSIT and Sun concluded that the problem with the T3 needed to be bypassed. All the T3 arrays were attached to the new server, and the mail spools that contained unread mail were divided among the various disk arrays. The server had to be taken offline in order to accomplish this.

On Saturday, a Sun senior design engineer was dispatched to work with NSIT. This engineer corrected many of the problems caused by previous Sun technicians, and system performance improved after each problem was solved.

The expected performance level from the T3 disk arrays, however, has still not been met. Jackson expects that Sun will "want to compensate the University for our difficulties," adding that the relationship with Sun now has stresses it did not contain before.

Eventually, NSIT may have to adopt the system used by other universities, separating users into groups supported by separate servers. This would reduce the single-point-of-failure risk that affected the system in December. However, this would require extra technology, more money, and some inconvenience to users.

"Basically, this problem needs to be dealt with from the top-down and has nothing to do with infrastructure and everything to do with personnel," Rasch said. "The University of Michigan's servers are so stable that they even provide fully active accounts to any of their alumni who want them. If we can't even get our own e-mail server back online after 9 straight days, four of which fell on weekends when no one was at work or in school, well, you draw your own conclusions."