CHPC Outage on Monday, August on 8th
Date Posted: August 11th, 2016
About 3 p.m. on Monday August 8th, CHPC experienced some service interruptions which were difficult to diagnose and whose symptoms were not consistent with any obvious problem. Issues were reported on file server access, cluster access, and some virtual machines were also affected.
Monday evening, after rebooting one of our file servers, most reported problems looked to be resolved. However, we were less that confident that we had found the root cause. We had no explanation of cause or resolution.
CHPC staff continued to diagnose and dig through logs, and finally found an elusive network problem. One thing staff had noticed was strange traffic patterns (high CPU utilization) on several switches which began at a suspicious time - about 3 p.m. on Monday. After much work and diagnosis, late afternoon on the 9th, they and were able to isolate the offending traffic to a particular vlan, and they ultimately found a physical connection problem - which had caused a network loop. We have identified some things that will reduce our vulnerabilities going forward, and after careful analysis, will most likely implement them in the near future. Note: please refrain from plugging in unknown switches, and always feel free to ask us for help.
Once corrected, all switch CPU utilization returned to normal, and we believe that we have found the root cause.
CHPC thanks our great users for their useful reports of errors and access problems, and for their patience with us finding a resolution to the matter.