Update 3:22 pm
CHPC has completed the vendor recommended changes and access has been restored. The
issues of the last two days, caused by inconsistencies in the two redundant circuits (details
below) have been resolved.
Note that we are still working with vendors regarding issues seen over the holidays. They have
identified bugs in firmware and are working with us to get new software. We are in
a stable workaround situation at the moment and have been trying to bring services
back online (such as the redundant circuit) while we wait for the fix.
We thank you for your patience as we continue to troubleshoot the new network equipment
and ask that you report any access issues to
helpdesk@chpc.utah.edu. The CHPC team will continue to perform quality assurance tests on all of the services
that experienced problems.
Details: Two days ago, we brought up a redundant circuit that we had brought down due to
troubleshooting isolation. With that circuit off-line, some services started to have
intermittent and random connectivity issues. Therefore, we brought the circuit back
down this morning to help mitigate the issues. Even though we had tested bringing
the redundant circuits up and down previously and there had been no issues, when
we brought the circuit down this morning, random services started dropping or experiencing
timeouts throughout all the infrastructure and not just to a few services. We brought
the circuit back up and it helped the majority of the services, but not all. We kept
digging and contacted the vendor. They were able to help us isolate the problems
to two inconsistencies between the routers. We were able to repair these inconsistencies
by removal of part of the configuration on one unit and re-instatement of the same
config. We also added some additional configurations. These changes have reset and
normalized the respective configuration data bases
CHPC has experienced two unplanned network interruptions this morning.
The first was from approximately 7:00-8:45am as a result of removing a redundant link
in order to address issues being observed on several isolated systems. As this change
was not expected to cause an outage, we did not schedule a time or announce the change.
The second was at 11:45 am, duration of a couple minutes, due to a configuration change
made while troubleshooting the first outage.
The CHPC networking team is in contact with the vendors of the networking hardware and
are planning to make a vendor recommended change between 1-2pm today. There is the possibility that the change being made may temporarily break all connectivity.
We will send a follow-up message once we are done with the changes.