DBA ring stack loop (Status)

« Back

[#631] DBA ring stack loop (Status)

Posted: 2018-04-04 13:36

Start: 2018-04-04 12:07:00
End : 2018-04-04 12:45:00

Affects: DBA layer 3 routing

At 12.07 one of the links between the ring network and the routers locally at datacenter DBA experienced major packetloss. This caused the BGP peers on the DBA routers to flap, causing layer 3 routing issues.

At 12.15 one of the ring switches, part of a stack, reloaded, during the reload BGP and traffic stabilized and recovered. When the device became an active member of the ring stack the same issues occurred. The reason for the reload was due to traffic loop on the backplane of the ring stack. Hence we expected the device to be faulty, or a bug to be present.

At 12.30 we have located that the issue was not with device (member) of the ring stack, the device itself was stable after the reload. We have debugged all four 100G ports connecting the ring stack to the DBA routers and found that one port was generating 'fake' traffic, on the side of the ring stack.

Debugging shows that the communication between the device and the tranceiver is working fine, however traffic was looping back within the transceiver to the device - or at least this was what our monitoring shows. This seems to have caused a flood of partial packets and as it kept looping (generating more and more) on the device, it at some point reloaded. The reload itself most likely due to management unable to communicate with its linecards/switchfabrics due to this loop. The reason the device did not turn of the defective port is because the transceiver communication was not affected and the transceiver itself did not report any errors.

At 12.40 the problematic port was disabled and the transceiver was physically removed from the ring stack.

At 12.45 the network was stabilizing, all BGP sessions are recovering.

At around 13.00 the transceiver was replaced by a spare, no issues shown on this transceiver.