#649 Router linecard crashes, vendor investigation (ongoing)

Posted: 2018-05-11 10:37

Start: 2018-04-01 00:00:00
End : 0000-00-00 00:00:00

Affects: Routing, primarely DBA

For transparency purposes we opened this case to keep you informed regarding an issue that causes crashing linecards in the routers.

During February and March we have been working with our vendor Brocade/Extreme to resolve packetloss on a 20x 10G linecard, which occurred when using more than 8 ports in a 160 Gbps lag (bundle). For this issue a work-around and fixes were implemented and till date this has been working. This was acceptable as these are the last linecards with 10G ports we use on our routing platform; they are marked/planned to be replaced by 100G linecards. All other routers and linecards in our hosting facilities DBC/NZS are already replaced with 100G models. The replacement for DBA was actually was planned to be done for last month, everything has been prepared and connect, however due to current new ongoing issue we have proponed this work.

Currently since ~1st of April we have noticed crashes of linecards and mgmt cards on Brocade/Extreme routers. First incidents were immediately reported to our vendor and a TAC case was created. During April we noticed 5 incidents at DBC, an incident at EQX-AM7. During April the TAC case has been escalated to the highest level possible with vendor. We are part of the critical customers program. During May the incidents increased further with less time apart between the incidents. So far we have counted 19 incidents on DBA during May. Our vendor has been made aware that failure to resolve forces us to break a 10-year partnership with them. Vendor has assured this case is handled with highest priority and all escalations possible are already done. It is currently in the hands of the engineers responsible for developing the code for these devices, if they cannot get this fixed then there is no one who can.

Our network is setup 100% redundantly in an active/active manner. Meaning redundant physical fiberpaths, redundant network chassis (routers and vpls) and also within these chassis we place everything redundant, thus double (or more) the amount of required psu's, redundant management cards and we spread the load of each LAG over multiple linecards. The latter (spreading the load of each LAG over multiple linecards) should have reduced the incidents to cause minimal issues (few millisecond). Unfortunately just before the linecard crashes, it seems to drop packets - if it were to crash immediately the impact would have been milliseconds, now the impact is seconds. As the traffic is coming in balanced over two routers, as well as going out balanced over two routers, the impact is reduced to roughly 1/4th connections. The incidents are recovered automatically without the need/presence/work from our NOC. This means that even though the incidents are there, at least it auto recovers when the card fails from the system and also auto-joins when it comes back. This means that no human interaction is required to recover during incidents.

This week we did patch the routers on DBA (based on the crash reports provided to Brocade), this seem to have resolved the management card crash. However the most pressing issue which is the linecard crashes are not yet (or not all of them) fixed.

So far it seems they have narrowed it down to their IPv6 code/routines. They are modifying the lab setup to start sending different packets in different scenario's to replicate the crashes, once they are able to replicate they can build a fix.

We hope that soon we can say that we are confident that the layer 3 routing platform in DBA is stable again, and everyone can enjoy the stability as they were used to from us before these incidents. Our apologies for the situation at hand, we can assure you that this situation is our highest priority - as well as with our vendor.

Update 2018-05-16 18.15:
We have implemented temporary partial fixes including an ACL blocking certain IPv6 packets at DBA. Since the the initial report of this post (2018-06-11) no more crashes were observed at DBA.

Last night at 00.50 there was a new crash at DBC R1, two linecards carrying 400 Gbps crashed within minutes apart. We applied the same ACL as applied on DBA, on all routers in our network.

#647 VPLS Outage EQX-AM7

Posted: 2018-05-06 23:32

Start: 2018-05-06 23:10:00
End : 2018-05-07 00:15:00

Affects: Fiber to the business Customers located on EQX-AM7, VPLS Customers EQX-AM7, Peering traffic R1 EQX-AM7

Currently there is an outage ongoing on the VPLS ring of EQX-AM7 where one of the switches in the stack rebooted itself.

The stack itself is setup redundantly where the standby took over it's place.

However due to a yet unknown reason a part of the VPLS ring went down even though the ports are still up.

We are currently investigating into the issue to find the cause and to resolve this and to collect data for our vendor.

Update 2018/05/07 00.05:
As even with trying different ways to get the remaining parts back online we decided to try a clean restart of the whole stack after collecting enough debug logs for our vendor.

After both switches were rebooted the remainder of the ports also came back online.

Because peering connections were still working on the EQX-AM7 router but had no way of being send over to the correct datacenters a higher latency or random packetloss may also have occurred.

At this moment the switches are stable again and we'll keep monitoring them while discussing this issue with our vendor.

#638 Packetloss traffic over NTT

Posted: 2018-04-28 21:10

Start: 2018-04-28 18:40:00
End : 2018-04-28 18:50:00

Affects: Routing NTT

We have received notification from NTT reporting issues with one of their routers causing packet loss for traffic through NTT.

Their notification: " experienced an issue where it's routing process crashed and did not gracefully fallback. This upstream device caused a brief outage while routing was reloaded and latency due to convergence. We are investigating with our vendor."

Their reference: VNOC-1-1698497805

For transparancy purposes we decided to publish this notification.

#631 DBA ring stack loop

Posted: 2018-04-04 13:36

Start: 2018-04-04 12:07:00
End : 2018-04-04 12:45:00

Affects: DBA layer 3 routing

At 12.07 one of the links between the ring network and the routers locally at datacenter DBA experienced major packetloss. This caused the BGP peers on the DBA routers to flap, causing layer 3 routing issues.

At 12.15 one of the ring switches, part of a stack, reloaded, during the reload BGP and traffic stabilized and recovered. When the device became an active member of the ring stack the same issues occurred. The reason for the reload was due to traffic loop on the backplane of the ring stack. Hence we expected the device to be faulty, or a bug to be present.

At 12.30 we have located that the issue was not with device (member) of the ring stack, the device itself was stable after the reload. We have debugged all four 100G ports connecting the ring stack to the DBA routers and found that one port was generating 'fake' traffic, on the side of the ring stack.

Debugging shows that the communication between the device and the tranceiver is working fine, however traffic was looping back within the transceiver to the device - or at least this was what our monitoring shows. This seems to have caused a flood of partial packets and as it kept looping (generating more and more) on the device, it at some point reloaded. The reload itself most likely due to management unable to communicate with its linecards/switchfabrics due to this loop. The reason the device did not turn of the defective port is because the transceiver communication was not affected and the transceiver itself did not report any errors.

At 12.40 the problematic port was disabled and the transceiver was physically removed from the ring stack.

At 12.45 the network was stabilizing, all BGP sessions are recovering.

At around 13.00 the transceiver was replaced by a spare, no issues shown on this transceiver.

#627 Seabone outage GSA

Posted: 2018-03-19 12:48

Start: 2018-03-19 12:22:00
End : 2018-03-19 20:57:00

Affects: Routing Seabone GSA

Currently our connection to Seabone located at Globalswitch Amsterdam has gone down without reason.

We've sent a trouble ticket to Seabone to request more information about this outage.

All Seabone traffic is automatically re-routed over our other connection at Equinix Amsterdam 5.

Update 21.05:
Please be Informed that there was a Faulty card on transmissive equipment replaced,