Router linecard crashes, vendor investigation (ongoing) (Status)

« Back

[#649] Router linecard crashes, vendor investigation (ongoing) (Status)

Posted: 2018-05-11 10:37

Start: 2018-04-01 00:00:00
End : 2018-07-06 00:00:00

Affects: Routing, primarely DBA

For transparency purposes we opened this case to keep you informed regarding an issue that causes crashing linecards in the routers.

During February and March we have been working with our vendor Brocade/Extreme to resolve packetloss on a 20x 10G linecard, which occurred when using more than 8 ports in a 160 Gbps lag (bundle). For this issue a work-around and fixes were implemented and till date this has been working. This was acceptable as these are the last linecards with 10G ports we use on our routing platform; they are marked/planned to be replaced by 100G linecards. All other routers and linecards in our hosting facilities DBC/NZS are already replaced with 100G models. The replacement for DBA was actually was planned to be done for last month, everything has been prepared and connect, however due to current new ongoing issue we have proponed this work.

Currently since ~1st of April we have noticed crashes of linecards and mgmt cards on Brocade/Extreme routers. First incidents were immediately reported to our vendor and a TAC case was created. During April we noticed 5 incidents at DBC, an incident at EQX-AM7. During April the TAC case has been escalated to the highest level possible with vendor. We are part of the critical customers program. During May the incidents increased further with less time apart between the incidents. So far we have counted 19 incidents on DBA during May. Our vendor has been made aware that failure to resolve forces us to break a 10-year partnership with them. Vendor has assured this case is handled with highest priority and all escalations possible are already done. It is currently in the hands of the engineers responsible for developing the code for these devices, if they cannot get this fixed then there is no one who can.

Our network is setup 100% redundantly in an active/active manner. Meaning redundant physical fiberpaths, redundant network chassis (routers and vpls) and also within these chassis we place everything redundant, thus double (or more) the amount of required psu's, redundant management cards and we spread the load of each LAG over multiple linecards. The latter (spreading the load of each LAG over multiple linecards) should have reduced the incidents to cause minimal issues (few millisecond). Unfortunately just before the linecard crashes, it seems to drop packets - if it were to crash immediately the impact would have been milliseconds, now the impact is seconds. As the traffic is coming in balanced over two routers, as well as going out balanced over two routers, the impact is reduced to roughly 1/4th connections. The incidents are recovered automatically without the need/presence/work from our NOC. This means that even though the incidents are there, at least it auto recovers when the card fails from the system and also auto-joins when it comes back. This means that no human interaction is required to recover during incidents.

This week we did patch the routers on DBA (based on the crash reports provided to Brocade), this seem to have resolved the management card crash. However the most pressing issue which is the linecard crashes are not yet (or not all of them) fixed.

So far it seems they have narrowed it down to their IPv6 code/routines. They are modifying the lab setup to start sending different packets in different scenario's to replicate the crashes, once they are able to replicate they can build a fix.

We hope that soon we can say that we are confident that the layer 3 routing platform in DBA is stable again, and everyone can enjoy the stability as they were used to from us before these incidents. Our apologies for the situation at hand, we can assure you that this situation is our highest priority - as well as with our vendor.


Update 2018-05-16 18.15:
We have implemented temporary partial fixes including an ACL blocking certain IPv6 packets at DBA. Since the the initial report of this post (2018-06-11) no more crashes were observed at DBA.

Last night at 00.50 there was a new crash at DBC R1, two linecards carrying 400 Gbps crashed within minutes apart. We applied the same ACL as applied on DBA, on all routers in our network.


Update 2018-07-06 17.23:
Since the end of May we have not seen any crashes anymore. Closing case.


Update 2018-07-16 10.40:
For archiving purposes the following defects were reported by NFOrce to vendor:
DEFECT000660088 [ifsr issue] (fixed in 6200b)
DEFECT000642455 [mgmt crash] (fixed in 6200b)
DEFECT000661713 [lp crash] (to be fixed in next rls)
DEFECT000661585 [lp crash]
DEFECT000661951 [lp crash]
DEFECT000661952 [lp crash]