#685 Darkfiber outage NZS - EQX-AM7

Posted: 2018-07-25 11:14

Start: 2018-07-25 10:30:00
End : 2018-07-25 13:25:00

Affects: Redundancy NZS

At 10:30 CEST a darkfiber between NZS (Nedzone Steenbergen) and EQX-AM7 (Equinix AM7) went down, carrying 200 Gbps traffic.

Currently all traffic is sent over the alternative path on our ring namely NZS - GSA.

We have contacted the darkfiber provider Eurofiber and were notified that currently one of the fiber paths is damaged and that crews are looking for the exact breaking point.

Update 13:25:

Currently the darkfiber is online again. The root cause of this issue is still not known, but as this line has been stable for the last two hours, we will be closing this case.

#673 Darkfiber outage NZS - EQX-AM7

Posted: 2018-06-29 12:44

Start: 2018-06-29 12:19:00
End : 2018-06-30 04:15:00

Affects: Redundancy NZS

At 12.19 CEST a darkfiber between NZS (Nedzone Steenbergen) and EQX-AM7 (Equinix AM7) went down, carrying 200 Gbps traffic.

Currently all traffic is send over the alternative path on our ring namely NZS - GSA.

We have contacted the darkfiber provider Eurofiber and were notified that currently one of the fiber paths is damaged and that crews are looking for the exact breaking point.

Update 17:11:
The location of the fiber damage has been found to be construction works in Etten-Leur, repair crews are underway to repair the fibers.

Update 20:38:
The damage to fibers and pipes is that severe that the decision has been made to place new pipes and put new fibers in them, while Eurofiber does not want to give out a solid ETA the estimate is that they will start placing new fibers at 22.00 CEST.

Update 2018-06-30 04.15:
The splicing seems to be finished and connection is re-established.

#663 FTTB outage Stolwijk / Bergambacht / Nieuwerkerk aan de Ijssel

Posted: 2018-06-03 11:46

Start: 2018-06-03 10:22:00
End : 2018-06-03 13:31:00

Affects: Stolwijk / Bergambacht / Nieuwerkerk

*This does not affect any Hosting or Transit Customers and is only valid for Customers located in one of the affected areas.*

Due to a power outage in Waddinxveen the PoP of our transport provider Eurofiber is also without power.

Because of this our connections to Stolwijk / Bergambacht and Nieuwerkerk aan de Ijssel are currently not working.

According to Eurofiber the current ETA for solving the power disruption is 4 PM CEST.
They've also sent out engineers with a power generator to feed the PoP for which the ETA is set to 1 PM CEST.

Our apologies for the incovenience caused.

#649 Router linecard crashes, vendor investigation (ongoing)

Posted: 2018-05-11 10:37

Start: 2018-04-01 00:00:00
End : 2018-07-06 00:00:00

Affects: Routing, primarely DBA

For transparency purposes we opened this case to keep you informed regarding an issue that causes crashing linecards in the routers.

During February and March we have been working with our vendor Brocade/Extreme to resolve packetloss on a 20x 10G linecard, which occurred when using more than 8 ports in a 160 Gbps lag (bundle). For this issue a work-around and fixes were implemented and till date this has been working. This was acceptable as these are the last linecards with 10G ports we use on our routing platform; they are marked/planned to be replaced by 100G linecards. All other routers and linecards in our hosting facilities DBC/NZS are already replaced with 100G models. The replacement for DBA was actually was planned to be done for last month, everything has been prepared and connect, however due to current new ongoing issue we have proponed this work.

Currently since ~1st of April we have noticed crashes of linecards and mgmt cards on Brocade/Extreme routers. First incidents were immediately reported to our vendor and a TAC case was created. During April we noticed 5 incidents at DBC, an incident at EQX-AM7. During April the TAC case has been escalated to the highest level possible with vendor. We are part of the critical customers program. During May the incidents increased further with less time apart between the incidents. So far we have counted 19 incidents on DBA during May. Our vendor has been made aware that failure to resolve forces us to break a 10-year partnership with them. Vendor has assured this case is handled with highest priority and all escalations possible are already done. It is currently in the hands of the engineers responsible for developing the code for these devices, if they cannot get this fixed then there is no one who can.

Our network is setup 100% redundantly in an active/active manner. Meaning redundant physical fiberpaths, redundant network chassis (routers and vpls) and also within these chassis we place everything redundant, thus double (or more) the amount of required psu's, redundant management cards and we spread the load of each LAG over multiple linecards. The latter (spreading the load of each LAG over multiple linecards) should have reduced the incidents to cause minimal issues (few millisecond). Unfortunately just before the linecard crashes, it seems to drop packets - if it were to crash immediately the impact would have been milliseconds, now the impact is seconds. As the traffic is coming in balanced over two routers, as well as going out balanced over two routers, the impact is reduced to roughly 1/4th connections. The incidents are recovered automatically without the need/presence/work from our NOC. This means that even though the incidents are there, at least it auto recovers when the card fails from the system and also auto-joins when it comes back. This means that no human interaction is required to recover during incidents.

This week we did patch the routers on DBA (based on the crash reports provided to Brocade), this seem to have resolved the management card crash. However the most pressing issue which is the linecard crashes are not yet (or not all of them) fixed.

So far it seems they have narrowed it down to their IPv6 code/routines. They are modifying the lab setup to start sending different packets in different scenario's to replicate the crashes, once they are able to replicate they can build a fix.

We hope that soon we can say that we are confident that the layer 3 routing platform in DBA is stable again, and everyone can enjoy the stability as they were used to from us before these incidents. Our apologies for the situation at hand, we can assure you that this situation is our highest priority - as well as with our vendor.

Update 2018-05-16 18.15:
We have implemented temporary partial fixes including an ACL blocking certain IPv6 packets at DBA. Since the the initial report of this post (2018-06-11) no more crashes were observed at DBA.

Last night at 00.50 there was a new crash at DBC R1, two linecards carrying 400 Gbps crashed within minutes apart. We applied the same ACL as applied on DBA, on all routers in our network.

Update 2018-07-06 17.23:
Since the end of May we have not seen any crashes anymore. Closing case.

Update 2018-07-16 10.40:
For archiving purposes the following defects were reported by NFOrce to vendor:
DEFECT000660088 [ifsr issue] (fixed in 6200b)
DEFECT000642455 [mgmt crash] (fixed in 6200b)
DEFECT000661713 [lp crash] (to be fixed in next rls)
DEFECT000661585 [lp crash]
DEFECT000661951 [lp crash]
DEFECT000661952 [lp crash]

#647 VPLS Outage EQX-AM7

Posted: 2018-05-06 23:32

Start: 2018-05-06 23:10:00
End : 2018-05-07 00:15:00

Affects: Fiber to the business Customers located on EQX-AM7, VPLS Customers EQX-AM7, Peering traffic R1 EQX-AM7

Currently there is an outage ongoing on the VPLS ring of EQX-AM7 where one of the switches in the stack rebooted itself.

The stack itself is setup redundantly where the standby took over it's place.

However due to a yet unknown reason a part of the VPLS ring went down even though the ports are still up.

We are currently investigating into the issue to find the cause and to resolve this and to collect data for our vendor.

Update 2018/05/07 00.05:
As even with trying different ways to get the remaining parts back online we decided to try a clean restart of the whole stack after collecting enough debug logs for our vendor.

After both switches were rebooted the remainder of the ports also came back online.

Because peering connections were still working on the EQX-AM7 router but had no way of being send over to the correct datacenters a higher latency or random packetloss may also have occurred.

At this moment the switches are stable again and we'll keep monitoring them while discussing this issue with our vendor.