Network issues DBA (Status)

« Back

[#625] Network issues DBA (Status)

Posted: 2018-03-15 12:25

Start: 2018-03-12 00:00:00
End : 2018-03-15 23:59:59

Affects: Routing DBA

Currently there is an issue in the DBA datacenter where both R1 and R2 stopped responding to their linecards effectively cutting off all communication with the outside world.

Currently the cards are back online and are synching the routes with our transit providers. Our vendor is currently debugging along with us to find the source of these issues.

Update 12.45:
Issue was with not with the linecards itself but with the management handling arp incorrectly.

Begin this week the core routers at our DBA server hosting facility were facing issues. Shortly after the start of these incidents we implemented a workaround ourselves to prevent the routers from having to deal with the problematic traffic causing this issue.

The vendor has been working on debugging these issues since that time, using tech-logs/captures as well with live sessions. Today the vendor came with a solution suggestion. This was implemented and for a short period of time this solution did seem to improve the situation. Shortly after the situation was worsened and BGP sessions included the ones to our IP transits started to flap. We immediately removed their 'fix' and implemented our workaround again.

We have filed an official complaint with the vendor of the routers (Extreme, fka Brocade). This complaint means they have to escalate the case to the highest level of support as well as management for investigation.

Apologies for the issues in the past days on the routing in this facility. Please rest assured we are doing everything in our power to have them come up with a fix themselves. However awaiting their investigation we will leave our own workaround, which proven to work stable, in place.

Update 19.00:
Issue seems to be back slightly. Brocade is however already working on the routers by live session since 17.30. We have disabled some capacity to regain stability. When we know more, we will post an update.

Update 19.48:
Additionally we have located a mac-address causing issues in vlan 210/211/212/215.

Update 2018-03-16 04.00:
Since last update Brocade Engineers have been debugging live and will discuss gathered information with their team.

Update 14.40:
Brocade is still debugging in their lab / discussing information with their team. So far everything is still stable but not resolved.

Update 2018-03-17 15.45:
We have analysed several recommendation changes provided by vendor. We are applying some of them to the configuration. We will monitor the effects closely.

Update 16.30:
So far the first changes seem to have improved stability. Before deploying further changes we will keep this as-is for the time being to make sure it is stable.

Update 2018-03-19 11.20:
All has been stable with last changes applied. We will now proceed with additional configuration to increase continious stability.

Update 12.15:
Last week the routing platform at DBA suffered from a serie (~4) of, relatively short but very impactfull, packet-forwarding issues. Vendor has been debugging alongside to our NOC team to find the cause and a solution for this issue. A work-around was implemented after the first few incidents. On saturday a more definate fix was implemented. This has been tested and the routing platform in DBA has been stable for ~4 days. We are hereby are closing this incident, we will however keep working with vendor to analyse the incident further over the next couple of weeks.