Power outage DBA (Status)

« Back

[#248] Power outage DBA (Status)

Posted: 2015-04-30 21:09

Start: 2015-04-30 19:35:00
End : 2015-04-30 19:48:00

Affects: DBA

Notification regarding datacenter DBA, other datacenters DBC, TC2, TC5, GSA are not affected by this.

At ~19.15 a part of Amsterdam suffered from a power outage and went dark. In this zone the datacenter DBA is located.

The UPS systems toke over the power and the datacenter continued to run on backup power for its primary services, which are the datafloors (everything other than utility power).

At 19.35 the powergrid feed was still disconnected and the systems were still running on backup power. However at this time the powergrid feed received a burst/spike of power which was sent from the net/power-grid towards the system and physcially tripped the system internal breakers.

At 19.43 the powergrid was physically disconnected by hand by an engineer who was protected during the process to do so, hence this toke several minutes.

At 19.45 the datafloors were back online.

We sincerely apologize for all the issues caused by this disruption. This datacenter has not experienced such outage for at least 6 years. Please be assured we are doing everything that is possible from our end to make sure this incident stays a single incident.


Update 21.00:
We have resolved an issue with the network path towards zone 5, one of the coreswitches in that zone, or several of its ports seem to be damaged. We disabled these ports to resolve the packetloss. We will debug this issue further.

Update 21.30:
Several PDU's seem to have lost their IP configuration. An engineer is resolving these issues. It affects the powercontrol of the servers connected but not the power itself.

Update 21.45:
Engineers from the powergrid provider are checking their feeds.

Update 00.30:
Power is restored to normal operations.

Update 00.40:
Core-sw1.z5.dba is an IRF set of 96x 10G ports. 16 of these ports have become unusable. We will replace the primary switch of this IRF set asap to restore full network redundancy and capacity on this zone.

Update 01.10:
Please be aware we are still processing over 7.000 emails, a response and ticket creation is taking longer than normal in this situation. We are however handling everything as quick as possible.

Update 05.00:
The backlog of emails have been processed.

The broken switch that need to be replaced is being replaced at this time.

Update 05.30:
Switch has been replaced on the fly (hot) and network redundancy has been restored to zone 5.

Update 2015-05-04 16.15:
Throughout the whole weekend engineers working for Merlin Gerin / Schneider Electric are analysing the outage. They have so far concluded that the breaker between the powergrid and the electric system of the datacenter is at fault.

The breaker was physically tripped and did not auto-revert. The breaker is only used in case there is maintenance or when i.e. government emergency services need to turn off the power between the power-grid and the datacenter for whatever reason. The breaker has a high amperage value and should never trip due to datacenter load, it has 3x the value as the load of the datacenter. However the breaker did physically fail and locked itself, which most likely has been due to an outside high voltage surge. The breaker needed to be manually put back in place.

The breaker also has the ability to start the generators, which it did. Even though this is not required by design as the breaker should not trip, and in cases like this when it does the powergrid would normally be down for a longer time. It was however implemented to do easy maintenance/extra fail safe to make sure the generators are started.

The Netwatcher system, which controls the power sources and system passively, measures if there is power on the netgrid. The Netwatcher starts the generators upon detection of powergrid failure. If the power is restored it will synchronise back to the netgrid and shutdown the generators.

The Netwatcher system measures directly on the power-grid. The powergrid however was online according to the Netwatcher, thus on its turn it shutdown the generators.

A solution to resolve this would be having the Netwatcher measure at another stage in the power-grid, somewhere behind the master breaker of the power-grid. However this might cause difficulties for doing planned maintenance on the generators or other related equipment. Another solution would be to have the Generators ignore a shutdown by the Netwatcher or remove the function from the Netwatcher which shuts down the generators. This means an engineer will always have to shutdown the generators manually (physically) on-site after every power-failure.

There are two generators and each can on its turn carry the whole current load of the datacenter. The generators have a combined local source fuel which reaches over 20.000 liters. This means the datacenter can run on fuel for over 10 days before needing to refuel.

The Blackbox connected to the powersystem does not show any signs of failures from other systems such as the generators and UPS. It confirms the UPS toke over the power during the outage but failed to receive power from the generators.

At this moment the damaged Merlin Gerin Schneider Masterpact 2500A 400 Volt breaker is being replaced and the PLC's are being reprogrammed. Currently they are testing the new configuration on a spare system that the datacenter has on-site.

The engineers are finished testing and are now proceeding with rolling out the changes. This means:
- Replacing all components of the breaker itself
- Replacing the external sensor(s) in the breaker
- Replacing the cabling between the Netwatcher and the breaker.
- Adding additional (cabling) circuits between the Netwatcher and the system. This means it will measure before and after the breaker.
- Removing the functionality from the Netwatcher which turns of the generators. (not confirmed, awaiting update)

While they are replacing the breaker they will need to shutdown the power-grid feed multiple times. The UPS system will start to run on batteries. For your information, the UPS system runs weekly for several minutes on batteries during its automatic self-test and battery preservation process. During the shutdown of the powergrid breaker they will be able to check if the generators indeed power on and if the Netwatcher system does not turn them off again. This process/test will completely act out the same scenario as to what happened last thursday.

We have did not yet have receive information on when they expect to finish these works. After this the status of the system will be moved from at-risk to fully operational again.

Update 2015-05-05 12.15:
Manufacturer from Switzerland are still analysing the data from the systems but did not find any further issues than already reported.

At this moment manufacturer engineers are finishing up their work and they will proceed with final system testing to confirm proper operation again. This process could take all day depending on the results, this might be a a slow process. This as they will need to replicate the possible scenario's that caused the outage to occur. Once this has been finished the system will be marked as fully operational again.

Update 2015-05-06 16.00:
Manufacturer engineers have finished their work. They have performed a proper black building test twice today in which they test the certain scenarios. We had one of our own engineers present during the black building tests. We requested this with the DC as we asked to see the backup system in action with our own eyes.

We hereby mark the system as fully operational again.