Thursday, March 19, 2009

Post Incident Update - Facility Power

    I have attached the post-incident report from the data center 
facility provider to this email below. As one of our most important
vendors, we keep engaged with American Internet Services (AIS) plans,
upgrades, and operational changes. Naturally, as such a critical
supplier, we scrutinize their people's actions, technology, and
facility, especially when it comes to service impacting events. It is
our job to provide the best service possible to our customers, and
that requires us to be a tough and determined customer of AIS, on your
behalf.

While power has been restored and the facility has resumed normal
operating status, this event is not "over" for us. M5Hosting is
evaluating our own response to the event, and how our processes,
systems and technologies can be improved to mitigate the impact of
another service affecting event. AIS will certainly be doing the
same... and we will follow their actions closely and remain engaged
and involved with them.

With this said, and with due respect that we are talking about an
unplanned loss of power in an Internet Data Center, I am pleased with
AIS's response to the incident once it happened. All of M5Hosting's
technical staff and I were on site for up to 18hrs after the event and
observed their response to it first hand. It is clear that AIS's
response was well directed and planned.

Please find their Post Incident Report attached below. A diagram
of the AIS power infrastructure for the affected facility can be found
at:
http://www.m5hosting.com/AIS_SDTC_Power_Diagram_l.jpg

As always, I'd like to hear from you about this email, and the
events and actions described in it... or anything at all.

Sincerely,

Michael J. McCafferty
Principal Engineer
M5 Hosting
mike[at]5hosting.com
877-344-4678 x501

[quote]
Dear Valued Customer:

As a follow up to the power event which occurred on the morning of
March 18th, 2009 at the 9725 Scranton Data Center (SDTC), American
Internet Services has compiled the following post incident report for
our customer base. As always, our Account Relations and Management
team members are available to discuss specific customer issues or
concerns, while this report is intended to provide comprehensive
overview of the event itself.

At approximately 08:15AM PDT, March 18th, the SDTC datacenter suffered
a complete power failure for approximately 30 seconds while conducting
routine maintenance to the critical datacenter systems. The work that
was being performed is part of AIS? Standard Operating Procedure. This
procedure is in alignment with industry guidelines, and our commitment
to provide customers with the highest availability in data center
solutions. As we have informed our customers in the past, all critical
systems are tested bi-monthly by our team of mechanical engineers in
conjunction with our outside contractors under service agreements.
Standard maintenance is performed during normal business hours and is
carefully planned to incorporate the strictest test procedures to
ensure the success of the work performed. Our SOP incorporates
escalation processes and back out procedures in the unlikely event of
an alert or anomaly during the standard maintenance.

Regretfully, during our maintenance yesterday, we encountered a
mechanical failure. The Powerware 9515 UPS plant failed during the
transition of building load from street power to generator power.
Approximately 30 seconds upon failure of the UPS plant, our CTO,
Richard Sears, who was present for the maintenance, restored power to
the data center by manually moving the building to generator, quickly
isolated the failure to the UPS plant, reset all four UPS modules, and
brought all four UPS modules back online. Following, he moved the UPS
plant from bypass mode to normal operational mode.

At that time, senior management called to initiate the Emergency
Response Plan (ERP) and made a decision not to move the data center
back to street power until our mechanical engineers and external
contractors had an opportunity to perform diagnostics of all
datacenter systems to determine what caused the failure to the UPS
plant, as well as, test the general state of health of all critical
systems.

Within approximately 15 minutes of initiating ERP, we had mobilized 18
Customer Service Engineers, 5 Networking Engineers and Facilities and
HVAC teams to the datacenter, in an effort to assists our customers
with recovery. We also had UPS, battery and power experts from Eaton
Powerware, CPD and Emerson there to assist in the investigation of the
issue. As part of our emergency communication plan, all customers were
proactively contacted and informed of the situation and were provided
multiple progress updates throughout the day.

Upon reviewing of the findings, it was determined that one of our
battery strings failed, resulting in their not being able to hold
system load once the UPS plant went fully to battery. This caused a
critically low battery voltage condition to the entire UPS plant and
the plant protected itself by bypassing its system load to the main
bus. This was during the time the building was being transferred from
street power to generator power, so the main busses were both dead. In
order to prevent a dead-head of the generator and utility systems, the
SEL electrical system has a failsafe that prevents the main breakers
from closing after the emergency breakers have been commanded to
close, and there is power on the emergency bus. This condition
prevented us from closing the main breakers, while we were still able
to close the emergency breakers.

As with all of our critical datacenter systems, we have external
contractors under maintenance agreements to provide system
maintenance. JT Packard is responsible for system maintenance on our
entire UPS and battery plant at SDTC datacenter. We rely on our vendor
to test each battery at specific intervals to determine if and when
our batteries are approaching the threshold that requires replacement.
JT Packard has been performing this system maintenance on a regular
basis for several years now, of which most recently, reported 100%
System Health.

The result of the investigation; in the opinion of both Eaton and CPD
who conducted their investigation under separate check is that the
battery string in question failed due to bad batteries that were not
identified during the latest battery tests by JT Packard.

Upon validation of the findings, we mobilized 160 replacement
batteries from Orange County to our datacenter and proceeded to
schedule a three hour Emergency Maintenance window to start at 7:15PM
PDT in order to replace the batteries and perform the load transfer
back to street power. The evening's emergency maintenance window was
completed successfully at approximately 10:00PM PDT and all critical
systems where again, checked and diagnosed to be operating at 100%.

We sincerely apologize for the inconvenience yesterday?s event caused
you. We want to assure you that we spare no expense when it comes to
designing, deploying and maintaining our datacenter systems in order
to meet the industry?s highest levels of reliability which our
customers have come to expect. If you would like to receive any more
detailed information regarding this matter, or would like a detailed
layout of our power infrastructure, please let us know. We are here to
be of assistance.

We want to thank all of our customers for their continued support
while we worked together to mitigate this critical event.

Sincerely,

Alessandra M. Carrasco
Chief Executive Officer
American Internet Services

[/quote]


No comments: