Thursday, March 19, 2009

Network Maintenance Window Tonight

    We will be returning our network systems to normal fully redundant 
operating status tonight. Once this is done we will test the fail-over
and fault-tolerance. This may cause brief (1 to 15 second) pauses on
the network. The purpose is to ensure that all systems perform as
intended.

After the power incident yesterday, the M5Hosting "PremiumNet"
network systems performed in an unexpected manner. All systems have
been previously tested and performed fail-over flawlessly. This
testing to take place tonight is to make sure that this is still the
case, given the unexpected behavior yesterday. The unexpected behavior
yesterday took place between our border routers and AIS's routers
which face us when power was restored. Performing a test of the
fail-over systems tonight will help determine if the network issues
were due to the effects of the abrupt power interruption on the
network devices, or a configuration of the network devices. Both M5
and AIS Network Engineers will be on hand for this work.

The M5Hosting "ValueNet" was not affected once power was restored,
is running in normal operational status and should not be affected by
tonights testing.
The vast majority of our customers are on "PremiumNet". If you do
not know which you are on, you are most likely on PremiumNet.

Window Start: Friday March 20, 2009 02:00 US/Pacific (GMT -7)
Window End: Friday March 20, 2009 04:00 US/Pacific (GMT -7)
Duration: Customers on M5Hosting's "PremiumNet" may notice a few very
brief interruptions of a few seconds in length.

As always, I want to hear from you regarding this email, and how
we are doing... or anything at all.

Post Incident Update - Facility Power

    I have attached the post-incident report from the data center 
facility provider to this email below. As one of our most important
vendors, we keep engaged with American Internet Services (AIS) plans,
upgrades, and operational changes. Naturally, as such a critical
supplier, we scrutinize their people's actions, technology, and
facility, especially when it comes to service impacting events. It is
our job to provide the best service possible to our customers, and
that requires us to be a tough and determined customer of AIS, on your
behalf.

While power has been restored and the facility has resumed normal
operating status, this event is not "over" for us. M5Hosting is
evaluating our own response to the event, and how our processes,
systems and technologies can be improved to mitigate the impact of
another service affecting event. AIS will certainly be doing the
same... and we will follow their actions closely and remain engaged
and involved with them.

With this said, and with due respect that we are talking about an
unplanned loss of power in an Internet Data Center, I am pleased with
AIS's response to the incident once it happened. All of M5Hosting's
technical staff and I were on site for up to 18hrs after the event and
observed their response to it first hand. It is clear that AIS's
response was well directed and planned.

Please find their Post Incident Report attached below. A diagram
of the AIS power infrastructure for the affected facility can be found
at:
http://www.m5hosting.com/AIS_SDTC_Power_Diagram_l.jpg

As always, I'd like to hear from you about this email, and the
events and actions described in it... or anything at all.

Sincerely,

Michael J. McCafferty
Principal Engineer
M5 Hosting
mike[at]5hosting.com
877-344-4678 x501

[quote]
Dear Valued Customer:

As a follow up to the power event which occurred on the morning of
March 18th, 2009 at the 9725 Scranton Data Center (SDTC), American
Internet Services has compiled the following post incident report for
our customer base. As always, our Account Relations and Management
team members are available to discuss specific customer issues or
concerns, while this report is intended to provide comprehensive
overview of the event itself.

At approximately 08:15AM PDT, March 18th, the SDTC datacenter suffered
a complete power failure for approximately 30 seconds while conducting
routine maintenance to the critical datacenter systems. The work that
was being performed is part of AIS? Standard Operating Procedure. This
procedure is in alignment with industry guidelines, and our commitment
to provide customers with the highest availability in data center
solutions. As we have informed our customers in the past, all critical
systems are tested bi-monthly by our team of mechanical engineers in
conjunction with our outside contractors under service agreements.
Standard maintenance is performed during normal business hours and is
carefully planned to incorporate the strictest test procedures to
ensure the success of the work performed. Our SOP incorporates
escalation processes and back out procedures in the unlikely event of
an alert or anomaly during the standard maintenance.

Regretfully, during our maintenance yesterday, we encountered a
mechanical failure. The Powerware 9515 UPS plant failed during the
transition of building load from street power to generator power.
Approximately 30 seconds upon failure of the UPS plant, our CTO,
Richard Sears, who was present for the maintenance, restored power to
the data center by manually moving the building to generator, quickly
isolated the failure to the UPS plant, reset all four UPS modules, and
brought all four UPS modules back online. Following, he moved the UPS
plant from bypass mode to normal operational mode.

At that time, senior management called to initiate the Emergency
Response Plan (ERP) and made a decision not to move the data center
back to street power until our mechanical engineers and external
contractors had an opportunity to perform diagnostics of all
datacenter systems to determine what caused the failure to the UPS
plant, as well as, test the general state of health of all critical
systems.

Within approximately 15 minutes of initiating ERP, we had mobilized 18
Customer Service Engineers, 5 Networking Engineers and Facilities and
HVAC teams to the datacenter, in an effort to assists our customers
with recovery. We also had UPS, battery and power experts from Eaton
Powerware, CPD and Emerson there to assist in the investigation of the
issue. As part of our emergency communication plan, all customers were
proactively contacted and informed of the situation and were provided
multiple progress updates throughout the day.

Upon reviewing of the findings, it was determined that one of our
battery strings failed, resulting in their not being able to hold
system load once the UPS plant went fully to battery. This caused a
critically low battery voltage condition to the entire UPS plant and
the plant protected itself by bypassing its system load to the main
bus. This was during the time the building was being transferred from
street power to generator power, so the main busses were both dead. In
order to prevent a dead-head of the generator and utility systems, the
SEL electrical system has a failsafe that prevents the main breakers
from closing after the emergency breakers have been commanded to
close, and there is power on the emergency bus. This condition
prevented us from closing the main breakers, while we were still able
to close the emergency breakers.

As with all of our critical datacenter systems, we have external
contractors under maintenance agreements to provide system
maintenance. JT Packard is responsible for system maintenance on our
entire UPS and battery plant at SDTC datacenter. We rely on our vendor
to test each battery at specific intervals to determine if and when
our batteries are approaching the threshold that requires replacement.
JT Packard has been performing this system maintenance on a regular
basis for several years now, of which most recently, reported 100%
System Health.

The result of the investigation; in the opinion of both Eaton and CPD
who conducted their investigation under separate check is that the
battery string in question failed due to bad batteries that were not
identified during the latest battery tests by JT Packard.

Upon validation of the findings, we mobilized 160 replacement
batteries from Orange County to our datacenter and proceeded to
schedule a three hour Emergency Maintenance window to start at 7:15PM
PDT in order to replace the batteries and perform the load transfer
back to street power. The evening's emergency maintenance window was
completed successfully at approximately 10:00PM PDT and all critical
systems where again, checked and diagnosed to be operating at 100%.

We sincerely apologize for the inconvenience yesterday?s event caused
you. We want to assure you that we spare no expense when it comes to
designing, deploying and maintaining our datacenter systems in order
to meet the industry?s highest levels of reliability which our
customers have come to expect. If you would like to receive any more
detailed information regarding this matter, or would like a detailed
layout of our power infrastructure, please let us know. We are here to
be of assistance.

We want to thank all of our customers for their continued support
while we worked together to mitigate this critical event.

Sincerely,

Alessandra M. Carrasco
Chief Executive Officer
American Internet Services

[/quote]


Wednesday, March 18, 2009

Emergency Maintenance and Initial Network/Power Event Overview

Dear M5Hosting Customer,
As you may be aware, your server was rebooted today due to a momentary power loss at the data center facility (American Internet Services) from which M5Hosting provides your service. The root cause was a failed power system component during a scheduled generator test.

After the power was restored, M5Hosting also had some network issues due to some redundant network components behaving in an unexpected manner.

The failed power system components have been replaced and power will be taken off generator and back on the "street" power tonight during an emergency maintenance window declared by the data center facility. This maintenance window will begin at 8:30pm US/Pacific Time. While the window has been declared to be 2hrs, the actual cutover will be quite short. The window will allow for testing. This work is not expected to impact service at all.

I have pasted the most recent announcement from the data center facility below. We will also forward any subsequent announcements from them regarding this incident.

While the network and facility are fully functional and stable at this time, there is some work to do return all systems back to full redundancy and normal operating status. The maintenance discussed below will be the first step. This will return the power system. Our network work will take place during a late night maintenance window to be announced later and scheduled to mitigate any risks of an impact to service.

If you are still experiencing any issues with your server or other services which we provide for you at this time, please contact us at 877-344-4678 or submit a support ticket, so we take care of your needs immediately.

Sincerely,
Mike

The most recent Data Center facility announcement from AIS:

To Our Valued Customers,


Following this morning's facility issue at the 9725 Scranton Data Center, AIS Facilities and Engineering teams are announcing a 2 hour emergency maintenance window which will begin at 17:15 and which will conclude at or before 20:15.



During this window, main power for the facility will be returned from generator to normal operating mode. Throughout the day, engineering and mechanical teams have taken every precaution and performed considerable diagnostics to ensure that all systems are once again operating at 100% effectiveness. Even following these precautions, all maintenance windows involving critical power systems still carry the risk of causing of a power event during the maintenance operation, thus, we are bringing this important issue to your attention at this time. While maintaining a cautious posture, we at AIS do not anticipate any issues during this window.


During the maintenance window, the facility will remain fully staffed by 18 customer service engineers, 3 different mechanical engineering contractors, and AIS senior management personnel.


We are in the process of producing a detailed post incident report which will be made available during business hours tomorrow, Thursday March 19th.


A follow up communication will be sent upon completion of the maintenance window. We apologize for any inconvenience this issue may have caused.


AIS Support Services
There was a short (less than a minute) power interruption at the data facility which houses the majority of M5Hosting customer servers. This caused all of the servers to reboot.
It is too early to know the root cause. The majority of customer servers came back up right away, as expected. However, some did not come back up as expected.
We have our entire technical staff at the data center working on the issues as quickly as possible.
There will be a complete post-mortem available as soon as possible.

Thank you for your patience and trust in us.

Sincerely,
Mike

UPDATE 13:45 PDT: There has some packet loss, which may be related to the power event this morning.

UPDATE 15:30 PDT: All network issues have been stable for a few hours. Only a few servers which have hardware failures (ie: failed power supplies or hard disks) remain down. Tech staff are addressing them as quickly as possible.

UPDATE 17:34 PDT: The failed power system components have been replaced. The data center will be transitioning off backup power and back to SDG&E power at about 18:00 PDT. Tech staff remains on site to address any issues that may arise.

Wednesday, March 11, 2009

Packet Loss Event

This morning at 7:35am PDT, some customers reported up to 33% packet loss. The issue was resolved within 18 minutes. The root cause was a Denial of Service attack against a customer. The offending system was removed from the network and the DoS stopped.

Our Network Engineers have identified the resources which were DoS'd and the technical issues which occurred. We will develop a plan to mitigate the risk of the same events happening again.