M5Hosting System Status: 2009

Thursday, August 27, 2009

Brief Interruptions on PremiumNet

We have had reports of intermittent network connectivity to some hosts on PremiumNet (but not ValueNet). We have identified the cause and implemented a fix. We are performing an analysis to determine the events that triggered this event.

There are currently no known connectivity issues remaining. A root cause explanation will be sent to all customers who have opened a support ticket regarding this issue.

Thursday, March 19, 2009

Network Maintenance Window Tonight

    We will be returning our network systems to normal fully redundant 
operating status tonight. Once this is done we will test the fail-over 
and fault-tolerance. This may cause brief (1 to 15 second) pauses on 
the network. The purpose is to ensure that all systems perform as 
intended.

   After the power incident yesterday, the M5Hosting "PremiumNet" 
network systems performed in an unexpected manner. All systems have 
been previously tested and performed fail-over flawlessly. This 
testing to take place tonight is to make sure that this is still the 
case, given the unexpected behavior yesterday. The unexpected behavior 
yesterday took place between our border routers and AIS's routers 
which face us when power was restored. Performing a test of the 
fail-over systems tonight will help determine if the network issues 
were due to the effects of the abrupt power interruption on the 
network devices, or a configuration of the network devices. Both M5 
and AIS Network Engineers will be on hand for this work.

   The M5Hosting "ValueNet" was not affected once power was restored, 
is running in normal operational status and should not be affected by 
tonights testing.
   The vast majority of our customers are on "PremiumNet". If you do 
not know which you are on, you are most likely on PremiumNet.

Window Start: Friday March 20, 2009 02:00 US/Pacific (GMT -7)
Window End: Friday March 20, 2009 04:00  US/Pacific (GMT -7)
Duration: Customers on M5Hosting's "PremiumNet" may notice a few very 
brief interruptions of a few seconds in length.

   As always, I want to hear from you regarding this email, and how 
we are doing... or anything at all.

Post Incident Update - Facility Power

    I have attached the post-incident report from the data center 
facility provider to this email below. As one of our most important 
vendors, we keep engaged with American Internet Services (AIS) plans, 
upgrades, and operational changes. Naturally, as such a critical 
supplier, we scrutinize their people's actions, technology, and 
facility, especially when it comes to service impacting events. It is 
our job to provide the best service possible to our customers, and 
that requires us to be a tough and determined customer of AIS, on your 
behalf.

   While power has been restored and the facility has resumed normal 
operating status, this event is not "over" for us. M5Hosting is 
evaluating our own response to the event, and how our processes, 
systems and technologies can be improved to mitigate the impact of 
another service affecting event. AIS will certainly be doing the 
same... and we will follow their actions closely and remain engaged 
and involved with them.

   With this said, and with due respect that we are talking about an 
unplanned loss of power in an Internet Data Center, I am pleased with 
AIS's response to the incident once it happened. All of M5Hosting's 
technical staff and I were on site for up to 18hrs after the event and 
observed their response to it first hand. It is clear that AIS's 
response was well directed and planned.

   Please find their Post Incident Report attached below. A diagram 
of the AIS power infrastructure for the affected facility can be found 
at:
http://www.m5hosting.com/AIS_SDTC_Power_Diagram_l.jpg

   As always, I'd like to hear from you about this email, and the 
events and actions described in it... or anything at all.

Sincerely,

Michael J. McCafferty
Principal Engineer
M5 Hosting
mike[at]5hosting.com
877-344-4678 x501

[quote]
Dear Valued Customer:

As a follow up to the power event which occurred on the morning of  
March 18th, 2009 at the 9725 Scranton Data Center (SDTC), American  
Internet Services has compiled the following post incident report for  
our customer base.  As always, our Account Relations and Management  
team members are available to discuss specific customer issues or  
concerns, while this report is intended to provide comprehensive  
overview of the event itself.

At approximately 08:15AM PDT, March 18th, the SDTC datacenter suffered  
a complete power failure for approximately 30 seconds while conducting  
routine maintenance to the critical datacenter systems. The work that  
was being performed is part of AIS? Standard Operating Procedure. This  
procedure is in alignment with industry guidelines, and our commitment  
to provide customers with the highest availability in data center  
solutions. As we have informed our customers in the past, all critical  
systems are tested bi-monthly by our team of mechanical engineers in  
conjunction with our outside contractors under service agreements.  
Standard maintenance is performed during normal business hours and is  
carefully planned to incorporate the strictest test procedures to  
ensure the success of the work performed. Our SOP incorporates  
escalation processes and back out procedures in the unlikely event of  
an alert or anomaly during the standard maintenance.

Regretfully, during our maintenance yesterday, we encountered a  
mechanical failure. The Powerware 9515 UPS plant failed during the  
transition of building load from street power to generator power.  
Approximately 30 seconds upon failure of the UPS plant, our CTO,  
Richard Sears, who was present for the maintenance, restored power to  
the data center by manually moving the building to generator, quickly  
isolated the failure to the UPS plant, reset all four UPS modules, and  
brought all four UPS modules back online. Following, he moved the UPS  
plant from bypass mode to normal operational mode.

At that time, senior management called to initiate the Emergency  
Response Plan (ERP) and made a decision not to move the data center  
back to street power until our mechanical engineers and external  
contractors had an opportunity to perform diagnostics of all  
datacenter systems to determine what caused the failure to the UPS  
plant, as well as, test the general state of health of all critical  
systems.

Within approximately 15 minutes of initiating ERP, we had mobilized 18  
Customer Service Engineers, 5 Networking Engineers and Facilities and  
HVAC teams to the datacenter, in an effort to assists our customers  
with recovery. We also had UPS, battery and power experts from Eaton  
Powerware, CPD and Emerson there to assist in the investigation of the  
issue. As part of our emergency communication plan, all customers were  
proactively contacted and informed of the situation and were provided  
multiple progress updates throughout the day.

Upon reviewing of the findings, it was determined that one of our  
battery strings failed, resulting in their not being able to hold  
system load once the UPS plant went fully to battery. This caused a  
critically low battery voltage condition to the entire UPS plant and  
the plant protected itself by bypassing its system load to the main  
bus. This was during the time the building was being transferred from  
street power to generator power, so the main busses were both dead. In  
order to prevent a dead-head of the generator and utility systems, the  
SEL electrical system has a failsafe that prevents the main breakers  
from closing after the emergency breakers have been commanded to  
close, and there is power on the emergency bus. This condition  
prevented us from closing the main breakers, while we were still able  
to close the emergency breakers.

As with all of our critical datacenter systems, we have external  
contractors under maintenance agreements to provide system  
maintenance.  JT Packard is responsible for system maintenance on our  
entire UPS and battery plant at SDTC datacenter. We rely on our vendor  
to test each battery at specific intervals to determine if and when  
our batteries are approaching the threshold that requires replacement.  
JT Packard has been performing this system maintenance on a regular  
basis for several years now, of which most recently, reported 100%  
System Health.

The result of the investigation; in the opinion of both Eaton and CPD  
who conducted their investigation under separate check is that the  
battery string in question failed due to bad batteries that were not  
identified during the latest battery tests by JT Packard.

Upon validation of the findings, we mobilized 160 replacement  
batteries from Orange County to our datacenter and proceeded to  
schedule a three hour Emergency Maintenance window to start at 7:15PM  
PDT in order to replace the batteries and perform the load transfer  
back to street power. The evening's emergency maintenance window was  
completed successfully at approximately 10:00PM PDT and all critical  
systems where again, checked and diagnosed to be operating at 100%.

We sincerely apologize for the inconvenience yesterday?s event caused  
you. We want to assure you that we spare no expense when it comes to  
designing, deploying and maintaining our datacenter systems in order  
to meet the industry?s highest levels of reliability which our  
customers have come to expect. If you would like to receive any more  
detailed information regarding this matter, or would like a detailed  
layout of our power infrastructure, please let us know. We are here to  
be of assistance.

We want to thank all of our customers for their continued support  
while we worked together to mitigate this critical event.

Sincerely,

Alessandra M. Carrasco
Chief Executive Officer
American Internet Services

[/quote]

Wednesday, March 18, 2009

Emergency Maintenance and Initial Network/Power Event Overview

Dear M5Hosting Customer,
As you may be aware, your server was rebooted today due to a momentary power loss at the data center facility (American Internet Services) from which M5Hosting provides your service. The root cause was a failed power system component during a scheduled generator test.

After the power was restored, M5Hosting also had some network issues due to some redundant network components behaving in an unexpected manner.

The failed power system components have been replaced and power will be taken off generator and back on the "street" power tonight during an emergency maintenance window declared by the data center facility. This maintenance window will begin at 8:30pm US/Pacific Time. While the window has been declared to be 2hrs, the actual cutover will be quite short. The window will allow for testing. This work is not expected to impact service at all.

I have pasted the most recent announcement from the data center facility below. We will also forward any subsequent announcements from them regarding this incident.

While the network and facility are fully functional and stable at this time, there is some work to do return all systems back to full redundancy and normal operating status. The maintenance discussed below will be the first step. This will return the power system. Our network work will take place during a late night maintenance window to be announced later and scheduled to mitigate any risks of an impact to service.

If you are still experiencing any issues with your server or other services which we provide for you at this time, please contact us at 877-344-4678 or submit a support ticket, so we take care of your needs immediately.

Sincerely,
Mike

The most recent Data Center facility announcement from AIS:

To Our Valued Customers,

Following this morning's facility issue at the 9725 Scranton Data Center, AIS Facilities and Engineering teams are announcing a 2 hour emergency maintenance window which will begin at 17:15 and which will conclude at or before 20:15.

During this window, main power for the facility will be returned from generator to normal operating mode. Throughout the day, engineering and mechanical teams have taken every precaution and performed considerable diagnostics to ensure that all systems are once again operating at 100% effectiveness. Even following these precautions, all maintenance windows involving critical power systems still carry the risk of causing of a power event during the maintenance operation, thus, we are bringing this important issue to your attention at this time. While maintaining a cautious posture, we at AIS do not anticipate any issues during this window.

During the maintenance window, the facility will remain fully staffed by 18 customer service engineers, 3 different mechanical engineering contractors, and AIS senior management personnel.

We are in the process of producing a detailed post incident report which will be made available during business hours tomorrow, Thursday March 19th.

A follow up communication will be sent upon completion of the maintenance window. We apologize for any inconvenience this issue may have caused.

AIS Support Services

There was a short (less than a minute) power interruption at the data facility which houses the majority of M5Hosting customer servers. This caused all of the servers to reboot.
It is too early to know the root cause. The majority of customer servers came back up right away, as expected. However, some did not come back up as expected.
We have our entire technical staff at the data center working on the issues as quickly as possible.
There will be a complete post-mortem available as soon as possible.

Thank you for your patience and trust in us.

Sincerely,
Mike

UPDATE 13:45 PDT: There has some packet loss, which may be related to the power event this morning.

UPDATE 15:30 PDT: All network issues have been stable for a few hours. Only a few servers which have hardware failures (ie: failed power supplies or hard disks) remain down. Tech staff are addressing them as quickly as possible.

UPDATE 17:34 PDT: The failed power system components have been replaced. The data center will be transitioning off backup power and back to SDG&E power at about 18:00 PDT. Tech staff remains on site to address any issues that may arise.

Wednesday, March 11, 2009

Packet Loss Event

This morning at 7:35am PDT, some customers reported up to 33% packet loss. The issue was resolved within 18 minutes. The root cause was a Denial of Service attack against a customer. The offending system was removed from the network and the DoS stopped.

Our Network Engineers have identified the resources which were DoS'd and the technical issues which occurred. We will develop a plan to mitigate the risk of the same events happening again.

M5Hosting System Status