Sunday, January 10, 2010

PremiumNet issues

We are experiencing packet loss between PremiumNet and some remote networks on the Internet. This involves traffic that ingresses or egresses's network via Level 3 Communications. However, our ValueNet connection to Level 3 does not appear to be affected currently.

We are in constant communication with AmericanIS currently, we will update you as soon as we have additional information.

Thank you for your patience.

Update - 1/14/10 00:52:
The following post-incident follow-up will explain the root cause, what was done and what is planned to prevent this from happening again.

On Sunday January 10th, between 7:05pm and 9:40pm some customers noticed some difficulty connecting to some applications on our PremiumNet network from some locations. Not all customers or applications were affected, and not from all locations. Our ValueNet network was not affected.

All devices were pingable in all directions from all locations we were able to test from outside of our network. Our ping-based monitoring systems did not show packet loss, latency or jitter. However, some customers had difficulty when both of two conditions were met: 1) The data path either ingressed or egressed AIS via their 10 GigE connection with Level 3. 2) The packets were "full" (ie: 1500 Bytes). Small packets seemed not to be affected (ie: VoIP, DNS, games, etc).

Pings are small ICMP packets, and pings with the typical default packet sizes did not reveal the problem. However, all customer trouble reports were from customers who had difficulty only through paths that included Level 3 at the last hop outside of our upstream provider American Internet Services (AIS). Small web pages, DNS lookups, VoIP, games, and other applications which use small packets did not experience trouble. However, video conferencing, file downloads, "heavy" web pages, email (especially with attachments), etc., which transited Level 3 experienced dropped connections, and other bad behavior.

We buy our "PremiumNet" from American Internet Services (AIS), who is also our colocation facility. We have two redundant GigE connections to them for PremiumNet. Their network is built to be completely redundant and fault tolerant, just as we would build it. The failure in this case was not any component on our network, not on the AIS network or in between. According to AIS, the failure was a relatively new 10 GigE connection between AIS and Level 3. This new 10 GigE connection is an addition to two existing 1GigE connections between AIS and Level 3 at this facility. Eventually, the two 1GigE connections are to be replaced with two 10 GigE connections. This network at this site also has multiple other 10 GigE connections to transport to two other AIS facilities and then on to multiple other tier-1 carriers. The transport links to other facilities are *not* provided by Level 3.

According to information from AIS, poor optical performance on the 10 GigE fiber optic connection to Level 3 was the root cause of the issued on Sunday night. I have spoken directly with AIS Engineers, I have asked my questions both as an engineer and on your behalf as principal of M5Hosting. Most of my questions have been answered. I believe that the post incident handling of this event has been correct and appropriate. However, I am not altogether happy with the duration of the event and that we were unable to work around it for PremiumNet customers.

One of our projects for the new year is to change the way we connect with AIS and still provide top quality, route-optimized bandwidth. While AIS has always provided a solid and reliable network, we feel we need to address two key issues concerning our connectivity through AIS: 1) The cost. The cost of our "pure" AIS bandwidth option (PremiumNet) is far too high to be competitive with other "Premium" solutions. Our growth over the last several years gives us the scale needed to buy better and engineer bigger, as we have done with our ValueNet option. We have fantastic announcements to make within 2 months for our PremiumNet customers. 2) The ability to route around an AIS problem, if needed. We can drop any other network provider on our "ValueNet" offering without any significant effect to service. As great as AIS usually is, we need to be able to drop our peering session with them as well in the event of trouble. As our sole route in and out for PremiumNet, we can not route around AIS at this time. This will change very soon. An announcement will be made when this improvement has been made.

Our ValueNet network is directly connected to Level 3. We were not able to reproduce the same poor behavior on our own Level 3 connection. But, we had our finger on "the button". We were ready to drop our direct Level 3 connection to our ValueNet network about 20 minutes after the issue began, in the event that our connection to Level 3 began to act up too. Ultimately, AIS did de-peer Level 3 via the 10GigE connection to resolve the issue. However, not until 2 hours and 40 minutes after the problem began. The additional 2+ hours of time for AIS to take this action is the part that I am not satisfied with. If this event had happened after our planned network changes over the next month or so, we would have dropped our connection with AIS within 20 minutes, and this event would have been 20 minutes total for our PremiumNet customers. Work has already been planned which will enable us to drop AIS if needed during an event on their network.

This work will begin later this month. The related hardware purchases were made on 12/31/09, but not been received. Additional carrier connectivity is on the way as well. We will make more detailed announcements as we have details to report.

The most recent and most technically detailed message from AIS regarding this event follows:

To Our Valued Clients,

In a continued effort to provide our clients with the best state-of-the-art data center infrastructure, this is to notify you that we will open a maintenance window for Level3 re-peering on Wednesday, 1/13/10 at 10:00pm.

Type of Maintenance: Level3 re-peering

Location: All

Purpose: AIS will be reintroducing a Level3 peering link, which was removed by AIS Network Engineers on 1/10/10 (event #50289)

Window Start: Wednesday, 1/13/10 - 10:00pm PST
Window End: Thursday, 1/14/10 - 12:00am PST

Current Status: This scheduled maintenance window is now active.

Service Impact: As with any network maintenance, while highly unlikely, there is a possibility that something unexpected may occur during the maintenance process. Please rest assured that we will immediately address such occurrences, and will send a notification upon the successful completion of this maintenance.

Schedule: The work is scheduled to begin at 10:00pm on Wednesday, January 13th and should be completed within 2 hours time. Should additional time be required, notice will be provided and the maintenance window will be expanded. A notice will be sent at the completion of the maintenance window.

Testing & Planning: All testing and planning being conducted during this window is part of a pre-defined checklist designed by the AIS Network Engineering team.

Regression Planning:
The AIS Network Engineering team will be on-site managing this maintenance window. Should any issues arise, all equipment will be placed back to previous configuration and the maintenance will be postponed until the issue is resolved.

Updated Event #50289 Root Cause Analysis: Between 7:00pm - 9:40pm on Sunday, 1/10/10, some clients may have noticed intermittent high latency and packet loss to certain Level3-connected sites. The ultimate root cause of this event, as provided by Level3, was optical degradation of the fiber circuit as a result of fluctuating light levels at the SDTC end of the 10Gig optical hand-off.

Updated Event #50289 Timeline: At approximately 7:00pm on Sunday, 1/10/10, AIS Client Support began to notice a pattern of network routing anomalies, which introduced intermittent latency for some clients. After testing, this issue was escalated to the Engineering department at 7:43pm, who isolated the fault to Level3. AIS escalated this issue to the Level3 engineering team, who were able to reproduce intermittent packet loss across certain Level3 transit. After additional troubleshooting, the issue was isolated to one 10Gbps fiber hand-off that was deemed physically faulty. At 9:40pm, the unstable Level3 peering link was removed from the bandwidth blend, clearing all issues.

Resolution: Level3 felt it was necessary to put the fiber circuit through the same network testing process that was conducted in the initial provisioning phase of this circuit. On-site engineers cleaned and replaced the fiber connectors, then remeasured and saw that the light levels were now stable. Level3 then connected the optics to a test set and tested throughput on the connection, producing no errors. AIS and Level3 independently tested network connectivity to verify that all latency and mtu-related packet loss has disappeared.