Home > Blog > Xignite DNS Incident – June 12, 2014

Xignite DNS Incident – June 12, 2014

On June 12, 2014, many Xignite customers were unable to correctly resolve xignite.com URLs due to a DNS issue, resulting in API request failures even though Xignite’s production infrastructure was fully operational. This is my first-hand account of what happened. All times below are Pacific.


 

What happened?

The early morning of Thursday June 12, 2014 started off quiet for the on-duty Xignite operations staff out of our Shanghai office. As of 02:40, all our continuous checks against Xignite APIs were passing successfully. But we began seeing alerts from Pingdom:

From: alert@pingdom.com [mailto:alert@pingdom.com]
Sent: Thursday, June 12, 2014 02:39 AM
To: Matt Purkeypile
Subject: Pingdom Alert: incident #1826 is open for XigniteAnalysts (www.xignite.com)Hi Matt Purkeypile,

This is a notification sent by Pingdom. Incident 1826, check 956681 XigniteAnalysts (www.xignite.com), is currently open.

Figure 1. First Pingdom alert

That incident quickly resolved itself, meaning Pingdom was able to successfully verify calling that API, but was followed by another alert on a different API (XigniteFundamentals, specifically). These checks escalated to paging the on-duty operations staff, as a third-party monitor detecting an API being down is a potentially serious incident. Thus began the pattern of Pingdom issuing alerts and pages about various API failures, only to close them out shortly thereafter. During this time, Xignite’s internal monitoring reported all systems as operating normally.

The on-duty operations staff immediately began investigating, but was unable to diagnose the cause. Recognizing the potential customer impact, the on-duty operations staff issued an “all hands on deck” alert and woke up the senior technical staff, including myself, at 03:22. I quickly logged into the Xignite network but was unable to reproduce any problems. I checked our internal monitoring and it was reporting all systems as normal. Confused as to why Pingdom was still reporting issues, I verified that our Pingdom monitoring was set up correctly and using valid authentication credentials. So I disconnected myself from the Xignite network to replicate Pingdom’s monitors (and how a customer would call our API from an external location).

Then I was able to reproduce the problem. Attempts to navigate to www.xignite.com in my browser reached a generic web page displaying an error message (which we later learned was a Network Solutions parking page); in fact, every xignite.com request was being redirected. Interestingly, some of our staff who were not connect to the Xignite network were observing xignite.com resolving correctly and successfully making API requests. By disconnecting from the Xignite network, we were no longer relying on an internal Xignite DNS server but rather public DNS servers.

Location 1 (from another employee’s home)
C:\>nslookup www.xignite.comns.above.net
Server: ns.above.net
Address: 207.126.96.162
Name: xignite.com
Address: 64.124.25.100
Aliases :www.xignite.com
Location 2 (from my home)
C:\>nslookup www.xignite.comns.above.net
Server: ns1432.ztomy.com
Address: 208.91.197.132
Name: www.xignite.com.prodsj.xignite.com
Address: 208.91.197.132

Figure 2. nslookup commands run by Xignite staff at different locations, a few seconds apart.

So it appeared that some public DNS servers were correctly resolving the xignite.com domain while others were not. I immediately filed an urgent ticket with Zayo Group (our DNS provider) at 03:44 (which coincidentally is 1 minute prior to when Zayo Group’s formal incident report later reported that they first began to receive alarms and customer reports of a problem). I issued a company-wide internal alert at 03:48, but since this was a DNS issue, not everyone received the alert. And because we often experienced response times in terms of days from Zayo, I followed up and called Zayo’s Network Operation Center (NOC) at 04:02 to escalate the incident. While we waited for a response from Zayo, we continued our investigations. I first sought to confirm with Network Solutions (our domain registrar) that our xignite.com domain name was still active and properly registered.

We had recently received several e-mails over the past few weeks from Network Solutions, including one at 02:03 that morning, notifying us of an unrelated Xignite domain name that was about to expire. (In our experience, Network Solutions always aggressively notifies domain name owners when a domain is about to expire.) Sure enough, xignite.com was still active and defined to use Zayo’s name servers ns.above.net and ns3.above.net. The next step was to evaluate these name servers, which is what uncovered the problem. Both ns.above.net and ns3.above.net were resolving to the same generic web page that displayed an error that www.xignite.com was returning.

In other words, what was defined as name servers for xignite.com was resolving to a static web page. (It turns out, coincidentally, that Zayo also uses Network Solutions as their domain registrar, at least for the above.net domain name.) At this point, it became fairly clear what was happening, but not exactly why. Our Xignite domains were pointing to name servers that were not name servers. So DNS servers that had our domain name resolutions cached would continue to successfully resolve xignite.com, but DNS servers that queried these name servers (or the DNS servers that queried those DNS servers) would result in incorrectly resolving our domain names (and caching those incorrect resolutions).

 

Fixing the problem

We decided to act immediately rather than wait for Zayo because the US markets were opening in about 2 hours and our prior experiences with Zayo handling tickets included waiting days for a response, incorrectly updating DNS entries, and even incorrectly updating unrelated DNS entries. We chose to move forward with replacing Zayo for DNS service with Amazon Route 53, something we had been researching, testing, and experimenting with for the past several weeks. I retrieved a zone file from February 2014 from Zayo and loaded it into Amazon Route 53.

Figure 3. Amazon Route 53 showing several DNS entries, post zone file import.

I imported the zone file and a manual inspection showed that it looked complete and correct. I also verified some of our most heavily used domains such as www.xignite.com and globalquotes.xignite.com against these Amazon name servers. It’s now 04:30, and with no ETA from Zayo and no details other than “major DNS” problems we made the decision to switch to using Amazon Route 53 in production. In addition, with Amazon Route 53, we were leveraging 4 name servers rather than the 2 that Zayo provided for greater redundancy.

So through Network Solutions, I re-defined our Xignite domain name servers to use Amazon 53 rather than the non-functioning Zayo name servers. After rolling out this change, we continued further extensive testing to verify that all Xignite domain mappings were correct. At this point we had done all we could to fix the problem. Xignite had addressed the issue within about 2 hours after the problem had surfaced.

Unfortunately the time to live (TTL) was set at 48 hours, which is why Xignite domain name resolutions for some customers would begin to reflect the changes we made fairly quickly while for others it may have taken up to 2 days. Our Pingdom monitors continued to detect DNS servers that failed to properly resolve Xignite domains in some parts of the world until about 03:00 on June 14, which was roughly about 48 hours after the onset of this incident. By 03:00 on June 14, Pingdom monitors confirmed that everything was back to normal.

Figure 4. DNS problems as detected by Pingdom. Available at http://status.xignite.com/.

 

Following up with Zayo

By 06:19 there was still no update from Zayo, so I called them as it had been more than 2 hours since my last call with them. Didn’t get much of an update other than they were still working on it. A few minutes later I received an e-mail update saying “Issue was believed corrected but issue still exists.”

From: Salesforce Notification [mailto:salesforce@zayo.com]
Sent: Thursday, June 12, 2014 06:27 AM
LATEST CASE COMMENTComments
Hello, We are still working on the resolution for the DNS issue. Issue was believed corrected but issue still exists. Updates to follow as soon as they are received. Thank You

Figure 5. Update from Zayo.

At 07:01, I received another email indicating that the issue has been resolved:

From: Salesforce Notification [mailto:salesforce@zayo.com]
Sent: Thursday, June 12, 2014 07:01 AM
LATEST CASE COMMENT

Comments
Dear valued customers, Our DNS admins have identified the problem, and services should be restored at this time. If you are still having an issue, please call the NOC at 866-236-2824.

Figure 6. Update from Zayo.

Of course, because resolving the issue required propagation of any change to all public DNS servers, it was a bit premature to declare that services have been restored at this time. It was fairly simple to confirm this was the case, and I called Zayo to inform them of this at 07:26. For Xignite customers, however, this didn’t matter as we had already migrated to Amazon Route 53 over 2 hours earlier.

 

Zayo’s post mortem

I then started pushing Zayo for details on what happened. After all, this had an impact on our entire business and I had been up for hours working on the issue. We had hypothesized among ourselves that perhaps it was DNS hijacking or some sort of attack. What we were told next blew our minds:

From: Stephen Vaca
Sent: Thursday, June 12, 2014 08:46 AM
To: Matt Purkeypile, …
Subject: Re: There has been an update to your case – TTN-0000473048 [ ref:_00D6079Qk._50060avq3w:ref ]Matt,

To clarify the domain above.net did expire and we renewed this AM once discovered. Initially we were unable to access the account, but have since gained full control of the account with Network Solutions for any future notifications and have full control for any move/add/change activity.
I apologize for the negative impact to your business, but we have limited control to update DNS records globally as you know.  As TTL expires and renews for DNS globally for Above.net, operations will return to normal.

James or Cody,
Please fill in any gaps in my response so Matt has a full understanding of the issue, impact and resolution.

Thank you,

Stephen P. Vaca
Director, IP Operations | Zayo Group

Figure 7. First communication from Zayo about what caused the problem.

It wasn’t anything sophisticated, Zayo acquire AboveNet and let the domain name expire. And as is standard practice, Network Solutions had redirected all requests for the above.net domain name to a generic Network Solutions web page rather than Zayo’s name servers. This was the generic web page displaying an error message that we were seeing earlier. How do you let your domain names expire, especially if you’re an ISP?! As for the root cause, the official explanation was a single sentence in Zayo’s subsequent incident report:

 

Figure 8. Official root cause from Zayo’s formal incident report.

As previously mentioned, Xignite also uses Network Solutions as a domain registrar and we received repeated and frequent communication about an unrelated and expiring domain. We even followed their manual steps and called them to stop the flood of emails about an expiring domain, and yet continued to receive emails regarding the expiring domain. So Zayo’s claims about insufficient communication doesn’t match our own experience with Network Solutions. Also of interest was Zayo’s timeline of events (note that these times are Eastern):

Figure 9. First entries in Zayo’s incident report to Xignite.

For example the first event at 6:45 am of receiving alarms and customer reports is just 1 minute after we filed a ticket with them, which was over 1 hour after our first Pingdom alert. The second event at 7:29 am indicates that they are still investigating and coincides with when we completed migrating off Zayo. The third event at 9:20 am indicates Zayo addressed the issue after nearly 4 hours from our first Pingdom alert. I’m proud that Xignite correctly detected, diagnosed, and resolved the problem (by switching off Zayo) in under 2 hours, which was a point in time that Zayo wasn’t even sure what the problem was. Zayo was not even aware that there was a problem until Xignite filed our ticket based on their timeline.. Finally, Zayo reported the duration of the incident as 27 hours in their formal incident report:

 

Figure 10. Outage summary in Zayo’s incident report to Xignite.

It’s baffling why Zayo would cite a duration of 27 hours. They were fully aware of and communicated that the TTL was 48 hours, so how are you going to get it resolved faster? True, Zayo asked some ISPs to flush their DNS- but that doesn’t fix everything on the planet. Or maybe they didn’t test very extensively, if at all, given that Pingdom was able to detect a total duration of close to 48 hours.

What other mitigating steps did Xignite take?

Since we had minimal control over how quickly our changes would propagate to all DNS servers, we also communicated workaround instructions to all our customers. These work around instructions included:

  • Change to use DNS servers that are correctly resolving xignite.com domains.
  • Map directly to Xignite IP addresses by overriding DNS in a host file.
  • Flush your DNS cache, if possible.

 

Reflections

This was one of those events so rare and improbable, protecting against it admittedly never even crossed my mind because I didn’t even image the possibility. Can you imagine if Google let youtube.com expire for example? You don’t even anticipate that being a problem. That being said, the monitoring we had in place was able to immediately detect a problem. Our on duty staff tried investigating at first, then escalated when they couldn’t resolve the issue. It was only about an hour from the time my phone woke me up until we had migrated to Amazon.

So short of not using Zayo, I’m pretty happy with Xignite’s technical response to this catastrophe. The one area we do plan to improve is getting out an alert to clients sooner. With those changes in place I believe we should have issued an alert to all our clients around 04:00, and the workarounds to our entire client base not long after. A very frustrating aspect was lack of accurate updates from Zayo. It would have been nice for them to at least publicly acknowledge the incident for example. We noticed they were tweeting about the “WomanInTechCO” event later that day, yet there was no mention of a major DNS failure? We received the incident report from Zayo the afternoon of Monday June 16 — days after the incident. Xignite made several rounds of communication to clients and issued our formal report early the following morning (Friday June 13).

The fact that their incident report neither accurate nor consistent with what they told us through updates in our tickets is also a disappointment. Where do we stand today? We migrated to Amazon for DNS in the middle of the incident, hours before Zayo could resolve things on their end. So that migration certainly narrowed the window that the bad DNS was live. We have no intention to reverting back to Zayo for DNS after this fiasco. Xignite has been a user of Amazon since 2009, and our experience has shown that they are a reliable and responsive company to work with. Therefore I don’t anticipate any DNS problems going forward now that we’re using Amazon.

One thing to highlight here is that this is a good example of Xignite’s commitment to operations. Many companies treat operations as a necessary evil — something they do only because they need to do it. My view, and by extension Xignite view, is that operations and support is what makes us a better company, and something we continually strive to make a differentiator for Xignite. This is why we continuously invest in making our detection systems more robust and complete. So here we were able to tell Zayo of a problem with their systems, before they even knew, and were clearly one of the first customers to do so.

 

Matt Purkeypile

Engineering Director, Xignite

 

Background

Here is some additional background on Xignite’s production infrastructure to provide more context:

  • Xignite relied on Zayo (formerly AboveNet) for our public DNS.
    • This stems from our early days as using AboveNet as our ISP in our San Jose, CA data center.
    • AboveNet was acquired by Zayo in July 2012 [1], which is what made Xignite a Zayo customer.
  • Xignite also runs its own internal DNS servers for xignite.com. This is primarily for use within internal systems, including monitoring.
    • For example, our monitoring can watch “serverxyz.xignite.com” instead of a specific IP, where serverxyz.xignite.com may not be publically accessible.
  • Xignite relies on PagerDuty to notify operations staff of urgent problems.
  • Xignite has evolved our own custom built monitoring system throughout our company’s history.
    • This monitoring system is performing tens of thousands of checks per hour to ensure all systems are up, data is flowing how it should, and a sampling of that data is accurate.
    • This is something we are continuously improving. In fact, we update this system twice a week in order to get these changes live as soon as possible.
  • In addition to our own in house monitoring, we also use Pingdom to monitor every public web service and then some.
    • This is what powers http://status.xignite.com/.
    • Pingdom continually performs these checks from various locations around the world, to give us third party verification that things are truly up and responding correctly.
  • Xignite was targeting moving our DNS to Route53 this summer.
    • Research and a limited amount of testing had already been performed prior to this incident.
    • This move was in part driven by the poor service we’ve received from Zayo after the AboveNet acquisition. For example, a simple DNS A record addition would frequently take days and multiple requests before Zayo would add it. Not only that, but we also had incidents where Zayo had changed the wrong record, not actually written the changes, or deleted other entries when performing these simple DNS changes. Therefore DNS maintenance was a painful experience.

 

Summarized timeline of events

This is a quick summary of the catastrophic DNS failure by Zayo that negatively impacted Xignite. Times given are Pacific time on Thursday June 12, 2014 unless specified otherwise.

  • 02:40- Xignite monitoring starts picking up DNS problems.
  • 03:22- Xignite senior technical staff, including myself, woke up via paging from the on duty ops staff.
  • 03:44- Xignite files a ticket with Zayo letting them know there are DNS problems.
  • 03:45 – Zayo claims the incident starts in their incident report, due to client reported problems. (Incorrect, started more than an hour earlier.)
  • ~04:30- Xignite has correctly diagnosed the problem and performed an emergency migration of our DNS to Amazon’s Route 53.
  • 07:00- Zayo claims the issue is resolved, both in an update to Xignite at the time and in their incident report. (Incorrect.)
  • 07:26- I let Zayo know that it isn’t fixed.
  • Friday June 13, 2014 at 10:41- Xignite incident report deliver to customers. (Prior communication explained the basics of the problem and that it was resolved minus DNS propagation.)
  • Friday June 13 at 19:00 – Zayo reports the issue is completely resolved in their incident report. (Incorrect.)
  • Saturday June 14 02:49 – last failure detected from Pingdom.
  • Monday June 16 15:55- formal incident report delivered to Xignite from Zayo.

__________

For more info on how Xignite's market data cloud can improve your business processes, request a consultation with one of our data experts.

To test drive our market data APIs and experience for yourself how easy they are to integrate into your applications, request a free trial.