Home > Blog > Utilizing PagerDuty for Critical Operations Alerts

Utilizing PagerDuty for Critical Operations Alerts

Xignite’s monitoring system is constantly performing health checks to detect problems, or even potential problems. Consequently, the notifications our monitoring system generates fall into one of three general buckets:

  1. Informational. Example: CPU usage is higher than normal on a machine.
  2. This needs to be looked at. Example: archiving some data failed.
  3. This is a critical problem someone needs to look at NOW. Example: third party monitoring is detecting services as down, as in the Zayo DNS incident.

It is that third bucket I’d like to talk about here. Even for staff that is on-duty, how do you alert them immediately when there is a major incident? What if they’re talking with a co-worker or in a meeting? Do you really want the timeliness of your response to be based on how long it takes them to finish up a conversation and get back to their computer?

Xignite’s solution to this, which we’ve had in place for several years, is PagerDuty. With PagerDuty it was easy for us to extend our monitoring systems to immediately notify operations staff through a variety of means, including SMS, email, and phone calls. For myself, I prefer to be emailed and paged first, then start calling my various numbers (cell, office, and home) until I answer:

My notification rules

 

Escalation policies can be defined so that if an employee does not acknowledge an alert, it automatically escalates to the next person on the list, and so on. This continues until someone acknowledges that they’re looking into the problem. These alerts can also be made specific for the problem, so employee A might be alerted for one incident and employee B for another. As would be expected, on-call schedules can also be created. This allows for me to be paged at 10:00 am on a Tuesday for an incident, but our operations staff in Shanghai to be paged at 3:00 am for that same incident instead of waking me up. For example, here is the current schedule for the first person to be called on our Shanghai operations team:

China On Call

With PagerDuty, I can confidently say that operations staff will be immediately alerted in the event of a critical problem. The hidden benefit here is that our operations staff is more productive: we don’t have to have someone constantly staring at the incoming stream of notifications. They can work on other things including continually increasing what we monitor and fine-tuning thresholds for alerting, as well as dealing with the less critical alerts.

Another benefit is that this allows us to quickly get staff on deck in the event of a major problem. Imagine you need to get a handful of senior engineers online at 2:00 am on a Thursday morning.  Without PagerDuty someone has to look up their numbers and start calling them to wake them up. We’ve set up PagerDuty, so that we simply fire off an email and let the system deal with repeatedly calling each team member’s various numbers until they’ve responded. This allows the first responder to “fire and forget” to get help online and immediately get back to dealing with the problem at hand.

All said, I’m extremely happy with PagerDuty. This is just one more thing that demonstrates Xignite commitment to operational excellence. For Xignite, operations isn’t a necessity, it is something that sets us apart.

__________

For more info on how Xignite's market data cloud can improve your business processes, request a consultation with one of our data experts.

To test drive our market data APIs and experience for yourself how easy they are to integrate into your applications, request a free trial.