Notifying with Urgency

PagerDuty

Timeline

January - September 2015

My Role

As a member of a cross-functional product development team, I led UX research and design for PagerDuty Notification Urgencies, producing design deliverables for web and mobile platforms. I partnered closely with a product manager, technical architect, and software developers throughout the process.

Background

PagerDuty aggregates information from IT monitoring tools and facilitates rapid incident response when those tools detect an issue. People who use PagerDuty typically work in IT operations teams responsible for maintaining business-critical infrastructure, services, and applications. A service disruption may result in lost revenue, damage to the service provider's reputation, or—depending on the nature of the business—lost revenue for the company’s customers, so they use PagerDuty to help them minimize downtime. 

The PagerDuty service sits between monitoring tools and IT operations teams, enabling managers to centralize alerting policy while also allowing responders to customize how they get alerted when things break. The PagerDuty web app allows people to integrate their monitoring tools with the PagerDuty service and configure alerting policy. The mobile apps for iOS and Android enable people on-call to receive push-notification alerts, as well as acknowledge and resolve those alerts from their smartphones. In addition to push, PagerDuty can also send notifications via email, SMS, and phone call, providing teams with the flexibility and redundancy to achieve reliable alerting.

The Problem

Prior to September 2015, the notifications that PagerDuty sent to on-call IT staff only had one volume setting: OMG FIRE! The ability to wake people up when necessary has always been part of PagerDuty's core value proposition, but people on call typically deal with both critical and non-critical issues. Because PagerDuty didn't respect this distinction, it had the annoying tendency of repeatedly waking people up for non-critical issues. Some customers responded by routing only their most critical alerts to PagerDuty, which made it difficult to track operational performance. Others disabled alerting after they’d been woken up needlessly so that they could get their sleep, which risked missing critical alerts. We saw an opportunity to not only address an obvious pain point for sleep-deprived engineers but also to make it easier for them to prioritize and act on critical issues immediately.

Approach

After joining the product development team tasked with tackling this problem, I worked with the product manager to align on the core user needs and business goals. I reviewed the documented customer feedback, wrote a design brief, and proposed a research plan to validate assumptions and understand users' workflows. I interviewed PagerDuty customers who had expressed a desire for this capability, asking them questions about their role and responsibilities, how they respond to incidents, and whether they use any kind of incident classification during their response.

Members of my team observed these interviews, shared their takeaways in debriefings, and participated in data synthesis. Since the team was distributed across two geographic locations, we used online tools such as Google Hangouts and Trello to collaborate on an affinity diagram of the observations.

Affinity diagram of research observations

By the end of the discovery phase, the team had a much better understanding of the needs motivating feature requests. As a bonus, most of the team members had never heard a PagerDuty customer directly before, so everyone came away with a better sense of our customers’ goals and context. 

Armed with new insights, my team and I collaborated closely in two-week agile sprints to deliver a feature that allowed system administrators to configure the urgency level of incidents flowing into PagerDuty as "high" or "low" and allowed incident responders to define their personal notification methods based on an incident urgency. Some of the administrators I interviewed had expected to be able to define four or five levels of urgency, mirroring their organization's incident priority scale. For the first cut of the feature, however, I proposed just two levels of urgency because...

  1. urgency and priority are related but not synonymous concepts
  2. the bottom line at 3 AM is, either something's worth waking up for, or it isn't

This design decision simplified the configuration workflows and limited the scope of development while addressing the main problem of differentiating between critical and non-critical alerts.

Urgency as a new property of an incident

Diagram of system notification behavior, used for facilitating communication within the product development team

While the developers built our first functional prototype, I created a usability test plan that outlined our learning goals, research questions, target audience, and testing methodology. Once the product manager and I were aligned on the test plan and participants had been recruited, I conducted usability testing of the functional prototype and used the feedback to iterate on the experience and functionality. One area that I spent time refining was the system administrator's configuration experience. Some teams only respond to incidents during defined support hours, which meant they wanted PagerDuty to track incidents 24/7 but only notify them loudly during their support period. I generated a few different concepts for handling the configuration of support hours. I favored a calendar-based UI supporting direct manipulation of time intervals. 

Calendar-based configuration UI

For the sake of expedient implementation, I came up with a less flexible but considerably simpler design covering the vast majority of use cases, which the team ultimately adopted.

 Specification of new service configuration options

Specification of new service configuration options

Incident responders typically interact with the PagerDuty mobile app, so in addition to proposing distinct push notification sound settings for high vs low urgency incidents, I also collaborated with our mobile developers to add a visual indicator to low-urgency incidents. 

PagerDuty iOS app displaying notification sound settings

PagerDuty iOS app with high- and low-urgency incidents

 

Impact

At launch, customers reacted positively on social media and, more substantively, altered their PagerDuty configuration to take advantage of the new feature. Since then, the feature has prevented millions of unnecessary moments of panic each month and let PagerDuty users sleep better at night. 20% of all incidents in PagerDuty are now classified as low-urgency

"Incidents with a Volume Knob: Introducing Incident Urgencies!"

This has been on my wishlist for a good while; should make things much more workable. Thanks! —Alex Hooper (Sep 10, 2015 from PagerDuty blog)
Yay! I've been scratching my head on how to solve this before, ended up not alerting at all when something wasn't that bad. Great feature, thanks! —KenjiD (Sep 10, 2015 from PagerDuty blog)
This really makes it possible to differentiate monitoring and alerting. A feature we have been wanting for a long time! —Johan Bloemberg (Sep 11, 2015 from PagerDuty blog)