Mobilizing an Incident Response
January - July 2016
I led UX research and delivered interaction and visual design specifications for the PagerDuty Response Mobilizer feature on web, iOS, and Android platforms. As UX lead for the multi-team incident management initiative, I also coordinated regularly with three other UX designers working in parallel on related projects to ensure a holistic user experience.
PagerDuty is a SaaS product that aggregates information from IT monitoring tools and facilitates rapid incident response when those tools detect an issue. People who use PagerDuty typically work in IT operations teams responsible for maintaining business-critical infrastructure, services, and applications. A service disruption may result in lost revenue, damage to the service provider's reputation, or—depending on the nature of the business—lost revenue for the company’s customers, so they use PagerDuty to help them minimize downtime.
The PagerDuty service sits between monitoring tools and IT operations teams, enabling managers to centralize alerting policy while also allowing responders to customize how they get alerted when things break. The PagerDuty web app allows people to integrate their monitoring tools with the PagerDuty service and configure alerting policy. The mobile apps for iOS and Android enable people on-call to receive push-notification alerts, as well as acknowledge and resolve those alerts from their smartphones. In addition to push, PagerDuty can also send notifications via email, SMS, and phone call, providing teams with the flexibility and redundancy to achieve reliable alerting.
During the last quarter of 2015, I conducted a series of interviews to learn about our customers’ incident management practices. Rapidly engaging multiple responders for a given incident turned out to be a significant pain point, with coordinators spending up to 30-60 minutes manually looking up contact information and chasing people down via phone. Even in organizations where they did leverage PagerDuty to contact people, they had to create a separate PagerDuty incident for each individual or team they needed to loop in, and then duplicate context and conference bridge information across all those PagerDuty incidents.
Customers needed an easy, fast way to recruit individuals (either specific people or the on-call for a specific functional area) from one or more teams to collaborate on an open issue and track participant status.
PagerDuty did not solve this problem because
- Although PagerDuty could notify multiple responders when an incident opened, the first acknowledgement stopped PagerDuty from notifying anyone further for that incident.
- Reassigning the incident dropped all people currently assigned.
- Without reassigning, users could not add anyone other than themselves to an incident.
- Creating one PagerDuty incident per responder and duplicating contextual information across all the incidents was a slow and error-prone process.
I proposed a research plan, sketched low-fidelity design concepts, and worked with my team’s product manager to identify and interview customers in our target market. Since we already had some insight into the incident management practices of these customers from the prior quarter’s research, this research sprint focused on the specifics of engaging people for a coordinated response (how they mentally organized the people they needed to reach, what tools they used, how they conveyed context, etc.) and gathered feedback on design concepts. I conducted interviews, synthesized data with my team, and summarized and shared findings with internal stakeholders.
My team iterated on the feature in two-week sprints, starting with capabilities core to the goal of getting people engaged in a coordinated incident response quickly and easily. While the developers worked on implementation and I provided UX reviews as needed, the product manager and I also signed up a small set of customers to provide early feedback on the experience. As the developers shipped functionality, we gathered internal feedback first, then released the changes to preview participants. The team instrumented the feature so that we could track when and how the feature was being used. I used that quantitative data in conjunction with first-time experience usability tests and post-preview user surveys to prioritize and inform design iterations.
As we approached our target launch date for general availability, I considered how this new feature fit into the existing product, how it would change our customers’ workflows, and how it might change how the market values PagerDuty as a product. We introduced the concept of a coordinated response into the product and, in doing so, declared that PagerDuty was no longer just a tool for notifying one person when something broke. Now it could help drive full-scale incident response from detection to resolution. In the run-up to launch, I communicated with product management, customer support, product marketing, and sales to provide a user-centric perspective on messaging for the feature.
The Response Mobilizer feature shipped in July 2016 with functionality on iOS, Android, and web platforms. Users could...
- create incidents in PagerDuty with up to 300 responders, specified as either individual people or someone from a functional team
- add responders to an incident already in progress
- respond to a request that someone else had sent to them to join a coordinated incident response
- view the status of any responder (pending, accepted, or declined)
Prior to this feature, it was impossible to track on the effectiveness of coordinated incident response in any meaningful way using PagerDuty. Customers didn’t cite this as a problem during discovery because they didn’t think of PagerDuty as a tool for measuring or understanding how they’re doing at incident response. Strategically, though, PagerDuty wanted to to provide this level of insight to operations managers. Not only did Response Mobilizer address a real and significant pain point for current users, it opened up new opportunities for PagerDuty to help organizations level up their operational performance.