Automating Response Coordination
June - December 2017
I led user research and delivered interaction and visual design specifications for the Response Plays feature in the PagerDuty web, iOS, and Android applications.
PagerDuty is a SaaS product that aggregates information from IT monitoring tools and facilitates rapid incident response when those tools detect an issue. People who use PagerDuty typically work in IT operations teams responsible for maintaining business-critical infrastructure, services, and applications. A service disruption may result in lost revenue, damage to the service provider's reputation, or—depending on the nature of the business—lost revenue for the company’s customers, so they use PagerDuty to help them minimize downtime.
The PagerDuty service sits between monitoring tools and IT operations teams, enabling managers to centralize alerting policy while also allowing responders to customize how they get alerted when things break. The PagerDuty web app allows people to integrate their monitoring tools with the PagerDuty service and configure alerting policy. The mobile apps for iOS and Android enable people on-call to receive push-notification alerts, as well as acknowledge and resolve those alerts from their smartphones. In addition to push, PagerDuty can also send notifications via email, SMS, and phone call, providing teams with the flexibility and redundancy to achieve reliable alerting.
By mid-2017, PagerDuty could be used to rapidly mobilize an incident response and engage stakeholders as an incident unfolded. These steps, however, still required responders to know or look up which people/teams to engage—typically a time-consuming and error-prone process involving checking documentation outside of PagerDuty and then adding all of the required individuals and teams. One customer told us that in the event of an "All Hands on Deck" scenario, they needed to page 80+ employees for a single incident and wished they had a faster way to accomplish this task. Another customer told us that in the event of a business-impacting incident, they wanted to automatically send status updates so that their executives would remain informed while their response team focused on addressing the root cause. Automating these aspects of coordinated incident response would reduce time to respond, and by extension, the cost of downtime.
I proposed a research plan, conducted a competitive analysis of incident response automation features, and worked with my team's product manager to identify and interview customers in our target market. Here are our learning goals, which informed the questions we asked in interviews:
- What does the major incident response process in a customer's organization look like today?
- Who defines major incident response process within the organization?
- Does it differ based on operational model?
- How satisfied are customers with their current process?
- What are the biggest pain points?
- What tool(s) do customers use to manage their major incidents?
- What role does PagerDuty play in customers' incident response process?
- What could the incident response process look like tomorrow?
- What aspects of incident response do customers want to automate, and why?
- How often would automation play a role in incident response? Is it just for certain kinds of incidents?
- How much control do responders need to have over the automation to feel comfortable using it? Is it a set-it-and-forget-it kind of thing, or do they want to have the ability to invoke or even reconfigure on-the-fly?
- Any requirements around access control for configuration? For usage?
- Any requirements around API?
- How can we best communicate value proposition and drive successful feature adoption?
- What benefits do customers hope to gain in automating their incident response?
- Who in the organization needs to buy in?
- How would customers describe our proposed solution in their own words?
- What are the most compelling use cases?
- What are possible objections or concerns around adoption?
During the first few customer interviews, the product manager and I asked open-ended questions to understand their current incident response process and pain points, then showed concept mock-ups I had made to facilitate the discussion. This level of prototype was sufficient to gather information about what capabilities customers were interested in automating and to what extent they were comfortable doing so.
While we conducted these early interviews, the developers on our team were hard at work building a minimally viable version of the automation feature. We had aligned on what the scope would be in advance by using Jeff Patton's user story mapping technique, which helped us visualize when to deliver units of functionality to customers based on our research and business goals. The very first iteration was switched on only for internal use at PagerDuty, what we called our "dogfooding" release, in July.
A couple of weeks later, we had our first closed beta release and began enabling the feature for select customers who had agreed to let us interview and survey them about their experiences. I tracked a steady stream of user feedback in a Google Sheet, which the team referenced during planning and prioritization sessions throughout the project.
Initially, we had expected customers to be most enthusiastic about full automation—paging responders and sending information to business stakeholders the moment a monitoring tool triggered a major incident in PagerDuty. Surprisingly, the concept that resonated most with participants was the ability for a responder, after manually triaging the incident, to initiate a predefined response on demand with a single click—essentially, to run a macro. We hypothesized that people could more easily envision streamlining their current workflows using macros than they could imagine an alternative reality of fully automated response coordination.
Although still convinced that we could minimize downtime with end-to-end automation, we decided to meet our users where they are today and support the "run a macro" use case in addition to supporting fully automated response coordination. Our launch story at PagerDuty Summit 2017 in September centered around this on-demand capability, which our marketing and product teams jointly decided to rename "response play" since "macro" was deemed too old-school. (To this day, I giggle every time I explain the response play feature to a customer only to hear them exclaim, "Oh, it's like a macro!")
We opened the beta to all eligible customers in November and launched the Response Plays feature in December 2017. Throughout the project, I worked closely with developers to provide UX guidance in the form of mock-ups, implementation reviews, and QA testing. I also partnered with the product manager to create documentation that would help users understand what kinds of problems they could solve with response plays and how to configure them.
The team kept an eye on how customers used the Response Play feature using a combination of back-end and front-end instrumentation. I led our efforts on the front end by leveraging the user behavioral analytics tool Pendo for click tracking and path analysis. I also created an in-app guide using Pendo to help first-time users understand the overall value proposition and the consequences of their configuration choices; I created another guide targeting repeat users to solicit their feedback. We saw a steady, modest level of feature usage during the beta period, followed by a healthy spike after the formal launch, validating the steps we had taken to promote feature discovery.
A new concept called a response play has been introduced in PagerDuty, which is a reusable set of actions that can be run on an incident. Actions supported at the time of launch included paging responders with a default or custom message, subscribing stakeholders to the incident, and sending a status update to all subscribed stakeholders. From the web application, PagerDuty users can create, edit, delete, and browse response plays. They can also run response plays on demand on incidents and, to achieve full automation, configure any PagerDuty service to automatically run a response play as soon as an incident opens on that service. From the PagerDuty apps for iOS and Android, responders can run a response play on an incident.
The Response Plays feature enables PagerDuty customers to automate their incident response coordination as much or as little as they want. Usage data from the first month post-launch show that 95% of response plays are run in a fully automated manner, confirming our initial hypothesis about the primary value proposition being end-to-end automation. Feedback from our sales team indicates that it is a popular demo with a clear value proposition. Customers have reacted with enthusiasm and, even more encouraging, detailed requests for enhancements. Perhaps best of all, I have heard from numerous customers that they have finally adopted our other incident response features (Response Mobilizer and Incident Subscription) because response plays automate their usage, making them easier to incorporate into their own processes without having to re-train employees.