Creating Incident Postmortem Reports
PagerDuty
TIMEFRAME
December 2016 - May 2017
My Role
I led user research and delivered interaction and visual design specifications for the Postmortems feature in the PagerDuty web application.
Background
PagerDuty is a SaaS product that aggregates information from IT monitoring tools and facilitates rapid incident response when those tools detect an issue. People who use PagerDuty typically work in IT operations teams responsible for maintaining business-critical infrastructure, services, and applications. A service disruption may result in lost revenue, damage to the service provider's reputation, or—depending on the nature of the business—lost revenue for the company’s customers, so they use PagerDuty to help them minimize downtime.
The PagerDuty service sits between monitoring tools and IT operations teams, enabling managers to centralize alerting policy while also allowing responders to customize how they get alerted when things break. The PagerDuty web app allows people to integrate their monitoring tools with the PagerDuty service and configure alerting policy. The mobile apps for iOS and Android enable people on-call to receive push-notification alerts, as well as acknowledge and resolve those alerts from their smartphones. In addition to push, PagerDuty can also send notifications via email, SMS, and phone call, providing teams with the flexibility and redundancy to achieve reliable alerting.
The Problem
IT teams spend significant time and energy responding to incidents. The more operationally mature IT teams also take time to reflect upon the effectiveness of their response process, often immediately following the resolution of a business-impacting incident, a practice commonly referred to as a postmortem. The person or technical team that owns the follow-up for a given incident typically creates a postmortem report by manually collating data from multiple data sources, synthesizing that data into a timeline of the incident, providing a written root cause analysis, proposing steps to prevent the incident from occurring in the future, and reflecting on opportunities for process improvement.
Collating data from multiple sources and creating an incident timeline are tedious and time-consuming steps in the postmortem creation process, with a lot of manual copy/paste operations. If the data and response activity spans multiple time zones, whoever has the job of assembling the timeline also needs to do a lot of time zone math to put all the data together into a single timeline so that there a clear picture of what happened and when. The inefficient and tedious nature of the task means that organizations conduct postmortems for only the most critical business-impacting incidents, and so they miss most opportunities to learn and improve from their experiences.
Approach
Back in mid-2016, A PagerDuty devops engineer created a tool to author postmortem reports more efficiently, purely for the benefit of PagerDuty engineers. His simple solution was so compelling, however, that PagerDuty leadership gave him a cross-functional product development team and the month of January 2017 to incorporate his postmortem report creation tool into the PagerDuty application. I was the UX designer assigned to that team.
We did not have time to conduct discovery before starting implementation, so I combed through my research notes on incident management for everything we had learned about our customers' postmortem processes and pain points. During the project kickoff, the team reviewed the existing data and engaged in a story mapping activity to align on the overall vision of the solution as well as immediate next steps.
User story map
We had weekly reviews of implementation, and by the end of January, we released the first version of PagerDuty Postmortems for internal use. During the subsequent "dogfooding" period, a variety of engineers used the feature to create their own postmortem reports and offered us a wealth of candid, constructive feedback. We learned, for instance, that multiple users commonly needed to simultaneously edit a report, and so we devised a few enhancements to support real-time collaboration and auto-saving.
Real-time collaboration and auto-save features
The team continued to make improvements and add functionality to the feature over the next few months, based on user feedback as well as strategic business goals. We announced an open preview for the feature in February and collected additional feedback from early adopters up to the official feature launch date in May 2017.
Affinity diagram of user feedback
Solution
PagerDuty users can create, edit, browse, and export postmortem reports from the web application. Within the report building interface, a user can rapidly assemble an incident timeline from PagerDuty data, chat tools, and/or custom timeline entries.
Create a postmortem report directly from an incident
Browse postmortem reports or create a new one
Pull individual chat messages directly into an incident timeline without having to copy/paste across tools
Edit an incident timeline
Create a custom timeline entry
Compose analysis sections
"Save as PDF" layout
Impact
The Postmortems feature extends PagerDuty's value proposition beyond incident response, helping organizations learn and improve their response processes over time. It enables PagerDuty to tell a more comprehensive incident management story to the market, and it saves users hours of time creating their postmortem reports. In addition to encouraging early adoption and usage numbers and some very appreciative comments from customers, the PagerDuty sales team reports that this feature demos very well. The product development team has also received a healthy stream of enhancement requests, suggesting that the tool has traction.