Developer Experience Research Ebook
On-Call Practices in Developer Experience - Manageable and Well-Supported Systems
Research-based guide on effective on-call practices for engineering teams. Learn how to create manageable on-call rotations that support developer well-being and system reliability.
On-call practice
The on-call load in my team is manageable and well-supported.
What are effective on-call practices?
Effective on-call practices represent a balanced approach to maintaining system reliability while ensuring developer well-being. These practices include:
- Equitable rotation systems that distribute the on-call burden fairly across team members
- Clear escalation paths that define when and how to involve additional support
- Comprehensive documentation that enables rapid diagnosis and resolution of incidents
- Reasonable response expectations that acknowledge human needs for rest and recovery
- Post-incident learning that improves systems rather than assigns blame
When implemented properly, these practices ensure that critical systems receive necessary attention without exhausting the teams that maintain them.
Why is managing on-call load critical for developer experience?
On-call duties represent one of the most significant sources of developer stress and potential burnout when poorly managed. Teams with unsustainable on-call practices often experience:
- Higher turnover rates as developers seek positions with better work-life balance
- Decreased productivity during regular work hours due to fatigue from night-time incidents
- Reduced innovation as mental energy is consumed by operational concerns
- Deteriorating system reliability as exhausted engineers make more mistakes
"Each team typically has someone who's not only on call for, you know, pages and things overnight, but is also the interruptible person. So they're the ones handling that work stream. Because of that, we. I don't think we struggle as much with feeling like we're getting interrupted from that work. It's kind of segmented off on one individual. So I think we don't struggle with that pain point as much."
Head of Engineering at Observability Platform
This strategic approach to managing interruptions allows teams to maintain focus while ensuring systems receive necessary attention.
How can organizations measure if their on-call practices are healthy?
Measuring on-call health requires both quantitative and qualitative approaches:
Quantitative metrics:
- Alert frequency per developer per week
- After-hours page rate and distribution
- Mean time to resolution for incidents
- Sleep interruption frequency for on-call personnel
- Incident recurrence rate indicating systemic issues
Qualitative assessment:
- Developer feedback through targeted survey questions
- Retention rates of teams with high on-call responsibilities
- Team morale during and after on-call rotations
- Willingness to participate in on-call rotations
The Network Perspective DevEx Survey directly addresses on-call practices with the statement: "The on-call load in my team is manageable and well-supported.". This carefully crafted question measures two critical dimensions: Manageability - whether the frequency and intensity of on-call work is reasonable, and Support - whether engineers have the tools, documentation, and backup they need.
Low scores on this question should prompt immediate investigation into: - On-call rotation frequency and duration - Alert volume and quality - Documentation completeness - Escalation processes - Recovery time allowed after intense on-call periods
By regularly measuring this aspect of developer experience, organizations can identify issues before they lead to burnout and attrition.
While specialized on-call management tools like PagerDuty, OpsGenie, and VictorOps provide operational metrics, the DevEx survey provides the critical human perspective on whether these systems are working effectively for the people who use them.
What are the most common on-call challenges teams face?
1. Excessive alert volume and alert fatigue
When systems generate too many alerts, especially false positives, engineers become desensitized and may miss critical issues.
2. Inadequate documentation and troubleshooting guides
Without proper documentation, on-call engineers waste precious time searching for information during incidents:
"One effective strategy was improving on-call documentation. We had a team that was overwhelmed with support questions, customer data requests, and other interruptions that prevented deep, focused work. By investing time to document answers to common questions, we initially saw an increase in the time spent on on-call duties. However, this was followed by a dramatic decrease as team members could simply direct people to the documentation, immediately resolving many requests."
Software Development Manager
3. Unpredictable workload and unbalanced schedules
Some on-call rotations place excessive burden on certain team members or during particular time periods.
4. Insufficient compensation or recognition
Many organizations fail to adequately compensate or recognize the additional stress and responsibility of on-call work.
How can teams improve their on-call practices?
Implement follow-the-sun rotations when possible
For global teams, structure on-call schedules so that engineers are primarily on-call during their daytime hours.
Establish clear escalation policies
Define when and how to escalate issues, ensuring that no single engineer bears the full burden of complex incidents.
Invest in proper tooling
Use specialized incident management tools that provide context and streamline communication.
Dedicate time for system improvements
Allocate engineering time specifically to address issues that cause repeated alerts.
Rotate the "interruptible" role
Some teams designate a rotating "interruptible" engineer who handles on-call duties during regular hours, shielding the rest of the team from context switching:
Why is comprehensive monitoring critical for sustainable on-call practices?
Effective monitoring significantly reduces unnecessary on-call burden by:
- Reducing false alarms that wake engineers unnecessarily
- Providing actionable context that speeds resolution
- Identifying recurring issues that should be permanently fixed
- Enabling proactive intervention before situations become critical
- Facilitating post-incident analysis to prevent future occurrences
Modern observability practices go beyond traditional monitoring by providing rich context that helps engineers quickly understand and resolve issues, reducing the cognitive load during stressful on-call situations.
How can leadership measure and improve on-call health?
Leadership plays a crucial role in establishing healthy on-call cultures:
Regular assessment using DevEx surveys
The Network Perspective DevEx Survey question "The on-call load in my team is manageable and well-supported" provides direct insight into how developers experience on-call duties.
Tracking key metrics over time
Monitor trends in incident frequency, resolution time, and after-hours pages to identify improvement opportunities.
Creating feedback loops
Establish regular retrospectives specifically focused on on-call experiences to capture improvement ideas.
Leading by example
When leaders participate in on-call rotations, they gain firsthand experience with pain points and demonstrate solidarity.
Providing adequate resources
Ensure teams have the time, tools, and training needed to effectively manage on-call responsibilities.
What benefits do well-designed on-call practices provide?
Organizations that implement thoughtful on-call practices realize numerous benefits:
- Increased retention of experienced engineers who might otherwise leave due to burnout
- Improved system reliability through more thorough incident response and follow-up
- Enhanced team morale when on-call duties feel fair and manageable
- Better work-life balance leading to more sustainable performance
- Knowledge distribution as team members learn from diverse incident responses
Ultimately, well-designed on-call practices represent an investment in both system reliability and human sustainability, creating a virtuous cycle that benefits the organization, its customers, and its engineering teams.