Incident Updates
Learn how incident updates work, their significance in the incident timeline, and how they control incident state progression.
Incident updates (also called comments) are the primary mechanism for communicating incident progress and controlling incident state. Each update creates a timestamped entry in the incident timeline that's visible to your users.
What are Incident Updates?
An incident update is a timestamped message that:
- Describes current status or actions taken
- Changes the incident state
- Appears on the public status page
- Creates an audit trail
- Communicates transparently with users
Updates are stored as "comments" in the database but are displayed as a timeline on the status page.
Update Components
Each update consists of:
Message (Required)
The actual status update text that users will see.
Format: Supports full Markdown formatting including:
- Bold and italic text
- Headers
- Lists (bulleted and numbered)
- Links
- Code blocks
- Tables
Example:
We've identified the root cause as a database connection pool exhaustion.
**Actions Taken:**
- Increased connection pool size from 100 to 500
- Restarted application servers
- Enabled connection pool monitoring
**Next Steps:**
- Monitor error rates for 15 minutes
- Verify connection pool health
State (Required)
The incident state after this update. This is crucial because:
- The incident's current state = the state of the most recent update
- Changing state in an update changes the incident's state
- State progression drives the incident lifecycle
Available States:
title: Incident Updates
description: Quick reference for posting timeline updates on incidents
- IDENTIFIED - Root cause found
- MONITORING - Fix applied, watching for stability
Incident updates are timeline entries used to communicate progress and move incident state.
Quick reference
When posting an update, choose one state:
INVESTIGATINGIDENTIFIEDMONITORINGRESOLVED
Setting RESOLVED closes the incident and sets end time.
- Aligning timeline with actual events
Use concise, user-facing text and include only meaningful changes.
See also
- Creating and Managing Incidents
- Impact on Monitoring
Important State Changes:
Moving to RESOLVED:
- Sets incident end_date_time to update timestamp
- Closes the incident on status page
- Triggers resolution notifications
Moving FROM RESOLVED:
- Clears incident end_date_time
- Reopens the incident
- Shows as ongoing again
Deleting Updates
You can delete updates:
- Find the update in the timeline
- Click the Delete (trash icon) button
- Confirm deletion
Warning: This cannot be undone.
What Happens:
- Update is permanently removed
- Incident state reverts to the previous update's state
- Timeline adjusts
- If you delete the most recent update, the incident takes the state of the second-most-recent update
Special Case: If you delete all updates, the incident retains its original state from creation.
Update Status
Each update has a status field that controls visibility:
ACTIVE (Default)
The update is visible on the public status page and in the incident timeline.
INACTIVE
The update is hidden from public view but retained in the database.
Use Cases:
- Internal notes not meant for users
- Potentially sensitive information
- Draft updates
- Historical record keeping
How to Set:
Currently managed through API calls. Dashboard UI support may be added in the future.
Update Best Practices
Frequency
Critical Incidents:
- Update every 15-30 minutes
- Even if no progress ("Still investigating...")
- Keeps users informed
Major Incidents:
- Update every 30-60 minutes
- When significant progress made
- At state transitions
Minor Incidents:
- Update when state changes
- When fix is applied
- When resolved
Content Guidelines
Be Clear:
- Use simple language
- Avoid jargon and acronyms
- Explain technical terms if necessary
Be Specific:
- What IS affected, not just what you're doing
- What users can expect
- Estimated timelines (if available)
Be Honest:
- Admit uncertainty when present
- Don't promise what you can't deliver
- Update estimates as they change
Good Examples:
## Investigating (Good)
We're seeing 20% of API requests failing with 503 errors.
Our team is investigating the cause. Retries should succeed,
but you may experience delays.
## Identified (Good)
We've identified a database replication lag as the cause.
The secondary database is catching up. ETA for full resolution: 15 minutes.
## Monitoring (Good)
Database replication is back in sync. Error rates have dropped to normal levels.
We're monitoring for the next 20 minutes to ensure stability before resolving.
## Resolved (Good)
All systems are operating normally. Total incident duration: 47 minutes.
**Root Cause:** Database replication lag due to a large batch import.
**Prevention:** We've implemented rate limiting on batch imports and improved monitoring.
Poor Examples:
## Bad - Too Vague
We're working on it.
## Bad - Too Technical
Increased innodb_buffer_pool_size and optimized query Q47392.
## Bad - Unprofessional
Really sorry about this mess! Not sure what happened.
State Progression
Don't Skip States Unnecessarily:
- Helps users understand progress
- Provides detailed timeline
- Better transparency
It's OK to Skip:
- INVESTIGATING → RESOLVED (if quick fix)
- IDENTIFIED → RESOLVED (if no monitoring needed)
States Can Go Backward:
- MONITORING → INVESTIGATING (if issue returns)
- RESOLVED → INVESTIGATING (if reopened)
- This is normal for complex incidents
Markdown Usage
Use Structure:
## Summary
Brief overview of current status
**Impact:**
- What's affected
- Who's affected
**Next Steps:**
- What we're doing
- Expected timeline
Use Lists:
- Easier to scan
- Clearer action items
- Better readability
Use Bold for Emphasis:
- Highlight important information
- Draw attention to key points
- Don't overuse
Link to Resources:
- Status page for related monitors
- Documentation for workarounds
- Support channels for help
Special Use Cases
Backdating Updates
If you're creating updates after the fact:
- Add the update
- Adjust the timestamp to when it actually occurred
- Maintains accurate timeline
- Preserves historical accuracy
Example:
Issue occurred at 10:00 AM, but you're recording it at 2:00 PM. Set update timestamps to 10:00 AM, 10:30 AM, etc.
Multiple Updates at Once
For complex incidents with many developments:
- Add updates in chronological order
- Adjust timestamps to spread them appropriately
- Ensure state progression makes sense
- Most recent update determines current state
Resolution Updates
When resolving an incident, include:
Summary:
- What was fixed
- Verification of resolution
- Confidence level
Root Cause:
- What caused the incident
- Why it happened
- Technical details (optional)
Prevention:
- Steps taken to prevent recurrence
- Monitoring improvements
- Process changes
Example:
All services are fully operational. The incident has been resolved.
**Root Cause:**
A misconfigured load balancer was routing traffic to unhealthy backend servers.
**Resolution:**
- Fixed load balancer health check configuration
- Restarted affected backend servers
- Verified traffic routing correctly
**Prevention:**
- Added health check validation to deployment pipeline
- Implemented automated health check monitoring
- Updated runbooks for faster diagnosis
**Total Duration:** 1 hour 15 minutes
Timeline Display
On the public status page, updates are displayed:
Order: Most recent first (reverse chronological)
Information Shown:
- State badge (color-coded)
- Timestamp
- Message content (rendered Markdown)
Styling:
- Each update in a card
- State-appropriate colors
- Clear visual separation
Update Notifications
When configured with the subscription system:
Users Receive:
- Email notifications for new updates
- Subject includes incident ID and state
- Rendered HTML of update message
- Link to full incident page
Notifications Sent When:
- New update is added
- State changes
- Incident is created
- Incident is resolved
See Subscription documentation for setup.
Next Steps
- Incident Impact on Monitoring - How incident state affects monitor status
- Creating and Managing Incidents - Back to incident management basics
- Auto-Generated Incidents - How alerts create and update incidents automatically