Kener Documentation

Incident updates (also called comments) are the primary mechanism for communicating incident progress and controlling incident state. Each update creates a timestamped entry in the incident timeline that's visible to your users.

What are Incident Updates?

An incident update is a timestamped message that:

Describes current status or actions taken
Changes the incident state
Appears on the public status page
Creates an audit trail
Communicates transparently with users

Updates are stored as "comments" in the database but are displayed as a timeline on the status page.

Update Components

Each update consists of:

Message (Required)

The actual status update text that users will see.

Format: Supports full Markdown formatting including:

Bold and italic text
Headers
Lists (bulleted and numbered)
Links
Code blocks
Tables

Example:

We've identified the root cause as a database connection pool exhaustion.

**Actions Taken:**

- Increased connection pool size from 100 to 500
- Restarted application servers
- Enabled connection pool monitoring

**Next Steps:**

- Monitor error rates for 15 minutes
- Verify connection pool health

State (Required)

The incident state after this update. This is crucial because:

The incident's current state = the state of the most recent update
Changing state in an update changes the incident's state
State progression drives the incident lifecycle

Available States:
title: Incident Updates
description: Quick reference for posting timeline updates on incidents

IDENTIFIED - Root cause found
MONITORING - Fix applied, watching for stability
Incident updates are timeline entries used to communicate progress and move incident state.

Quick reference

When posting an update, choose one state:

INVESTIGATING
IDENTIFIED
MONITORING
RESOLVED

Setting RESOLVED closes the incident and sets end time.

Aligning timeline with actual events
Use concise, user-facing text and include only meaningful changes.

Deleting Updates

You can delete updates:

Find the update in the timeline
Click the Delete (trash icon) button
Confirm deletion

Warning: This cannot be undone.

What Happens:

Update is permanently removed
Incident state reverts to the previous update's state
Timeline adjusts
If you delete the most recent update, the incident takes the state of the second-most-recent update

Special Case: If you delete all updates, the incident retains its original state from creation.

Update Status

Each update has a status field that controls visibility:

ACTIVE (Default)

The update is visible on the public status page and in the incident timeline.

INACTIVE

The update is hidden from public view but retained in the database.

Use Cases:

Internal notes not meant for users
Potentially sensitive information
Draft updates
Historical record keeping

How to Set:
Currently managed through API calls. Dashboard UI support may be added in the future.

Update Best Practices

Frequency

Critical Incidents:

Update every 15-30 minutes
Even if no progress ("Still investigating...")
Keeps users informed

Major Incidents:

Update every 30-60 minutes
When significant progress made
At state transitions

Minor Incidents:

Update when state changes
When fix is applied
When resolved

Content Guidelines

Be Clear:

Use simple language
Avoid jargon and acronyms
Explain technical terms if necessary

Be Specific:

What IS affected, not just what you're doing
What users can expect
Estimated timelines (if available)

Be Honest:

Admit uncertainty when present
Don't promise what you can't deliver
Update estimates as they change

Good Examples:

## Investigating (Good)

We're seeing 20% of API requests failing with 503 errors.
Our team is investigating the cause. Retries should succeed,
but you may experience delays.

## Identified (Good)

We've identified a database replication lag as the cause.
The secondary database is catching up. ETA for full resolution: 15 minutes.

## Monitoring (Good)

Database replication is back in sync. Error rates have dropped to normal levels.
We're monitoring for the next 20 minutes to ensure stability before resolving.

## Resolved (Good)

All systems are operating normally. Total incident duration: 47 minutes.

**Root Cause:** Database replication lag due to a large batch import.

**Prevention:** We've implemented rate limiting on batch imports and improved monitoring.

Poor Examples:

## Bad - Too Vague

We're working on it.

## Bad - Too Technical

Increased innodb_buffer_pool_size and optimized query Q47392.

## Bad - Unprofessional

Really sorry about this mess! Not sure what happened.

State Progression

Don't Skip States Unnecessarily:

Helps users understand progress
Provides detailed timeline
Better transparency

It's OK to Skip:

INVESTIGATING → RESOLVED (if quick fix)
IDENTIFIED → RESOLVED (if no monitoring needed)

States Can Go Backward:

MONITORING → INVESTIGATING (if issue returns)
RESOLVED → INVESTIGATING (if reopened)
This is normal for complex incidents

Markdown Usage

Use Structure:

## Summary

Brief overview of current status

**Impact:**

- What's affected
- Who's affected

**Next Steps:**

- What we're doing
- Expected timeline

Use Lists:

Easier to scan
Clearer action items
Better readability

Use Bold for Emphasis:

Highlight important information
Draw attention to key points
Don't overuse

Link to Resources:

Status page for related monitors
Documentation for workarounds
Support channels for help

Special Use Cases

Backdating Updates

If you're creating updates after the fact:

Add the update
Adjust the timestamp to when it actually occurred
Maintains accurate timeline
Preserves historical accuracy

Example:
Issue occurred at 10:00 AM, but you're recording it at 2:00 PM. Set update timestamps to 10:00 AM, 10:30 AM, etc.

Multiple Updates at Once

For complex incidents with many developments:

Add updates in chronological order
Adjust timestamps to spread them appropriately
Ensure state progression makes sense
Most recent update determines current state

Resolution Updates

When resolving an incident, include:

Summary:

What was fixed
Verification of resolution
Confidence level

Root Cause:

What caused the incident
Why it happened
Technical details (optional)

Prevention:

Steps taken to prevent recurrence
Monitoring improvements
Process changes

Example:

All services are fully operational. The incident has been resolved.

**Root Cause:**
A misconfigured load balancer was routing traffic to unhealthy backend servers.

**Resolution:**

- Fixed load balancer health check configuration
- Restarted affected backend servers
- Verified traffic routing correctly

**Prevention:**

- Added health check validation to deployment pipeline
- Implemented automated health check monitoring
- Updated runbooks for faster diagnosis

**Total Duration:** 1 hour 15 minutes

Timeline Display

On the public status page, updates are displayed:

Order: Most recent first (reverse chronological)

Information Shown:

State badge (color-coded)
Timestamp
Message content (rendered Markdown)

Styling:

Each update in a card
State-appropriate colors
Clear visual separation

Update Notifications

When configured with the subscription system:

Users Receive:

Email notifications for new updates
Subject includes incident ID and state
Rendered HTML of update message
Link to full incident page

Notifications Sent When:

New update is added
State changes
Incident is created
Incident is resolved

See Subscription documentation for setup.

Next Steps

Incident Impact on Monitoring - How incident state affects monitor status
Creating and Managing Incidents - Back to incident management basics
Auto-Generated Incidents - How alerts create and update incidents automatically

Incident Updates