# Display all open, acknowledged or resolved incidents assigned to a user

−

# Acknowledge an incident

−

# Resolve an incident

−

# Comment on an incident

−

# Assign/Reassign an incident

−

== Domain Model ==

−

This section describes the domain model by establishing the core concepts and ubiquitous language for incident management.

−

−

* Incidents

−

** Incidents are resources that are created when an alarm transitions to the ALARM or UNDETERMINED state.

−

** Incidents are associated with an alarm.

−

** Incidents allow users to manage alarms as follows:

−

*** Assign and query the status of alarms

−

*** Track the history of alarm events for an incident.

−

*** Assign incidents to users, store and query the history of assignments for an incident.

−

*** Allow users to comment on incidents, store and query the history of comments for an incident.

−

** There are three status of an incident

−

*** OPEN: When a a new alarm occurs and incident is created it is in the OPEN state.

−

*** ACKNOWLEDGED: When an incident is being worked on it is ACKNOWLEDGED.

−

*** RESOLVED: When an incident is closed, it is resolved.

−

* Alarms

−

** Alarms are resources in Monasca that are created by the Threshold Engine when new metrics are received that match one or more alarm definitions.

−

** There are three states of an alarm:

−

*** OK

−

*** ALARM

−

*** UNDETERMINED

−

** The state of an alarm is controlled by the Threshold Engine unless it is explicitly set using the Monasca API.

−

* Alarm state transition events

−

** An event that is published by the Threshold Engine to the Message Queue when an alarm transitions state.

−

* Assignment/Owner

−

** The user that the incident is assigned to.

−

* Comment

−

** Comments are resources that are created when a user comments on an incident.

−

* Actions

−

** Similar to actions for alarm definitions in Monasca, incidents can also have actions which occur when an incident is modified.

−

** Actions can be associated with notification methods.

−

−

−

Note, several of the concepts related to incidents were "borrowed" from PagerDuty. See https://developer.pagerduty.com/documentation/rest/incidents.

−

−

== Incident Lifecycle ==

−

This section describes the lifecycle of an incident which includes creating incidents, handling alarm state transitions, updating the status of incidents, assignment of incidents and commenting on incidents.

−

−

=== Alarms states ===

−

Alarm states transition events are created by the Threshold Engine and are processed as follows:

−

−

# To ALARM

−

## Open a new incident for the supplied alarm, or add an alarm state transition event to an existing incident.

−

### If an incident doesn't exist for the alarm, or the status of the incident has been RESOLVED, a new incident is created with the incident status as OPEN.

−

### If there exists an incident with a status of OPEN or ACKNOWLEDGED for the alarm, the alarm state transition event is added to the existing incident, and the status is not modified.

−

# To OK

−

## Adds an alarm state transition event to an existing incident.

−

### If an incident doesn't exist for the alarm, or the status of the incident has been RESOLVED, nothing is done.

−

### If there exists an incident with a status of OPEN or ACKNOWLEDGED for the alarm, the alarm state transition event is added to the existing incident, and the status is not modified.

−

# To UNDETERMINED

−

## Open a new incident for the supplied alarm, or adds an alarm state transition event to an existing incident.

−

### If an incident doesn't exist for the alarm, or the status of the incident has been RESOLVED, a new incident is created with the incident status as OPEN.

−

### If there exists an incident with a status of OPEN or ACKNOWLEDGED for the alarm, the alarm state transition event is added to the existing incident, and the status is not modified.

−

−

=== Incident status ===

−

The status of a Incident is modified via the Incidents API and processed as follows:

−

−

# To OPEN

−

## When an incident is created the status is set to OPEN.

−

−

# To ACKNOWLEDGED

−

## If an incident is in the OPEN state the status can be set to ACKNOWLEDGED using the Incidents API.

−

## An incident status event to published to Kafka which is processed by the Notification Engine.

−

## If an incident is acknowledged, it won't generate any additional notifications, even if it receives new alarm state transition events.

−

−

# To RESOLVED

−

## If an incident is in the ACKNOWLEDGED state the status can be set to RESOLVED using the Incidents API.

−

## An incident status event to published to Kafka which is processed by the Notification Engine.

−

## If an incident is resolved, it won't generate any additional notifications.

−

−

Whenever the status of an incident is modified the user that modified the incident and timestamp is recorded.

−

−

=== Assign or reassign incident ===

−

The assignment/reassignment of an incident is done via the Incidents API and are processed as follows:

−

−

# When an incident is created it is initially unassigned. It can then be assigned or reassigned later.

−

# An incident assignment/reassignment event is published to the Message Queue which is then processed by the Notification Engine.

−

−

=== Comment on incident ===

−

Comments can be created via the Incidents API and are processed as follows:

−

−

# When a comment is added to an incident the incident is stored.

−

# An incident comment event is published to the Message Queue and then processed by the Notification Engine.

−

−

== Incidents API ==

−

* GET /v2.0/incidents/

−

** Query parameters

−

*** status

−

*** alarm_state

−

*** assigned_to

−

*** acknowledged_by

−

*** create_start_time

−

*** status_update_start_time

−

* GET /v2.0/incidents/{incident-id}

−

* PATCH /v2.0/incidents/{incident-id}: Update an incident, such as modifying the status to ACKNOWLEDGED or RESOLVED.

−

−

=== Incident Object ===

−

* id: The ID of the incident.

−

* name: The name of the incident.

−

* description: The description of the incident.

−

* alarm: {alarm}

−

* alarm_state_transitions: [{alarm_state_transition}]

−

* status: OPEN, ACKNOWLEDGED, RESOLVED

−

* created_timestamp: The timestamp when the incident was created.

−

* status_updated_timestamp: The timestamp when the incident was last updated.

−

* comments: [comment-id]: An array of comments for the incident.

−

* assignments: [{Assignment}]: The user ID and timestamp that the incident was assigned.

−

* acknowledgments: [{Acknowledgment}]: The user ID and timestamp that acknowledged or resolved the incident.

−

* actions: [{notification-method}]: Array of notification method IDs that are invoked when the incident is modified in any way.

−

−

== Comments API ==

−

* GET /v2.0/comments

−

** Query parameters

−

*** incident_id (string, optional) -

−

* GET /v2.0/comments/{comment-id}

−

* POST /v2.0/comments

−

−

=== Comment Object ===

−

* id

−

* incident_id

−

* created_timestamp

−

* comment

−

* user_id (string, required)

−

−

== Architecture ==

−

* Monasca Incident Manager

−

** Provides an API that enables the following:

−

*** Incidents: Query and update incidents, such as updating the status of incidents.

** Creates incidents in the MySQL database based on the rules listed above

−

** Publishes incident events to the incident events topic in Kafka which are consumed by the Notification Engine and an potentially result in notifications being sent.

−

* MySQL

−

** Schemas

−

*** Incidents

−

**** id: The ID of the incident.

−

**** name: The name of the incident.

−

**** description: The description of the incident.

−

**** alarm_id

−

**** alarm_state_transitions: [{alarm_state_transition}]

−

**** status: OPEN, ACKNOWLEDGED, RESOLVED

−

**** created_timestamp: The timestamp when the incident was created.

−

**** status_updated_timestamp: The timestamp when the incident was last updated.

−

*** IncidentAcknowledgments

−

**** id

−

**** incident_id

−

**** status

−

**** user_id

−

**** timestamp

−

*** IncidentAssignments

−

**** id

−

**** incident_id

−

**** user_id

−

**** timestamp

−

*** Comments

−

**** id

−

**** incident_id

−

**** user_id

−

**** timestamp

−

**** comment_text

−

*** IncidentActions

−

**** id

−

**** incident_id

−

**** action

−

*** IncidentAlarmHistory

−

**** ?

−

−

== Issues ==

−

# How to assign actions when a new incident is created?

−

# Should alarm IDs match to incidents directly, or should there be a level of indirection between an incident ID and an alarm ID? In PagerDuty you create an incident and get a response that has the incident ID, which the client should store. On subsequent events, the same incident ID can be provided for the same alarm. If the incident has been resolved an new incident is created and a new incident ID is returned. If the incident has not been resolved, the event is added to the incident. In PagerDuty the responsibility is on the client to manage the incident IDs associated with an alarm such that on subsequent alarm events the incident ID can be provided. What is described here is that the Incident Manager creates new incident when a alarm event occurs, but the incident tracking the alarm has already been resolved.

−

# Teams and Groups. PagerDuty has the ablity to assign incidents to teams or groups or individuals with escalation policies.

−

# Maintenance Schedules

−

# Should incidents be unassigned when created or assigned to a user based on a "escalation" policy?

−

# Incident status or state? Which word is better. Alarms have a state. Incidents have a state too. But, status seems more appropriate for incidents, than state.