Incident management is the end-to-end business process of identifying, analyzing, and resolving an outage or service disruption. The goal of incident management is to keep services running or restore them as quickly as possible, while minimizing the impact to the business.
Incident Management Is Important
Service interruption incidents can be extremely costly to your business and its teams. Incidents can disrupt operations, lead to temporary downtime, and contribute to the loss of data and productivity. Incident management provides teams with a reliable method to prioritize incidents, get to resolution faster, and offer better service for users.
Benefits of Incident Management 🔗
Some of the benefits of incident management include the following:
Increased productivity and efficiency.
Increased visibility and transparency.
Improved mean time to resolution (MTTR). MTTR is a combination of the average time to detect, diagnose, and mitigate incidents.
Improved customer and employee experience.
Prevention of incidents.
Oracle Cloud Infrastructure Support 🔗
When using Oracle Cloud Infrastructure, sometimes you need to get help from the community or talk to someone in Oracle support. For information about support options, see Getting Help and Contacting Support.
Recommendations 🔗
Design a support and incident management strategy to support your environment and minimize service disruptions.
Proactively define your support and incident management strategy wherever possible, but learn from experience and adjust your practices as needed.
Put controls in place to prepare and respond to incidents. Recommendations include:
Use a system to determine risks, threats, vulnerabilities, and impacts related to security
Use a security information and event management (SIEM) system
Set up a security operations center (SOC)
Set up an incident response team
Implement incident detection, response, and reporting
Define escalation paths
Build a standard post-mortem mechanism
Develop an operations strategy to detect, prevent, respond to, and recover from events. Recommendations include:
Monitor system performance metrics
Document and test a disaster recovery plan
Understand key roles needed for disaster recovery coordination
Plan for interactions with Oracle Cloud Infrastructure support
Respond to incidents
Simulate attacks based on real incidents
Prepare for application failure
Recover from data corruption
Recover from network outage
Recover from a dependent service failure
Recover from a region-wide service disruption
Learn from disaster recovery tests, and improve processes
Expect failure and learn from mistakes
We recommend that you formalize a support contract with Oracle or an approved partner to help keep your organization's systems running at peak performance. Leverage these partnerships when critical events are scheduled, such as migrations or expected increases in demand. Doing so ensures that you can benefit from the right support, best practices, and expertise. It can also ensure a feedback mechanism directly with Oracle engineering for continuous improvement of the platform.