Compute Instance Metrics
You can monitor the health, capacity, and performance of your compute instances by using metrics, alarms, and notifications.
This topic describes the metrics emitted by the metric namespace oci_computeagent
(the Compute Instance Monitoring plugin on compute instances).
You can view these metrics for individual compute instances, and for all the instances in an instance pool.
Resources: Monitoring-enabled compute instances.
Overview of Metrics for an Instance and Related Resources
This section gives an overall picture of the different types of metrics available for an instance and its storage and network devices. See the following diagram and table for a summary.
Metric Namespace | Resource ID | Where Measured | Available Metrics |
---|---|---|---|
oci_computeagent
|
Instance OCID | On the instance. The metrics in this namespace are aggregated across all the related resources on the instance. For example, DiskBytesRead is aggregated across all the instance's attached storage volumes, and NetworkBytesIn is aggregated across all the instance's attached VNICs. |
|
oci_blockstore
|
Boot or block volume OCID | By the Block Volume service. The metrics are for an individual volume (either boot volume or block volume). | See Block Volume Metrics. |
oci_vcn
|
VNIC OCID | By the Networking service. The metrics are for an individual VNIC. |
See VNIC Metrics. |
Before You Begin
- IAM policies: To monitor resources, you must be granted the required type of access in a policy written by an administrator, whether you're using the Console or the REST API with an SDK, CLI, or other tool. The policy must give you access to the monitoring services as well as the resources being monitored. If you try to perform an action and get a message that you don't have permission or are unauthorized, contact the administrator to find out what type of access you were granted and which compartment you need to work in. For more information about user authorizations for monitoring, see IAM Policies.
- Metrics exist in Monitoring: The resources that you want to monitor must emit metrics to the Monitoring service.
- Compute instances: To emit metrics, the Compute Instance Monitoring plugin must be enabled on the instance, and plugins must be running. The instance must also have either a service gateway or a public IP address to send metrics to the Monitoring service. For more information, see Enabling Monitoring for Compute Instances.
Available Metrics: oci_computeagent
The compute instance metrics help you measure activity level and throughput of compute instances. The metrics listed in the following table are available for any monitoring-enabled compute instance. To get these metrics, enable monitoring on the instance.
The metrics in this namespace are aggregated across all the related resources on the instance. For example, DiskBytesRead
is aggregated across all the instance's attached storage volumes, and NetworkBytesIn
is aggregated across all the instance's attached VNICs.
For metrics emitted by the metric namespace oci_computeagent
, data points are sampled every ten seconds. A batch of six of data points is emitted every minute. Therefore, for every minute granularity, the aggregate count is always six, the aggregate sum is the sum of the six data points, and the aggregate average is the average of the six data points.
You also can use the Monitoring service to create custom queries.
Each metric includes the following dimensions :
- availabilityDomain
- The availability domain where the instance resides.
- faultDomain
- The fault domain where the instance resides.
- imageId
- The OCID of the image for the instance.
- instancePoolId
- The instance pool that the instance belongs to.
- region
- The region where the instance resides.
- resourceDisplayName
- The friendly name of the instance.
- resourceId
- The OCID of the instance.
- shape
- The shape of the instance.
Metric | Metric Display Name | Unit | Description | Dimensions |
---|---|---|---|---|
CpuUtilization
|
CPU Utilization | percent |
Activity level from CPU. Expressed as a percentage of total time. For instance pools, the value is averaged across all instances in the pool. |
|
DiskBytesRead 1, 3 |
Disk Read Bytes | bytes | Read throughput. Expressed as bytes read per interval. | |
DiskBytesWritten 1, 3 |
Disk Write Bytes | bytes | Write throughput. Expressed as bytes written per interval. | |
DiskIopsRead 1, 3 |
Disk Read I/O | operations | Activity level from I/O reads. Expressed as reads per interval. | |
DiskIopsWritten 1, 3 |
Disk Write I/O | operations | Activity level from I/O writes. Expressed as writes per interval. | |
LoadAverage |
Load Average | number of processes | Average system load calculated over a 1-minute period. | |
MemoryAllocationStalls |
Memory Allocation Stalls | number of stalls | Number of times page reclaim was called directly. | |
MemoryUtilization 1 |
Memory Utilization | percent |
Space currently in use. Measured by pages. Expressed as a percentage of used pages. For instance pools, the value is averaged across all instances in the pool. |
|
NetworksBytesIn 1, 2 |
Network Receive Bytes | bytes |
Network receipt throughput. Expressed as bytes received. |
|
NetworksBytesOut 1, 2
|
Network Transmit Bytes | bytes | Network transmission throughput. Expressed as bytes transmitted. | |
1This metric is a cumulative counter that shows monotonically increasing behavior for each session of the Oracle Cloud Agent software, resetting when the operating system is restarted. 2The Networking service provides more metrics (in the 3The Block Volume service provides more metrics (in the |
Available Metrics: gpu_infrastructure_health
The compute instance metrics help you measure the activity level and throughput of compute instances. The metrics listed in the following table are available for any monitoring-enabled compute instance. To get these metrics, enable monitoring on the instance.
The metrics in this namespace are aggregated across all the related resources on the instance. For example, DiskBytesRead
is aggregated across all the instance's attached storage volumes, and NetworkBytesIn
is aggregated across all the instance's attached VNICs.
For metrics emitted by the metric namespace gpu_infrastructure_health
, data points are sampled every ten seconds. A batch of six of data points is emitted every minute. Therefore, for every minute granularity, the aggregate count is always six, the aggregate sum is the sum of the six data points, and the aggregate average is the average of the six data points.
You also can use the Monitoring service to create custom queries.
Each metric includes the following dimensions :
- component
- GPU or rdma_nic
- timestamp
- UTC time when the payload/heartbeat is emitted
- version
- The payload version number for compatibility
Metric | Metric Display Name | Unit | Description | Dimensions |
---|---|---|---|---|
GpuUtilization
|
GPU utilization | percent |
Activity level from GPU. Expressed as a percentage of total time. For instance pools, the value is averaged across all instances in the pool. |
|
GpuMemoryUtilization |
GPU memory utilization | percent | The percentage of the GPU memory resource in use. | |
GpuPowerDraw |
GPU power draw | integer | The amount of GPU power used. | |
GpuTemperature |
GPU temperature | integer | The GPU temperature reported. | |
GpuEccSingleBitErrors |
GPU single-bit errors | integer | The number of GPU single bit ECC errors reported. | |
GpuEccDoubleBitErrors |
GPU double-bit errors | integer | The number of GPU double bit ECC errors reported. | |
1This metric is a cumulative counter that shows monotonically increasing behavior for each session of the Oracle Cloud Agent software, resetting when the operating system is restarted. 2The Networking service provides more metrics (in the 3The Block Volume service provides more metrics (in the |
Fault Metrics: gpu_infrastructure_health
Metric | Metric Display Name | Unit | Description | Dimensions |
---|---|---|---|---|
Fault |
GPU fault | count |
If the value is 0, there are no faults. If the value is 1, faults are detected. |
|
1This metric is a cumulative counter that shows monotonically increasing behavior for each session of the Oracle Cloud Agent software, resetting when the operating system is restarted. 2The Networking service provides more metrics (in the 3The Block Volume service provides more metrics (in the |
Available Metrics: rdma_infrastructure_health
The compute instance metrics help you measure activity level and throughput of compute instances. The metrics listed in the following table are available for any monitoring-enabled compute instance. To get these metrics, enable monitoring on the instance.
The metrics in this namespace are aggregated across all the related resources on the instance. For example, DiskBytesRead
is aggregated across all the instance's attached storage volumes, and NetworkBytesIn
is aggregated across all the instance's attached VNICs.
For metrics emitted by the metric namespace rdma_infrastructure_health
, data points are sampled every ten seconds. A batch of six of data points is emitted every minute. Therefore, for every minute granularity, the aggregate count is always six, the aggregate sum is the sum of the six data points, and the aggregate average is the average of the six data points.
You also can use the Monitoring service to create custom queries.
Each metric includes the following dimensions :
- component
- GPU or rdma_nic
- timestamp
- UTC time when the payload/heartbeat is emitted
- version
- The payload version number for compatibility
Metric | Metric Display Name | Unit | Description | Dimensions |
---|---|---|---|---|
RdmaTxBytes
|
RDMA aggregate network transmit bytes | bytes | The bytes transmitted on the RDMA interface. |
|
RdmaRxBytes |
RDMA aggregate network receive bytes | bytes | The bytes received on the RDMA interface. | |
RdmaTxPackets |
RDMA aggregate network transmit packets | integer | The number of RDMA interface packets transmitted. | |
RdmaRxPackets |
RDMA aggregate network receive packets | integer | The number of RDMA interface packets received. | |
1This metric is a cumulative counter that shows monotonically increasing behavior for each session of the Oracle Cloud Agent software, resetting when the operating system is restarted. 2The Networking service provides more metrics (in the 3The Block Volume service provides more metrics (in the |
Fault Metrics: rdma_infrastructure_health
Metric | Metric Display Name | Unit | Description | Dimensions |
---|---|---|---|---|
RdmaLinkSpeedFault |
Faults | count | Detects if a link speed fault is present. If the value is 0, there are no faults. If the value is 1, faults are detected. |
|
RdmaPcieAddressFault |
Faults | count | Detects if a PCIE address fault is present. If the value is 0, there are no faults. If the value is 1, faults are detected. |
|
RdmaPcieBerCheckFault |
Faults | count | Detects if a PCIE BER fault is present. If the value is 0, there are no faults. If the value is 1, faults are detected. |
|
RdmaPcieCableFlapFault |
Faults | count | Detects if a PCIE cable flap fault is present. If the value is 0, there are no faults. If the value is 1, faults are detected. |
|
RdmaPcieCablePlugFault |
Faults | count | Detects if a PCIE cable plug fault is present. If the value is 0, there are no faults. If the value is 1, faults are detected. |
|
RdmaPcieCableStateFault |
Faults | count | Detects if a PCIE cable state fault is present. If the value is 0, there are no faults. If the value is 1, faults are detected. |
|
1This metric is a cumulative counter that shows monotonically increasing behavior for each session of the Oracle Cloud Agent software, resetting when the operating system is restarted. 2The Networking service provides more metrics (in the 3The Block Volume service provides more metrics (in the |
Using the Console
- Open the navigation menu and select Compute. Under Compute, select Instances.
- Click the instance that you're interested in.
- Under Resources, click Metrics.
-
In the Metric namespace list, select oci_computeagent.
The Metrics page displays a default set of charts for the current instance.
Not seeing any metric charts for the instance?If you don't see any metric charts, the instance might not be emitting metrics. See the following possible causes and resolutions.
Possible cause How to check Resolution The Compute Instance Monitoring plugin is disabled on the instance or plugins are stopped. Review the instance properties. Enable the Compute Instance Monitoring plugin and start all plugins. The instance can't access the Monitoring service because its VCN doesn't use the internet. Review the instance's IP address. If it's not public, then a service gateway is needed. Set up a service gateway. The instance doesn't use a supported image. Review the supported images. Create an instance with a supported image. Older images and custom images: No Oracle Cloud Agent software exists on the instance. Connect to the instance and look for the software. Install the Oracle Cloud Agent software. Something else is wrong with the Oracle Cloud Agent software. (not applicable) Follow the troubleshooting steps for Oracle Cloud Agent. For more information about monitoring metrics and using alarms, see Overview of Monitoring. For information about notifications for alarms, see Overview of Notifications.
- Open the navigation menu and select Observability & Management. Under Monitoring, select Service Metrics.
- Select a compartment.
-
For Metric namespace, select oci_computeagent.
The Service Metrics page dynamically updates the page to show charts for each metric that is emitted by the selected metric namespace.
For more information about monitoring metrics and using alarms, see Overview of Monitoring. For information about notifications for alarms, see Overview of Notifications.
- Open the navigation menu and select Compute. Under Compute, select Instance Pools.
- Click the instance pool that you're interested in.
- Under Resources, click Metrics.
-
In the Metric namespace list, select oci_computeagent.
The Metrics page displays a default set of charts for the current instance pool.
For more information about monitoring metrics and using alarms, see Overview of Monitoring. For information about notifications for alarms, see Overview of Notifications.
Using the API
For information about using the API and signing requests, see REST API documentation and Security Credentials. For information about SDKs, see SDKs and the CLI.
- Monitoring API for metrics and alarms
- Notifications API for notifications (used with alarms)