High Availability

Compute Cloud@Customer is built to eliminate single points of failure, enabling the system and hosted workloads to remain operational in case of hardware or software faults, and during upgrades and maintenance operations.

Compute Cloud@Customer has redundancy built into its architecture at every level: hardware, controller software, master database, services, and so on. Features such as backup, automated service requests and optional disaster recovery further enhance the system's serviceability and continuity of service.

Hardware Redundancy

The minimum base rack configuration contains redundant networking, storage and server components to ensure that failure of any single element doesn't affect overall system availability.

Data connectivity throughout the system is built on redundant pairs of leaf and spine switches. Link aggregation is configured on all interfaces: switch ports, host NICs and uplinks. The leaf switches interconnect the rack components using cross-cabling to redundant network interfaces in each component. Each leaf switch also has a connection to each of the spine switches, which are also interconnected. The spine switches form the backbone of the network and enable traffic external to the rack. Their uplinks to the data center network consist of two cable pairs, which are cross-connected to two redundant ToR (top-of-rack) switches.

The management cluster, which runs the controller software and system-level services, consists of three fully active management nodes. Inbound requests pass through the virtual IP of the management node cluster, and are distributed across the three nodes by a load balancer. If one of the nodes stops responding and fences from the cluster, the load balancer continues to send traffic to the two remaining nodes until the failing node is healthy again and rejoins the cluster.

Storage for the system and for the cloud resources in the environment is provided by the internal ZFS Storage Appliance. Its two controllers form an active-active cluster, providing high availability and excellent throughput at the same time. The ZFS pools are built on disks in a mirrored configuration for the best data protection.

System Availability

The software and services layer are deployed on the three-node management cluster, and take advantage of the high availability that's inherent to the cluster design. The Kubernetes container orchestration environment also uses clustering for both its own controller nodes and the service pods it hosts. Many replicas of the microservices are running at any particular time. Nodes and pods are distributed across the management nodes, and Kubernetes ensures that failing pods are replaced with new instances to keep all services running in an active/active setup.

All services and components store data in a common, central MySQL database. The MySQL cluster database has instances deployed across the three management nodes. Availability, load balancing, data synchronization and clustering are all controlled by internal components of the MySQL cluster.

A significant part of the system-level infrastructure networking is software-defined. The configuration of virtual switches, routers and gateways isn't stored and managed by the switches, but is distributed across several components of the network architecture. The network controller is deployed as a highly available containerized service.

The upgrade framework leverages the hardware redundancy and the clustered designs to provide rolling upgrades for all components. During the upgrade of one component instance, the remaining instances ensure that there's no downtime. The upgrade is complete when all component instances have been upgraded and returned to normal operation.

Compute Instance Availability

For a compute instance, high availability refers to the automated recovery of an instance in case the underlying infrastructure fails. The state of the compute nodes, hypervisors, and compute instances is monitored continually. Each compute node is polled with a 5 minute interval. When compute instances go down, by default the system tries to recover it automatically.

By default, the system attempts to restart instances in their selected fault domain but will restart instances in other fault domains if insufficient resources are available in the selected fault domain. The selected fault domain is the fault domain that's specified in the instance configuration.

If a compute node goes down because of an unplanned reboot, when the compute node successfully returns to normal operation, instances are restarted. At the next polling interval, by default if instances are found that should be running but are in a different state, the start command is issued again. If any instances have stopped and remain in that state, the hypervisor tries to restart them up to 5 times. Instances that weren't running before the compute node became unavailable, remain shut down when the compute node is up and running again.

A compute node is considered failing when it has been disconnected from the data network or has been in a powered-off state for about 5 minutes. This 5-minute timeout corresponds with two unsuccessful polling tries, and is the threshold for placing the compute node in FAIL state and its agent in EVACUATING state. This condition is required before the reboot migration can start.

Reboot migration implies that all compute instances from the failing compute node are stopped and restarted on another compute node. When migration is complete, the failing compute node's agent indicates that instances have been evacuated. If the compute node eventually reboots successfully, it must go through a cleanup process that removes all stale instance configurations and associated virtual disks. After cleanup, the compute node can host compute instances again.

During the entire reboot migration, the instances remain in "moving" configuration state. After migration is completed, the instance configuration state is changed to "running". Instances that were stopped before the failure aren't migrated, because they're not associated with any compute node.

Continuity of Service

Compute Cloud@Customer offers several features that further enhance high availability. Health monitoring at all levels of the system is a key factor. Diagnostic and performance data is collected from all components, then centrally stored and processed, and made available to Oracle personnel.

To mitigate data loss and support the recovery of system and services configuration in case of failure, consistent and complete backups are made regularly.

Optionally, workloads deployed on Compute Cloud@Customer can be protected against downtime and data loss through the implementation of disaster recovery. To achieve this, two Compute Cloud@Customer infrastructures need to be set up at different sites, and configured to be each others replica. Resources under disaster recovery control are stored separately on the ZFS Storage Appliances in each system, and replicated between the two. When an incident occurs at one site, the environment is brought up on the replica system with minimal downtime. We recommend that disaster recovery is implemented for all critical production systems.