Compute Cloud@Customer is built to eliminate single points of failure,
enabling the system and hosted workloads to remain operational in case of hardware or
software faults, and during upgrades and maintenance operations.
Compute Cloud@Customer has redundancy built into its architecture at every
level: hardware, controller software, master database, services, and so on. Features
such as backup, automated service requests and optional disaster recovery further
enhance the system's serviceability and continuity of service.
Hardware Redundancy
The minimum base rack configuration contains redundant networking, storage and server
components to ensure that failure of any single element doesn't affect overall
system availability.
Data connectivity throughout the system is built on redundant pairs of leaf and spine
switches. Link aggregation is configured on all interfaces: switch ports, host NICs
and uplinks. The leaf switches interconnect the rack components using cross-cabling
to redundant network interfaces in each component. Each leaf switch also has a
connection to each of the spine switches, which are also interconnected. The spine
switches form the backbone of the network and enable traffic external to the rack.
Their uplinks to the data center network consist of two cable pairs, which are
cross-connected to two redundant ToR (top-of-rack) switches.
The management cluster, which runs the controller software and system-level services,
consists of three fully active management nodes. Inbound requests pass through the
virtual IP of the management node cluster, and are distributed across the three
nodes by a load balancer. If one of the nodes stops responding and fences from the
cluster, the load balancer continues to send traffic to the two remaining nodes
until the failing node is healthy again and rejoins the cluster.
Storage for the system and for the cloud resources in the environment is provided by
the internal ZFS Storage Appliance. Its two controllers form an active-active
cluster, providing high availability and excellent throughput at the same time. The
ZFS pools are built on disks in a mirrored configuration for the best data
protection.
System Availability 🔗
The software and services layer are deployed on the three-node management cluster,
and take advantage of the high availability that's inherent to the cluster design.
The Kubernetes container orchestration environment also uses clustering for both its
own controller nodes and the service pods it hosts. Many replicas of the
microservices are running at any particular time. Nodes and pods are distributed
across the management nodes, and Kubernetes ensures that failing pods are replaced
with new instances to keep all services running in an active/active setup.
All services and components store data in a common, central MySQL database. The MySQL
cluster database has instances deployed across the three management nodes.
Availability, load balancing, data synchronization and clustering are all controlled
by internal components of the MySQL cluster.
A significant part of the system-level infrastructure networking is software-defined.
The configuration of virtual switches, routers and gateways isn't stored and managed
by the switches, but is distributed across several components of the network
architecture. The network controller is deployed as a highly available containerized
service.
The upgrade framework leverages the hardware redundancy and the clustered designs to
provide rolling upgrades for all components. During the upgrade of one component
instance, the remaining instances ensure that there's no downtime. The upgrade is
complete when all component instances have been upgraded and returned to normal
operation.
Compute Instance Availability 🔗
For a compute instance, high availability refers to the automated recovery of an
instance in case the underlying infrastructure fails. The state of the compute
nodes, hypervisors, and compute instances is monitored continually. Each compute
node is polled with a 5 minute interval. When compute instances go down, by default
the system tries to recover it automatically.
By default, the system attempts to restart instances in their selected fault domain
but will restart instances in other fault domains if insufficient resources are
available in the selected fault domain. The selected fault domain is the fault
domain that's specified in the instance configuration.
If a compute node goes down because of an unplanned reboot, when the compute node
successfully returns to normal operation, instances are restarted. At the next
polling interval, by default if instances are found that should be running but are
in a different state, the start command is issued again. If any instances have
stopped and remain in that state, the hypervisor tries to restart them up to 5
times. Instances that weren't running before the compute node became unavailable,
remain shut down when the compute node is up and running again.
A compute node is considered failing when it has been disconnected from the data network or has been in a powered-off state for about 5 minutes. This 5-minute timeout corresponds with two unsuccessful polling tries, and is the threshold for placing the compute node in FAIL state and its agent in EVACUATING state. This condition is required before the reboot migration can start.
Reboot migration implies that all compute instances from the failing compute node are
stopped and restarted on another compute node. When migration is complete, the
failing compute node's agent indicates that instances have been evacuated. If the
compute node eventually reboots successfully, it must go through a cleanup process
that removes all stale instance configurations and associated virtual disks. After
cleanup, the compute node can host compute instances again.
During the entire reboot migration, the instances remain in "moving" configuration
state. After migration is completed, the instance configuration state is changed to
"running". Instances that were stopped before the failure aren't migrated, because
they're not associated with any compute node.
Continuity of Service 🔗
Compute Cloud@Customer offers several features that further enhance high
availability. Health monitoring at all levels of the system is a key factor.
Diagnostic and performance data is collected from all components, then centrally
stored and processed, and made available to Oracle personnel.
To mitigate data loss and support the recovery of system and services configuration
in case of failure, consistent and complete backups are made regularly.
Optionally, workloads deployed on Compute Cloud@Customer can be protected
against downtime and data loss through the implementation of disaster recovery. To
achieve this, two Compute Cloud@Customer infrastructures need to be set
up at different sites, and configured to be each others replica. Resources under
disaster recovery control are stored separately on the ZFS Storage Appliances in
each system, and replicated between the two. When an incident occurs at one site,
the environment is brought up on the replica system with minimal downtime. We
recommend that disaster recovery is implemented for all critical production systems.