Patching Failures on Oracle Exadata Database Service on
Cloud@Customer Systems
π
Patching operations can fail for various reasons. Typically, an operation fails
because a database node is down, there is insufficient space on the file
system, or the virtual machine cannot access the object store.
Determining the Problem In the Console, you can identify a failed patching operation by viewing the patch history of an Oracle Exadata Database Service on Cloud@Customer system or an individual database.
Troubleshooting and Diagnosis Diagnose the most common issues that can occur during the patching process of any of the Oracle Exadata Database Service on Cloud@Customer components.
In the Console, you can identify a failed patching operation by viewing the patch history of an Oracle Exadata Database Service on
Cloud@Customer system or an individual database.
A patch that was not successfully applied displays a status of
Failed and includes a brief description of the
error that caused the failure. If the error message does not contain enough
information to point you to a solution, you can use the database CLI and log
files to gather more data. Then, refer to the applicable section in this
topic for a solution.
One or more of the following conditions on the database server VM can cause
patching operations to fail.
Database Server VM Connectivity
Problems
Cloud tooling relies on the proper networking and connectivity
configuration between virtual machines of a given VM cluster. If the configuration
is not set properly, this may incur in failures on all the operations that require
cross-node processing. One example can be not being able to download the required
files to apply a given patch.
Given the case, you can perform the following actions:
Verify that your DNS configuration is correct so that the relevant
virtual machine addresses are resolvable within the VM cluster.
Refer to the relevant Cloud Tooling logs as instructed in the Obtaining
Further Assistance section and contact Oracle Support for further
assistance.
One or more of the following conditions on Oracle Grid Infrastructure can
cause patching operations to fail.
Oracle Grid Infrastructure is
Down
Oracle Clusterware enables servers to communicate with each other so that they can
function as a collective unit. The cluster software program must be up and running
on the VM Cluster for patching operations to complete. Occasionally you might need
to restart the Oracle Clusterware to resolve a patching failure.
In such cases, verify the status of the Oracle Grid Infrastructure as
follows:
./crsctl check cluster
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
If Oracle Grid Infrastructure is down, then restart by running the following
commands:
An improper database state can lead to patching failures.
Oracle Database is Down
The database must be active and running on all the active nodes so the patching
operations can be completed successfully across the cluster.
Use the following command to check the state of your database, and ensure that any
problems that might have put the database in an improper state are
resolved:
srvctl status database -d db_unique_name -verbose
The system returns a message including the database instance status. The instance
status must be Open for the patching operation to succeed.
If the database is not running, use the following command to start
it:
If you were unable to resolve the problem using the information in this topic, follow the
procedures below to collect relevant database and diagnostic information. After you have
collected this information, contact Oracle Support.
Collecting Cloud Tooling Logs Use the relevant log files that could assist Oracle Support for further investigation and resolution of a given issue.
To collect the relevant Oracle diagnostic information and logs, run the dbaascli
diag collect command.
For more information about the usage of this utility, see DBAAS Tooling: Using dbaascli to Collect Cloud Tooling Logs and Perform a Cloud
Tooling Health Check.
VM Operating System Update Hangs During
Database Connection Drain π
Description: This is an intermittent issue. During virtual machine
operating system update with 19c Grid Infrastructure and running databases,
dbnodeupdate.sh waits for RHPhelper to drain the
connections, which will not progress because of a known bug "DBNODEUPDATE.SH HANGS IN
RHPHELPER TO DRAIN SESSIONS AND SHUTDOWN INSTANCE".
Symptoms: There are two possible outcomes due to this bug:
VM operating system update hangs in rhphelper
Hangs the automation
Some or none of the database connections will have drained,
and some or all of the database instances will remain running.
VM operating system update does not drain database connections because
rhphelper crashed
Does not hang automation
Some or none of the database connection draining completes
/var/log/cellos/dbnodeupdate.trc will show this as the
last line:
(ACTION:) Executing RHPhelper to drain sessions and shutdown instances.
(trace:/u01/app/grid/crsdata/scaqak04dv0201/rhp//executeRHPDrain.150721125206.trc)
Action:
Upgrade Grid Infrastructure version to 19.11 or above.
(OR)
Disable rhphelper
before updating and enable it back after updating.
If
you disable rhphelper, then there will be no database
connection draining before database services and instances are shutdown on a
node before the operating system is updated.
If you missed disabling RHPhelper and upgrade is not progressing and hung, then
it is observed that the draining of services is taking time:
Inspect the
/var/log/cellos/dbnodeupdate.trc trace file,
which contains a paragraph similar to the
following:
(ACTION:) Executing RHPhelper to drain sessions and shutdown instances.
(trace: /u01/app/grid/crsdata/<nodename>/rhp//executeRHPDrain.150721125206.trc)
Open the
/var/log/cellos/dbnodeupdate.trc trace file.
If rhphelper fails, then the trace file contains
the message as
follows:
"Failed execution of RHPhelper"
If
rhphelper hangs, then the trace file contains
the message as
follows:
(ACTION:) Executing RHPhelper to drain sessions and shutdown instances.
Identify the rhphelper processes running at
the operating system level and kill them.
There are two
commands that will have the string βrhphelperβ in the name β a Bash
shell, and the underlying Java program, which is really
rhphelper executing.
rhphelper runs as root, so
must be killed as root (sudo from
opc).
Description: When adding a VM to a VM cluster, you might encounter
the following
issue:
[FATAL] [INS-32156] Installer has detected that there are non-readable files in oracle home.
CAUSE: Following files are non-readable, due to insufficient permission oracle.ahf/data/scaqak03dv0104/diag/tfa/tfactl/user_root/tfa_client.trc
ACTION: Ensure the above files are readable by grid.
Cause: Installer has detected a non-readable trace file,
oracle.ahf/data/scaqak03dv0104/diag/tfa/tfactl/user_root/tfa_client.trc
created by Autonomous Health Framework (AHF) in Oracle home that causes adding a cluster
VM to fail.
AHF ran as root created a trc file with
root ownership, which the grid user is not able to
read.
Action: Ensure that the AHF trace files are readable by the
grid user before you add VMs to a VM cluster. To fix the permission
issue, run the following commands as root on all the existing VM
cluster
VMs:
Nodelist is not Updated for Data Guard-Enabled
Databases π
Description: Adding a VM to a VM cluster completes successfully,
however, for Data Guard-enabled databases, the new VM is not added to the nodelist in
the /var/opt/oracle/creg/<db>.ini file.
Cause: Data Guard-enabled databases will not be extended to the newly
added VM. And therefore, the <db>.ini file will also not be
updated because the database instance is not configured in the new VM.
Action: To add an instance to primary and standby databases and to
the new VMs (Non-Data Guard), and to remove an instance from a Data Guard environment,
see My Oracle Support note 2811352.1.
Description: CPU offline scaling fails with the following
error:
** CPU Scale Update **An error occurred during module execution. Please refer to the log file for more information
Cause: After provisioning a VM cluster, the
/var/opt/oracle/cprops/cprops.ini file, which is automatically
generated by the database as a service (DBaaS) is not updated with the
common_dcs_agent_bindHost and
common_dcs_agent_port parameters and this causes CPU offline
scaling to fail.
Action: As the root user, manually add the following
entries in the /var/opt/oracle/cprops/cprops.ini
file.
Using Custom SCAN Listener Port With Data
Guard On Disaster Recovery Network Causes Data Guard Association Verification
Failures π
Description: If the SCAN listener port for the client network and
disaster recovery network (DR network) are different, then Data Guard (DG) configuration
fails during verification phase of create data guard association.
Action: Use the same SCAN listener ports (or default port) on all
networks. To fix the listener port after the cluster has been configured, run the
GI home/bin/srvctl modify listener
-listener listener_name -endpoints endpoints command. For more information, see
srvctl modify listener in the
Oracle Real Application Clusters
Administration and Deployment Guide.
PDB Creation Fails After Moving Database to a New DB Home (23ai) π
Description: After relocating a database to a different DB Home, creating a new Pluggable Database (PDB) fails. The PDB service fails to start, resulting in the following error:
[FATAL] [DBAAS-60022] Command '/u02/app/oracle/product/23.0.0.0/dbhome_3/bin/srvctl 'start' 'service' '-db' 'db23ano' '-service' 'db23ano_PDBJULY242.paas.oracle.com'' has failed on nodes [localnode].
Action: If the Grid Infrastructure version is 23.4.0.24.05, upgrade to version 23.5.0.24.07 to resolve this issue.