Planning and Understanding ODH Clusters
Before creating Big Data Service clusters, you must plan and understand clusters, instance types and shapes, and cluster profiles.
For more information, see the following:
Planning the Cluster Layout, Shape, and Storage
Before you start the process to create a cluster, you must plan the layout of the cluster, the node shapes, and storage.
Nodes and services are organized differently on clusters, based on whether the cluster is highly available (HA) and secure, or not.
About using HA clusters
Use HA clusters for production environments. They're required for resiliency and to minimize downtime.
In this release, a cluster must be both HA and secure, or neither.
Types of nodes
The types of nodes are as follows:
- Master or utility nodes include the services required for the operation and management of the cluster. These nodes don't store or process data.
- Worker nodes store and process data. The loss of a worker node doesn't affect the operation of the cluster, although it can affect performance.
- Compute only worker nodes process data. The loss of a compute only worker node doesn't affect the operation of the cluster, although it can affect performance.Note
Compute only worker nodes aren't supported for CDH clusters. - Edge nodes are extended nodes to the cluster that have only clients installed. You can install additional packages and run additional applications in this node instead of worker/compute/master nodes to avoid classpath conflicts and resources issues with cluster services.
High availability (HA) cluster layout
A high availability cluster has two master nodes, two utility nodes, three or more worker nodes, and zero or more compute only worker nodes.
Type of node | Services on ODH | Services on CDH |
---|---|---|
First master node |
|
|
Second master node |
|
|
First utility node |
|
|
Second utility node |
|
|
Worker nodes (3 minimum) |
|
|
Compute only worker nodes |
|
NA |
Edge nodes |
|
NA |
Minimal (nonHA) Cluster Layout
A nonhigh availability cluster has one master node, one utility node, three or more worker nodes, and zero or more compute only worker nodes.
Type of node | Services on ODH | Services on CDH |
---|---|---|
Master node |
|
|
Utility node |
|
|
Worker nodes |
|
|
Compute only worker nodes |
|
NA |
Edge nodes |
|
NA |
The node shape describes the compute resources allocated to the node.
The shapes used for master/utility nodes and worker nodes can be different. But all master/utility nodes must be of the same shape and all worker nodes must be of the same shape.
The following table shows what shapes can be used for the different node types. See Compute Shapes for more detailed information.
For a list of the resources provided by each shape, see:
Node Type | Available Shapes | Required Number of Virtual Network Interface Cards (VNICs) |
---|---|---|
Master or utility |
VM.Standard2.4 VM. Standard2.8 VM. Standard2.16 VM.Standard2.24 VM.Standard.E5.Flex VM.Standard.E4.Flex * VM.Standard3.Flex* VM.Optimized3.Flex* VM.DenseIO.E4.Flex* VM.DenseIO.E5.Flex* VM.DenseIO2.8 VM.DenseIO2.16 VM.DenseIO2.24 BM.Standard2.52 BM.DenseIO2.52 BM.HPC2.36 BM.Standard3.64* BM.Optimized3.36* BM.DenseIO.E4.128* BM.Standard.E4.128* |
3 minimum Used for the cluster subnet, the DP access subnet, and the customer's subnet *You must specify a minimum of 3 OCPU and 32 GB memory. |
Worker |
VM.Standard2.1* VM.Standard2.2* VM. Standard2.4 VM. Standard2.8 VM. Standard2.16 VM.Standard2.24 VM.Standard.E5.Flex VM.Standard.E4.Flex * VM.Standard3.Flex* VM.Optimized3.Flex* VM.DenseIO.E4.Flex* VM.DenseIO.E5.Flex* VM.DenseIO2.8 VM.DenseIO2.16 VM.DenseIO2.24 BM.Standard2.52 BM.DenseI2.52 BM.HPC2.36 BM.Standard3.64* BM.Optimized3.36* BM.DenseIO.E4.128* BM.Standard.E4.128* |
2 minimum Used for the cluster subnet and the your subnet |
Compute only worker |
VM.Standard2.1* VM.Standard2.2* VM. Standard2.4 VM. Standard2.8 VM. Standard2.16 VM.Standard2.24 VM.Standard.E5.Flex VM.Standard.E4.Flex * VM.Standard3.Flex* VM.Optimized3.Flex* VM.DenseIO.E4.Flex* VM.DenseIO.E5.Flex* VM.DenseIO2.8 VM.DenseIO2.16 VM.DenseIO2.24 BM.Standard2.52 BM.DenseI2.52 BM.HPC2.36 BM.Standard3.64* BM.Optimized3.36* BM.DenseIO.E4.128* BM.Standard.E4.128* |
2 minimum Used for the cluster subnet and the your subnet Compute only worker nodes aren't supported for CDH clusters. |
Edge |
VM.Standard2.1* VM.Standard2.2* VM. Standard2.4 VM. Standard2.8 VM. Standard2.16 VM.Standard2.24 VM.Standard.E5.Flex VM.Standard.E4.Flex * VM.Standard3.Flex* VM.Optimized3.Flex* VM.DenseIO.E4.Flex* VM.DenseIO.E5.Flex* VM.DenseIO2.8 VM.DenseIO2.16 VM.DenseIO2.24 BM.Standard2.52 BM.DenseI2.52 BM.HPC2.36 BM.Standard3.64* BM.Optimized3.36* BM.DenseIO.E4.128* BM.Standard.E4.128* |
2 minimum Used for the cluster subnet and the customer's subnet Note: Because the Edge node is specific to client application usecases, choose shape as required by the application.Edge nodes aren't supported for CDH clusters. |
* Be aware that VM.Standard2.1 and VM.Standard2.2 are small shapes and won't support running large workloads. For VM.Standard.E4.Flex, VM.Standard3.Flex, VM.Standard.E5.Flex, and VM.Optimized3.Flex you must specify minimum of 1 OCPU and 16GB memory.
Note:The following shapes aren't supported for CDH clusters. They're supported for ODH clusters only.
VM.Standard.E4.Flex
VM.Standard.E5.Flex
VM.Standard3.Flex
VM.Optimized3.Flex
VM.DenseIO.E4.Flex
VM.DenseIO.E5.Flex*
BM.Standard3.64
BM.Optimized3.36
BM.DenseIO.E4.128
Not all shapes are available by default. To see what shapes are available by default through the Cloud Console, see Finding Tenancy Limits. To submit a request to increase the service limits, see Requesting a Service Limit Increase.
Nodes based on standard VM shapes use network-attached block storage.
Block storage isn't supported for nodes based on DenseIO and HPC shapes.
All nodes have a boot volume of 150 GB.
Option | Limits/Guidelines |
---|---|
Minimum initial block storage | 150 GB |
Default initial block storage * | 150 GB |
Minimum additional block storage | 150 GB |
Default additional block storage * | 1 TB |
Incremental step for (initial and additional) block storage | 50 GB |
Maximum block storage for a single node |
48 TB The 48 TB total results from 12 volumes of 4 TB each. If you add block storage multiple times, the maximum remains 48 TB, but it might be spread across more than 12 volumes. |
Maximum block volume size |
4 TB If you specify the maximum 48 TB, 12 drives of 4 TB each are created. If you specify a lower number, enough 4 TB devices for that amount are created, and more devices are created as you add more storage. |
You can't add more block storage to master or utility nodes. Therefore, the following figures show initial sizes only.
Option | Limits/Guidelines |
---|---|
Minimum initial block storage | 150 GB |
Default initial block storage | 1 TB |
Minimum additional block storage | 150 GB |
Default additional block storage | 1 TB |
Incremental step for (initial and additional) block storage | 50 GB |
Maximum block storage for a single node | 32 TB |
Maximum block volume size | 32 TB |
MySQL placement | For utility nodes move /var/lib/mysql to /u01 and create a symbolic link. This prevents filling up the boot volume. |
Option | Guidelines |
---|---|
Default initial block storage | 2 TB |
Minimum initial block storage | 150 GB |
Query server storage is used for temporary table space to perform heavy JOIN and GROUP BY operations. 2 TB is recommended for typical processing. For small environments, for example development, this number can be adjusted down.
For best performance, consider these factors:
- I/O throughput
- Networking between the compute device and block storage device.
See Block Volume Performance in the Oracle Cloud Infrastructure documentation.
The following table describes how Big Data Service allocates block volume storage for nodes of different sizes.
What | Amount |
---|---|
Initial volume allocation for master nodes and utility nodes | 1 large volume |
Volume allocation for additional block storage for master nodes and utility nodes | 1 large volume |
Initial volume allocation for worker nodes. |
|
Volume allocation for additional block storage for worker nodes |
The minimum number of volumes that can accommodate the storage size, with a maximum volume size of 4 TB per volume. (The last volume might be smaller than 4 TB.) |
We recommend that you use edge nodes for staging.
Understanding Instance Types and Shapes
Big Data Service cluster nodes run in Oracle Cloud Infrastructure compute instances (servers).
When you create a cluster, you choose an instance type, which determines whether the instance runs directly on the bare metal instance of the hardware or in a virtualized environment. You also choose a shape, which configures the resources assigned to the instance.
-
Bare metal: A bare metal compute instance uses a dedicated physical server for the node, for highest performance and strongest isolation.
-
Virtual Machine (VM): Through virtualization, a VM compute instance can host multiple, isolated nodes that run on a single physical bare metal machine. VM instances are less expensive than bare metal instances and are useful for building less demanding clusters that don't require the performance and resources (CPU, memory, network bandwidth, storage) of an entire physical machine for each node.
VM instances run on the same hardware as bare metal instances, with the same firmware, software stack, and networking infrastructure.
For more information about compute instances, see Overview of Compute.
The shape determines the number of CPUs, amount of memory, and other resources allocated to the compute instance hosting the cluster node. See Planning the Cluster Layout, Shape, and Storage in the Oracle Cloud Infrastructure documentation for the available shapes.
The shapes of the Big Data Service master nodes and the worker nodes don't have to match. But the shapes of all the master nodes must match each other and the shapes of all the worker nodes must match each other.
Understanding Cluster Profiles
Cluster profiles enable you to create optimal clusters for a specific workload or technology. After creating a cluster with a specific cluster profile, more Hadoop services can be added to the cluster.
Cluster Profile Types
Oracle Big Data Service enables you to create clusters for numerous cluster profile types.
Cluster profile | Components (Secure and Highly Available) | Components |
---|---|---|
HADOOP_EXTENDED1 | Hive, Spark, HDFS, Yarn, ZooKeeper, MapReduce2, Ambari Metrics, Ranger, Hue, Oozie, Tez | Hive, Spark, HDFS, Yarn, ZooKeeper, MapReduce2, Ambari Metrics, Hue, Oozie, Tez |
HADOOP | HDFS, Yarn, ZooKeeper, MapReduce2, Ambari Metrics, Ranger, Hue | HDFS, Yarn, ZooKeeper, MapReduce2, Ambari Metrics, Hue |
HIVE | Hive, HDFS, Yarn, ZooKeeper, MapReduce2, Ambari Metrics, Ranger, Hue, Tez | Hive, HDFS, Yarn, ZooKeeper, MapReduce2, Ambari Metrics, Hue, Tez |
SPARK | Spark, Hive2, HDFS, Yarn, ZooKeeper, MapReduce2, Ambari Metrics, Ranger, Hue | Spark, Hive2, HDFS, Yarn, ZooKeeper, MapReduce2, Ambari Metrics, Hue 2 |
HBASE | HBase, HDFS, Yarn, ZooKeeper, MapReduce2, Ambari Metrics, Ranger, Hue | HBase, HDFS, Yarn, ZooKeeper, MapReduce2, Ambari Metrics, Hue |
TRINO | Trino, Hive3, HDFS, ZooKeeper, Ambari Metrics, Ranger, Hue | Trino, Hive3, HDFS, ZooKeeper, Ambari Metrics, Hue |
KAFKA | Kafka Broker, HDFS, ZooKeeper, Ambari Metrics, Ranger, Hue | Kafka Broker, HDFS, ZooKeeper, Ambari Metrics, Hue |
1 HADOOP_EXTENDED consists of components that you created clusters before cluster profiles were available.
2Hive metastore component from Hive service is used for managing the metadata in Spark.
3Hive metastore component from Hive service is used for managing the Hive metadata entities in Trino.
Apache Hadoop Versions in Cluster Profiles
The following table lists the Hadoop component versions included in cluster profiles corresponding to ODH version.
ODH 1.x
Cluster profile | Version |
---|---|
HADOOP_EXTENDED | HDFS 3.1, Hive 3.1, Spark 3.0.2 |
HADOOP | HDFS 3.1 |
HIVE | Hive 3.1 |
SPARK | Spark 3.0.2 |
HBASE | HBase 2.2 |
TRINO | Trino 360 |
KAFKA | Kafka 2.1.0 |
ODH 2.x
Cluster profile | Version |
---|---|
HADOOP_EXTENDED | HDFS 3.3, Hive 3.1, Spark 3.2 |
HADOOP | HDFS 3.3 |
HIVE | Hive 3.1 |
SPARK | Spark 3.2 |
HBASE | HBase 2.2 |
TRINO | Trino 389 |