Using Apache Trino

Apache Trino lets you run adhoc and ETL queries over multiple big data file systems. By default, Trino is installed in the first master node of the cluster and has system, tpch, and hive connectors installed.

To run the CRUD operation in the Hive metastore, you must add the HDFS policy with resource path as /tmp,/warehouse and permission as R/W/E for the Trino user.

Performance Tuning

Trino is a distributed SQL query engine that's designed to handle large volumes of data across many data sources, including object storage in cloud environments. However, the performance of Trino can be affected by various factors, such as the size and complexity of the data, the available network bandwidth, the cluster configuration, and the query patterns. Therefore, it's important to perform performance tuning to optimize the performance and scalability of Trino for customer specific use cases.

Recommendations While Reading Datasets from Object Storage

fs.oci.io.read.ahead.blocksize = 6291456: This parameter sets the block size (in bytes) used for read-ahead buffering in object storage reads. By default, the block size is set to 262144 (256 KB), but setting it to a larger value such as 6291456 (6 MB) can help improve read performance for large files, especially for sequential reads.
fs.oci.io.read.ahead = true: This parameter controls the read-ahead buffering mechanism for object storage reads in Trino. When read-ahead buffering is enabled, Trino proactively reads additional data beyond the current read request, in anticipation of future read requests. This can help to reduce the impact of network latency and improve the performance of read-heavy workloads.
fs.oci.caching.object.parquet = true: This parameter enables caching of Parquet objects in the local file system to improve query performance when working with Parquet files. When set to true, the OCI storage client caches Parquet objects in the local file system for subsequent reads.
fs.oci.caching.filesystem.enabled=true: This parameter enables caching of data from OCI object storage in the HDFS cache. This can improve performance by reducing the amount of data that needs to be retrieved from object storage, and by caching frequently accessed data.

General Recommendations

Note

Set the following configurations in the Ambari UI.

Access Apache Ambari.
From the side toolbar, under Services select Trino.
Select Configs, and then update/add the configurations properties.


Parameter	Description	Where to s]Set	Recommended Value
`Heap (-Xmx)`	This parameter specifies the maximum amount of heap memory that the Trino worker can use. We recommend you set this parameter to 80% of the physical memory available on the machine.	`jvm.config`	We recommend to setting this parameter to 75-80% of the physical memory available on the machine. Note: If other processes are running on the same node where Trino worker is running, leave some amount of physical memory for those processes. In such cases, set the value of the `-Xmx` parameter to a lower value than 80% of the available physical memory. We recommend you leave at least 10 to 20% of physical memory for other processes, depending on the amount of memory they require. For example, if other processes on the same node require 4 GB of memory, and the total physical memory available on the node is 16 GB, then set the `-Xmx` parameter to 75 to 80% of 12 GB (for example, 7.2 ti 8.4 GB).
`query.max-memory-per-node`	The memory of a single worker node.	`config.properties`	Set this parameter to a value that's less than or equal to the difference between the JVM memory and the value of `memory.heap-headroom-per-node`. Effective Memory = Total JVM Heap - Heap Headroom For example, if there's a concurrent user scenario then set the parameter as Effective Memory / Concurrent Queries.
`query.max-memory`	The memory of all nodes in a cluster.	`config.properties`	Set this parameter to a value that's calculated based on the following formula: Value of `query.max-memory-per-node` × Number of worker nodes.
`query.max-total-memory`	The total amount of memory consumed by a cluster, which must not be less than the value of `query.max-memory`.	`config.properties`	If the concurrency isn't high, set this parameter to the same value as the `query.max-memory` parameter. If the concurrency is high, the value of the `query.max-total-memory` parameter can't exceed the maximum memory capacity of the cluster, which is equal to 70% of JVM memory multiplied by the number of worker nodes. Proportionally decrease the values of `query.max-memory` and `query.max-memory-per-node` based on your requirements. You can decrease the value of `query.max-memory` to the half of `query.max-total-memory`. In this case, you must set the `query.max-memory-per-node` parameter to the value of `query.max-memory` divided by the number of worker nodes.
`memory.heap-headroom-per-node`	The reserved JVM heap memory.	`config.properties`	Effective Memory must be 15% of the JVM memory
`task.concurrency`	Default local concurrency for parallel operators, such as joins and aggregations. Adjust this value up or down based on the query concurrency and worker resource utilization. Lower values are better for clusters that run many queries concurrently, because the cluster is already utilized by all the running queries, so adding more concurrency results in slow downs because of context switching and other overhead. Higher values are better for clusters that only run one or a few queries at a time.	`config.properties`	80% of available CPU in a node/VM

Oracle Cloud Infrastructure Documentation

Using Apache Trino

Performance Tuning