Data Flow applications can be configured to access
data sources hosted within private networks, enabling secure and seamless connectivity and
reducing the exposure of sensitive data to potential breaches or unauthorized
access.
By limiting exposure to public networks, this approach reduces the risk of data breaches and
unauthorized access, a critical concern for industries handling sensitive information. For
example, organizations subject to regulations such as the General Data Protection Regulation
(GDPR) in the EU can ensure personal data remains protected within controlled environments,
minimizing the risk of noncompliance. Similarly, healthcare providers bound by the Health
Insurance Portability and Accountability Act (HIPAA) in the United States can securely process
protected health information (PHI) through private network configurations, safeguarding
patient confidentiality. This architecture also supports compliance with other regulatory
frameworks, such as SOC 2 for data security and privacy, enabling organizations to meet their
obligations while maintaining high-performance data processing.
Configuring a Data Flow application with private network access provides the following
capabilities:
Access to Private Oracle Cloud Infrastructure (OCI) Data Sources: Connect to OCI data sources accessible only within private
networks.
Integration with On-Premises Data Sources: Access on-premises data connected to an
OCI Virtual Cloud Network (VCN) through
Site-to-Site VPN or FastConnect.
Support for Oracle RAC Databases: Use the SCAN proxy functionality to access Oracle
RAC databases.
The following image depicts a simplified diagram of the Data Flow network configuration. Within that, notice the
serverless applications running inside the Data Flow Services
Tenancy. Therefore, some network components need to be understood from this diagram.
Like any other application running on a secure private network, the Data Flow application cluster connects to the Internet through a
NAT Gateway (network address translation) and to the Oracle Service Network (OSN) using the
Services Gateway (SGW), restricting any access from the Internet to the application cluster. A
Data Flow Private Access Gateway (Data Flow private endpoint) is constructed, letting a Data Flow application access OCI resources such as ADB Instances that reside in a
private subnet, and customer on-premises resources connected to OCI with Fast-Connect or Site-to-Site VPNs.
The following use cases show how to analyze how the network settings and some extra
configurations allow secure access through private subnets:
Connect to an ADB Instance Configured with Private Endpoint Access only
This is the most common use case for Data Flow and
private endpoints.
Two things need consideration:
The ADB instance has a network Access type set to Virtual cloud
network, and a private subnet is selected.
Navigate to the
Autonomous Database details, and under
Network, copy the Private endpoint URL (FQDN), for example:
<podID>.adb.<region>.oraclecloud.com
Hence, the private endpoint is already resolved with the VCN, so it can be carried
forward to the Data Flow configuration. So, in the Data Flow Private endpoint details:
Select
Edit and update the DNS zones to resolve and paste the FQDN of
the ADB instance gathered in the previous step.
If the ADB network configuration restricts application access even further by using
Network Security Groups (NSGs) that allow only specific CIDR ranges, OCI Services, or NSGs, ensure they're represented in
both ends of the ADB and Data Flow configuration.
Secure Access from Allowed IPs and VCNs 🔗
This is a variation of the Network settings in the ADB, where the option "Secure access
from allowed IPs and VCNs only" is selected, and an access control list (ACL) is attached to
the configuration. In this scenario, the Data Flow private
endpoint isn't necessary. The network traffic travels throughout the NAT gateway in the Data Flow service tenancy within the customer-allocated subnet
(tenant OKE cluster subnet in the previous image). For the documented list of IPs to allowlist
in the ACL, see the IP Address Allowlist.
Note
Data Flow private endpoint s aren't necessary for the
implementation setting.
Move a Data Flow Application to Use Private Endpoints and
Regional Restrictions on Oracle Cloud Infrastructure Object Storage 🔗
A Data Flow application running with the "Internet access"
type might count to access buckets in different regions. For example, an application running
in IAD region can access objects stored in PHX region, provided OCI
IAM authorization grants such access.
If this application eventually moves into a "Private access" run, it loses access to another
region's public Object Store service. The Service Gateway maintains communication with the
Object Store service, as depicted in the previous image, and therefore, regional access is
enforced at the domain name resolution (DNS).
In general, the following Oracle services are limited by using the Data Flow private endpoints:
Object store buckets with segregated IAM access
policy.
Cross-Region access of OCI resources.
Direct use of IP addresses to access private resources in the customer tenancy.
Connect to On-Premises Resources 🔗
Integrate Data Flow with on-premises data sources
through private endpoints ensures secure and efficient data processing across hybrid
environments.
By using private network connectivity options such as Site-to-Site VPN or FastConnect, you
can seamlessly connect Data Flow applications to data
repositories hosted in the on-premises infrastructure. This gives robust, low-latency
communication while maintaining strict security boundaries. Use it for use cases that require
secure data access and processing across cloud and on-premises environments.
A key element of this setup is the DNS resolution, which now takes place within the Private
Access Gateway. The configuration provides a DNS name (fully qualified domain name, or FQDN)
for the private endpoints, not the private IP address itself. If you've configured your
network setup for DNS, your hosts can access the private endpoint using the FQDN. Data Flow supports network security groups (NSGs) with its resources. You
can request that the Data Flow service set up the private
endpoint in an NSG within your VCN. NSGs let you write security rules to control access to the private
endpoint without knowing the private IP address assigned to the private endpoint.
When connectivity between the OCI customer VCN
and the on-premises network has been established, for the Data Flow application to operate correctly, it's necessary to
associate the private IPs in the on-premises network with a private Domain resolver in the OCI customer VCN.
To illustrate this part, we're using two networks connected using the OCI Network Visualizer. The hdi-dataflow-VCN
private subnet is connected to a disdemo-DRG dynamic routing gateway attachment.
From this point onwards, the relevant part is to decide how the DNS resolution occurs. In
this example, we created two DNS resolvers:
One attached to hdi-dataflow-vcn standard view (industrial.com)
Another in a customer-private-view with the resolution for the domain oraclevcn.com, indicating this is a private subnet
in an OCI VCN.
In the "Private resolver details" of the selected VCN, select Manage private
views. Two options exist:
Protected private view, in our case, creating the hdi-dataflow-vcn private view
(Order 1).
Using Manage private views again, create the
customer-private-view (Order 2).
Navigating to the first private view hdi-dataflow-vcn, under the Private
Zones, we created the industrial.com zone with Primary
Zone type.
Similarly, navigate to the customer-private-view and verify that a Private zone is
already created within the Private subnet of the referred VCN.
Under the industrial.com zone in the Private view, select Manage
records to create a new record for the IP of an external MySQLServer running in
the customer on-premises network. For example:
Name: MySQL_Customer_OnPremises(.industrial.com)
Type: A - IPv4 address
TTL in seconds: 3600
RDATA/Answers
Address: 10.xx.xx.xx (the IP address of the on-premises resource).
Similarly, do the same for private access within another OCI private VCN that has been peered with the
current VCN, such as the one in the previous image, through the local peering gateway (LPG).
So, in the internal zone privatesubnet.dfarchitecture.oraclevcn.com add record to the
IP of another MySQL instance running in the private subnet of the peered VCN. For example:
Address: 12.xx.xx.xx (the IP address of the resource in the peered VCN).
Again, the DNS resolution is the key factor to be addressed when using Data Flow private endpoints. Resolve them using record types
appropriate to the downstream network. For these types and configurations, follow the Managing DNS zones documentation.
Test the Data Flow Private Endpoint 🔗
To test and verify that this configuration is working, in the Github Dataflow Samples, a body of code tests the DNS
resolution in the first iteration and, if positive, tries to establish connectivity with the
configured record in the second iteration. For more details, see README.
The output of a successful test shows in the application driver's log. For
example:
FQDN 'MySQL_Private_DB.industrial.com' resolved to IP '255.33.36.2'. Testing connectivity...
Success: Able to connect to MySQL_Private_DB.industrial.com (255.33.36.2) on port 3306.
Cross-Region Access with Data Flow Private
Endpoints 🔗
Data Flow supports private endpoint integration for
seamless cross-regional data access using remote VCN peering through an upgraded Dynamic Routing
Gateway (DRG).
This configuration lets Data Flow applications in one region
to securely connect to data sources hosted in another region's VCN, as depicted in the
following image:
You can ensure high-performance, low-latency data transfers while maintaining a robust
security posture by using the upgraded DRG's advanced capabilities, including transitive
routing and centralized connectivity.
Also, with OCI’s multicloud networking
capabilities, you can extend this setup to connect with, for example, Microsoft Azure. By
using OCI-Azure Interconnect or a similar multicloud
connectivity framework, you can enable Data Flow to process
data stored in Azure resources such as Azure Blob Storage or Azure SQL Database, as shown in
the following image:
This architecture supports centralized connectivity, transitive routing, and low-latency data
transfer between OCI and Azure while maintaining
stringent security and compliance standards.
For the Data Flow private endpoint, it matters how the DNS
names are resolved before the DRG, using private resolvers, as shown in Connect to On-Premises Resources. A tangible benefit is that the instances for
private connectivity in the service, VCN can access a consumer-specified workload without
traversing the Internet. Beyond that, the Data Flow private
endpoint can extend private connectivity from instances in service VCN to the consumer's
on-premises network and other networks accessible through the consumer VCN. From the usability
perspective, you can continue interacting with only the service Console (or API) and don't need another interface to enable
private access. Despite the flexibility of operation using the Data Flow private endpoint, some limitations exist:
The default limit for a Data Flow private endpoint is no
more than five per tenancy per region.
If Internet connectivity is required for a Spark Application run with private endpoints
enabled, the corresponding DNS zone (for example, Google's APIs/google.zone) needs to be
mentioned in the parameter (zones) section for private endpoints under the Application. So,
if the zone is allowlisted, the traffic is routed to the customer VCN for resolution. The
customer network is responsible for internet connectivity after the packet reaches the
consumer gateway VCN. The network traffic is dropped for all other zones not mentioned in
the parameter (DNS Zones).
Connect to an Oracle Database Cluster (RAC or Exadata) 🔗
Data Flow can connect to the RAC (real application
cluster) or Exadata machine as a client application using SCAN (Single Client Access
Name).
The SCAN is a virtual name similar to those used for virtual IP addresses. However, unlike a
virtual IP, the SCAN virtual name is associated with the entire cluster rather than an
individual node and several IP addresses, not one address only.
When the SCAN proxy feature is enabled, a reverse connection entry point (RCE) is established
to handle IP-based redirects. As shown in the following image, a private endpoint VNIC
(private endpoint virtual network interface card) is created in the Customer VCN. The RCE
private endpoint VNIC is unique for each Data Flow private
endpoint setup. One important consideration about the TLS connection to a database cluster is
that the database SCAN listeners redirect the network traffic to an FQDN, not the IP address
directly. Only the FQDN redirects from a SCAN listener enable TLS. Therefore, configure the
database cluster to redirect to an FQDN if TLS is a requirement.
The configuration steps that happen behind the scenes to create the SCAN proxy feature:
The user configures the SCAN proxy in the Data Flow
private endpoint configuration
The Data Flow updates RCE to include SCAN configuration
(the SCAN listener DNS name and port), which provides a new IP (SCAN proxy IP) in the
service VCN binding to the same SCAN port
Data Flow then uses the SCAN proxy IP to create a DNS
mapping within the service network, using the original SCAN listener DNS name and the SCAN
proxy IP
The previous image shows an example of a RAC Oracle Database system within a private subnet
in the customer VCN. The flow of the picture states the sequence used to identify the
connectivity:
Data Flow starts a connection to the SCAN Proxy endpoint
within the service VCN using the DNS name. Data Flow defines
a customer RAC connection through the SCAN Proxy by selecting a specific port. RCE SCAN
proxy forwards the request to the underlying private endpoint VNIC listener in the customer
network. It then inspects the SCAN listener response for an IP of the underlying database
cluster instance, creates a Class E NAT IP, and replaces the cluster instance IP with NAT IP
in the SCAN proxy response.
The Private Access Gateway receives the redirect request from the SCAN listener and
automatically translates the local listener IP in the customer's VCN into a mapped IP
address, then returns this information to the Data Flow
components and makes a connection request to one of the local listeners.
In the Data Flow private endpoint configuration, enter the DNS
name of the scan host in the SCAN details section and its associated port number. For
example:
DNS name: oracleDB-scan-sub0911000090.dfarchitecture.oraclevcn.com
Port: 16001
Considerations about Network Traffic and Isolation 🔗
When running a Spark serverless execution, it's critical to account for network traffic
patterns and isolation to ensure best performance, security, and compliance.
Serverless Spark jobs run within a managed environment where network traffic flows between
your application, data sources, and external services. To minimize latency and control
traffic, ensure data sources are colocated within the same region and Virtual Cloud Network
(VCN) where possible. Here are some more considerations and information about the Data Flow private endpoint setup:
After an Application run with Data Flow private endpoint
resource is attached, the network traffic to the Internet is routed to the your VCN subnet
through private endpoint Infrastructure as long as the DNS zone is allowlisted during
private endpoint creation. It fails if your VCN doesn't have an Internet Gateway attached.
If the DNS zone isn't allowlisted, network traffic is dropped. Network traffic to OCI services, for example, Object Storage in the
Oracle Services Network is still routed through Data Flow's
VCN.
When the Data Flow run is running, the Spark Data Flow Application running from any nodes assigned to the
tenant start a network connection to the your private resource with the DNS name, for
example, customer1.instance1.subnet.oraclevcn.com). This involves a DNS lookup of the
DNS Proxy IP assigned to your private resource, created during private endpoint/reverse
connection endpoint setup. In a reverse connection, a server starts the connection to a
client, letting Data Flow access the resource privately by
connecting to a specified endpoint within the your network.
Your DNS zones in the private views create a proxy that returns a Class E IP address
(240.0.0.0-255.255.255.255) for the customer.instance.subnet.oraclevcn.com
allocating from a specific CIDR range, for example, 255.33.36.2, as portrayed in the
test script in Test the Data Flow Private Endpoint. In this example,
Data Flow nodes running your jobs can establish a network
connection to 255.33.36.0/24, and a Stateful Egress Rule is created with that
destination CIDR range. This means that when a Data Flow
instance starts traffic to another host and that traffic is allowed by egress security
rules, any traffic that the instance later receives from that host for a period is
considered response traffic(ingress) and is allowed. A route table rule is also added to
route the Data Flow private endpoint appropriately for the
destination CIDR range as 255.33.36.0/24, allocated to that resource.
Summary 🔗
Data Flow's private endpoint capabilities provide
robust and secure connectivity for accessing diverse data sources and environments.
With private endpoint access, you can seamlessly connect to Autonomous Database (ADB)
instances configured for private access only, ensuring secure interactions without exposing
the database to public networks. Similarly, private endpoints enable secure connectivity to
on-premises resources through Site-to-Site VPN or FastConnect, giving hybrid cloud use cases.
Cross-regional access is supported for distributed workloads using remote VCN peering and
upgraded Dynamic Routing Gateways (DRGs), enabling low-latency data processing across regions.
Also, Data Flow supports connections to Oracle Database
Clusters, such as RAC or Exadata, leveraging SCAN proxy functionality for efficient and
high-availability access. These features are underpinned by stringent network isolation and
traffic management practices, including private IPs, security rules, and DNS configuration, to
ensure best performance, security, and compliance for serverless Spark executions.
Data Flow Private Endpoints Use Cases References 🔗
For more information on Data Flow Private Endpoints Use Cases, see the following: