Creating a PySpark Data Flow Application
Follow these steps to create a PySpark application in Data Flow.
Upload your Spark-submit files to an Oracle Cloud Infrastructure Object Storage. See Set Up Object Store for details. - On the Data Flow page, in the left-side menu, select Applications. If you need help finding the Data Flow page, see Listing Applications.
- On the Applications page, select Create application.
- In the Create application panel, enter a name for the application and an optional description that can help you search for it.
-
Under Resource configuration, provide the following
values. To help calculate the number of resources that you need, see Sizing the Data Flow Application.
- Select the Spark version.
- (Optional) Select a pool.
- For Driver shape, select the type of cluster node to use to host the Spark driver.
- (Optional) If you selected a flexible shape for the driver, customize the number of OCPUs and the amount of memory.
- For Executor shape, select the type of cluster node to use to host each Spark executor.
- (Optional) If you selected a flexible shape for the executor, customize the number of OCPUs and the amount of memory.
- (Optional) To enable use of Spark dynamic allocation (autoscaling), select Enable autoscaling.
- Enter the Number of executors you need. If you selected to use autoscaling, enter a minimum and maximum number of executors.
-
Under Application configuration, provide the following
values.
- (Optional) If the application is for Spark streaming, select Spark streaming
-
Note
You must have followed the steps in Getting Started with Spark Streaming for your streaming application to work. - Don't select Use Spark-Submit options.
- Select Python from the Language options.
- Under Select a file, Enter specify the File file
URL to the application. There are two in one of the following ways to do
this:
- Select the file from the Object Storage file name list. Select Change compartment if the bucket is in a different compartment.
- Select Enter the file URL manually and
enter the file name and the path to it using this format:
oci://<bucket_name>@<objectstore_namespace>/<file_name>
- Enter the Main class name.
- (Optional) Enter any arguments to use to invoke the main class. There is
no limit to their number or their names. For example, in the
Arguments field,
enter:
You're prompted for the default value. It's a good idea to enter these now. Each time you add an argument, a parameter is displayed with the name, as entered in the Argument field and a text box in which to enter the parameter value.${<argument_1>} ${<argument_2>}
If Spark streaming is specified, then you must include the checkpoint folder as an argument. See an example from the sample code on GitHub for how to pass a checkpoint as an argument.
Note
Don't include either "$" or "/" characters in the parameter name or value. - (Optional) If you have an
archive.zip
file, upload the file to Oracle Cloud Infrastructure Object Storage and then populate Archive URI with the path to it. There are two ways to do this:- Select the file from the Object Storage file name list. Select Change compartment if the bucket is in a different compartment.
- Select Enter the file path manually and
enter the file name and the path to it using this format:
oci://<bucket_name>@<namespace_name>/<file_name>
- Under Application log location, specify where you
want to ingest Oracle Cloud
Infrastructure Logging in one of the following ways:
- Select the
dataflow-logs
bucket from the Object Storage file name list. Select Change compartment if the bucket is in a different compartment. - Select Enter the bucket path manually and
enter the bucket path to it using this format:
oci://dataflow-logs@<namespace_name>
- Select the
- (Optional) Select the metastore from the list. If the metastore is in a different compartment, select Change compartment. The default managed table location is automatically populated based on the metastore.
- (Optional) In the Tags section, add one or more tags to the <resourceType>. If you have permissions to create a resource, then you also have permissions to apply free-form tags to that resource. To apply a defined tag, you must have permissions to use the tag namespace. For more information about tagging, see Resource Tags. If you're not sure whether to apply tags, skip this option or ask an administrator. You can apply tags later.
- (Optional)
Add advanced configuration options.
- Select Show advanced options.
- (Optional) Select Use resource principal auth to enable faster starting or if you expect the Run to last more than 24 hours.
- (Optional) Select Enable Spark Oracle data source to use Spark Oracle Datasource.
- Select a Delta Lake version. The value you select is reflected in the Spark configuration properties Key/Value pair. See Data Flow and Delta Lake for information on Delta Lake.
- In the Logs section, select the Logs groups and the application logs for Oracle Cloud Infrastructure Logging . You can change compartment if the logs groups are in a different compartment.
- Enter the key of the Spark configuration property and a value.
- If you're using Spark
streaming, include a key of
spark.sql.streaming.graceful.shutdown.timeout
with a value of no more than 30 minutes (in milliseconds). - If you're using Spark
Oracle Datasource, include a key of
spark.oracle.datasource.enabled
with a value oftrue
.
- If you're using Spark
streaming, include a key of
- Select + Another property to add another configuration property.
- (Optional) Override the default value for the warehouse bucket by
populating Warehouse Bucket bucket URI in the
following
format:
oci://<warehouse-name>@<tenancy>
- Select the network access.
- If you're attaching a private
endpoint to Data Flow, select Secure access to private
subnet. Select the private endpoint from the
resulting list. Note
You can't use an IP address to connect to the private endpoint, you must use the FQDN. - If you're not using a private endpoint, select Internet access (No subnet).
- If you're attaching a private
endpoint to Data Flow, select Secure access to private
subnet. Select the private endpoint from the
resulting list.
- (Optional) To enable data lineage collection:
- Select Enable data lineage collection.
- Select Enter data catalog into manually or select a Data Catalog instance from a configurable compartment in the current tenancy.
- (Optional) If you selected Enter data catalog into manually in the previous step, enter the values for Data catalog tenancy OCID, Data catalog compartment OCID, and Data Catalog instance ODID.
- For Max run duration in minutes, enter a value between 60 (1 hour) and 10080 (7 days). If you don't enter a value, the submitted run continues until it succeeds, fails, is canceled, or reaches its default maximum duration (24 hours).
-
Select Create to create the application, or select
Save as stack to create it later.
To change the values for language, name, and file URL in the future, see Editing an Application. You can change the language only between Java and Scala. You can't change it to Python or SQL.
Use the create command and required parameters to create an application:
For a complete list of flags and variable options for CLI commands, see the CLI Command Reference.Commandoci data-flow application create [OPTIONS]
Run the CreateApplication operation to create an application.