You can use spark-submit compatible options to run your applications using Data Flow.
Spark-submit is an industry standard command
for running applications on Spark clusters. The following spark-submit compatible
options are supported by Data Flow:
--conf
--files
--py-files
--jars
--class
--driver-java-options
--packages
main-application.jar or main-application.py
arguments to main-application. Arguments passed to the main method
of your main class (if any).
The --files option flattens your file hierarchy, so all files are placed
at the same level in the current working directory. To keep the file hierarchy, use
either archive.zip, or --py-files with a JAR, ZIP or
EGG dependecy module.
The --packages option is used to include any other dependencies by
supplying a comma-delimited list of Maven coordinates. For example,
All
transitive dependencies are handled when using this command.
Note
With the
--packages option, each Run's driver pod needs to download
dependencies dynamically, which relies on network stability and access to Maven
central or other remote repositories. Use the Data Flow Dependency Packager to generate a
dependency archive for production.
Note
For all spark-submit options on Data Flow, the URI
must begin oci://.... URIs starting with
local://... or hdfs://... aren't supported.
Use the fully qualified domain names (FQDN) in the URI. Load all files, including
main-application, to Oracle Cloud Infrastructure Object Storage.
Creating a Spark-Submit Data Flow Application explains how to create an application in the
Console using spark-submit. You can also use
spark-submit with a Java SDK or from the CLI. If you're using CLI, you don't have to
create a Data Flow Application to run your Spark
application with spark-submit compatible options on Data Flow. This is useful if you already have a working
spark-submit command in a different environment. When you follow the syntax of the
run submit command, an Application is created, if one doesn't
already exist in the main-application URI.
Installing Public CLI with the run submit Command 🔗
These steps are needed to install a public CLI with the run submit
command for use with Data Flow:
Create a customized Python environment to use as the destination of the
CLI.
oci --version
oci data-flow run submit
Usage: oci data-flow run submit [OPTIONS]
Error: Missing option(s) --compartment-id, --execute.
Authenticate the session:
Copy
oci session authenticate
Select a region from the list.
Enter the name of the profile to create.
Create a token profile:
Copy
oci iam region list --config-file /Users/<user-name>/.oci/config --profile <profile_name> --auth security_token
Using Spark-submit in Data Flow 🔗
You can take your spark-submit CLI command and convert it into a compatible CLI
command in Data Flow.
The spark-submit compatible command in Data Flow, is the
run submit command. If you already have a working Spark application
in any cluster, you're familiar with the spark-submit syntax. For example:
To get the compatible Data Flow command, follow
these steps:
Upload all the files, including the main application, in the Object Storage.
Replace the existing URIs with the corresponding oci://... URI.
Remove any unsupported or reserved spark-submit parameters. For example,
--master and --deploy-mode are
reserved for Data Flow and a user
doesn't need to populate them.
Add --execute parameter and pass in a spark-submit
compatible command string. To build the --execute string,
keep the supported spark-submit parameters, and main-application and its
arguments, in sequence. Put them inside a quoted string (single-quote or
double-quotes).
Replace spark submit with the Oracle Cloud Infrastructure standard command prefix,
oci data-flow run submit.
Add the Oracle Cloud Infrastructure mandatory argument
and parameter pairs for --profile, --auth
security_token, and --compartment-id.
Run submit with URL validation inside jars, files, and
pyfiles:
Copy
oci --profile oci-cli --auth security_token data-flow run submit \
--compartment-id <compartment-id> \
--archive-uri "oci://<bucket-name>@<tenancy-name>/mmlspark_original.zip" \
--execute "--jars oci://<bucket-name>@<tenancy-name>/fake.jar
--conf spark.sql.crossJoin.enabled=true
--class org.apache.spark.examples.SparkPi
oci://<bucket-name>@<tenancy-name>/spark-examples_2.11-2.3.1-SNAPSHOT-jar-with-dependencies.jar 10"
#result
{'opc-request-id': '<opc-request-id>', 'code': 'InvalidParameter',
'message': 'Invalid OCI Object Storage uri. The object was not found or you are not authorized to access it.
{ResultCode: OBJECTSTORAGE_URI_INVALID,
Parameters: [oci://<bucket-name>@<tenancy-name>/fake.jar]}', 'status': 400}
To enable Resource Principal Auth, add the Spark property in the conf file using Spark
submit, and add the following configuration in the execute method: