Your Java or Scala applications might need extra JAR files that you can't, or don't,
want to bundle in a Fat JAR. Or you might want to include native code or other assets to make
available within the Spark runtime.
When the spark-submit options don't work, Data Flow has the
option of providing a ZIP archive (archive.zip) along with your application
for bundling third-party dependencies. The ZIP archive can be created using a Docker-based
tool. The archive.zip is installed on all Spark nodes before running the
application. If you construct the archive.zip correctly, the Python libraries
are added to the runtime, and the JAR files are added to the Spark classpath. The libraries
added are isolated to one Run. That means they don't interfere with other concurrent Runs or
later Runs. Only one archive can be provided per Run.
Anything in the archive must be compatible with the Data Flow
runtime. For example, Data Flow runs on Oracle Linux using
particular versions of Java and Python. Binary code compiled for other OSs, or JAR files
compiled for other Java versions, might cause the Run to fail. Data Flow provides tools to help you build archives with
compatible software. However, these archives are ordinary Zip files, so you're free to create
them any way you want. If you use your own tools, you're responsible for ensuring
compatibility.
Dependency archives, similarly to your Spark applications, are loaded to Data Flow. Your Data Flow
Application definition contains a link to this archive, which can be overridden at runtime.
When you run your Application, the archive is downloaded and installed before the Spark job
runs. The archive is private to the Run. This means, for example, that you can run
concurrently two different instances of the same Application, with different dependencies, but
without any conflicts. Dependencies don't persist between Runs, so there aren't any problems
with conflicting versions for other Spark applications that you might run.
Building a Dependency Archive Using the Data Flow
Dependency Packager
For Python dependencies, create a requirements.txt file. For
example, it might look similar to:
numpy==1.18.1
pandas==1.0.3
pyarrow==0.14.0
Note
Don't include pyspark or
py4j. These dependencies are provided by Data Flow, and including them causes Runs to
fail.
The Data Flow Dependency
Packager uses Python's pip tool to install all dependencies. If you have
Python wheels that can't be downloaded from public sources, place them in a
directory beneath where you build the package. See them in
requirements.txt with a prefix of
/opt/dataflow/. For example:
/opt/dataflow/<my-python-wheel.whl>
where <my-python-wheel.whl> represents the name of the Python
wheel. Pip sees it as a local file and installs it normally.
For Java dependencies, create a file called packages.txt. For
example, it might look similar to:
The
Data Flow Dependency Packager uses Apache
Maven to download dependency JAR files. If you have JAR files that
can't be downloaded from public sources, place them in a local directory beneath
where you build the package. Any JAR files in any subdirectory where you build
the package are included in the archive.
Use docker container to create the archive.
Note
The Python version must be set to 3.11 when using
Spark 3.5.0, 3.8 when using Spark 3.2.1, and
3.6 when using Spark 3.0.2 or Spark 2.4.4. In the
following commands, <python_version>
represents this number.
The working directory is represented by pwd. The flag
-v indicates docker volume mapping to the local file
system.
You can add static content. You might want to include other content in
the archive. For example, you might want to deploy a data file, an ML model file, or
an executable that the Spark program calls at runtime. You do this by adding files
to archive.zip after you created it in Step 4.
For Java applications:
Unzip archive.zip.
Add the JAR files in the java/ directory only.
Zip the file.
Upload it to Object Storage.
For Python applications:
Unzip archive.zip.
Add you local modules to only these three subdirectories of the
python/ directory:
python/lib
python/lib32
python/lib64
Zip the file.
Upload it to object store.
Note
Only these four directories are allowed for storing the Java and
Python dependencies.
When the Data Flow application runs, the static content is
available on any node under the directory where you chose to place it. For example,
if you added files under python/lib/ in the archive, they're
available in the /opt/dataflow/python/lib/ directory on any
node.
Dependency archives are ordinary ZIP files. Advanced users might want to build archives
with their own tools rather than using the Data Flow
Dependency Packager. A correctly constructed dependency archive has this general
outline:
To connect to Oracle databases such as ADW, you need to include Oracle JDBC JAR files.
Download and extract the compatible driver JAR files into a directory
under where you build the package. For example, to package the Oracle 18.3 (18c) JDBC
driver, ensure all these JARs are present: