Data Flow Concepts
An understanding of these concepts is essential for using Data Flow.
- Data Flow Applications
- An Application is an infinitely reusable Spark application template consisting of a Spark application, its dependencies, default parameters, and a default runtime resource specification. After a developer creates a Data Flow Application, anyone can use it without worrying about the complexities of deploying it, setting it up, or running it.
- Data Flow Library
- The Library is the central repository of Data Flow Applications. Anyone can browse, search, and run applications published to the Library, subject to having the correct permissions in the Data Flow system.
- Data Flow Runs
- Every time a Data Flow Application is run, a Run is created. The Data Flow Run captures the Application's output, logs, and statistics that are automatically securely stored. Output is saved so it can be viewed by anyone with the correct permissions using the UI or REST API. Runs give you secure access to the Spark UI for debugging and diagnostics.
- Data Flow Pools
- A Data Flow Pool is a pre-configured group of Compute resources that can be used to run various Spark data and machine learning workloads, including batch, streaming, and interactive. Data Flow Pools can be used in many Data Flow batch, Streaming, Session workloads by various users at the same time in same tenant.
- Elastic Compute
- Every time you run a Data Flow Application, you decide how big you want it to be. Data Flow allocates your VMs, runs your job, securely captures all output, and shuts the cluster down. You don't have anything to maintain in Data Flow. Clusters only run when there's real work to do.
- Elastic Storage
- Data Flow works with the Oracle Cloud Infrastructure Object Storage service. For more information, see the Overview of Object Storage.
- Private Network
- You can configure your Data Flow Application to access data sources hosted in private networks. You must create a private endpoint for your Application to use, if one doesn't already exist.
- Security
- Data Flow is integrated with Oracle Cloud Infrastructure Identity and Access Management (IAM) for authentication and authorization. Your Spark applications run on behalf of the person who launches them. This means that the Spark application has the same privileges the end user has. You don't need to use credentials to access any IAM-capable system. In addition, Data Flow benefits from all the other security attributes of Oracle Cloud Infrastructure including transparent encryption of data at rest and in motion.
- Service Administrator
- See About Service Administrator Roles for more information about administrator roles.
- Account Administrator
- The Account Administrator creates an account for each user who needs access to the service.
- Administrator Controls
- Data Flow lets you to set service limits, and create administrators who have full control over all applications and runs. You're in control regardless of how many users you have.
- Apache Spark
- Apache Spark is a unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.
- Spark Application
- A Spark Application uses the Spark API to perform distributed data processing tasks. Spark Applications can be written in several languages including Java, Python and more. Spark Applications manifest themselves as files such as JAR files that are run within the Spark framework.
- Spark UI
- The Spark UI is included with Apache Spark and is an important tool for debugging and diagnosing Spark applications. You can access the Spark UI for any Data Flow Run, subject to the Run's authorization policies.
- Spark Logs
- Spark generates Spark Log files which are useful for debugging and diagnostics. Each Data Flow Run automatically stores log files which you can access by using the UI or API, subject to the Run's authorization policies.
- Enhanced Logs
- Driver and executor logs, both StdOut and StdErr, provided by Oracle Cloud Infrastructure Logging. optional whether you use them.