Query External Data with Data Catalog

Oracle Cloud Infrastructure Data Catalog is the metadata management service for Oracle Cloud that helps you discover data and support data governance. It provides an inventory of assets, a business glossary, and a common metastore for data lakes.

Autonomous Database can leverage this metadata to dramatically simplify management for access to your data lake's object store. Instead of manually defining external tables to access your data lake, use the external tables that are defined and managed automatically. These tables will be found in Autonomous Database protected schemas that are kept up to date with changes in Data Catalog.

For more information about Data Catalog, please refer to Data Catalog documentation.

About Querying with Data Catalog
By synchronizing with Data Catalog metadata, Autonomous Database automatically creates external tables for each logical entity harvested by Data Catalog. These external tables are defined in database schemas that are fully managed by the metadata synchronization process. Users can immediately query data without having to manually derive the schema (columns and data types) for external data sources and manually create external tables.
Concepts Related to Querying with Data Catalog
An understanding of the following concepts is necessary for querying with Data Catalog.
Synchronization Mapping
The synchronization process creates and updates Autonomous Database schemas and external tables based on Data Catalog data assets, folders, logical entities, attributes and relevant custom overrides.
Typical Workflow with Data Catalog
There is a typical workflow of actions performed by users who want to query with Data Catalog.
Example: MovieStream Scenario
In this scenario, Moviestream is capturing data in a landing zone on object storage. Much of this data, but not necessarily all, is then used to feed an Autonomous Database. Prior to feeding Autonomous Database, the data is transformed, cleansed and subsequently stored in the "gold" area.
Example: Partitioned Data Scenario
This scenario illustrates how to create external tables in Autonomous Database that are based on Data Catalog logical entities harvested from partitioned data in Object Store.

Parent topic: Query External Data with Autonomous Database

About Querying with Data Catalog

By synchronizing with Data Catalog metadata, Autonomous Database automatically creates external tables for each logical entity harvested by Data Catalog. These external tables are defined in database schemas that are fully managed by the metadata synchronization process. Users can immediately query data without having to manually derive the schema (columns and data types) for external data sources and manually create external tables.

Synchronization is dynamic, keeping the Autonomous Database up-do-date with respect to changes to the underlying data, reducing administration cost as it automatically maintains hundreds to thousands of tables. It also allows multiple Autonomous Database instances to share the same Data Catalog, further reducing management costs and providing a common set of business definitions.

The Data Catalog folders/buckets are containers that sync with Autonomous Database schemas. Logical entities within those folders/buckets map to Autonomous Database external tables. These schemas and external tables are automatically generated and maintained through the sync process:

Folders/Buckets map to database schemas that are for organizational purposes only.
The organization is meant to be consistent with the data lake and minimize confusion when accessing data thru different paths.
Data Catalog is the source of truth for the tables contained within schemas. Changes made in the Data Catalog update the schema's tables during a subsequent sync.

To use this capability, a Database Data Catalog Administrator initiates a connection to a Data Catalog instance, selects which data assets and logical entities to synchronize, and runs the sync. The sync process creates schemas and external tables based on the selected Data Catalog harvested data assets and logical entities. As soon as the external tables are created, Data Analysts can start querying their data without having to manually derive the schema for external data sources and create external tables.

Note

The DBMS_DCAT Package is available for performing the tasks required to query Data Catalog object store data assets. See DBMS_DCAT Package.

Parent topic: Query External Data with Data Catalog

Concepts Related to Querying with Data Catalog

An understanding of the following concepts is necessary for querying with Data Catalog.

Data Catalog

Data Catalog harvests data assets that point to the object store data sources you want to query with Autonomous Database. From Data Catalog you can specify how the data is organized during harvesting, supporting different file organization patterns. As part of the Data Catalog harvesting process, you can select the buckets and files you want to manage within the asset. For further information, see Data Catalog Overview.

Object Stores

Object Stores have buckets containing a variety of objects. Some common types of objects found in these buckets include: CSV, parquet, avro, json, and ORC files. Buckets generally have a structure or a design pattern to the objects they contain. There are many different ways to structure data and many different ways of interpreting these patterns.

For example, a typical design pattern uses top-level folders that represent tables. Files within a given folder share the same schema and contain data for that table. Subfolders are often used to represent table partitions (for example, a subfolder for each day). Data Catalog refers to each top-level folder as a logical entity, and this logical entity maps to an Autonomous Database external table.

Connection

A connection is an Autonomous Database connection to a Data Catalog instance. For each Autonomous Database instance there can be connections to multiple Data Catalog instances. The Autonomous Database credential must have rights to access Data Catalog assets that have been harvested from object storage.

Harvest

A Data Catalog process that scans object storage and generates the logical entities from your data sets.

Data Asset

A data asset in Data Catalog represents a data source, which includes databases, Oracle Object Storage, Kafka, and more. Autonomous Database leverages Oracle Object Storage assets for metadata synchronization.

Data Entity

A data entity in Data Catalog is a collection of data such as a database table or view, or a single file and normally has many attributes that describe its data.

Logical Entity

In Data Lakes, numerous files typically comprise a single logical entity. For example, you may have daily clickstream files, and these files share the same schema and file type.

A Data Catalog logical entity is a group of Object Storage files that are derived during harvesting by applying filename patterns that have been created and assigned to a data asset.

Data Object

A data object in Data Catalog refers to data assets and data entities.

Filename Pattern

In a data lake, data may be organized in different ways. Typically, folders capture files of the same schema and type. You must register to Data Catalog how your data is organized. Filename patterns are used to identify how your data is organized. In Data Catalog, you can define filename patterns using regular expressions. When Data Catalog harvests a data asset with an assigned filename pattern, logical entities are created based on the filename pattern. By defining and assigning these patterns to data assets, multiple files can be grouped as logical entities based on the filename pattern.

Synchronize (Sync)

Autonomous Database performs synchronizations with Data Catalog to automatically keep its database up-to-date with respect to changes to the underlying data. Synchronization can be performed manually, or on a schedule.

The sync process creates schemas and external tables based on the Data Catalog data assets and logical entities. These schemas are protected, which means their metadata is managed by Data Catalog. If you want to alter the metadata, you must make the changes in Data Catalog. The Autonomous Database schemas will reflect any changes after the next sync is run. For further details, see Synchronization Mapping.

Parent topic: Query External Data with Data Catalog

Synchronization Mapping

The synchronization process creates and updates Autonomous Database schemas and external tables based on Data Catalog data assets, folders, logical entities, attributes and relevant custom overrides.

Data Catalog Autonomous Database Mapping Description

Data Catalog	Autonomous Database	Mapping Description
Data asset and folder (object storage bucket)	Schema name	Default values: By default, the generated schema name in Autonomous Database has the following format: `DCAT$<dcat-con-id>_<data-asset-name>_<folder-name>` `dcat-con-id` is the unique Data Catalog connection identifier. For details on specifying this identifier, see the `dcat_con_id` parameter in the DBMS_DCAT RUN_SYNC Procedure. `data-asset-name` is the name of the Data Catalog data asset's name. `folder-name` is the Data Catalog folder name. This folder maps to an object storage bucket. Customizations: The default `data-asset-name` and `folder-name` can be customized by defining custom properties, business names and display names to override these default names. `data-asset-name` can be overidden by defining the `oracle-db-schema-prefix` custom property for the data asset in Data Catalog. `folder-name` can be overidden by defining the `oracle-db-schema` custom property for the folder in Data Catalog, a business name or display name. The following attributes are used in order of precedence for generating the `folder-name`: `oracle-db-schema` custom property Business Name Display Name Examples: If the connection id is `DataModels`, the asset name is `ObjectStorage`, the folder name is `HR` and there are no custom property overrides, the derived schema name is: `DCAT$DATAMODELS_OBJECTSTORAGE_HR` If the data asset name is MYASSET, the folder name is MYFOLDER, and there are no custom property overrides, the schema name is: `DCAT$MYASSET_MYFOLDER` If the data asset has `oracle-db-prefix = FIRSTASSET`, and the folder has `oracle-db-schema = FIRSTFOLDER`, then the schema name is: `DCAT$FIRSTASSET_FIRSTFOLDER`
Logical entity	External table	Logical entities are mapped to external tables. If the logical entity has a partitioned attribute, it is mapped to a partitioned external table. The external table name is derived from the corresponding logical entity's Display Name, or Business Name. If `oracle-db-schema` is set, then its value overrides all the names and custom properties of the corresponding folders and data assets. For example, if `oracle-db-schema` for an entity is set to `EntitySchema`, then the table is created in schema `DCAT$ENTITYSCHEMA`.
Logical entity's attributes	External table columns	Column names: The external table column names are derived from the corresponding logical entity's attribute display names, or business names. For logical entities derived from Parquet, Avro, and ORC files, the column name is always the display name of the attribute as it represents the field name derived from the source files. For attributes corresponding to a logical entity derived from CSV files, the following attribute fields are used in order of precedence for generating the column name: `oracle-db-column-name` Business Name Display Name Column type: The `oracle-db-column-type` custom property overrides the default column type that was derived by Data Catalog. For attributes corresponding to a logical entity derived from Avro files with `TIME_MICROS`, `TIME_MILLIS`, `TIMESTAMP_MICROS` or `TIMESTAMP_MILLIS` data types, you must set the `oracle-db-column-type` of the corresponding attribute in Data Catalog. Column length: The `oracle-db-column-length` custom property overrides the default column length for a string field that was derived by Data Catalog. Column precision: The `oracle-db-column-precision` custom property overrides the default precision for a number that was derived by Data Catalog. For attributes corresponding to a logical entity derived from Avro files with `TIME_MICROS`, `TIME_MILLIS`, `TIMESTAMP_MICROS` or `TIMESTAMP_MILLIS` data types, you must set the `oracle-db-column-precision` of the corresponding attribute in Data Catalog. Column scale: The `oracle-db-column-scale` custom property overrides the default scale for a number that was derived by Data Catalog.

Data asset and folder (object storage bucket)

Schema name

Default values:

By default, the generated schema name in Autonomous Database has the following format:

DCAT$<dcat-con-id>_<data-asset-name>_<folder-name>

dcat-con-id is the unique Data Catalog connection identifier. For details on specifying this identifier, see the dcat_con_id parameter in the DBMS_DCAT RUN_SYNC Procedure.
data-asset-name is the name of the Data Catalog data asset's name.
folder-name is the Data Catalog folder name. This folder maps to an object storage bucket.

Customizations:

The default data-asset-name and folder-name can be customized by defining custom properties, business names and display names to override these default names.

data-asset-name can be overidden by defining the oracle-db-schema-prefix custom property for the data asset in Data Catalog.
folder-name can be overidden by defining the oracle-db-schema custom property for the folder in Data Catalog, a business name or display name. The following attributes are used in order of precedence for generating the folder-name:
1. oracle-db-schema custom property
2. Business Name
3. Display Name

Examples:

If the connection id is DataModels, the asset name is ObjectStorage, the folder name is HR and there are no custom property overrides, the derived schema name is: DCAT$DATAMODELS_OBJECTSTORAGE_HR
If the data asset name is MYASSET, the folder name is MYFOLDER, and there are no custom property overrides, the schema name is: DCAT$MYASSET_MYFOLDER
If the data asset has oracle-db-prefix = FIRSTASSET, and the folder has oracle-db-schema = FIRSTFOLDER, then the schema name is: DCAT$FIRSTASSET_FIRSTFOLDER

Logical entity

External table

Logical entities are mapped to external tables. If the logical entity has a partitioned attribute, it is mapped to a partitioned external table.

The external table name is derived from the corresponding logical entity's Display Name, or Business Name.

If oracle-db-schema is set, then its value overrides all the names and custom properties of the corresponding folders and data assets.

For example, if oracle-db-schema for an entity is set to EntitySchema, then the table is created in schema DCAT$ENTITYSCHEMA.

Logical entity's attributes

External table columns

Column names: The external table column names are derived from the corresponding logical entity's attribute display names, or business names.

For logical entities derived from Parquet, Avro, and ORC files, the column name is always the display name of the attribute as it represents the field name derived from the source files.

For attributes corresponding to a logical entity derived from CSV files, the following attribute fields are used in order of precedence for generating the column name:

oracle-db-column-name
Business Name
Display Name

Column type: The oracle-db-column-type custom property overrides the default column type that was derived by Data Catalog.

For attributes corresponding to a logical entity derived from Avro files with TIME_MICROS, TIME_MILLIS, TIMESTAMP_MICROS or TIMESTAMP_MILLIS data types, you must set the oracle-db-column-type of the corresponding attribute in Data Catalog.

Column length: The oracle-db-column-length custom property overrides the default column length for a string field that was derived by Data Catalog.

Column precision: The oracle-db-column-precision custom property overrides the default precision for a number that was derived by Data Catalog.

For attributes corresponding to a logical entity derived from Avro files with TIME_MICROS, TIME_MILLIS, TIMESTAMP_MICROS or TIMESTAMP_MILLIS data types, you must set the oracle-db-column-precision of the corresponding attribute in Data Catalog.

Column scale: The oracle-db-column-scale custom property overrides the default scale for a number that was derived by Data Catalog.

Parent topic: Query External Data with Data Catalog

Typical Workflow with Data Catalog

There is a typical workflow of actions performed by users who want to query with Data Catalog.

The Database Data Catalog Admin creates a connection between the Autonomous Database instance and a Data Catalog instance, then configures and runs a synchronization (sync) between the Data Catalog and Autonomous Database. The sync creates external tables and schemas in the Autonomous Database instance based on the synced Data Catalog contents.

The Database Data Catalog Query Admin or Database Admin grants READ access to the generated external tables so that Data Analysts and other database users can browse and query the external tables.

The table below describes each action in detail. For a description of the different user types included in this table, see Data Catalog Users and Roles.

Note

The DBMS_DCAT Package is available for performing the tasks required to query Data Catalog object store data assets. See DBMS_DCAT Package.

Action	Who is the user	Description
Create policies	Database Data Catalog Administrator	The Autonomous Database resource principal or Autonomous Database user credential must have the appropriate permissions to manage Data Catalog and to read from object storage. More information: Required Credentials and IAM Policies.
Create credentials	Database Data Catalog Administrator	Ensure database credentials are in place to access Data Catalog and to query object store. The user calls `DBMS_CLOUD.CREATE_CREDENTIAL` to create user credentials and/or `DBMS_CLOUD_ADMIN.ENABLE_RESOURCE_PRINCIPAL` to enable resource principals. More information: DBMS_CLOUD CREATE_CREDENTIAL Procedure, Use Resource Principal with DBMS_CLOUD.
Create connections to Data Catalog	Database Data Catalog Administrator	To initiate a connection between an Autonomous Database instance and a Data Catalog instance the user calls `DBMS_DCAT.SET_DATA_CATALOG_CONN` to specify the target Data Catalog instance. Connections from an Autonomous Database instance to multiple Data Catalog instances is supported. The connection to the Data Catalog instance must use a database credential object with sufficient Oracle Cloud Infrastructure (OCI) privileges. For example, the Resource Principal Service Token for the Autonomous Database instance or an OCI user with sufficient privileges can be used. Once the connection has been made, the Data Catalog instance is updated with the `DBMS_DCAT` namespace and custom properties (if they do not already exist). The user can run a query to see the new connection including all current connections: `select * from all_dcat_connections;` More information: SET_DATA_CATALOG_CONN Procedure, UNSET_DATA_CATALOG_CONN Procedure.
Create a selective sync	Database Data Catalog Administrator	Create a sync job by selecting the Data Catalog objects to sync. The user can: Select data assets/folders to sync. Select individual logical entities to sync. Preview the resulting external tables before syncing. Change external tables (for example, the name) by modifying custom properties on Data Catalog. More information: See CREATE_SYNC_JOB Procedure, DROP_SYNC_JOB Procedure, Synchronization Mapping
Sync with Data Catalog	Database Data Catalog Administrator	The user initiates a sync operation. The sync is initiated manually through the `DBMS_DCAT.RUN_SYNC` procedure call, or automatically as part of a scheduled sync job. The sync operation creates, modifies and drops external tables and schemas according to the Data Catalog contents and sync selections. Manual configuration is applied using Data Catalog Custom Properties. More information: See RUN_SYNC Procedure, CREATE_SYNC_JOB Procedure, Synchronization Mapping
Monitor sync and view logs	Database Data Catalog Administrator	The user can view the sync status by querying the `USER_LOAD_OPERATIONS` view. After the sync process has completed, the user can view a log of the sync results, including details about the mappings of logical entities to external tables. More information: Monitoring and Troubleshooting Loads
Grant privileges	Database Data Catalog Query Administrator, Database Administrator	The database Data Catalog Query Administrator or database Administrator must grant READ on generated external tables to data analyst users. This allows the data analysts to query the generated external tables.
Browse and query external tables	Data Analyst	Data analysts are able to query the external tables through any tool or application that supports Oracle SQL. Data Analysts can review the synced schemas and tables in the DCAT$* schemas, and query the tables using Oracle SQL. More information: Synchronization Mapping
Terminate connections to Data Catalog	Database Data Catalog Administrator	To remove an existing Data Catalog association, the user calls the UNSET_DATA_CATALOG_CONN procedure. This action is only done when you no longer plan on using Data Catalog and the external tables that are derived from the catalog. This action deletes Data Catalog metadata, and drops synced external tables from the Autonomous Database instance. The custom properties on Data Catalog and OCI policies are not affected. More information: UNSET_DATA_CATALOG_CONN Procedure

Parent topic: Query External Data with Data Catalog

Example: MovieStream Scenario

In this scenario, Moviestream is capturing data in a landing zone on object storage. Much of this data, but not necessarily all, is then used to feed an Autonomous Database. Prior to feeding Autonomous Database, the data is transformed, cleansed and subsequently stored in the "gold" area.

Data Catalog is used to harvest these sources and then provide a business context to the data. Data Catalog metadata is shared with Autonomous Database, allowing Autonomous Database users to query those data sources using Oracle SQL. This data may be loaded into Autonomous Database or queried dynamically using external tables.

For more information on using Data Catalog, see Data Catalog Documentation.

Object Store - Review buckets, folders and files
1. Review the buckets in your object store.
  For example, below are the landing (moviestream_landing) and gold zone (moviestream_gold) buckets in object storage:
2. Review the folders and files in the object store buckets.
  For example, below are the folders in the landing bucket (moviestream_landing) in object storage:

Data Catalog - Create filename patterns

Inform Data Catalog how your data is organized using filename patterns. These are regular expressions used to categorize files. The filename patterns are used by the Data Catalog harvester to derive logical entities. The following two filename patterns are used to harvest the buckets in the MovieStream example. See Harvesting Object Storage Files as Logical Data Entities for further details on creating filename patterns.

Hive-style	Folder-style
`{bucketName:.}/{logicalEntity:[^/]+}.db/{logicalEntity:[^/]+}/.`	`{bucketName:[\w]+}/{logicalEntity:[^/]+}(?<!.db)/.*$`
Creates logical entities for sources that contain ".db" as the first part of the object name. To ensure uniqueness within the bucket, the resulting name is (db-name).(folder name)	Creates a logical entity based on the folder name off of the root To prevent duplication with Hive, object names that have ".db" in them are skipped.

To create filename patterns, go to the Filename Patterns tab for your Data Catalog and click Create Filename Pattern. For example, the following is the Create Filename Pattern tab for the moviestream Data Catalog:

Data Catalog - Data Asset Creation
1. Create a data asset that is used to harvest data from your object store.
  For example, a data asset named phoenixObjStore is created in the moviestream Data Catalog:
2. Add a connection to your data asset.
  In this example, the data asset connects to the compartment for the moviestream object storage resource.
3. Now, associate your filename patterns with your data asset. Select Assign Filename Patterns, check the patterns you want and click Assign.
  For example, here are the patterns assigned to the phoenixObjStore data asset:
  Description of the illustration assign_filename_patterns_v1.png
Data Catalog - Harvest data from object store
1. Harvest the Data Catalog data asset. Select the object store buckets containing the source data.
  In this example, the moviestream_gold and moviestream_landing buckets from object store are selected for harvesting.
2. After running the job, you see the logical entities. Use the Browse Data Assets to review them.
  In this example, you are looking at the customer-extension logical entity and its attributes.
  If you have a glossary, Data Catalog recommends categories and terms to associate with the entity and its attributes. This provides a business context for the items. Schemas, tables and columns are oftentimes not self-explanatory.
  In our example, we want to differentiate between the different types of buckets and the meaning of their content:
  - what is a landing zone?
  - how accurate is the data?
  - when was it last updated?
  - what is the definition of a logical entity or its attribute

Autonomous Database - Connect to Data Catalog

Connect Autonomous Database to Data Catalog. You need to ensure that the credential used to make that connection is using an OCI principal that is authorized to access the Data Catalog asset. For further information, see Data Catalog Policies and Access Cloud Resources by Configuring Policies and Roles.

Connect to Data Catalog

-- Variables are used to simplify usage later
define oci_credential = 'OCI$RESOURCE_PRINCIPAL'
define dcat_ocid = 'ocid1.datacatalog.oc1.iad.aaaaaaaardp66bg....twiq'
define dcat_region='us-ashburn-1'
define uri_root =    'https://objectstorage.us-ashburn-1.oraclecloud.com/n/mytenancy/b/landing/o'
define uri_private = 'https://objectstorage.us-ashburn-1.oraclecloud.com/n/mytenancy/b/private_data/o'

-- Run as admin
-------
-- Enable resource principal support
-------
exec dbms_cloud_admin.enable_resource_principal();

-- Test to make sure credential was created. Returns a row if it was successful
select * 
from dba_credentials 
where credential_name = 'OCI$RESOURCE_PRINCIPAL' and owner = 'ADMIN'; 

-- Query a private bucket to test the principal and privileges.
select * 
from dbms_cloud.list_objects('&oci_credential', '&uri_private/');

--------
-- Set the credentials to use for object store and data catalog
-- Connect to Data Catalog
-- Review connection
---------
-- Set credentials
exec dbms_dcat.set_data_catalog_credential(credential_name => '&oci_credential');
exec dbms_dcat.set_object_store_credential(credential_name => '&oci_credential');

-- Connect to Data Catalog
begin
   dbms_dcat.set_data_catalog_conn (
     region => '&dcat_region',
     catalog_id => '&dcat_ocid');
end;
/
-- Review the connection
select * from all_dcat_connections;

Sync Data Catalog with Autonomous Database. Here, we'll sync all the object storage assets:


-- Sync Data Catalog with Autonomous Database
---- Let's sync all of the assets.
begin
 dbms_dcat.run_sync('{"asset_list":["*"]}');
end;
/

-- View log
select type, start_time, status, logfile_table from user_load_operations;  -- Logfile_Table will have the name of the table containing the full log.
select * from dbms_dcat$1_log;

-- View the new external tables
select * from dcat_entities;
select * from dcat_attributes;

Autonomous Database - Now start running queries against object store.
```
-- Query the Data !
select *from dcat$phoenixobjstore_moviestream_gold.genre
;
```

Change schemas for objects

The default schema names are rather complicated. Let's simplify them by specifying both the asset and the folder's Oracle-Db-Schema custom attribute in Data Catalog. Change the data asset to PHX and the folders to landing and gold respectively. The schema is a concatenation of the two.
1. From Data Catalog, navigate to the moviestream_landing bucket and change the asset to landing and gold respectively.
  
  Before change:
  
  After change:
2. Run another sync.

Parent topic: Query External Data with Data Catalog

Example: Partitioned Data Scenario

This scenario illustrates how to create external tables in Autonomous Database that are based on Data Catalog logical entities harvested from partitioned data in Object Store.

The following example is based on Example: MovieStream Scenario and has been adapted to demonstrate integrating with partitioned data. Data Catalog is used to harvest these sources and then provide a business context to the data. For further details about this example, see Example: MovieStream Scenario.

For more information on using Data Catalog, see Data Catalog Documentation.

Object Store - Review buckets, folders and files
1. Review the buckets in your object store.
  For example, below are the landing (moviestream_landing) and gold zone (moviestream_gold) buckets in object storage:
2. Review the folders and files in the object store buckets.
  For example, below are the folders in the landing bucket (moviestream_landing) in object storage:
Data Catalog - Create filename patterns
1. Inform Data Catalog how your data is organized using filename patterns. These are folder prefixes or regular expressions used to categorize files. The filename patterns are used by the Data Catalog harvester to derive logical entities. When a folder prefix is specified, the Data Catalog automatically generates logical entities from the specified folder prefix in the object store. The following filename pattern is used to harvest the buckets in the MovieStream example. See Harvesting Object Storage Files as Logical Data Entities for further details on creating filename patterns.
  
  Folder prefix Description
  
  workshop.db/ Creates logical entities for sources that contain "workshop.db" path in the object store.
2. To create filename patterns, go to the Filename Patterns tab for your Data Catalog and click Create Filename Pattern. For example, the following is the Create Filename Pattern tab for the moviestream Data Catalog:
Data Catalog - Data Asset Creation
1. Create a data asset that is used to harvest data from your object store.
  For example, a data asset named amsterdamObjStore is created in the moviestream Data Catalog:
2. Add a connection to your data asset.
  In this example, the data asset connects to the compartment for the moviestream object storage resource.
3. Now, associate your filename patterns with your data asset. Select Assign Filename Patterns, check the patterns you want and click Assign.
  For example, here are the patterns assigned to the amsterdamObjStore data asset:
Data Catalog - Harvest data from object store
1. Harvest the Data Catalog data asset. Select the object store buckets containing the source data.
  In this example, the moviestream_gold and moviestream_landing buckets from object store are selected for harvesting.
2. After running the job, you see the logical entities. Use the Browse Data Assets to review them.
  In this example, you are looking at the sales_sample_parquet logical entity and its attributes. Note that Data Catalog has identified the month attribute as partitioned.

Folder prefix	Description
`workshop.db/`	Creates logical entities for sources that contain "workshop.db" path in the object store.

Autonomous Database - Connect to Data Catalog

Connect to Data Catalog

-- Variables are used to simplify usage later
define oci_credential = 'OCI$RESOURCE_PRINCIPAL'
define dcat_ocid = 'ocid1.datacatalog.oc1.eu-amsterdam-1....leguurn3dmqa'
define dcat_region='eu-amsterdam-1'
define uri_root =    'https://objectstorage.us-ashburn-1.oraclecloud.com/n/mytenancy/b/landing/o'
define uri_private = 'https://objectstorage.us-ashburn-1.oraclecloud.com/n/mytenancy/b/private_data/o'

-- Run as admin
-------
-- Enable resource principal support
-------
exec dbms_cloud_admin.enable_resource_principal();

-- Test to make sure credential was created. Returns a row if it was successful
select * 
from dba_credentials 
where credential_name = 'OCI$RESOURCE_PRINCIPAL' and owner = 'ADMIN'; 

-- Query a private bucket to test the principal and privileges.
select * 
from dbms_cloud.list_objects('&oci_credential', '&uri_private/');

--------
-- Set the credentials to use for object store and data catalog
-- Connect to Data Catalog
-- Review connection
---------
-- Set credentials
exec dbms_dcat.set_data_catalog_credential(credential_name => '&oci_credential');
exec dbms_dcat.set_object_store_credential(credential_name => '&oci_credential');

-- Connect to Data Catalog
begin
   dbms_dcat.set_data_catalog_conn (
     region => '&dcat_region',
     catalog_id => '&dcat_ocid');
end;
/
-- Review the connection
select * from all_dcat_connections;

Sync Data Catalog with Autonomous Database. Here, we'll sync all the object storage assets:


-- Sync Data Catalog with Autonomous Database
---- Let's sync all of the assets.
begin
 dbms_dcat.run_sync('{"asset_list":["*"]}');
end;
/

-- View log
select type, start_time, status, logfile_table from user_load_operations;  -- Logfile_Table will have the name of the table containing the full log.
select * from dbms_dcat$1_log;

-- View the new external tables
select * from dcat_entities;
select * from dcat_attributes;

Autonomous Database - Now start running queries against object store.

-- Query the Data !
select count(*) from DCAT$AMSTERDAMOBJSTORE_MOVIESTREAM_LANDING.SALES_SAMPLE_PARQUET;

-- Examine the generated partitioned table
select dbms_metadata.get_ddl('TABLE','SALES_SAMPLE_PARQUET','DCAT$AMSTERDAMOBJSTORE_MOVIESTREAM_LANDING') from dual;

CREATE TABLE "DCAT$AMSTERDAMOBJSTORE_MOVIESTREAM_LANDING"."SALES_SAMPLE_PARQUET"
(    "MONTH" VARCHAR2(4000) COLLATE "USING_NLS_COMP",
     "DAY_ID" TIMESTAMP (6),
     "GENRE_ID" NUMBER(20,0),
     "MOVIE_ID" NUMBER(20,0),
     "CUST_ID" NUMBER(20,0),
...
)  DEFAULT COLLATION "USING_NLS_COMP"
ORGANIZATION EXTERNAL
 ( TYPE ORACLE_BIGDATA
   ACCESS PARAMETERS
   ( com.oracle.bigdata.fileformat=parquet
     com.oracle.bigdata.filename.columns=["MONTH"]
     com.oracle.bigdata.file_uri_list="https://swiftobjectstorage.eu-amsterdam-1.oraclecloud.com/v1/tenancy/moviestream_landing/workshop.db/sales_sample_parquet/*"
...
   )
 )
REJECT LIMIT 0
PARTITION BY LIST ("MONTH")
(PARTITION "P1"  VALUES (('2019-01'))
   LOCATION ( 'https://swiftobjectstorage.eu-amsterdam-1.oraclecloud.com/v1/tenancy/moviestream_landing/workshop.db/sales_sample_parquet/month=2019-01/*'),
PARTITION "P2"  VALUES (('2019-02'))
   LOCATION ( 'https://swiftobjectstorage.eu-amsterdam-1.oraclecloud.com/v1/tenancy/moviestream_landing/workshop.db/sales_sample_parquet/month=2019-02/*'),
...PARTITION "P24"  VALUES (('2020-12'))
   LOCATION ( 'https://swiftobjectstorage.eu-amsterdam-1.oraclecloud.com/v1/tenancy/moviestream_landing/workshop.db/sales_sample_parquet/month=2020-12/*'))
PARALLEL

Change schemas for objects

The default schema names are rather complicated. Let's simplify them by specifying both the asset and the folder's Oracle-Db-Schema custom attribute in Data Catalog. Change the data asset to PHX and the folders to landing and gold respectively. The schema is a concatenation of the two.
1. From Data Catalog, navigate to the moviestream_landing bucket and change the asset to landing and gold respectively.
  
  Before change:
  
  After change:
2. Run another sync.

Parent topic: Query External Data with Data Catalog

Oracle Cloud Infrastructure Documentation