Query External Data with ORC, Parquet, or Avro Source Files
Autonomous Database makes it easy to access ORC,
Parquet, or Avro data stored in object store using external tables. ORC, Parquet, and Avro
sources have metadata embedded in them and the DBMS_CLOUD.CREATE_EXTERNAL_TABLE
procedure can utilize this metadata
to simplify the creation of external tables.
You don’t need to know the structure of the data, DBMS_CLOUD
can examine the file and convert ORC, Parquet, or Avro contents into the equivalent Oracle
columns and data types. You only need to know the location of the data in object store,
specify its type, ORC, Parquet, or Avro, and have credentials to access the source file on
your object store.
The steps to use external tables are very similar for ORC, Parquet, and Avro. These steps show working with a Parquet format source file.
The source file in this example, sales_extended.parquet
,
contains Parquet format data. To query this file in Autonomous Database, do the following:
- Store your object store credentials, to access the object store, using the
procedure
DBMS_CLOUD.CREATE_CREDENTIAL
:BEGIN
DBMS_CLOUD.CREATE_CREDENTIAL
( credential_name => 'DEF_CRED_NAME', username => 'adb_user@example.com', password => 'password' ); END; /Creating a credential to access Oracle Cloud Infrastructure Object Store is not required if you enable resource principal credentials. See Use Resource Principal to Access Oracle Cloud Infrastructure Resources for more information.
This operation stores the credentials in the database in an encrypted format. You can use any name for the credential name. Note that this step is required only once unless your object store credentials change. Once you store the credentials you can then use the same credential name for creating external tables.
See CREATE_CREDENTIAL Procedure for information about the
username
andpassword
parameters for different object storage services. - Create an external table for ORC, Parquet, or Avro on top of your source files using
the procedure
DBMS_CLOUD.CREATE_EXTERNAL_TABLE
.The procedure
DBMS_CLOUD.CREATE_EXTERNAL_TABLE
supports external files in the supported cloud object storage services, including: Oracle Cloud Infrastructure Object Storage, Azure Blob Storage or Azure Data Lake Storage, Amazon S3, and Amazon S3-Compatible, including: Oracle Cloud Infrastructure Object Storage, Google Cloud Storage, and Wasabi Hot Cloud Storage. The credential is a table level property; therefore, the external files must be on the same object store.By default, the columns created in the external table automatically map their data types to Oracle data types for the fields found in the source files and the external table column names match the source field names.
BEGIN
DBMS_CLOUD.CREATE_EXTERNAL_TABLE
( table_name =>'sales_extended_ext', credential_name =>'DEF_CRED_NAME', file_uri_list =>'https://objectstorage.us-phoenix-1.oraclecloud.com/n/namespace-string/b/bucketname/o/sales_extended.parquet', format => '{"type":"parquet", "schema": "first"}' ); END; /The parameters are:
-
table_name
: is the external table name. -
credential_name
: is the name of the credential created in the previous step. Thecredential_name
parameter must conform to Oracle object naming conventions. See Database Object Naming Rules for more information. -
file_uri_list
: is a comma delimited list of the source files you want to query. The URI format for Dedicated Endpoints is supported in commercial (OC1) realms. See Object Storage Dedicated Endpoints, and DBMS_CLOUD URI Formats for more information. -
format
: defines the options to describe the format of the source file. For a Parquet file, use theformat
parameter to specify thetype
parquet
. For an Avro file use theformat
parameter to specify thetype
avro
. For an ORC file use theformat
parameter to specify thetype
orc
.
In this example,
namespace-string
is the Oracle Cloud Infrastructure object storage namespace andbucketname
is the bucket name. See Understanding Object Storage Namespaces for more information.By default the
format
schema
parameter is set and the columns and data types are derived automatically and the fields in the source match the external table columns by name. Source data types are converted to the external table column Oracle data types according to theDBMS_CLOUD
mapping for ORC, Parquet, or Avro data types. The validschema
parameter values are:-
first
: Analyze the schema of the first ORC, Parquet, or Avro file thatDBMS_CLOUD
finds in the specifiedfile_uri_list
(first
is the default value forschema
). -
all
: Analyze all the schemas for all the ORC, Parquet, or Avro files found in thefile_uri_list
. Because these are simply files captured in an object store, there is no guarantee that each file’s metadata is the same. For example, File1 may contain a field called “address”, while File2 may be missing that field. Examining each file to derive the columns is a bit more expensive but may be required if the first file does not contain all the required fields.
Note
If thecolumn_list
parameter is specified, then you provide the column names and data types for the external table and theschema
value, if specified is ignored. Usingcolumn_list
you can limit the columns in the external table. Ifcolumn_list
is not specified then theschema
default value isfirst
. -
- You can now run queries on the external table you created in the previous step:
DESC sales_extended_ext; Name Null? Type -------------- ----- -------------- PROD_ID NUMBER(10) CUST_ID NUMBER(10) TIME_ID VARCHAR2(32767) CHANNEL_ID NUMBER(10) PROMO_ID NUMBER(10) QUANTITY_SOLD NUMBER(10) AMOUNT_SOLD NUMBER(10,2) GENDER VARCHAR2(32767) CITY VARCHAR2(32767) STATE_PROVINCE VARCHAR2(32767) INCOME_LEVEL VARCHAR2(32767)
SELECT prod_id, quantity_sold, gender, city, income_level FROM sales_extended_ext WHERE ROWNUM < 8; PROD_ID QUANTITY_SOLD GENDER CITY INCOME_LEVEL 1 13 1 M Adelaide K: 250,000 – 299,999 2 13 1 M Dolores L: 300,000 and above 3 13 1 M Cayuga F: 110,000 – 129,999 4 13 1 F Bergen op Zoom C: 50,000 – 69,999 5 13 1 F Neuss J: 190,000 – 249,999 6 13 1 F Darwin F: 110,000 – 129,999 7 13 1 M Sabadell K:250,000 – 299,999
This query shows values for rows in the external table. If you want to query this data frequently, after examining the data you can load it into a table with
DBMS_CLOUD.COPY_DATA
.
See CREATE_EXTERNAL_TABLE Procedure for Avro, ORC, or Parquet Files and COPY_DATA Procedure for Avro, ORC, or Parquet Files for more information.
See DBMS_CLOUD URI Formats for information on supported cloud object storage services.
Query External Data with ORC, Parquet, or Avro Source Files and Explicitly Set the Text Column Size
When you access ORC, Parquet, or Avro data stored in object store using external tables on Autonomous Database, you can either automatically or explicitly set the maximum size of text columns.
By default the text column size is set based on the value of
MAX_STRING_SIZE
.
The source file in this example, sales_extended.parquet
,
contains Parquet format data. To query this file in Autonomous Database and set the maximum text column size, do the following:
- Store your object store credentials, to access the object store, using the
procedure
DBMS_CLOUD.CREATE_CREDENTIAL
:BEGIN
DBMS_CLOUD.CREATE_CREDENTIAL
( credential_name => 'DEF_CRED_NAME', username => 'adb_user@example.com', password => 'password' ); END; /Creating a credential to access Oracle Cloud Infrastructure Object Store is not required if you enable resource principal credentials. See Use Resource Principal to Access Oracle Cloud Infrastructure Resources for more information.
This operation stores the credentials in the database in an encrypted format. You can use any name for the credential name. Note that this step is required only once unless your object store credentials change. Once you store the credentials you can then use the same credential name for creating external tables.
See CREATE_CREDENTIAL Procedure for information about the
username
andpassword
parameters for different object storage services. - Create an external table for ORC, Parquet, or Avro on top of your source files using
the procedure
DBMS_CLOUD.CREATE_EXTERNAL_TABLE
and specify themaxvarchar
format parameter.The procedure
DBMS_CLOUD.CREATE_EXTERNAL_TABLE
supports external files in the supported cloud object storage services, including: Oracle Cloud Infrastructure Object Storage, Azure Blob Storage or Azure Data Lake Storage, Amazon S3, and Amazon S3-Compatible, including: Oracle Cloud Infrastructure Object Storage, Google Cloud Storage, and Wasabi Hot Cloud Storage. The credential is a table level property; therefore, the external files must be on the same object store.By default, the columns created in the external table automatically map their data types to Oracle data types for the fields found in the source files and the external table column names match the source field names.
BEGIN
DBMS_CLOUD.CREATE_EXTERNAL_TABLE
( table_name =>'sales_extended_ext', credential_name =>'DEF_CRED_NAME', file_uri_list =>'https://objectstorage.us-phoenix-1.oraclecloud.com/n/namespace-string/b/bucketname/o/sales_extended.parquet', format => '{"type":"parquet", "schema": "first", "maxvarchar":"extended"}' ); END; /The parameters are:
-
table_name
: is the external table name. -
credential_name
: is the name of the credential created in the previous step. Thecredential_name
parameter must conform to Oracle object naming conventions. See Database Object Naming Rules for more information. -
file_uri_list
: is a comma delimited list of the source files you want to query. The URI format for Dedicated Endpoints is supported in commercial (OC1) realms. See Object Storage Dedicated Endpoints, and DBMS_CLOUD URI Formats for more information. -
format
: defines the options to describe the format of the source file. For a Parquet file, use theformat
parameter to specify thetype
parquet
. For an Avro file use theformat
parameter to specify thetype
avro
. For an ORC file use theformat
parameter to specify thetype
orc
.The option
maxvarchar
with valueextended
specifies that text columns are created asvarchar(32767)
on an Autonomous Database instance with extended string size. The possible values arestandard
withvarchar(4000)
,extended
withvarchar(32767)
, andauto
. The defaultmaxvarchar
value isauto
. With this value, the maximum text size is based on the value ofMAX_STRING_SIZE
.
In this example,
namespace-string
is the Oracle Cloud Infrastructure object storage namespace andbucketname
is the bucket name. See Understanding Object Storage Namespaces for more information.By default the
format
schema
parameter is set and the columns and data types are derived automatically and the fields in the source match the external table columns by name. Source data types are converted to the external table column Oracle data types according to theDBMS_CLOUD
mapping for ORC, Parquet, or Avro data types. The validschema
parameter values are:-
first
: Analyze the schema of the first ORC, Parquet, or Avro file thatDBMS_CLOUD
finds in the specifiedfile_uri_list
(first
is the default value forschema
). -
all
: Analyze all the schemas for all the ORC, Parquet, or Avro files found in thefile_uri_list
. Because these are simply files captured in an object store, there is no guarantee that each file’s metadata is the same. For example, File1 may contain a field called “address”, while File2 may be missing that field. Examining each file to derive the columns is a bit more expensive but may be required if the first file does not contain all the required fields.
Note
If thecolumn_list
parameter is specified, then you provide the column names and data types for the external table and theschema
value, if specified is ignored. Usingcolumn_list
you can limit the columns in the external table. Ifcolumn_list
is not specified then theschema
default value isfirst
. -
- You can now run queries on the external table you created in the previous step:
DESC sales_extended_ext; Name Null? Type -------------- ----- -------------- PROD_ID NUMBER(10) CUST_ID NUMBER(10) TIME_ID VARCHAR2(32767) CHANNEL_ID NUMBER(10) PROMO_ID NUMBER(10) QUANTITY_SOLD NUMBER(10) AMOUNT_SOLD NUMBER(10,2) GENDER VARCHAR2(32767) CITY VARCHAR2(32767) STATE_PROVINCE VARCHAR2(32767) INCOME_LEVEL VARCHAR2(32767)
SELECT prod_id, quantity_sold, gender, city, income_level FROM sales_extended_ext WHERE ROWNUM < 8; PROD_ID QUANTITY_SOLD GENDER CITY INCOME_LEVEL 1 13 1 M Adelaide K: 250,000 – 299,999 2 13 1 M Dolores L: 300,000 and above 3 13 1 M Cayuga F: 110,000 – 129,999 4 13 1 F Bergen op Zoom C: 50,000 – 69,999 5 13 1 F Neuss J: 190,000 – 249,999 6 13 1 F Darwin F: 110,000 – 129,999 7 13 1 M Sabadell K:250,000 – 299,999
This query shows values for rows in the external table. If you want to query this data frequently, after examining the data you can load it into a table with
DBMS_CLOUD.COPY_DATA
.If you specify the
format
optionmaxvarchar
asstandard
, thevarchar2()
text columns are created with size4000
. For example:BEGIN
DBMS_CLOUD.CREATE_EXTERNAL_TABLE
( table_name =>'sales_extended_ext', credential_name =>'DEF_CRED_NAME', file_uri_list =>'https://objectstorage.us-phoenix-1.oraclecloud.com/n/namespace-string/b/bucketname/o/sales_extended.parquet', format => '{"type":"parquet", "schema": "first", "maxvarchar":"standard"}' ); END; /The
format
parameter with the optionmaxvarchar
set to the valuestandard
specifies that text columns are created asvarchar(4000)
.DESC sales_extended_ext; Name Null? Type -------------- ----- -------------- PROD_ID NUMBER(10) CUST_ID NUMBER(10) TIME_ID VARCHAR2(4000) CHANNEL_ID NUMBER(10) PROMO_ID NUMBER(10) QUANTITY_SOLD NUMBER(10) AMOUNT_SOLD NUMBER(10,2) GENDER VARCHAR2(4000) CITY VARCHAR2(4000) STATE_PROVINCE VARCHAR2(4000) INCOME_LEVEL VARCHAR2(4000)
See CREATE_EXTERNAL_TABLE Procedure for Avro, ORC, or Parquet Files and COPY_DATA Procedure for Avro, ORC, or Parquet Files for more information.
See DBMS_CLOUD URI Formats for information on supported cloud object storage services.