Query External Data with ORC,
Parquet, or Avro Source Files
Autonomous Database makes it easy to access ORC,
Parquet, or Avro data stored in object store using external tables. ORC, Parquet, and Avro
sources have metadata embedded in them and the DBMS_CLOUD.CREATE_EXTERNAL_TABLE procedure can utilize this metadata
to simplify the creation of external tables.
You don’t need to know the structure of the data, DBMS_CLOUD
can examine the file and convert ORC, Parquet, or Avro contents into the equivalent Oracle
columns and data types. You only need to know the location of the data in object store,
specify its type, ORC, Parquet, or Avro, and have credentials to access the source file on
your object store.
Note
The steps to use external tables are very
similar for ORC, Parquet, and Avro. These steps show working with a Parquet format source
file.
The source file in this example, sales_extended.parquet,
contains Parquet format data. To query this file in Autonomous Database, do the following:
Store your object store credentials, to access the object store, using the
procedure DBMS_CLOUD.CREATE_CREDENTIAL:
This operation stores the credentials in the database in an encrypted
format. You can use any name for the credential name. Note that this step is required
only once unless your object store credentials change. Once you store the credentials
you can then use the same credential name for creating external tables.
See CREATE_CREDENTIAL Procedure
for information about
the username and password parameters for different
object storage services.
Create an external table for ORC, Parquet, or Avro on top of your source files using
the procedure DBMS_CLOUD.CREATE_EXTERNAL_TABLE.
The procedure DBMS_CLOUD.CREATE_EXTERNAL_TABLE supports external files in
the supported cloud object storage services, including: Oracle Cloud
Infrastructure Object Storage, Azure Blob Storage or Azure Data Lake
Storage, Amazon S3, and
Amazon S3-Compatible, including: Oracle Cloud
Infrastructure Object Storage, Google Cloud Storage, and Wasabi Hot Cloud Storage. The credential is a table level property; therefore, the
external files must be on the same object store.
By default, the columns created in the external table automatically map
their data types to Oracle data types for the fields found in the source files and the
external table column names match the source field names.
BEGIN
DBMS_CLOUD.CREATE_EXTERNAL_TABLE(
table_name =>'sales_extended_ext',
credential_name =>'DEF_CRED_NAME',
file_uri_list =>'https://objectstorage.us-phoenix-1.oraclecloud.com/n/namespace-string/b/bucketname/o/sales_extended.parquet',
format => '{"type":"parquet", "schema": "first"}'
);
END;
/
The parameters are:
table_name: is the external table name.
credential_name: is the name of the credential
created in the previous step. The credential_name parameter must
conform to Oracle object naming conventions. See Database Object Naming Rules for more
information.
file_uri_list: is a comma delimited list of the
source files you want to query. The URI format for Dedicated Endpoints is supported
in commercial (OC1) realms. See Object Storage Dedicated
Endpoints, and DBMS_CLOUD URI Formats for more information.
format: defines the options to describe the format of
the source file. For a Parquet file, use the format parameter to
specify the typeparquet. For an Avro file use the format parameter
to specify the typeavro. For an ORC file use the format parameter to
specify the typeorc.
In this example, namespace-string is the Oracle
Cloud Infrastructure object storage namespace and
bucketname is the bucket
name. See Understanding
Object Storage Namespaces for more
information.
By default the formatschema parameter is set and the columns and data types are derived
automatically and the fields in the source match the external table columns by name.
Source data types are converted to the external table column Oracle data types according
to the DBMS_CLOUD mapping for ORC, Parquet, or Avro data types. The
valid schema parameter values are:
first: Analyze the schema of the first ORC, Parquet,
or Avro file that DBMS_CLOUD finds in the specified
file_uri_list (first is the default value for
schema).
all: Analyze all the schemas for all the ORC,
Parquet, or Avro files found in the file_uri_list. Because these
are simply files captured in an object store, there is no guarantee that each file’s
metadata is the same. For example, File1 may contain a field called “address”, while
File2 may be missing that field. Examining each file to derive the columns is a bit
more expensive but may be required if the first file does not contain all the
required fields.
Note
If the column_list
parameter is specified, then you provide the column names and data types for the
external table and the schema value, if specified is ignored. Using
column_list you can limit the columns in the external table. If
column_list is not specified then the schema default
value is first.
You can now run queries on the external table you created in the previous step:
DESC sales_extended_ext;
Name Null? Type
-------------- ----- --------------
PROD_ID NUMBER(10)
CUST_ID NUMBER(10)
TIME_ID VARCHAR2(32767)
CHANNEL_ID NUMBER(10)
PROMO_ID NUMBER(10)
QUANTITY_SOLD NUMBER(10)
AMOUNT_SOLD NUMBER(10,2)
GENDER VARCHAR2(32767)
CITY VARCHAR2(32767)
STATE_PROVINCE VARCHAR2(32767)
INCOME_LEVEL VARCHAR2(32767)
SELECT prod_id, quantity_sold, gender, city, income_level
FROM sales_extended_ext
WHERE ROWNUM < 8;
PROD_ID QUANTITY_SOLD GENDER CITY INCOME_LEVEL
1 13 1 M Adelaide K: 250,000 – 299,999
2 13 1 M Dolores L: 300,000 and above
3 13 1 M Cayuga F: 110,000 – 129,999
4 13 1 F Bergen op Zoom C: 50,000 – 69,999
5 13 1 F Neuss J: 190,000 – 249,999
6 13 1 F Darwin F: 110,000 – 129,999
7 13 1 M Sabadell K:250,000 – 299,999
This query shows values for rows in the external table. If you want to
query this data frequently, after examining the data you can load it into a table with
DBMS_CLOUD.COPY_DATA.
Query External Data with ORC,
Parquet, or Avro Source Files and Explicitly Set the Text Column Size 🔗
When you
access ORC, Parquet, or Avro data stored in object store using external tables on Autonomous Database, you can either automatically or
explicitly set the maximum size of text columns.
By default the text column size is set based on the value of
MAX_STRING_SIZE.
The source file in this example, sales_extended.parquet,
contains Parquet format data. To query this file in Autonomous Database and set the maximum text column size, do the following:
Store your object store credentials, to access the object store, using the
procedure DBMS_CLOUD.CREATE_CREDENTIAL:
This operation stores the credentials in the database in an encrypted
format. You can use any name for the credential name. Note that this step is required
only once unless your object store credentials change. Once you store the credentials
you can then use the same credential name for creating external tables.
See CREATE_CREDENTIAL Procedure
for information about
the username and password parameters for different
object storage services.
Create an external table for ORC, Parquet, or Avro on top of your source files using
the procedure DBMS_CLOUD.CREATE_EXTERNAL_TABLE and specify the
maxvarchar format parameter.
The procedure DBMS_CLOUD.CREATE_EXTERNAL_TABLE supports external files in
the supported cloud object storage services, including: Oracle Cloud
Infrastructure Object Storage, Azure Blob Storage or Azure Data Lake
Storage, Amazon S3, and
Amazon S3-Compatible, including: Oracle Cloud
Infrastructure Object Storage, Google Cloud Storage, and Wasabi Hot Cloud Storage. The credential is a table level property; therefore, the
external files must be on the same object store.
By default, the columns created in the external table automatically map
their data types to Oracle data types for the fields found in the source files and the
external table column names match the source field names.
BEGIN
DBMS_CLOUD.CREATE_EXTERNAL_TABLE(
table_name =>'sales_extended_ext',
credential_name =>'DEF_CRED_NAME',
file_uri_list =>'https://objectstorage.us-phoenix-1.oraclecloud.com/n/namespace-string/b/bucketname/o/sales_extended.parquet',
format => '{"type":"parquet", "schema": "first", "maxvarchar":"extended"}'
);
END;
/
The parameters are:
table_name: is the external table name.
credential_name: is the name of the credential
created in the previous step. The credential_name parameter must
conform to Oracle object naming conventions. See Database Object Naming Rules for more
information.
file_uri_list: is a comma delimited list of the
source files you want to query. The URI format for Dedicated Endpoints is supported
in commercial (OC1) realms. See Object Storage Dedicated
Endpoints, and DBMS_CLOUD URI Formats for more information.
format: defines the options to describe the format of
the source file. For a Parquet file, use the format parameter to
specify the typeparquet. For an Avro file use the format parameter
to specify the typeavro. For an ORC file use the format parameter to
specify the typeorc.
The option maxvarchar with value
extended specifies that text columns are created as
varchar(32767) on an Autonomous Database instance with extended string size. The possible values are
standard with varchar(4000),
extended with varchar(32767), and
auto. The default maxvarchar value is
auto. With this value, the maximum text size is based on the
value of MAX_STRING_SIZE.
In this example, namespace-string is the Oracle
Cloud Infrastructure object storage namespace and
bucketname is the bucket
name. See Understanding
Object Storage Namespaces for more
information.
By default the formatschema parameter is set and the columns and data types are derived
automatically and the fields in the source match the external table columns by name.
Source data types are converted to the external table column Oracle data types according
to the DBMS_CLOUD mapping for ORC, Parquet, or Avro data types. The
valid schema parameter values are:
first: Analyze the schema of the first ORC, Parquet,
or Avro file that DBMS_CLOUD finds in the specified
file_uri_list (first is the default value for
schema).
all: Analyze all the schemas for all the ORC,
Parquet, or Avro files found in the file_uri_list. Because these
are simply files captured in an object store, there is no guarantee that each file’s
metadata is the same. For example, File1 may contain a field called “address”, while
File2 may be missing that field. Examining each file to derive the columns is a bit
more expensive but may be required if the first file does not contain all the
required fields.
Note
If the column_list
parameter is specified, then you provide the column names and data types for the
external table and the schema value, if specified is ignored. Using
column_list you can limit the columns in the external table. If
column_list is not specified then the schema default
value is first.
You can now run queries on the external table you created in the previous step:
DESC sales_extended_ext;
Name Null? Type
-------------- ----- --------------
PROD_ID NUMBER(10)
CUST_ID NUMBER(10)
TIME_ID VARCHAR2(32767)
CHANNEL_ID NUMBER(10)
PROMO_ID NUMBER(10)
QUANTITY_SOLD NUMBER(10)
AMOUNT_SOLD NUMBER(10,2)
GENDER VARCHAR2(32767)
CITY VARCHAR2(32767)
STATE_PROVINCE VARCHAR2(32767)
INCOME_LEVEL VARCHAR2(32767)
SELECT prod_id, quantity_sold, gender, city, income_level
FROM sales_extended_ext
WHERE ROWNUM < 8;
PROD_ID QUANTITY_SOLD GENDER CITY INCOME_LEVEL
1 13 1 M Adelaide K: 250,000 – 299,999
2 13 1 M Dolores L: 300,000 and above
3 13 1 M Cayuga F: 110,000 – 129,999
4 13 1 F Bergen op Zoom C: 50,000 – 69,999
5 13 1 F Neuss J: 190,000 – 249,999
6 13 1 F Darwin F: 110,000 – 129,999
7 13 1 M Sabadell K:250,000 – 299,999
This query shows values for rows in the external table. If you want to
query this data frequently, after examining the data you can load it into a table with
DBMS_CLOUD.COPY_DATA.
If you specify the format option maxvarchar as
standard, the varchar2() text columns are created
with size 4000. For example:
BEGIN
DBMS_CLOUD.CREATE_EXTERNAL_TABLE(
table_name =>'sales_extended_ext',
credential_name =>'DEF_CRED_NAME',
file_uri_list =>'https://objectstorage.us-phoenix-1.oraclecloud.com/n/namespace-string/b/bucketname/o/sales_extended.parquet',
format => '{"type":"parquet", "schema": "first", "maxvarchar":"standard"}'
);
END;
/
The format parameter with the option
maxvarchar set to the value standard specifies that
text columns are created as varchar(4000).
DESC sales_extended_ext;
Name Null? Type
-------------- ----- --------------
PROD_ID NUMBER(10)
CUST_ID NUMBER(10)
TIME_ID VARCHAR2(4000)
CHANNEL_ID NUMBER(10)
PROMO_ID NUMBER(10)
QUANTITY_SOLD NUMBER(10)
AMOUNT_SOLD NUMBER(10,2)
GENDER VARCHAR2(4000)
CITY VARCHAR2(4000)
STATE_PROVINCE VARCHAR2(4000)
INCOME_LEVEL VARCHAR2(4000)