CREATE EXTERNAL ICEBERG TABLE

Syntax

CREATE EXTERNAL ICEBERG TABLE <table_identifier>
   [{ table_options }];

Jump to

Table identifier

Table identifiers are provided in the following format:

<catalog_name>.<schema_name>.<table_name>

Note that only metastore connection types are accepted for the catalog name.

Valid table names match the following identifier format:

identifier = "([^"]|"")*"|[A-Za-z_][A-Za-z0-9_]*;

Table options

[ COMMENT = '<comment>']
[ OPTIMIZE = { TRUE | FALSE } ]
[ RETENTION_DATE_PARTITION = <column_name> ]
[ TABLE_DATA_RETENTION = <integer> DAYS ]

Jump to

COMMENT — editable

Type: text

(Optional) A description or comment regarding this table.

OPTIMIZE — editable

Type: Boolean

Default: false

(Optional) When false, the tables will be not optimized by Upsolver. When true, files are compacted to reduce storage costs and increase query scan performance.

RETENTION_DATE_PARTITION

Type: identifier

Default: The only partition column of type date

(Optional) This configures the partition column to be used to determine whether the retention period has passed for a given record.

This option is required if you have more than one date partition column.

TABLE_DATA_RETENTION — editable

Value: <integer> DAYS

(Optional) When set, data in partitions that have passed the retention period are deleted from the table. Number of days can range between 1 and 9999.

This option is not a deterministic mechanism that deletes data when it immediately surpasses the defined threshold. This mechanism is closer to the lifecycle policies on common blob storage services, such as Amazon S3, and is designed to save storage costs, not to delete data based on a specific time. Therefore when data passes the retention period, it will be deleted at some point in the future, and can no longer be relied on to exist, though Upsolver aims to delete it within a reasonable timeframe.

You should be aware that transformation job that reads from a table with a defined data retention may or may not read data that has surpassed the retention threshold.

For example, if the current time is 2023-02-23 12:30:00 UTC, and you have defined TABLE_DATA_RETENTION = 2 days, you can expect data written during 2023-02-23, 2023-02-22, and 2023-02-21 to exist in the table. The retention threshold truncates data to the nearest day, so when the time changes to 2023-02-24 00:00:00 UTC, you can no longer expect data from 2023-02-21 to be present in the table, although it might be there for a while.

Note that you need at least one date partition column for this option to work.


Examples

Basic table

The example below creates an external Iceberg table that Upsolver manages for you:

CREATE EXTERNAL ICEBERG TABLE
    default_glue_catalog.my_database.my_external_iceberg_table
    OPTIMIZE = TRUE;

Table with data retention

The following example creates an external Iceberg table with data retention. First, we create a table named orders in our Glue catalog. The PARTITIONED BY option specifies the partition column as order_date.

Next, we create an external Iceberg column in Upsolver, with the data retention based on the order_date column. The data will be retained for 2 days:

// CREATE ICEBERG TABLE IN GLUE
CREATE TABLE my_glue_catalog.demo.orders
(
    order_id string,
    order_date date,
    customer_email string,
    net_total bigint, 
    num_items bigint
)
    PARTITIONED BY (order_date)
    LOCATION 's3://upsolver-demo/order_data/'
    TBLPROPERTIES ('table_type' = 'ICEBERG');

// CREATE EXTERNAL ICEBERG TABLE IN UPSOLVER
CREATE EXTERNAL ICEBERG TABLE default_glue_catalog.demo.orders
    RETENTION_DATE_PARTITION = order_date
    TABLE_DATA_RETENTION = 2 days;

Last updated