CREATE ICEBERG TABLE

Syntax

CREATE ICEBERG TABLE <table_identifier>
([ <column_name> <column_type> [, ...]])
    { PARTITIONED | PARTITION } BY <column_name> [, ...]
    PRIMARY KEY <column_name> [, ...]
    -- Specify sort order (default is ASC)
    ORDER BY <column_name> [ASC | DESC] [, ...] -
    [{ table_options }];

Jump to

Table identifier

Table identifiers are provided in the following format:

<catalog_name>.<schema_name>.<table_name>

Note that only metastore connection types are accepted for the catalog name.

Valid table names match the following identifier format:

identifier = "([^"]|"")*"|[A-Za-z_][A-Za-z0-9_]*;

Column types

COLUMN_TYPE = { DATE 
              | TIMESTAMP 
              | BIGINT 
              | DOUBLE 
              | STRING 
              | BOOLEAN 
              | DECIMAL(precision, scale) }

Partition clause

{ PARTITIONED | PARTITION } BY <column_name> [, ...]

For each value within the column names provided, a partition is created in the table's underlying metastore.

These partitions allow you query only the files belonging to an individual partition rather than all the existing files. This can improve query efficiency when you filter based on a specific partition value.

A table's partitions can only be defined when the table is first created. The accepted column types are string, bigint, Boolean, and date.

Note that if this is a staging table that you intend on using to ingest external data using the COPY FROM command, the column names you partition by here should match the names of the field as they arrive exactly. Additionally, these columns should be in the root and not in a subfield.

If this is a target table that you intend on writing an aggregated output to, the partition column should be one of the aggregation key columns and not an aggregated column.

For more information, see Working with partition indexes.

Primary key clause

PRIMARY KEY <column_name> [, ...]

A table's primary key column(s) contains the values that uniquely identify each row.

When using an INSERT job to write to a table, values are inserted or updated based on the primary keys. This means that when writing to a table with no primary keys defined, the job only ever appends the new data that arrives; no existing values are updated.

In order to use a table in a MERGE job, at least one primary key column should be defined.

Additionally, if this is a staging table that you intend on using to ingest external data with COPY FROM, you should not define any primary key columns.

Order by clause

The data within a partition can be sorted by column(s) in order to gain performance. To do this, include the ORDER KEY clause, followed by the name of the column(s).

-- Specify sort order (default is ASC)
ORDER BY <column_name> [ASC | DESC] [, ...] 

For more information, please refer to the Apache Iceberg documentation.

"Order by" allows to sort data within partitions by columns to gain performance.

The sort order is defined by a list of sort fields. The order of the sort fields within the list defines the order in which the sort is applied to the data. Each sort field has a sort direction, that can only be either asc or desc.

Table options

[ COMMENT = '<comment>' ]
[ STORAGE_CONNECTION = <connection_identifier>
  STORAGE_LOCATION = '<storage_location>' ]
[ COMPUTE_CLUSTER = <cluster_identifier> ]
[ DISABLE_COMPACTION = { TRUE | FALSE } ]
[ TABLE_DATA_RETENTION = <integer> DAYS ]
[ COLUMN_DATA_RETENTION = ((COLUMN = <column_name>
                            DURATION = <integer> DAYS) [, ...]) ]

Jump to

COMMENT — editable

Type: text

(Optional) A description or comment regarding this table.

STORAGE_CONNECTION

Type: identifier

Default: Default storage connection configured for the metastore connection this table is created under

(Optional) The storage connection associated with the STORAGE_LOCATION for the table's underlying files.

Only a storage type connection can be used here (e.g. S3, Blob storage, GCS, Oracle object storage), and it should match the catalog's metastore. For example, if Glue is used as the metastore, only S3 is allowed as a storage connection.

When set, STORAGE_LOCATION must be configured as well to provide a path to store the data.

STORAGE_LOCATION

Type: text

Default: Default storage location configured for the metastore connection this table is created under

(Optional) The storage location for the table's underlying files.

For S3, it should be provided in the format s3://bucket_name/path_to_data.

This option is required when STORAGE_CONNECTION is set.

When set, STORAGE_CONNECTION must be configured as well to provide a connection with access to write to the specified storage location.

COMPUTE_CLUSTER — editable

Type: identifier

Default: The sole cluster in your environment

(Optional) The compute cluster that processes the table.

This option can only be omitted when there is just one cluster in your environment. You must specify which one to use when you have more than one compute cluster.

DISABLE_COMPACTION — editable

Type: Boolean

Default: false

(Optional) When true, disables the compaction process.

RETENTION_DATE_PARTITION

Type: identifier

Default: The only partition column of type date

(Optional) This configures the partition column to be used to determine whether the retention period has passed for a given record.

This option is required if you have more than one date partition column.

TABLE_DATA_RETENTION — editable

Value: <integer> DAYS

(Optional) When set, data in partitions that have passed the retention period are deleted from the table. Number of days can range between 1 and 9999.

This option is not a deterministic mechanism that deletes data when it immediately surpasses the defined threshold. This mechanism is closer to the lifecycle policies on common blob storage services, such as Amazon S3, and is designed to save storage costs, not to delete data based on a specific time. Therefore when data passes the retention period, it will be deleted at some point in the future, and can no longer be relied on to exist, though Upsolver aims to delete it within a reasonable timeframe.

You should be aware that transformation job that reads from a table with a defined data retention may or may not read data that has surpassed the retention threshold.

For example, if the current time is 2023-02-23 12:30:00 UTC, and you have defined TABLE_DATA_RETENTION = 2 days, you can expect data written during 2023-02-23, 2023-02-22, and 2023-02-21 to exist in the table. The retention threshold truncates data to the nearest day, so when the time changes to 2023-02-24 00:00:00 UTC, you can no longer expect data from 2023-02-21 to be present in the table, although it might be there for a while.

Note that you need at least one date partition column for this option to work.

Examples

Minimum example

CREATE ICEBERG TABLE default_glue_catalog.my_database.my_iceberg_table();

Partitioned table example

CREATE ICEBERG TABLE default_glue_catalog.my_database.my_iceberg_table()
    PARTITIONED BY $event_date;

Basic example

CREATE ICEBERG TABLE default_glue_catalog.my_database.orders_data
(
    order_id string,
    order_date date,
    customer_email string,
    net_total bigint, 
    num_items bigint,
    customer_address string,
    customer_city string,
    customer_state string,
    customer_zipcode bigint
)
    PARTITIONED BY order_date
    PRIMARY KEY order_id
    ORDER BY net_total DESC, order_date ASC
    STORAGE_CONNECTION = s3_connection
    STORAGE_LOCATION = 's3://bucket/storage_location'
    COMPUTE_CLUSTER = "my cluster"
    DISABLE_COMPACTION = false
    COMMENT = 'Orders table';

Last updated