CREATE ICEBERG TABLE
Syntax
Jump to
Table identifier
Table identifiers are provided in the following format:
Note that only metastore connection types are accepted for the catalog name.
Valid table names match the following identifier format:
Column types
Partition clause
For each value within the column names provided, a partition is created in the table's underlying metastore.
These partitions allow you query only the files belonging to an individual partition rather than all the existing files. This can improve query efficiency when you filter based on a specific partition value.
A table's partitions can only be defined when the table is first created. The accepted column types are string
, bigint
, Boolean
, and date
.
Note that if this is a staging table that you intend on using to ingest external data using the COPY FROM
command, the column names you partition by here should match the names of the field as they arrive exactly. Additionally, these columns should be in the root and not in a subfield.
If this is a target table that you intend on writing an aggregated output to, the partition column should be one of the aggregation key columns and not an aggregated column.
For more information, see Working with partition indexes.
Primary key clause
A table's primary key column(s) contains the values that uniquely identify each row.
When using an INSERT
job to write to a table, values are inserted or updated based on the primary keys. This means that when writing to a table with no primary keys defined, the job only ever appends the new data that arrives; no existing values are updated.
In order to use a table in a MERGE
job, at least one primary key column should be defined.
Additionally, if this is a staging table that you intend on using to ingest external data with COPY FROM
, you should not define any primary key columns.
Order by clause
The data within a partition can be sorted by column(s) in order to gain performance. To do this, include the ORDER KEY
clause, followed by the name of the column(s).
For more information, please refer to the Apache Iceberg documentation.
The ORDER BY
clause allows Upsolver to sort data within partitions by columns to gain performance.
The sort order is defined by a list of sort fields. The order of the sort fields within the list defines the order in which the sort is applied to the data. Each sort field has a sort direction, that can only be either ASC
or DESC
.
Table options
Jump to
COMMENT
— editable
COMMENT
— editableType: text
(Optional) A description or comment regarding this table.
STORAGE_CONNECTION
STORAGE_CONNECTION
Type: identif
ier
Default: Default storage connection configured for the metastore connection this table is created under
(Optional) The storage connection associated with the STORAGE_LOCATION
for the table's underlying files.
Only a storage type connection can be used here (e.g. S3, Blob storage, GCS, Oracle object storage), and it should match the catalog's metastore. For example, if Glue is used as the metastore, only S3 is allowed as a storage connection.
When set, STORAGE_LOCATION
must be configured as well to provide a path to store the data.
STORAGE_LOCATION
STORAGE_LOCATION
Type: text
Default: Default storage location configured for the metastore connection this table is created under
(Optional) The storage location for the table's underlying files.
For S3, it should be provided in the format s3://bucket_name/path_to_data
.
This option is required when STORAGE_CONNECTION
is set.
When set, STORAGE_CONNECTION
must be configured as well to provide a connection with access to write to the specified storage location.
COMPUTE_CLUSTER
— editable
COMPUTE_CLUSTER
— editable Type: identif
ier
Default: The sole cluster in your environment
(Optional) The compute cluster that processes the table.
This option can only be omitted when there is just one cluster in your environment. You must specify which one to use when you have more than one compute cluster.
DISABLE_COMPACTION
— editable
DISABLE_COMPACTION
— editableType: Boolean
Default: false
(Optional) When true
, disables the compaction process.
RETENTION_COLUMN
RETENTION_COLUMN
Type: identifier
(Optional) Data will be deleted from the table based on values in this column.
The column type must be DATE
, TIMESTAMP
, TIMESTAMPTZ
, LONG
, or INT
.
For performance and cost reasons, it is preferable that the RETENTION_COLUMN
is part of the partition columns defined for the table. However, in some use cases, you may want to select a retention column that is not part of your partition columns. In such cases, deletion occurs at the file level. Therefore, data will be deleted only once all rows in the file have passed the RETENTION_DURATION
.
If you select a column of type LONG
or INT
, Upsolver will automatically detect the unit (seconds, milliseconds, microseconds, minutes, days) in which the data is stored.
This option is stored in the Iceberg table properties under the key 'upsolver.retention.column'
.
RETENTION_DURATION
— editable
RETENTION_DURATION
— editableValue: <integer> DAYS
(Optional) Data must be at least this many days old to be deleted. Number of days can range between 1 and 9999.
This option functions similarly to lifecycle policies on common blob storage services, such as Amazon S3, and is designed to save storage costs. While it does not delete data immediately upon surpassing the defined threshold, it ensures that data will be deleted at some point in the future. Once data passes the retention period, it will be scheduled for deletion, and although the exact time may vary, Upsolver aims to complete this process within a reasonable timeframe.
This option is stored in the Iceberg table properties under the key 'upsolver.retention.duration'
.
ICEBERG_TABLE_PROPERTIES
ICEBERG_TABLE_PROPERTIES
You can configure any Iceberg table property as documented in the Apache Iceberg table properties documentation.
iceberg.<some.iceberg.property> = '<property_value>'
Note:
The prefix
iceberg.
should be added before the name of the property.Properties which include the character
"-"
should be written within double quotes. For example:iceberg."read.split.planning-lookback" = '10'
Iceberg table properties are strings, so the
<property_value>
should be enclosed in apostrophes.
Examples
Minimum example
Partitioned table example
Basic example
Last updated