Upsolver SQLake
Search…
⌃K

CREATE TABLE

Create a data lake table, defining the schema and related metadata in the data lake catalog.
Unlike most databases, a table can be created without any columns. SQLake tables support dynamic column lists, meaning that the list of columns can expand based on the data being loaded.
Tables are defined and managed in a metastore, while the physical data is stored as files in a corresponding storage location. Currently, SQLake supports writing and managing metadata in the AWS Glue Data Catalog and storing data lake files in Amazon S3.
By default, the CREATE TABLE command creates a base table in your specified catalog along with a view that is used to query the data.
See also:

Syntax

Jump to

Table identifier

Table identifiers are provided in the following format:
<catalog_name>.<schema_name>.<table_name>
Note that only metastore connection types are accepted for the catalog name.
Valid table names match the following identifier format:
identifier = "([^"]|"")*"|[A-Za-z_][A-Za-z0-9_]*;

Column types

COLUMN_TYPE = { DATE
| TIMESTAMP
| BIGINT
| DOUBLE
| STRING
| BOOLEAN
| DECIMAL(precision, scale) }

Partition clause

{ PARTITIONED | PARTITION } BY <column_name> [, ...]
For each value within the column names provided, a partition is created in the table's underlying metastore.
These partitions allow you query only the files belonging to an individual partition rather than all the existing files. This can improve query efficiency when you filter based on a specific partition value.
A table's partitions can only be defined when the table is first created. The accepted column types are string, bigint, boolean, and date.
Note that if this is a staging table that you intend on using to ingest external data with COPY FROM, the column names you partition by here should match the names of the field as they arrive exactly. Additionally, these columns should be in the root and not in a subfield.
If this is a target table that you intend on writing an aggregated output to, the partition column should be one of the aggregation key columns and not an aggregated column.

Primary key clause

PRIMARY KEY <column_name> [, ...]
A table's primary key column(s) contains the values that uniquely identify each row.
When using an INSERT job to write to a table, values are inserted or updated based on the primary keys. This means that when writing to a table with no primary keys defined, the job only ever appends the new data that arrives; no existing values are updated.
In order to use a table in a MERGE job, at least one primary key column should be defined.
Additionally, if this is a staging table that you intend on using to ingest external data with COPY FROM, you should not define any primary key columns.

Table options

[ GLOBALLY_UNIQUE_KEYS = { TRUE | FALSE } ]
[ STORAGE_CONNECTION = <connection_identifier>
STORAGE_LOCATION = '<storage_location>' ]
[ COMPUTE_CLUSTER = <cluster_identifier> ]
[ COMPRESSION = { SNAPPY | GZIP } ]
[ COMPACTION_PROCESSES = <integer> ]
[ DISABLE_COMPACTION = { TRUE | FALSE } ]
[ RETENTION_DATE_PARTITION = <column_name> ]
[ TABLE_DATA_RETENTION = <integer> DAYS ]
[ COLUMN_DATA_RETENTION = ((COLUMN = <column_name>
DURATION = <integer> DAYS) [, ...]) ]
[ COMMENT = '<comment>' ]

Jump to

GLOBALLY_UNIQUE_KEYS

Type: boolean
Default: false
(Optional) By default, partition keys are implicitly part of the primary key. This means that when upserting, only rows with the same primary key and partition are replaced. This is the more performant and therefore recommended option.
However, some use cases may require that the primary keys be globally unique (in other words, unique across partitions). This means that when upserting, rows with the same primary key should be replaced, even if they belong to different partitions. Note that this also means that rows can "switch" partitions.
If such is the case, you should set this option as true.

STORAGE_CONNECTION

Type: identifier
Default: Default storage connection configured for the metastore connection this table is created under
(Optional) The storage connection associated with the STORAGE_LOCATION for the table's underlying files.
Only a storage type connection can be used here (e.g. S3, Blob storage, GCS, Oracle object storage), and it should match the catalog's metastore. For example, if Glue is used as the metastore, only S3 is allowed as a storage connection.
When set, STORAGE_LOCATION must be configured as well to provide a path to store the data.

STORAGE_LOCATION

Type: text
Default: Default storage location configured for the metastore connection this table is created under
(Optional) The storage location for the table's underlying files.
For S3, it should be provided in the format s3://bucket_name/path_to_data.
This option is required when STORAGE_CONNECTION is set.
When set, STORAGE_CONNECTION must be configured as well to provide a connection with access to write to the specified storage location.

COMPUTE_CLUSTER — editable

Type: identifier
Default: The sole cluster in your environment
(Optional) The compute cluster that processes the table.
This option can only be omitted when there is just one cluster in your environment.
Once you have more than one compute cluster, you are required to provide which one to use through this option.

COMPRESSION — editable

Values: { SNAPPY | GZIP }
Default: SNAPPY
(Optional) Type of compression for the table data.

COMPACTION_PROCESSES — editable

Type: integer
Default: 1
(Optional) This determines the number of compaction processes your table can do in parallel when it periodically compacts your data.

DISABLE_COMPACTION — editable

Type: boolean
Default: false
(Optional) When true, disables the compaction process.

RETENTION_DATE_PARTITION

Type: identifier
Default: The only partition column of type date
(Optional) This configures the partition column to use to determine whether the retention period has passed for a given record.
This option is required if you have more than one date partition column.

TABLE_DATA_RETENTION — editable

Value: <integer> DAYS
(Optional) When set, data in partitions that have passed the retention period are deleted from the table. Number of days can range between 1 and 9999.
Note that you need at least one date partition column for this to work.

COLUMN_DATA_RETENTION — editable

Type: list of (<column_name>, <integer> DAYS) pairs
(Optional) When set, after the duration of a column elapses in a partition, the data is rewritten without the contents of that column. Number of days can range between 1 and 9999.
Note that you need at least one date partition column for this to work.

COMMENT — editable

Type: text
(Optional) A description or comment regarding this table.

Examples

Minimum example

Note that the parentheses at the end can be omitted when creating a table without any columns.

Basic example

Full example

Partitioned staging table example