Apache Iceberg

These job options are used when writing to Upsolver-managed Apache Iceberg tables.

Tables created within Upsolver using your metastore connection are considered Upsolver-managed tables. Note that these tables can still be queried externally. For example, you can create an AWS Glue Data Catalog table within Upsolver, and this table can be queried within Upsolver itself or when you go to your Athena console.

Job options

[ ADD_MISSING_COLUMNS = { TRUE | FALSE } ]
[ COMMENT = '<comment>' ]
[ COMPUTE_CLUSTER = <cluster_identifier> ]
[ END_AT = { NOW | timestamp } ]
[ FLATTEN_PATHS = (<array_path> [, ...]) ]
[ RUN_INTERVAL = <integer> { MINUTE[S] | HOUR[S] | DAY[S] } ]
[ START_FROM = { NOW | BEGINNING | timestamp } ]

Jump to

Apache Iceberg table options:

General job options:

ADD_MISSING_COLUMNS

Type: Boolean

Default: false

(Optional) When true, columns that don't exist in the target table are added automatically when encountered.

When false, you cannot do SELECT * within the SELECT statement of your transformation job.

ON_COLUMN_TYPE_MISMATCH

Type: String

Default: None

Possible values: Add Column ,None

This option is applicable only if ADD_MISSING_COLUMNS=true. It determines how to handle cases where the datatype of the source data does not match the datatype of the corresponding column in the target table.

If set to Add Column, Upsolver will attempt to cast the incoming data to the original column's datatype. If the cast fails, a new column with the format <originalColumnName>_newDataType will be created, and the mismatched data will be written to this new column. For example, if the CLIENT_ID column is a number in the target table and a VARCHAR arrives in that column, a new column called CLIENT_ID_VARCHAR will be added for the string data. The original column will continue to be populated if the data can be cast successfully.

If set to None, no new columns will be added, and only valid casts will be written to the original column.

METADATA_RETENTION

Type: Integer(Days)

Default: Lifetime

The METADATA_RETENTION parameter controls how long the system retains statistical data collected by the job, such as the top values for each column, the percentage of null records, and timestamps marking when a column was first and last seen.

The statistics are available to view via the relevant table on the datasets page.

It's important to note that METADATA_RETENTION refers to the retention of statistics based on the incremental dates that were ingested by the job, which may not always align with the retention of the table data itself. For instance, data received now may be associated with an older partition of the table, but the statistics retention will consider the ingestion time as "now," reflecting the current proccessed times. This distinction ensures that the statistics accurately represent the state of the data as it enters the system, even if the data is linked to an earlier partition.

FLATTEN_PATHS

Type: Array<String>

Default: ()

Last updated