Amazon S3

Job options

[ AGGREGATION_PARALLELISM = <integer> ]
[ COMMENT = '<comment>' ]
[ COMPRESSION = { NONE 
                | GZIP 
                | SNAPPY 
                | ZSTD } ]
[ COMPUTE_CLUSTER = <cluster_identifier> ]
[ DATE_PATTERN = '<date_pattern>' ]
[ END_AT = { NOW | timestamp } ]
[ FILE_FORMAT = { CSV 
                | JSON 
                | PARQUET
                | AVRO 
                | TSV } ]
[ OUTPUT_OFFSET = <integer> { MINUTE[S] | HOUR[S] | DAY[S] } ]
[ RUN_INTERVAL = <integer> { MINUTE[S] | HOUR[S] | DAY[S] } ]
[ RUN_PARALLELISM = <integer> ]
[ START_FROM = { NOW | BEGINNING | timestamp } ]

Jump to

Amazon S3 job options:

General job options:

AGGREGATION_PARALLELISM — editable

Type: integer

Default: 1

(Optional) Only supported when the query contains aggregations. Formally known as "output sharding."

COMPRESSION

Values: { NONE | GZIP | SNAPPY | ZSTD }

Default: NONE

(Optional) The compression for the output files.

DATE_PATTERN

Type: text

Default: 'yyyy/MM/dd/HH/mm'

(Optional) Upsolver uses the date pattern to partition the output on the S3 bucket. Upsolver supports partitioning up to the minute, for example: 'yyyy/MM/dd/HH/mm'. For more options, see Java SimpleDateFormat

FILE_FORMAT

Values: { CSV | TSV | AVRO | PARQUET | JSON }

The file format for the output file. The following options can be configured for CSV and TSV formats:

CSV

FILE_FORMAT = (
    TYPE = CSV
    [ DELIMITER = '<delimiter>' ]
  )

DELIMITER

Type: text

Default: ,

(Optional) Configures the delimiter to separate the values in the output file. For binary targets, use DELIMITER = '\u0001'

TSV

FILE_FORMAT = (
    TYPE = TSV
    [ HEADERLESS = { TRUE | FALSE } ]
  )

HEADERLESS

Type: Boolean

Default: false

(Optional) When true, the column names are used as the header row in the output file.

OUTPUT_OFFSET

Value: <integer> { MINUTE[S] | HOUR[S] | DAY[S] }

Default: 0

(Optional) By default, the file 2023/01/01/00/01 contains data for 2023-01-01 00:00 - 2023-01-01 00:00.59.999. Setting OUTPUT_OFFSET to 1 MINUTE add to that so a value of the first minute will move the file name to 02, if you want to move it back you can use negative values.

Location Options

LOCATION

Type: text

The target location to write files to, as a full S3 URI. The location URI pattern can include macros referring to data columns, this allows custom partitioning of the data in the target location.

Supported macros: Time: {time:<date-pattern>} This macro will be replaced with the job execution time at runtime. The date pattern provided must be in Java's date formatting syntax. Only a single-time macro can be used in the location.

Column: {col:<column-name>} This macro will be replaced with the value of the column provided. The column provided must appear in the select statement of the job.

Shard: {shard:format} This macro will be replaced by the output shard number writing the current file. It is important to use this as part of your pattern if you are using RUN_PARALLELISM, otherwise, each shard will overwrite the file. The supported format is a subset of Java's string fromat syntax. The supported options are either: 1. %0xd - Will result in a shard number padded with x-1 leading 0's. For example, %05d will result in 00001 for shard number 1. 2. %d - Will simply use the shard number with no padding. Usually, it's recommended to include padding to ensure alphabetical sorting of the output files.

If the location provided ends with a / and contains no date pattern, a default date pattern is added to the end of the path

Example location URI: s3://my-bucket/some/prefix/{time:yyyy-MM-dd-HH-mm}/{col:country}/output.json

Last updated