Amazon S3
Use the following options to configure jobs writing to Amazon S3.
[ AGGREGATION_PARALLELISM = <integer> ]
[ COMMENT = '<comment>' ]
[ COMPRESSION = { NONE
| GZIP
| SNAPPY
| ZSTD } ]
[ COMPUTE_CLUSTER = <cluster_identifier> ]
[ DATE_PATTERN = '<date_pattern>' ]
[ END_AT = { NOW | timestamp } ]
[ FILE_FORMAT = { CSV
| JSON
| PARQUET
| AVRO
| TSV } ]
[ OUTPUT_OFFSET = <integer> { MINUTE[S] | HOUR[S] | DAY[S] } ]
[ RUN_INTERVAL = <integer> { MINUTE[S] | HOUR[S] | DAY[S] } ]
[ RUN_PARALLELISM = <integer> ]
[ START_FROM = { NOW | BEGINNING | timestamp } ]
Amazon S3 job options:
General job options:
Values:
{ NONE | GZIP | SNAPPY | ZSTD }
Default:
NONE
(Optional) The compression for the output files.
Type:
text
Default:
'yyyy/MM/dd/HH/mm'
(Optional) Upsolver uses the date pattern to partition the output on the S3 bucket. Upsolver supports partitioning up to the minute, for example: 'yyyy/MM/dd/HH/mm'. For more options, see: Java SimpleDateFormat
Values:
{ CSV | TSV | AVRO | PARQUET | JSON }
The file format for the output file. The following options can be configured for CSV and TSV formats:
FILE_FORMAT = (
TYPE = CSV
[ DELIMITER = '<delimiter>' ]
)
DELIMITER
Type:
text
Default:
,
(Optional) Configures the delimiter to separate the values in the output file. For binary targets, use
DELIMITER = '\u0001'
FILE_FORMAT = (
TYPE = TSV
[ HEADERLESS = { TRUE | FALSE } ]
)
HEADERLESS
Type:
boolean
Default:
false
(Optional) When
true
, the column names are used as the header row in the output file.Value:
<integer> { MINUTE[S] | HOUR[S] | DAY[S] }
Default: 0
(Optional) By default, the file 2023/01/01/00/01 contains data for 2023-01-01 00:00 - 2023-01-01 00:00.59.999. Setting OUTPUT_OFFSET to
1 MINUTE
add to that so a value of the first minute will move the file name to 02, if you want to move it back you can use negative values.Type:
text
The target location to write files to, as a full S3 URI. The location URI pattern can include macros referring to data columns, this allows custom partitioning of the data in the target location.
Supported macros:
Time:
{time:<date-pattern>}
This macro will be replaced with the job execution time at runtime. The date pattern provided must be in Java's date formatting syntax. Only a single-time macro can be used in the location.Column:
{col:<column-name>}
This macro will be replaced with the value of the column provided. The column provided must appear in the select statement of the job. Shard:
{shard:format}
This macro will be replaced by the output shard number writing the current file. It is important to use this as part of your pattern if you are using RUN_PARALLELISM,
otherwise, each shard will overwrite the file.
The supported format is a subset of Java's string fromat syntax. The supported options are either:
1. %0xd - Will result in a shard number padded with x-1 leading 0's. For example, %05d will result in 00001 for shard number 1.
2. %d - Will simply use the shard number with no padding.
Usually, it's recommended to include padding to ensure alphabetical sorting of the output files.If the location provided ends with a
/
and contains no date pattern, a default date pattern is added to the end of the pathExample location URI:
s3://my-bucket/some/prefix/{time:yyyy-MM-dd-HH-mm}/{col:country}/output.json
Last modified 21d ago