Upsolver SQLake
Search…
⌃K

Amazon S3 options

Syntax

CREATE [SYNC] JOB <job_name>
[{ job_options }]
AS COPY FROM S3
<connection_identifier>
[{ source_options }]
INTO <table_identifier>;

Jump to

Job options

The following properties can be configured when ingesting data from Amazon S3.
To find data quickly and efficiently in Amazon S3, SQLake needs to list the files in the specified bucket and prefixes. SQLake utilizes the folder structure and naming, i.e. partitions, to optimize reading data before it is processed. When reading events, SQLake organizes them on a timeline which it uses to synchronize between jobs and ensure data is joined and aggregated correctly. Therefore, storing raw data using date formatted partitions such as /bucket/data/2022/11/01/, enables SQLake to optimize and speed up ingestion. To better understand how SQLake reads and processes data using date partitions, jump to working with date patterns.
[ FILE_PATTERN = '<pattern>' ]
[ DELETE_FILES_AFTER_LOAD = { TRUE | FALSE } ]
[ END_AT = { NOW | <timestamp> } ]
[ COMPUTE_CLUSTER = <cluster_identifier> ]
[ RUN_PARALLELISM = <integer> ]
[ CONTENT_TYPE = { AUTO
| CSV
| JSON
| PARQUET
| TSV
| AVRO
| AVRO_SCHEMA_REGISTRY
| FIXED_WIDTH
| REGEX
| SPLIT_LINES
| ORC
| XML } ]
[ COMPRESSION = { AUTO
| GZIP
| SNAPPY
| LZO
| NONE
| SNAPPY_UNFRAMED
| KCL } ]
[ COMMENT = '<comment>' ]

Jump to

FILE_PATTERN

Type: text
Default: ''
(Optional) Only files that match the provided regex pattern are loaded.
This is a way to filter out data that is not relevant. For example, you could filter by suffix to only keep .parquet files in a folder that may have some additional files that should not be ingested.

DELETE_FILES_AFTER_LOAD

Type: boolean
Default: false
(Optional) When true, files are deleted from the storage source once they have been ingested into the target location within your metastore.
This allows Upsolver to discover new files immediately, regardless of how many files there are in the source, or what file names and patterns are used.

END_AT — editable

Values: { NOW | <timestamp> }
Default: Never
(Optional) Configures the time to stop ingesting data. Files after the specified time are ignored. Timestamps provided should be based on UTC.

COMPUTE_CLUSTER — editable

Type: identifier
Default: The sole cluster in your environment
(Optional) The compute cluster to run this job.
This option can only be omitted when there is just one cluster in your environment.
Once you have more than one compute cluster, you are required to provide which one to use through this option.

RUN_PARALLELISM — editable

Type: integer
Default: 1
(Optional) The number of parser jobs to run in parallel per minute.

CONTENT_TYPE — editable

Values: { AUTO | CSV | JSON | PARQUET | TSV | AVRO | AVRO_SCHEMA_REGISTRY | FIXED_WIDTH | REGEX | SPLIT_LINES | ORC | XML }
Default: AUTO
(Optional) The file format of the content being read.
Note that AUTO only works when reading Avro, JSON, or Parquet.
To configure additional options for certain content types, see: Content type options

COMPRESSION

Values: { AUTO | GZIP | SNAPPY | LZO | NONE | SNAPPY_UNFRAMED | KCL }
Default: AUTO
(Optional) The compression of the source.

COMMENT — editable

Type: text
(Optional) A description or comment regarding this job.

Working with date patterns

Before you create your landing zone where raw data will be staged for SQLake to ingest, you need to be aware of a couple things:
  1. 1.
    Date values should be lexicographically ordered and include a leading zero when applicable. For example, prefer /2022/02/04 for February 4th 2022, over /22/2/4.
  2. 2.
    Avoid including a dynamically changing prefix, like application ID or job cluster ID, as the first part of your prefix. If you need to include it, add it at the end. For example, prefer /2022/02/04/X34TFA2 instead of /X34TFA2/2022/02/04.
To ingest date partitioned S3 data, you can configure the DATE_PATTERN property of your ingestion job. For Apache Hive style partitions, the value is prefixed with a keyword representing its meaning. For example, /year=2022/month=02/day=04/. To support this format, you include the prefixes in the DATE_PATTERN property surrounded by two single quotes (not a double quote). Using the above example, the DATE_PATTERN will be ’’year=’’yyyy’’/month=’’MM’’/day=’’dd. As you can see, the string literals in the pattern, like ‘’year=’’ and ‘’month=’’ are wrapped in two single quote characters, not double quotes. If there is a single quote in your folder path that needs to be represented in the path literal, you need to surround it with two single quotes. For example, year(‘0’)=2000 would be represented as ’’’year(’’’’0’’’’)=’’yyyy/’. This could be confusing, so just make sure you’re quoting the string literal parts of the pattern and not the date value patterns.
Note that when there is no date pattern defined, you cannot define a time to start ingestion from ; all available data is ingested by default.
[ DATE_PATTERN = '<date_pattern>' ]
[ START_FROM = { NOW | BEGINNING | <timestamp> } ]
[ INITIAL_LOAD_PREFIX = '<prefix>' ]
[ INITIAL_LOAD_PATTERN = '<pattern>' ]
[ MAX_DELAY = <integer> { MINUTE[S] | HOUR[S] | DAY[S] } ]

Jump to

  • [`DATE_PATTERN`](s3-options.md#date\_pattern)
  • [`START_FROM`](s3-options.md#start\_from)
  • [`INITIAL_LOAD_PREFIX`](s3-options.md#initial\_load\_prefix)
  • [`INITIAL_LOAD_PATTERN`](s3-options.md#initial\_load\_pattern)
  • [`MAX_DELAY`](s3-options.md#max\_delay)

DATE_PATTERN

Type: text
(Optional) The date pattern of the partitions on the S3 bucket to read from. SQLake supports reading from buckets partitioned up to the minute.
Example: 'yyyy/MM/dd/HH/mm'
For more options, see: Java SimpleDateFormat
When you set a DATE_PATTERN, SQLake uses the date in the folder path to understand when new files are added. The date in the path is used to process data in order of arrival, as well as set the $source_time and $event_time system columns used to keep jobs synchronized. If files are added into a folder named with a future date, these files will not be ingested until that date becomes the present.
If you don’t set a DATE_PATTERN, SQLake will list and ingest files in the ingest job’s BUCKET and PREFIX location as soon as they are discovered. Historical data will also be processed as soon as it is added and discovered by SQLake. To discover new files, when a DATE_PATTERN is not set, SQLake lists the top level prefix and performs a diff to detect newly added files. Subsequently, it lists the paths adjacent to these new files with the assumption that if a file was added here, others will be as well. This process is performed at regular intervals to make sure files are not missed. For buckets with few files and predictable changes, this works well, however for buckets with many changes across millions of files and hundreds of prefixes, the scanning and diffing process may result in ingestion and processing delay. To optimize this process, consider setting the COPY FROM job’s DELETE_FILES_AFTER_LOAD property to TRUE which will move ingested files to another staging location, leaving the source folder empty, making it easier and faster for SQLake to discover new files. Be aware that configuring SQLake to move ingested files could impact other systems if they depend on the same raw files.
To troubleshoot jobs that ingest data, you can query the task execution system table and inspect whether 0 bytes of data have been read in the “ingest data” stage, or SQLake is throwing parse errors in the “parse data” stage. In the case that 0 bytes have been read it means that your job is configured correctly, but there is no new data. In the case where you see parse errors, you can narrow it down to either a misconfiguration of the job or you have bad data.

START_FROM

Values: { NOW | BEGINNING | <timestamp> }
Default: BEGINNING
(Optional) Configures the time to start ingesting data from. Files before the specified time are ignored. Timestamps provided should be based on UTC.
When a DATE_PATTERN is not specified, configuring this option is not allowed. By default, all data available is ingested.
If the DATE_PATTERN is not lexicographically ordered, then this option cannot be set to BEGINNING.

INITIAL_LOAD_PREFIX

Type: text
(Optional) Any file matching this prefix is immediately loaded when the job is run.

INITIAL_LOAD_PATTERN

Type: text
(Optional) Any file matching this regex pattern is immediately loaded when the job is run.
This loads data separately from the date pattern and is primarily used in CDC use cases, where you load some initial files named LOAD00001, LOAD00002, etc. After that, all the data has a date pattern in the file name.

MAX_DELAY

Value: <integer> { MINUTE[S] | HOUR[S] | DAY[S] }
Default: 1 DAY
(Optional) Configures how far back to check for new files. The larger this is, the more list operations are used.

Source options

The source options below configure the location from which to read the data.
If there are multiple folders within the given location, the data from all folders is copied into the staging table specified by the job.
The schema of the final staging table is the union of all the schemas found within the source data.
BUCKET = '<bucket_name>'
[ PREFIX = '<prefix>' ]

BUCKET

Type: text
The bucket to read from.

PREFIX

Type: text
Default: ''
(Optional) The file prefix within the specified bucket to read from.
By default, all data within all folders of the specified BUCKET is read.

Example