SELECT

Syntax

SELECT { <expression> [ [AS] <field_name>[::<column_type>] ] 
       | CAST(<expression> AS <column_type>) } [,...]
    FROM { <catalog>.<schema>.<resource>
         | <identifier> } 
              [ [AS] <alias> ]
    [ LET <identifier> = <expression> [, ...] ]
    WHERE time_filter([timeColumn = {$commit_time | $event_time},
                       windowSize = <duration>,
                       windowEndTime = <timestamp>])
    [ AND <where_filters> ]
    [ GROUP BY { <field_name> | <ordinal_position> } [, ...] ]
    [ LET <identifier> = <expression> [, ...] ]
    [ HAVING <having_filters> ]
    [ LIMIT <integer> ]

Jump to

LET

The LET clause allows you to create a calculated field that can be used within your query. You can create multiple fields by separating each expression with a comma (e.g. LET a = b + c, d = e - f).

The first LET clause can only be used for calculations on raw columns while the second LET clause can include calculations based on aggregated columns.

These clauses can be particularly useful for certain use cases when working with arrays. For examples, see: Working with arrays.

Note that since these fields do not exist within the source table, the column types must be provided if you select them as columns within your query.

WHERE

The WHERE clause lets you filter out records based on predicates you can define on the row's values.

In addition to any custom filtering you may want to apply in the WHERE clause, in a job, it is necessary to include a condition using the time_filter() function to filter records based on the data's time range.

The time_filter() essentially acts as a predicate pushdown during job executions, ensuring Upsolver only processes the source data in the range required by the current execution.

The time_filter() function allows flexible filtering based on time-related criteria. The function takes optional parameters:

  • timeColumn: The time column to filter on

    • Only $commit_time and $event_time are supported as arguments to this parameter

    • Sync jobs will use $event_time by default

    • Unsynced jobs will use $commit_time by default

  • windowSize: The data range - in minutes - to scan. This value is subtracted from the windowEndTime to determine the data range.

    • Must be at least 1 minute

    • Defaults to RUN_INTERVAL option of a job

    • Defaults to 1 minute when running a SELECT statement

  • windowEndTime: The end timestamp of the window

    • In a job, this value will be auto-populated based on the current data window being processed. It cannot be set manually in a job.

    • When executing a SELECT statement, the default value can be overridden by passing a specific value to the parameter. Only literal values can be used, calculation will not work.

Including the time_filter() condition is mandatory within jobs, and it ensures that the query processes the relevant data within the specified time window.

Examples

  • WHERE TIME_FILTER()- Default behavior depending on the context.

  • WHERE TIME_FILTER($event_time, interval '5' minute)- Specifies a 5-minute window based on $event_time.

  • WHERE TIME_FILTER($event_time, interval '1' minute, timestamp '2023-04-04 12:20:00') - Queries rows ingested between 12:19:00 and 12:20:00 based on $event_time.

In most situations, the default values should suffice, resulting in job definitions with a simple WHERE condition like this:

CREATE SYNC JOB my_job
AS INSERT INTO <target>
SELECT * 
  FROM <source>
  WHERE time_filter()
  
      

Current Limitations

  • The TIME_FILTER function cannot be used outside the WHERE clause.

  • It must be used as a standalone expression and cannot be used as input in other functions or operators.

Note: Prior to the introduction of the time_filter() function, a job's WHERE clause was required to use the following clause:

$commit_time between run_start_time() [ - INTERVAL '<integer>' <time_unit> ] AND run_end_time() The time_filter() function should be used in new jobs instead. However, you might see the old syntax in templates or older jobs in your account.

Aggregated outputs

When aggregating data using Upsolver, the windowSize parameter in the TIME_FILTER function is significant, determining the range of source data that will be included in each aggregation.

By default aggregated outputs use the RUN_INTERVAL as the windowSize. This means that each execution of the job will aggregate just the data since the last execution. You may want to aggregate larger ranges of data but output them more frequently.

Example

Let's say we want a job to aggregate data from the last 1 hour, but we want it to output that aggregation result every 5 minutes. To achieve this we can set a RUN_INTERVAL of 5 minutes in our job options and a windowSize of 60 minutes in our time_filter():

CREATE SYNC JOB my_job
AS INSERT INTO <target>
SELECT COUNT(*) 
  FROM <source>
  WHERE time_filter($event_time, interval 60 minutes)

GROUP BY

The GROUP BY clause specifies the column(s) used to group records together. It is typically used in conjunction with some aggregate function(s) within the selected columns.

Note that all non-aggregate columns selected must be present in the GROUP BY clause.

Additionally, all partition columns should appear within the GROUP BY clause as mapping an aggregated column to a partition column is not allowed.

HAVING

The HAVING clause can be used to filter out records based on aggregate columns.

LIMIT

The LIMIT clause can be used to limit the amount of records written to the target location.

This can be helpful when previewing the output of your job.

Last updated