SELECT

Syntax

SELECT { <expression> [ [AS] <field_name>[::<column_type>] ] 
       | CAST(<expression> AS <column_type>) } [,...]
    FROM { <catalog>.<schema>.<resource>
         | <identifier> } 
              [ [AS] <alias> ]
    [ LET <identifier> = <expression> [, ...] ]
    WHERE time_filter() [ AND <where_filters> ]
    [ GROUP BY { <field_name> | <ordinal_position> } [, ...] ]
    [ LET <identifier> = <expression> [, ...] ]
    [ HAVING <having_filters> ]
    [ LIMIT <integer> ]
  • WHERE TIME_FILTER()- Default behavior depending on the context.

  • WHERE TIME_FILTER($event_time, interval '5' minute)- Specifies a 5-minute window based on $event_time.

  • WHERE TIME_FILTER($event_time, interval '1' minute, timestamp '2023-04-04 12:20:00') - Queries rows ingested between 12:19:00 and 12:20:00 based on $event_time.

In most situations, the default values should suffice, resulting in job definitions with a simple WHERE condition like this:

CREATE SYNC JOB my_job
AS INSERT INTO <target>
SELECT * 
  FROM <source>
  WHERE time_filter()

Current Limitations

  • The TIME_FILTER function cannot be used outside the WHERE clause.

  • It must be used as a standalone expression and cannot be used as input in other functions or operators.

Note: Prior to the introduction of the time_filter() function, a job's WHERE clause was required to use the following clause:

$commit_time between run_start_time() [ - INTERVAL '<integer>' <time_unit> ] AND run_end_time()

The time_filter() function should be used in new jobs instead. However, you might see the old syntax in templates or older jobs in your account.

Aggregated outputs

When aggregating data using Upsolver, the windowSize parameter in the TIME_FILTER function is significant, determining the range of source data that will be included in each aggregation.

By default aggregated outputs use the RUN_INTERVAL as the windowSize. This means that each execution of the job will aggregate just the data since the last execution. You may want to aggregate larger ranges of data but output them more frequently.

Example

Let's say we want a job to aggregate data from the last 1 hour, but we want it to output that aggregation result every 5 minutes. To achieve this we can set a RUN_INTERVAL of 5 minutes in our job options and a windowSize of 60 minutes in our time_filter():

CREATE SYNC JOB my_job
AS INSERT INTO <target>
SELECT COUNT(*) 
  FROM <source>
  WHERE time_filter($event_time, interval 60 minutes)

GROUP BY

The GROUP BY clause specifies the column(s) used to group records together. It is typically used in conjunction with some aggregate function(s) within the selected columns.

Note that all non-aggregate columns selected must be present in the GROUP BY clause.

Additionally, all partition columns should appear within the GROUP BY clause as mapping an aggregated column to a partition column is not allowed.

HAVING

The HAVING clause can be used to filter out records based on aggregate columns.

LIMIT

The LIMIT clause can be used to limit the amount of records written to the target location.

This can be helpful when previewing the output of your job.

Last updated