LogoLogo
OverviewQuickstartsHow To GuidesReferenceArticlesSupport
Reference
Reference
  • Reference
  • ✨Learning Paths
    • Amazon Kinesis
    • Amazon S3
    • Apache Kafka
    • Confluent Cloud
    • Microsoft SQL Server
    • MongoDB
    • MySQL
    • PostgreSQL
  • SQL COMMANDS
    • Clusters
      • CREATE CLUSTER
      • ALTER CLUSTER
      • ROLL CLUSTER
      • STOP/START CLUSTER
      • DROP CLUSTER
    • Connections
      • CREATE CONNECTION
        • Amazon Kinesis
        • Amazon Redshift
        • Amazon S3
        • Apache Kafka
        • AWS Glue Data Catalog
        • ClickHouse
        • Confluent Cloud
        • Elasticsearch
        • Hive Metastore
        • Microsoft SQL Server
        • MongoDB
        • MySQL
        • Polaris Catalog
        • Iceberg REST catalogs
        • PostgreSQL
        • Snowflake
      • ALTER CONNECTION
      • DROP CONNECTION
      • CDC Connections with SSL
    • Jobs
      • CREATE JOB
        • Ingestion
          • Amazon Kinesis
          • Amazon S3
          • Apache Kafka
          • Confluent Kafka
          • Content Types
          • Microsoft SQL Server
          • MongoDB
          • MySQL
          • PostgreSQL
        • Replication
          • Microsoft SQL Server
          • MongoDB
          • MySQL
          • PostgreSQL
        • Transformation
          • INSERT
            • MAP_COLUMNS_BY_NAME
            • UNNEST
          • MERGE
          • SELECT
          • Job Options
            • Amazon Redshift
            • Amazon S3
            • Apache Iceberg
            • ClickHouse
            • Data Lake Tables
            • Elasticsearch
            • PostgreSQL
            • Snowflake
        • Monitoring
          • Amazon CloudWatch
          • Datadog
          • Dynatrace
      • ALTER JOB
      • PAUSE / RESUME JOB
      • DROP JOB
    • Materialized Views
      • CREATE MATERIALIZED VIEW
      • ALTER MATERIALIZED VIEW
      • DROP MATERIALIZED VIEW
      • Join with a Materialized View
    • Tables
      • CREATE TABLE
      • ALTER TABLE
      • DROP TABLE
    • Iceberg Tables
      • Upsolver Managed Tables
        • CREATE ICEBERG TABLE
        • ALTER ICEBERG TABLE
        • DROP ICEBERG TABLE
        • OPTIMIZE ICEBERG TABLE
      • External Iceberg Tables
        • CREATE EXTERNAL ICEBERG TABLE
        • DROP EXTERNAL ICEBERG TABLE
      • Mirror Iceberg Tables
        • CREATE MIRROR
        • ALTER MIRROR
        • PAUSE / RESUME MIRROR
        • DROP MIRROR
  • FUNCTIONS & OPERATORS
    • Data Types
    • Functions
      • Aggregate
        • APPROX_COUNT_DISTINCT
        • APPROX_COUNT_DISTINCT_EACH
        • AVG
        • AVG_EACH
        • AVG_TIME_SERIES
        • COLLECT_SET
        • COLLECT_SET_EACH
        • COUNT
        • COUNT(*)
        • COUNT(DISTINCT ...)
        • COUNT_EACH
        • COUNT_IF
        • DYNAMIC_SESSIONS
        • FIRST
        • FIRST_ARRAY
        • FIRST_EACH
        • FIRST_TIME_SERIES
        • LAST
        • LAST_ARRAY
        • LAST_EACH
        • LAST_K
        • LAST_K_EACH
        • LAST_TIME_SERIES
        • MAX
        • MAX_BY
        • MAX_EACH
        • MAX_TIME_SERIES
        • MIN
        • MIN_BY
        • MIN_EACH
        • MIN_TIME_SERIES
        • SESSION_COUNT
        • STD_DEV
        • STD_DEV_EACH
        • STRING_MAX
        • STRING_MAX_EACH
        • STRING_MIN_EACH
        • SUM
        • SUM_EACH
        • SUM_TIME_SERIES
        • WEIGHTED_AVERAGE
      • Array
        • ARRAY_DISTINCT
        • ARRAY_JOIN
        • ARRAY_MAX
        • ARRAY_MIN
        • ARRAY_SORT
        • ARRAY_SORT_DESC
        • ARRAY_SUM
        • COUNT_VALUES_IF
        • COUNT_VALUES
        • ELEMENT_AT
        • FIRST_ELEMENT
        • LAST_ELEMENT
        • VALUE_INDEX_IN_ARRAY
        • VALUE_INDEX_IN_ROW
      • Comparison
        • GREATEST
        • LEAST
      • Conditional
        • COALESCE
        • IF_ELSE
        • NULL_IF
      • Date & Time
        • ADD_TIME_ZONE_OFFSET
        • DATE
        • DATE_ADD
        • DATE_DIFF
        • DATE_TRUNC
        • DAY
        • DAY_OF_WEEK
        • DAY_OF_YEAR
        • EXTRACT_TIMESTAMP
        • EXTRACT
        • FORMAT_DATETIME
        • FROM_ISO8601_DATE
        • FROM_UNIXTIME
        • HOUR
        • MILLISECOND
        • MINUTE
        • MONTH
        • QUARTER
        • RUN_END_TIME
        • RUN_START_TIME
        • SECOND
        • SUBTRACT_TIME_ZONE_OFFSET
        • TO_UNIX_EPOCH_MILLIS
        • TO_UNIX_EPOCH_SECONDS
        • TO_UNIXTIME
        • WEEK
        • YEAR_OF_WEEK
        • YEAR
      • Filter
        • IS_DUPLICATE
        • NOT
      • Interval
        • PARSE_DURATION
      • Mathematical
        • ABS
        • CBRT
        • CEIL
        • CEILING
        • DEGREES
        • EXP
        • FLOOR
        • GET_SHARD_NUMBER
        • LN
        • LOG
        • LOG2
        • LOG10
        • MOD
        • MODULO
        • POW
        • POWER
        • RADIANS
        • RAND
        • RANDOM
        • RECIPROCAL
        • ROUND
        • SIGN
        • SORT_VALUES
        • SQRT
        • TRUNCATE
      • Regular Expressions
        • REGEXP_EXTRACT
        • REGEXP_EXTRACT_ALL
        • REGEXP_LIKE
        • REGEX_MATCH_POSITION
        • REGEX_NAMED_GROUPS
        • REGEXP_REPLACE
      • Spatial
        • ST_DISTANCE
        • ST_WGS84_DISTANCE
        • WKT_SPATIAL_CONTAINS
        • WKT_SPATIAL_INTERSECT
      • String
        • BASE64_DECODE
        • BASE64_TO_HEX
        • BYTES_SUBSTRING
        • CONCAT
        • DATE
        • JOIN_ARRAYS
        • LENGTH
        • LOWER
        • LPAD
        • LTRIM
        • MD5
        • PARSE_DATETIME
        • REPLACE
        • REVERSE
        • RPAD
        • RTRIM
        • SHA1
        • SHA3_512
        • SHA256
        • SHA512
        • SORT_VALUES
        • SPLIT
        • SPLIT_TO_RECORD
        • STRING_FORMAT
        • STRIP_MARGIN
        • STRIP_PREFIX
        • STRIP_SUFFIX
        • STRPOS
        • SUBSTR
        • SUBSTRING
        • TRANSLATE
        • TRIM_CHARS
        • TRIM
        • UPPER
        • UUID_GENERATOR
        • XX_HASH
      • Structural
        • FROM_KEY_VALUE
        • GET_RANGE
        • JOIN_ALL_BY_KEY
        • JSON_PATH
        • JSON_TO_RECORD
        • MAP_WITH_INDEX
        • QUERY_STRING_TO_RECORD
        • RECORD_TO_JSON
        • SORT_BY
        • TO_ARRAY
        • ZIP_WITH_INDEX
        • ZIP
      • Trigonometric
        • COS
        • SIN
        • TAN
        • TANH
      • Type Conversion
        • CAST
        • CHR
        • DECIMAL_TO_HEX
        • HEX_TO_DECIMAL
        • TO_BIGINT
        • TO_DOUBLE
        • TO_STRING
      • URL
        • TOP_PRIVATE_DOMAIN
        • URL_DECODE
        • URL_ENCODE
        • URL_PARSER
    • Operators
      • Comparison
      • Conditional
        • CASE
      • Logical
      • Mathematical
      • String
  • MONITORING
    • Clusters
    • Datasets
      • Ingested Data
        • Column
      • Lineage
      • Data Violations
      • Statistics
      • Maintenance
        • Compactions
        • Expire Snapshots
        • Orphan Files
      • Columns
      • Partitions
      • Properties
    • Job Status
      • Stream & File Sources
        • Monitoring
        • Graphs
        • Lineage
        • Settings
      • CDC Sources
        • Monitoring
        • Replication Settings
        • Job Settings
    • System Catalog
      • Information Schema
        • Clusters
        • Columns
        • Connections
        • Jobs
        • Mirrors
        • Tables
        • Users
        • Views
      • Insights
        • job_output_column_stats
        • dataset_column_stats
      • Monitoring
        • CDC Status
        • Clusters
        • Expectations
        • Jobs
        • Partition Statistics
        • Recent Compactions
        • Running Queries
        • Table Statistics
      • Task Executions Table
  • GENERAL
    • Common SQL Syntax
    • View Entity Syntax
    • Keyboard Shortcuts
Powered by GitBook
On this page
  • CSV
  • TSV
  • JSON
  • AVRO_SCHEMA_REGISTRY
  • FIXED_WIDTH
  • REGEX
  • SPLIT_LINES
  • XML
  1. SQL COMMANDS
  2. Jobs
  3. CREATE JOB
  4. Ingestion

Content Types

When reading in your data, additional options can be configured for the following content types:

CSV

CONTENT_TYPE = (
    TYPE = CSV
    INFER_TYPES = { TRUE | FALSE }
    [ HEADER = ('<col1>', '<col2>', '<col3>',...) ]
    [ HEADER_LINE = '<header>, <header>,...' ]
    [ DELIMITER = '<delimiter>' ]
    [ QUOTE_ESCAPE_CHAR = '<char>' ]
    [ NULL_VALUE = '<null_value>' ]
    [ MAX_COLUMNS = <integer> ]
    [ ALLOW_DUPLICATE_HEADERS = { TRUE | FALSE } ]
)

INFER_TYPES

Type: Boolean

(Optional) When true, each column's data type is inferred as one of the following types: string, integer, double, Boolean.

When false, all data is treated as a string.

HEADER

Type: array

Default: Empty string

(Optional) An comma-separated list of column names.

When the CSV data include a header as the first row, HEADER property can be omitted. By omitting this property, it tells Upsolver that a header row can be found in the data and it will take the following actions:

  1. Use the first row for column names

  2. Skip the first row when processing the data

If the source data does not include a header as the first row, meaning the first row contains actual data, you must include the HEADER property when creating a JOB. This tells Upsolver to take the following actions:

  1. Use the provided HEADER property for column names

  2. Do not skip the first row since it contains data

If your data does not include a header row and you do not set a HEADER property when creating the job, Upsolver will assume the first row is a header and not process it.

HEADER_LINE

Type: string

Default: Empty string

(Optional) A string containing a comma-separated list of header names. This is an alternative to HEADER.

DELIMITER

Type: text

Default: ,

(Optional) The delimiter used for columns in the CSV file

QUOTE_ESCAPE_CHAR

Type: text

Default: "

(Optional) Defines the character used for escaping quotes inside an already quoted value.

NULL_VALUE

Type: text

(Optional) Values in the CSV that match the provided value are interpreted as null.

MAX_COLUMNS

Type: integer

(Optional) The number of columns to allocate when reading a row. Note that larger values may perform poorly.

ALLOW_DUPLICATE_HEADERS

Type: Boolean

Default: false

(Optional) When true, repeat headers are allowed. Numeric suffixes are added for disambiguation.

TSV

CONTENT_TYPE = (
    TYPE = TSV
    INFER_TYPES = { TRUE | FALSE } 
    [ HEADER = ('<col1>', '<col2>', '<col3>',...) ]
    [ HEADER_LINE = '<header>, <header>,...' ]
    [ NULL_VALUE = '<null_value>' ] 
    [ MAX_COLUMNS = <integer> ]
    [ ALLOW_DUPLICATE_HEADERS = { TRUE | FALSE } ]
)

INFER_TYPES

Type: Boolean

(Optional) When true, each column's data types are inferred as one of the following types: string, integer, double, Boolean.

When false, all data is treated as a string.

HEADER

Type: string

Default: Empty string

(Optional) A string containing a comma separated list of column names.

When the TSV data include a header as the first row, HEADER property can be omitted. By omitting this property, it tells Upsolver that a header row can be found in the data and it will take the following actions:

  1. Use the first row for column names

  2. Skip the first row when processing the data

If the source data does not include a header as the first row, meaning the first row contains actual data, you must include the HEADER property when creating a JOB. This tells Upsolver to take the following actions:

  1. Use the provided HEADER property for column names

  2. Do not skip the first row since it contains data

If your data does not include a header row and you do not set a HEADER property when creating the job, Upsolver will assume the first row is a header and not process it.

HEADER_LINE

Type: string

Default: Empty string

(Optional) A string containing a comma-separated list of header names. This is an alternative to HEADER.

NULL_VALUE

Type: text

(Optional) Values in the TSV that match the provided value are interpreted as null.

MAX_COLUMNS

Type: integer

(Optional) The number of columns to allocate when reading a row. Note that larger values may perform poorly.

ALLOW_DUPLICATE_HEADERS

Type: Boolean

Default: false

(Optional) When true, repeat headers are allowed. Numeric suffixes are added for disambiguation.

JSON

CONTENT_TYPE = (
    TYPE = JSON
    [ SPLIT_ROOT_ARRAY = { TRUE | FALSE } ] 
    [ STORE_JSON_AS_STRING = { TRUE | FALSE } ]
)

SPLIT_ROOT_ARRAY

Type: Boolean

Default: true

(Optional) When true, a root object that is an array is parsed as separate events. When false, it is parsed as a single event that contains only an array.

STORE_JSON_AS_STRING

Type: Boolean

Default: false

(Optional) When true, a copy of the original JSON is stored as a string value in an additional column.

AVRO_SCHEMA_REGISTRY

Note that only Avro schemas are currently supported.

CONTENT_TYPE = (
    TYPE = AVRO_SCHEMA_REGISTRY
    SCHEMA_REGISTRY_URL = '<url>'
)

SCHEMA_REGISTRY_URL

Type: text

Avro schema registry URL. To support schema evolution add {id} to the URL and Upsolver will embed the id from the AVRO header.

For example, https://schema-registry.service.yourdomain.com/schemas/ids/{id}

FIXED_WIDTH

CONTENT_TYPE = (
    TYPE = FIXED_WIDTH
    [ COLUMNS =  ( (COLUMN_NAME = '<column_name>' 
                    START_INDEX = <integer> 
                    END_INDEX = <integer>) [,...] ) ]
    [ INFER_TYPES = { TRUE | FALSE } ]
)    

COLUMNS

Type: list

(Optional) An array of the name, start index, and end index for each column in the file.

INFER_TYPES

Type: Boolean

Default: false

(Optional) When true, each column's data type is inferred. When false, all data is treated as a string.

REGEX

CONTENT_TYPE = (
    TYPE = REGEX
    [ PATTERN = '<pattern>' ]
    [ MULTILINE = { TRUE | FALSE } ]
    [ INFER_TYPES = { TRUE | FALSE } ]
)

PATTERN

Type: text

(Optional) The pattern to match against the input. Named groups are extracted from the data.

MULTILINE

Type: Boolean

Default: false

(Optional) When true, the pattern is matched against the whole input. When false, it is matched against each line of the input.

INFER_TYPES

Type: Boolean

Default: false

(Optional) When true, each column's data types is inferred. When false, all data is treated as a string.

SPLIT_LINES

CONTENT_TYPE = (
    TYPE = SPLIT_LINES
    PATTERN = '<pattern>'
)

PATTERN

Type: text

(Optional) A regular expression pattern to split the data by. If left empty, the data is split by lines.

XML

CONTENT_TYPE = (
    TYPE = XML
    [ STORE_ROOT_AS_STRING = { TRUE | FALSE } ]                 
)

STORE_ROOT_AS_STRING

Type: Boolean

Default: false

(Optional) When true, a copy of the XML is stored as a string in an additional column.

See for more information.

Java Pattern
CSV
TSV
JSON
AVRO_SCHEMA_REGISTRY
FIXED_WIDTH
REGEX
SPLIT_LINES
XML