Content Types

When reading in your data, additional options can be configured for the following content types:

CSV

CONTENT_TYPE = (
    TYPE = CSV
    INFER_TYPES = { TRUE | FALSE }
    [ HEADER = ('<col1>', '<col2>', '<col3>',...) ]
    [ HEADER_LINE = '<header>, <header>,...' ]
    [ DELIMITER = '<delimiter>' ]
    [ QUOTE_ESCAPE_CHAR = '<char>' ]
    [ NULL_VALUE = '<null_value>' ]
    [ MAX_COLUMNS = <integer> ]
    [ ALLOW_DUPLICATE_HEADERS = { TRUE | FALSE } ]
)

`INFER_TYPES`

Type: Boolean

(Optional) When true, each column's data type is inferred as one of the following types: string, integer, double, Boolean.

When false, all data is treated as a string.

`HEADER`

Type: array

Default: Empty string

(Optional) An comma-separated list of column names.

When the CSV data include a header as the first row, HEADER property can be omitted. By omitting this property, it tells Upsolver that a header row can be found in the data and it will take the following actions:

Use the first row for column names
Skip the first row when processing the data

If the source data does not include a header as the first row, meaning the first row contains actual data, you must include the HEADER property when creating a JOB. This tells Upsolver to take the following actions:

Use the provided HEADER property for column names
Do not skip the first row since it contains data

If your data does not include a header row and you do not set a HEADER property when creating the job, Upsolver will assume the first row is a header and not process it.

`HEADER_LINE`

Type: string

Default: Empty string

(Optional) A string containing a comma-separated list of header names. This is an alternative to HEADER.

`DELIMITER`

Type: text

Default: ,

(Optional) The delimiter used for columns in the CSV file

`QUOTE_ESCAPE_CHAR`

Type: text

Default: "

(Optional) Defines the character used for escaping quotes inside an already quoted value.

`NULL_VALUE`

Type: text

(Optional) Values in the CSV that match the provided value are interpreted as null.

`MAX_COLUMNS`

Type: integer

(Optional) The number of columns to allocate when reading a row. Note that larger values may perform poorly.

`ALLOW_DUPLICATE_HEADERS`

Type: Boolean

Default: false

(Optional) When true, repeat headers are allowed. Numeric suffixes are added for disambiguation.

TSV

CONTENT_TYPE = (
    TYPE = TSV
    INFER_TYPES = { TRUE | FALSE } 
    [ HEADER = ('<col1>', '<col2>', '<col3>',...) ]
    [ HEADER_LINE = '<header>, <header>,...' ]
    [ NULL_VALUE = '<null_value>' ] 
    [ MAX_COLUMNS = <integer> ]
    [ ALLOW_DUPLICATE_HEADERS = { TRUE | FALSE } ]
)

`INFER_TYPES`

Type: Boolean

(Optional) When true, each column's data types are inferred as one of the following types: string, integer, double, Boolean.

When false, all data is treated as a string.

`HEADER`

Type: string

Default: Empty string

(Optional) A string containing a comma separated list of column names.

When the TSV data include a header as the first row, HEADER property can be omitted. By omitting this property, it tells Upsolver that a header row can be found in the data and it will take the following actions:

Use the first row for column names
Skip the first row when processing the data

Use the provided HEADER property for column names
Do not skip the first row since it contains data

If your data does not include a header row and you do not set a HEADER property when creating the job, Upsolver will assume the first row is a header and not process it.

`HEADER_LINE`

Type: string

Default: Empty string

(Optional) A string containing a comma-separated list of header names. This is an alternative to HEADER.

`NULL_VALUE`

Type: text

(Optional) Values in the TSV that match the provided value are interpreted as null.

`MAX_COLUMNS`

Type: integer

(Optional) The number of columns to allocate when reading a row. Note that larger values may perform poorly.

`ALLOW_DUPLICATE_HEADERS`

Type: Boolean

Default: false

(Optional) When true, repeat headers are allowed. Numeric suffixes are added for disambiguation.

JSON

CONTENT_TYPE = (
    TYPE = JSON
    [ SPLIT_ROOT_ARRAY = { TRUE | FALSE } ] 
    [ STORE_JSON_AS_STRING = { TRUE | FALSE } ]
)

`SPLIT_ROOT_ARRAY`

Type: Boolean

Default: true

(Optional) When true, a root object that is an array is parsed as separate events. When false, it is parsed as a single event that contains only an array.

`STORE_JSON_AS_STRING`

Type: Boolean

Default: false

(Optional) When true, a copy of the original JSON is stored as a string value in an additional column.

AVRO_SCHEMA_REGISTRY

Note that only Avro schemas are currently supported.

CONTENT_TYPE = (
    TYPE = AVRO_SCHEMA_REGISTRY
    SCHEMA_REGISTRY_URL = '<url>'
)

`SCHEMA_REGISTRY_URL`

Type: text

Avro schema registry URL. To support schema evolution add {id} to the URL and Upsolver will embed the id from the AVRO header.

For example, https://schema-registry.service.yourdomain.com/schemas/ids/{id}

FIXED_WIDTH

CONTENT_TYPE = (
    TYPE = FIXED_WIDTH
    [ COLUMNS =  ( (COLUMN_NAME = '<column_name>' 
                    START_INDEX = <integer> 
                    END_INDEX = <integer>) [,...] ) ]
    [ INFER_TYPES = { TRUE | FALSE } ]
)

`COLUMNS`

Type: list

(Optional) An array of the name, start index, and end index for each column in the file.

`INFER_TYPES`

Type: Boolean

Default: false

(Optional) When true, each column's data type is inferred. When false, all data is treated as a string.

REGEX

See Java Pattern for more information.

CONTENT_TYPE = (
    TYPE = REGEX
    [ PATTERN = '<pattern>' ]
    [ MULTILINE = { TRUE | FALSE } ]
    [ INFER_TYPES = { TRUE | FALSE } ]
)

`PATTERN`

Type: text

(Optional) The pattern to match against the input. Named groups are extracted from the data.

`MULTILINE`

Type: Boolean

Default: false

(Optional) When true, the pattern is matched against the whole input. When false, it is matched against each line of the input.

`INFER_TYPES`

Type: Boolean

Default: false

(Optional) When true, each column's data types is inferred. When false, all data is treated as a string.

SPLIT_LINES

CONTENT_TYPE = (
    TYPE = SPLIT_LINES
    PATTERN = '<pattern>'
)

`PATTERN`

Type: text

(Optional) A regular expression pattern to split the data by. If left empty, the data is split by lines.

XML

CONTENT_TYPE = (
    TYPE = XML
    [ STORE_ROOT_AS_STRING = { TRUE | FALSE } ]                 
)

`STORE_ROOT_AS_STRING`

Type: Boolean

Default: false

(Optional) When true, a copy of the XML is stored as a string in an additional column.