Content Types

When reading in your data, additional options can be configured for the following content types:

CSV

CONTENT_TYPE = (
    TYPE = CSV
    INFER_TYPES = { TRUE | FALSE }
    [ HEADER = ('<col1>', '<col2>', '<col3>',...) ]
    [ HEADER_LINE = '<header>, <header>,...' ]
    [ DELIMITER = '<delimiter>' ]
    [ QUOTE_ESCAPE_CHAR = '<char>' ]
    [ NULL_VALUE = '<null_value>' ]
    [ MAX_COLUMNS = <integer> ]
    [ ALLOW_DUPLICATE_HEADERS = { TRUE | FALSE } ]
)

INFER_TYPES

Type: Boolean

(Optional) When true, each column's data type is inferred as one of the following types: string, integer, double, Boolean.

When false, all data is treated as a string.

Type: array

Default: Empty string

(Optional) An comma-separated list of column names.

When the CSV data include a header as the first row, HEADER property can be omitted. By omitting this property, it tells Upsolver that a header row can be found in the data and it will take the following actions:

  1. Use the first row for column names

  2. Skip the first row when processing the data

If the source data does not include a header as the first row, meaning the first row contains actual data, you must include the HEADER property when creating a JOB. This tells Upsolver to take the following actions:

  1. Use the provided HEADER property for column names

  2. Do not skip the first row since it contains data

If your data does not include a header row and you do not set a HEADER property when creating the job, Upsolver will assume the first row is a header and not process it.

HEADER_LINE

Type: string

Default: Empty string

(Optional) A string containing a comma-separated list of header names. This is an alternative to HEADER.

DELIMITER

Type: text

Default: ,

(Optional) The delimiter used for columns in the CSV file

QUOTE_ESCAPE_CHAR

Type: text

Default: "

(Optional) Defines the character used for escaping quotes inside an already quoted value.

NULL_VALUE

Type: text

(Optional) Values in the CSV that match the provided value are interpreted as null.

MAX_COLUMNS

Type: integer

(Optional) The number of columns to allocate when reading a row. Note that larger values may perform poorly.

ALLOW_DUPLICATE_HEADERS

Type: Boolean

Default: false

(Optional) When true, repeat headers are allowed. Numeric suffixes are added for disambiguation.

TSV

CONTENT_TYPE = (
    TYPE = TSV
    INFER_TYPES = { TRUE | FALSE } 
    [ HEADER = ('<col1>', '<col2>', '<col3>',...) ]
    [ HEADER_LINE = '<header>, <header>,...' ]
    [ NULL_VALUE = '<null_value>' ] 
    [ MAX_COLUMNS = <integer> ]
    [ ALLOW_DUPLICATE_HEADERS = { TRUE | FALSE } ]
)

INFER_TYPES

Type: Boolean

(Optional) When true, each column's data types are inferred as one of the following types: string, integer, double, Boolean.

When false, all data is treated as a string.

HEADER

Type: string

Default: Empty string

(Optional) A string containing a comma separated list of column names.

When the TSV data include a header as the first row, HEADER property can be omitted. By omitting this property, it tells Upsolver that a header row can be found in the data and it will take the following actions:

  1. Use the first row for column names

  2. Skip the first row when processing the data

If the source data does not include a header as the first row, meaning the first row contains actual data, you must include the HEADER property when creating a JOB. This tells Upsolver to take the following actions:

  1. Use the provided HEADER property for column names

  2. Do not skip the first row since it contains data

If your data does not include a header row and you do not set a HEADER property when creating the job, Upsolver will assume the first row is a header and not process it.

HEADER_LINE

Type: string

Default: Empty string

(Optional) A string containing a comma-separated list of header names. This is an alternative to HEADER.

NULL_VALUE

Type: text

(Optional) Values in the TSV that match the provided value are interpreted as null.

MAX_COLUMNS

Type: integer

(Optional) The number of columns to allocate when reading a row. Note that larger values may perform poorly.

ALLOW_DUPLICATE_HEADERS

Type: Boolean

Default: false

(Optional) When true, repeat headers are allowed. Numeric suffixes are added for disambiguation.

JSON

CONTENT_TYPE = (
    TYPE = JSON
    [ SPLIT_ROOT_ARRAY = { TRUE | FALSE } ] 
    [ STORE_JSON_AS_STRING = { TRUE | FALSE } ]
)

SPLIT_ROOT_ARRAY

Type: Boolean

Default: true

(Optional) When true, a root object that is an array is parsed as separate events. When false, it is parsed as a single event that contains only an array.

STORE_JSON_AS_STRING

Type: Boolean

Default: false

(Optional) When true, a copy of the original JSON is stored as a string value in an additional column.

AVRO_SCHEMA_REGISTRY

Note that only Avro schemas are currently supported.

CONTENT_TYPE = (
    TYPE = AVRO_SCHEMA_REGISTRY
    SCHEMA_REGISTRY_URL = '<url>'
)

SCHEMA_REGISTRY_URL

Type: text

Avro schema registry URL. To support schema evolution add {id} to the URL and Upsolver will embed the id from the AVRO header.

For example, https://schema-registry.service.yourdomain.com/schemas/ids/{id}

FIXED_WIDTH

CONTENT_TYPE = (
    TYPE = FIXED_WIDTH
    [ COLUMNS =  ( (COLUMN_NAME = '<column_name>' 
                    START_INDEX = <integer> 
                    END_INDEX = <integer>) [,...] ) ]
    [ INFER_TYPES = { TRUE | FALSE } ]
)    

COLUMNS

Type: list

(Optional) An array of the name, start index, and end index for each column in the file.

INFER_TYPES

Type: Boolean

Default: false

(Optional) When true, each column's data type is inferred. When false, all data is treated as a string.

REGEX

See Java Pattern for more information.

CONTENT_TYPE = (
    TYPE = REGEX
    [ PATTERN = '<pattern>' ]
    [ MULTILINE = { TRUE | FALSE } ]
    [ INFER_TYPES = { TRUE | FALSE } ]
)

PATTERN

Type: text

(Optional) The pattern to match against the input. Named groups are extracted from the data.

MULTILINE

Type: Boolean

Default: false

(Optional) When true, the pattern is matched against the whole input. When false, it is matched against each line of the input.

INFER_TYPES

Type: Boolean

Default: false

(Optional) When true, each column's data types is inferred. When false, all data is treated as a string.

SPLIT_LINES

CONTENT_TYPE = (
    TYPE = SPLIT_LINES
    PATTERN = '<pattern>'
)

PATTERN

Type: text

(Optional) A regular expression pattern to split the data by. If left empty, the data is split by lines.

XML

CONTENT_TYPE = (
    TYPE = XML
    [ STORE_ROOT_AS_STRING = { TRUE | FALSE } ]                 
)

STORE_ROOT_AS_STRING

Type: Boolean

Default: false

(Optional) When true, a copy of the XML is stored as a string in an additional column.