Content types

This page describes the job options for ingesting data from different types of content.

When reading in your data, additional options can be configured for the following content types:

CSV

CONTENT_TYPE = (
    TYPE = CSV
    INFER_TYPES = { TRUE | FALSE }
    [ HEADER = ('<col1>', '<col2>', '<col3>',...) ]
    [ HEADER_LINE = '<header>,<header>,...' ]
    [ DELIMITER = '<delimiter>' ]
    [ QUOTE_ESCAPE_CHAR = '<char>' ]
    [ NULL_VALUE = '<null_value>' ]
    [ MAX_COLUMNS = <integer> ]
    [ ALLOW_DUPLICATE_HEADERS = { TRUE | FALSE } ]
)

INFER_TYPES

Type: Boolean

(Optional) When true, each column's data type is inferred as one of the following types: string, integer, double, Boolean.

When false, all data is treated as a string.

Type: array

Default: Empty string

(Optional) A comma-separated list of column names.

When the CSV data include a header as the first row, the HEADER property can be omitted. By omitting this property, it tells Upsolver that a header row can be found in the data and it will take the following actions:

  1. Use the first row for column names

  2. Skip the first row when processing the data

If the source data does not include a header as the first row, meaning the first row contains actual data, you must include the HEADER property when creating a JOB. This tells Upsolver to take the following actions:

  1. Use the provided HEADER property for column names

  2. Do not skip the first row since it contains data

If your data does not include a header row and you do not set a HEADER property when creating the job, Upsolver will assume the first row is a header and not process it.

HEADER_LINE

Type: string

Default: Empty string

(Optional) A string containing a comma-separated list of header names. This is an alternative to HEADER.

DELIMITER

Type: text

Default: ,

(Optional) The delimiter used for columns in the CSV file

QUOTE_ESCAPE_CHAR

Type: text

Default: "

(Optional) Defines the character used for escaping quotes inside an already quoted value.

NULL_VALUE

Type: text

(Optional) Values in the CSV that match the provided value are interpreted as null.

MAX_COLUMNS

Type: integer

(Optional) The number of columns to allocate when reading a row. Note that larger values may perform poorly.

ALLOW_DUPLICATE_HEADERS

Type: Boolean

Default: false

(Optional) When true, repeat headers are allowed. Numeric suffixes are added for disambiguation.

TSV

CONTENT_TYPE = (
    TYPE = TSV
    INFER_TYPES = { TRUE | FALSE } 
    [ HEADER = ('<col1>', '<col2>', '<col3>',...) ]
    [ HEADER_LINE = '<header>,<header>,...' ]
    [ NULL_VALUE = '<null_value>' ] 
    [ MAX_COLUMNS = <integer> ]
    [ ALLOW_DUPLICATE_HEADERS = { TRUE | FALSE } ]
)

INFER_TYPES

Type: Boolean

(Optional) When true, each column's data types are inferred as one of the following types: string, integer, double, Boolean.

When false, all data is treated as a string.

HEADER

Type: array

Default: Empty string

(Optional) A comma-separated list of column names.

When the TSV data include a header as the first row, HEADER property can be omitted. By omitting this property, it tells Upsolver that a header row can be found in the data and it will take the following actions:

  1. Use the first row for column names

  2. Skip the first row when processing the data

If the source data does not include a header as the first row, meaning the first row contains actual data, you must include the HEADER property when creating a JOB. This tells Upsolver to take the following actions:

  1. Use the provided HEADER property for column names

  2. Do not skip the first row since it contains data

If your data does not include a header row and you do not set a HEADER property when creating the job, Upsolver will assume the first row is a header and not process it.

HEADER_LINE

Type: string

Default: Empty string

(Optional) A string containing a comma-separated list of header names. This is an alternative to HEADER.

NULL_VALUE

Type: text

(Optional) Values in the TSV that match the provided value are interpreted as null.

MAX_COLUMNS

Type: integer

(Optional) The number of columns to allocate when reading a row. Note that larger values may perform poorly.

ALLOW_DUPLICATE_HEADERS

Type: Boolean

Default: false

(Optional) When true, repeat headers are allowed. Numeric suffixes are added for disambiguation.

JSON

CONTENT_TYPE = (
    TYPE = JSON
    [ STORE_JSON_AS_STRING = { TRUE | FALSE } ]
)

STORE_JSON_AS_STRING

Type: Boolean

Default: false

(Optional) When true, a copy of the original JSON is stored as a string value in an additional column.

SCHEMA_REGISTRY

Note that only Avro schemas are currently supported.

CONTENT_TYPE = (
    TYPE = SCHEMA_REGISTRY
    SCHEMA_REGISTRY_URL = '<url>'
)

SCHEMA_REGISTRY_URL

Type: text

Avro schema registry URL. To support schema evolution add {id} to the URL and Upsolver will embed the id from the AVRO header.

For example, https://schema-registry.service.yourdomain.com/schemas/ids/{id}

FIXED_WIDTH

CONTENT_TYPE = (
    TYPE = FIXED_WIDTH
    [ COLUMNS =  ( (COLUMN_NAME = '<column_name>' 
                    START_INDEX = <integer> 
                    END_INDEX = <integer>) [,...] ) ]
    [ INFER_TYPES = { TRUE | FALSE } ]
)    

COLUMNS

Type: list

(Optional) An array of the name, start index, and end index for each column in the file.

INFER_TYPES

Type: Boolean

Default: false

(Optional) When true, each column's data type is inferred. When false, all data is treated as a string.

REGEX

See: Java Pattern

CONTENT_TYPE = (
    TYPE = REGEX
    [ PATTERN = '<pattern>' ]
    [ MULTILINE = { TRUE | FALSE } ]
    [ INFER_TYPES = { TRUE | FALSE } ]
)

PATTERN

Type: text

(Optional) The pattern to match against the input. Named groups are extracted from the data.

MULTILINE

Type: Boolean

Default: false

(Optional) When true, the pattern is matched against the whole input. When false, it is matched against each line of the input.

INFER_TYPES

Type: Boolean

Default: false

(Optional) When true, each column's data types is inferred. When false, all data is treated as a string.

SPLIT_LINES

CONTENT_TYPE = (
    TYPE = SPLIT_LINES
    PATTERN = '<pattern>'
)

PATTERN

Type: text

(Optional) A regular expression pattern to split the data by. If left empty, the data is split by lines.

XML

CONTENT_TYPE = (
    TYPE = XML
    [ STORE_ROOT_AS_STRING = { TRUE | FALSE } ]                 
)

STORE_ROOT_AS_STRING

Type: Boolean

Default: false

(Optional) When true, a copy of the XML is stored as a string in an additional column.

Last updated