Content Types
When reading in your data, additional options can be configured for the following content types:
CSV
CONTENT_TYPE = (
TYPE = CSV
INFER_TYPES = { TRUE | FALSE }
[ HEADER = ('<col1>', '<col2>', '<col3>',...) ]
[ HEADER_LINE = '<header>, <header>,...' ]
[ DELIMITER = '<delimiter>' ]
[ QUOTE_ESCAPE_CHAR = '<char>' ]
[ NULL_VALUE = '<null_value>' ]
[ MAX_COLUMNS = <integer> ]
[ ALLOW_DUPLICATE_HEADERS = { TRUE | FALSE } ]
)
INFER_TYPES
INFER_TYPES
Type: Boolean
(Optional) When true
, each column's data type is inferred as one of the following types: string
, integer
, double
, Boolean
.
When false
, all data is treated as a string.
HEADER
HEADER
Type: array
Default: Empty string
(Optional) An comma-separated list of column names.
When the CSV data include a header as the first row, HEADER
property can be omitted. By omitting this property, it tells Upsolver that a header row can be found in the data and it will take the following actions:
Use the first row for column names
Skip the first row when processing the data
If the source data does not include a header as the first row, meaning the first row contains actual data, you must include the HEADER
property when creating a JOB
. This tells Upsolver to take the following actions:
Use the provided
HEADER
property for column namesDo not skip the first row since it contains data
If your data does not include a header row and you do not set a HEADER
property when creating the job, Upsolver will assume the first row is a header and not process it.
HEADER_LINE
HEADER_LINE
Type: string
Default: Empty string
(Optional) A string containing a comma-separated list of header names. This is an alternative to HEADER
.
DELIMITER
DELIMITER
Type: text
Default: ,
(Optional) The delimiter used for columns in the CSV file
QUOTE_ESCAPE_CHAR
QUOTE_ESCAPE_CHAR
Type: text
Default: "
(Optional) Defines the character used for escaping quotes inside an already quoted value.
NULL_VALUE
NULL_VALUE
Type: text
(Optional) Values in the CSV that match the provided value are interpreted as null.
MAX_COLUMNS
MAX_COLUMNS
Type: integer
(Optional) The number of columns to allocate when reading a row. Note that larger values may perform poorly.
ALLOW_DUPLICATE_HEADERS
ALLOW_DUPLICATE_HEADERS
Type: Boolean
Default: false
(Optional) When true
, repeat headers are allowed. Numeric suffixes are added for disambiguation.
TSV
CONTENT_TYPE = (
TYPE = TSV
INFER_TYPES = { TRUE | FALSE }
[ HEADER = ('<col1>', '<col2>', '<col3>',...) ]
[ HEADER_LINE = '<header>, <header>,...' ]
[ NULL_VALUE = '<null_value>' ]
[ MAX_COLUMNS = <integer> ]
[ ALLOW_DUPLICATE_HEADERS = { TRUE | FALSE } ]
)
INFER_TYPES
INFER_TYPES
Type: Boolean
(Optional) When true
, each column's data types are inferred as one of the following types: string
, integer
, double
, Boolean
.
When false
, all data is treated as a string.
HEADER
HEADER
Type: string
Default: Empty string
(Optional) A string containing a comma separated list of column names.
When the TSV data include a header as the first row, HEADER
property can be omitted. By omitting this property, it tells Upsolver that a header row can be found in the data and it will take the following actions:
Use the first row for column names
Skip the first row when processing the data
If the source data does not include a header as the first row, meaning the first row contains actual data, you must include the HEADER
property when creating a JOB
. This tells Upsolver to take the following actions:
Use the provided
HEADER
property for column namesDo not skip the first row since it contains data
If your data does not include a header row and you do not set a HEADER
property when creating the job, Upsolver will assume the first row is a header and not process it.
HEADER_LINE
HEADER_LINE
Type: string
Default: Empty string
(Optional) A string containing a comma-separated list of header names. This is an alternative to HEADER
.
NULL_VALUE
NULL_VALUE
Type: text
(Optional) Values in the TSV that match the provided value are interpreted as null.
MAX_COLUMNS
MAX_COLUMNS
Type: integer
(Optional) The number of columns to allocate when reading a row. Note that larger values may perform poorly.
ALLOW_DUPLICATE_HEADERS
ALLOW_DUPLICATE_HEADERS
Type: Boolean
Default: false
(Optional) When true
, repeat headers are allowed. Numeric suffixes are added for disambiguation.
JSON
CONTENT_TYPE = (
TYPE = JSON
[ SPLIT_ROOT_ARRAY = { TRUE | FALSE } ]
[ STORE_JSON_AS_STRING = { TRUE | FALSE } ]
)
SPLIT_ROOT_ARRAY
SPLIT_ROOT_ARRAY
Type: Boolean
Default: true
(Optional) When true
, a root object that is an array is parsed as separate events. When false
, it is parsed as a single event that contains only an array.
STORE_JSON_AS_STRING
STORE_JSON_AS_STRING
Type: Boolean
Default: false
(Optional) When true
, a copy of the original JSON is stored as a string value in an additional column.
AVRO_SCHEMA_REGISTRY
CONTENT_TYPE = (
TYPE = AVRO_SCHEMA_REGISTRY
SCHEMA_REGISTRY_URL = '<url>'
)
SCHEMA_REGISTRY_URL
SCHEMA_REGISTRY_URL
Type: text
Avro schema registry URL. To support schema evolution add {id}
to the URL and Upsolver will embed the id from the AVRO header.
For example, https://schema-registry.service.yourdomain.com/schemas/ids/{id}
FIXED_WIDTH
CONTENT_TYPE = (
TYPE = FIXED_WIDTH
[ COLUMNS = ( (COLUMN_NAME = '<column_name>'
START_INDEX = <integer>
END_INDEX = <integer>) [,...] ) ]
[ INFER_TYPES = { TRUE | FALSE } ]
)
COLUMNS
COLUMNS
Type: list
(Optional) An array of the name, start index, and end index for each column in the file.
INFER_TYPES
INFER_TYPES
Type: Boolean
Default: false
(Optional) When true
, each column's data type is inferred. When false
, all data is treated as a string.
REGEX
See Java Pattern for more information.
CONTENT_TYPE = (
TYPE = REGEX
[ PATTERN = '<pattern>' ]
[ MULTILINE = { TRUE | FALSE } ]
[ INFER_TYPES = { TRUE | FALSE } ]
)
PATTERN
PATTERN
Type: text
(Optional) The pattern to match against the input. Named groups are extracted from the data.
MULTILINE
MULTILINE
Type: Boolean
Default: false
(Optional) When true
, the pattern is matched against the whole input. When false
, it is matched against each line of the input.
INFER_TYPES
INFER_TYPES
Type: Boolean
Default: false
(Optional) When true
, each column's data types is inferred. When false
, all data is treated as a string.
SPLIT_LINES
CONTENT_TYPE = (
TYPE = SPLIT_LINES
PATTERN = '<pattern>'
)
PATTERN
PATTERN
Type: text
(Optional) A regular expression pattern to split the data by. If left empty, the data is split by lines.
XML
CONTENT_TYPE = (
TYPE = XML
[ STORE_ROOT_AS_STRING = { TRUE | FALSE } ]
)
STORE_ROOT_AS_STRING
STORE_ROOT_AS_STRING
Type: Boolean
Default: false
(Optional) When true
, a copy of the XML is stored as a string in an additional column.