Change log
Change log for Upsolver Classic (app.upsolver.com)
Bug Fixes
- Apache Kafka / Amazon Kinesis output: when editing an Apache Kafka or Amazon Kinesis output or changing the number of shards, the new version waits until the previous version is completed
Enhancements
- Snowflake Output: changed the intermediate format from Avro to JSON. This change improves performance when writing to Snowflake and fixes an issue where writing to a column of type VARIANT with sub-fields that contain special chars in the field name
Bug Fixes
- Minor bug fixes
- Performance improvements when writing Parquet files
Enhancements
- PostgreSQL CDC:
- Tables that aren't included in the publication will not be part of the snapshot
- Support added for il-central-1 region. This region is currently only supported with private VPC deployments
- Elasticsearch Jobs:
- Write
timestamp
anddate
types as ISO-8601 strings in jobs that write to Elasticsearch
- Reduced the number of Amazon S3 API calls to lower S3 costs
Bug Fixes
- Minor bug fixes
Enhancements
- Write
Timestamp
andDate
types as ISO-8601 strings in Elasticsearch output - Performance Improvement: Reduce the number of file operations when coordinating future table operations
Bug Fixes
- Minor bug fixes
Enhancements
- Write
Timestamp
andDate
types as ISO-8601 strings in string outputs, for example: Amazon S3 output with format JSON/CSV - Write
Timestamp
andDate
types as ISO-8601 inRECORD_TO_JSON
function
Bug Fixes
- Performance improvements in CDC data sources
- Minor bug fixes
Enhancements
- Improved the performance of CDC jobs reading from databases with a large number of table
- Upgraded Avro and Parquet libraries to the latest versions
Bug Fixes
- Fixed the SQL Parser to parse the
LOG
function andDECAYED_SUM
aggregation - Minor bug fixes
Bug Fixes
- Minor bug fixes
Enhancements
- Cluster version appears in the UI on the clusters page
Bug Fixes
- Minor bug fixes
Enhancements
- Updated snowflake JDBC driver version to 3.13.33
Bug Fixes
- Fix UI error, "The client couldn't connect to the API cluster."
- Minor bug fixes
Bug Fixes
- Minor bug fixes
Bug Fixes
- Minor bug fixes
Bug Fixes
- Minor bug fixes and improvements
Enhancements
- New
UUID()
function returns a unique identifier (UUID) string - Upgraded Debezium version from 2.1.3 to 2.2.1
Bug Fixes
- Fixed the conversion of float to double to preserve the perceived semantic value in CDC sources and in data sources that get Avro or Parquet .
- Minor bug fixes
Enhancements
- Add new headers in Data Sources:
parser_shard_number
andparser_row_number
Bug Fixes
- Fixed a bug reading Avro and Parquet files that caused fields of type
Date
to be ignored - Minor bug fixes
Bug Fixes
- Fixed an issue reading from empty Kafka topics that contain empty partitions
- Fixed a bug reading Avro files that use a named type more than once
- Minor bug fixes
Bug Fixes
- Snowflake Merge Jobs: enforce the
ON
clause expression to prevent creating an array - Minor bug fixes
Bug Fixes
- Minor bug fixes
Enhancements
- CDC: PostgreSQL with partitioned tables - expose
data.full_partition_table_name
field specifying the name of the event's original partition - UI Performance improvements
Bug Fixes
CASE WHEN
now handlesNULL
as input and returns theELSE
value- CDC: Fixed the bug that caused the ingestion of a decimal type column to be converted to binary base64 string
Enhancements
- [BREAKING CHANGE] GET_SHARD_NUMBER function no longer requires arguments
Bug Fixes
- Minor bug fixes
Enhancements
- Validate that the first parameter in an
ARRAY_JOIN
is not a literal
Bug Fixes
- Parquet Files are now distributed more evenly when ingesting data from Amazon S3 with high execution parallelism
Enhancements
- Snowflake: Added query tag to queries executed by Upsolver for easier cost tracking
Bug Fixes
- Minor bug fixes
Bug Fixes
- Minor bug fixes
- CDC PostgreSQL: Fixed a bug that caused the replication slot to be not deleted when deleting the Data Source
- Athena Output: Filter out rows when the partition field value is an empty string (Partition cannot be an empty string)
Bug Fixes
- Fixed an issue collecting field statistics and metadata for large data files with a large number of unique field names
Enhancements
- Delete intermediate files after copy to Redshift
Bug Fixes
- JDBC Data Source: Fixed a bug that would not close the JDBC connection in some situations when using fullLoadInterval
- AvroRegistry content type: Support URL encoded authentication information
- Snowflake: Support keeping old values on partial updates
- Upgrade Debezium to version 2.1.3
Bug Fixes
- JDBC Outputs: delete intermediate files after being written to the database
- Revert Debezium to version 1.4
Bug Fixes
- JDBC Outputs: delete intermediate files after being written to the database
Bug Fixes
- Fixed Kafka batcher tasks getting stuck when reading with a wildcard topic and deleting all the topics in Kafka
Enhancements
- Upgrade Debezium to V2.1.2
- Add Debezium version header
- Fixed an issue when creating a Kafka Data Source with glob pattern that doesn't match any topics would cause no response in the API
- Memory allocation optimizations in Lookup Table Query servers
Bug Fixes
- Fixed memory leak on Elasticsearch outputs
- Minor bug fixes
Bug Fixes
- Fixed a rare issue that can cause duplicate data to be loaded into Redshift after copy failures
- Fixed an issue where discovering a new partition / topic without any messages would cause Kafka / Kinesis Data Sources to hang until a message arrived.
- Fixed an issue when creating a Kafka Data Source over high number of topics would cause CPU spike in the API
Enhancements
- Use regional STS endpoints if available
- Bug Fixes
- Minor bug fixes
- Bug Fixes
- Fixed bucket region detection when using an Amazon S3 Private VPC endpoint
- API: Fixed a bug that cause to fail to Run new output with Lookup when using full history snapshot
- Fixed an edge case that could cause data loss when editing stopped Athena output
- Enhancements
- Outputs: support window size override in non aggregated outputs
- Bug Fixes
- Monitoring: Fixed the 'operation_name' of aggregation steps to be the original 'operation_name' instead of "Output Aggregation". This means metrics reported via Monitoring Reports will now show aggregation step information under the correct 'operation_name'
- Unsynchronized data sources no longer fail if they can't construct their consumers
- Bug Fixes
- SQL: Improved error messages and auto completion
- Enhancements
- Performance and memory improvements
- Bug Fixes:
- API: Prevent changing the end execution time for old output versions
- API: Added validation to prevent creating Cloud Storage outputs with a date format that is not refined enough to include the Output Interval
- Improved performance of Python UDF validations when uploading a new UDF
- Fixed slow replay progress for Snowflake and PostgreSQL outputs
- Enhancements
- Added RAND function and added overload to RANDOM function that gets no arguments and returns a value between 0 and 1
- BREAKING CHANGE
- Hive Metastore (Athena) Output: When using SELECT * with partition fields, if there is a field in the source that mapped to the partition field column, the field won't be written to the parquet files because this value can't be queried
- Enhancements
- Kafka Data Sources: Support unsynced mode, which allows the stream to continue processing even when there are errors or a backlog from the topic
- Add presto-compliant RANDOM() and RAND() functions
- We now support clusters that mix both Intel-based (e.g. r6i, r6a) and ARM-based types (e.g. r6g) within the same Elastigroup
- Bug Fixes
- Fix deadlock between the indexing task and index entry deletion task that could end up waiting for each other when modifying an Athena output's data
- When deploying clusters to a region, we now filter out instance types that don't exist in that region
- Hive Metastore (Athena) Output: not calculate statistics of rows that were filtered out due to missing partition field value Previously, if a row was filtered out because the partition field value was missing or null, the rows counted in Output Fields statistics and in Events over Time graphs.
- Improved recovery mechanism when our configuration database is unavailable
- Enhancements
- Avoid out of order per key in Kinesis outputs by sending the same key only once within the same PutRecords request.
- Improved performance of server boot time and memory usage for organizations that use high number of shards.
- Bug Fixes
- Fixed stack overflow in JDBC Data Source in some cases
- API: Fixed a bug on generating SQL statement when the SQL is not sync with output's definition
- Enhancements
- Minor performance improvements in data processing critical path
- Improved performance of servers boot and periodic configuration load, this might improve reliability and performance of data flow for organizations and clusters that have a lot of processing entities
- Bug Fixes
- Fixed a bug where Compactions would stop working when advancing the "End Execution At" property of the Hive Metastore Output after it has arrived (now > End Execution At).
- API: Added validation to prevent creating connections with empty names
- Minor Performance Enhancements
- Bug Fixes
- API: Fixed an issue that caused the SQL statement to be invalid after changing data source of an output
- Fixed an issue when mapping numeric field to an upsert column of type string in JDBC outputs (Redshift, Snowflake, ...)
- Fixed a rare bug where an internal metadata index would stop progressing, preventing compactions from occurring.
- Enhancements
- The Elasticsearch client version was upgraded from 6.x to 7.x in order to also support Elasticsearch 7 & 8 as output targets
- Performance enhancements for clusters with a lot of tasks (more to come in the future)
- Snowflake Output: Support writing to Transient Tables
- Kafka Data Sources: Added an option to restart reading partition when the end offset of that partition is larger than the last offset read by the Data Source for the same partition. This should allow users to reset partitions.
- Bug Fixes
- Hive Metastore Output: performance improvements on calculating partition compaction trigger
- Fixed a bug where Outputs with IS_DUPLICATE with big window sizes wouldn't be considered as completed
- Fixed a bug where Outputs that depend on an Upsolver Output would run with a Runtime Delay based on the maximum Runtime Delay of all the versions of the Upsolver Output, the new behaviour will skip completed versions
- Upsolver Query (Table output) was visible in the UI. This will now only be available via SQLake.
- [BREAKING CHANGE] Simple S3 Data Source: changed the value of the time field to be the beginning of the minute instead of the end of the minute. This change will be applied only on new data sources
- Enhancements
- More informative errors when missing access to S3 resources
- Bug Fixes:
- API: Fixed being able to create a Kafka input with an invalid storage connection
- API: ModifyServerFile changeset now adds file if not exists
- New Features:
- Compression: Add ZStandard
- Redshift Output: Support authentication with IAM
- Redshift Output: Support Super type
- Roles Anywhere - Hide internal access/secret keys for SoC2
- Enhancements
- Upgraded Kafka Client to Version 3.2.0
- Upgraded Redshift to Version 2.1.0.9
- Improved the reliability of the connection between User Clusters and the Configuration Database
- Performance improvements in the Compaction Coordinator in Athena Outputs
- Improve error messages.
- Enlarged maximum number of shards, output shards and compaction shards in outputs to 512
​
Bug Fixes
- Simple Cloud Storage Input: Improvements to file discovery
Enhancements
- Athena Outputs: Enabled partition column types other than string
- Performance improvements
​
Changes in this Release:
- API: The return value of shards and related fields changed from number to struct. The struct contains executionParallelism which represents the old number. Customers using API endpoints related to data sources, lookup tables or outputs may need to update their code. Please contact our support for details.
Bug Fixes
SQL
Compute Cluster
Fixed a bug that would cause the Compute cluster, in rare cases, notMonitoring
API
API
Fixed a race condition that prevented multiple concurrent requests toSnowflake Output
Fixed a bug when writing values to DATE columns.CDC
Fixed a bug that failed to write data which was larger than 2GB.
Enhancements
Functions
Python
CDC
AWS VPC integration
Validated subnet ids in Existing AWS VPC integrationAthena Output
Non-string partition columns now supported
Bug Fixes
- Show scaling policy in the Cluster page.
- Wurfl User Agent: fixed a bug that appeared when there was more than one wurfl file in the organization.
- Fixed a bug that caused the metrics to stop being reported to external monitoring systems (Datadog / Influx).
- Deprecated SPLIT, CONCAT and DATE_DIFF functions and introduced new functions:
- SPLIT:
SPLIT_DELIMITER_FIRST & PRESTO_SPLIT
- CONCAT:
ARRAY_JOIN & PRESTO_CONCAT
- DATE_DIFF:
DATE_DIFF_PRECISE & PRESTO_DATE_DIFF
Enhancements
- Added function LN.
- DATE_DIFF function now supports dynamic units.
- LIKE operation now supports getting another field as a pattern.
Recently Implemented Changes (Currently Enabled)
As part of Upsolver's effort to adopt industry standards, we are gradually changing functions to be more Presto compatible. The functions that changed are CONCAT, SPLIT and DATE_DIFF.
CONCAT, SPLIT and DATE_DIFF are being deprecated. Henceforth, SQLs that attempt to use CONCAT, SPLIT and DATE_DIFF will include a warning message when executed. This behavior is designed to draw attention to the changes. Currently running outputs are NOT affected by these changes.
The change log summary:
Important: All information in this table, including planned versions and dates, is subject to change; the information is provided only as a guideline for updates you may make in the future.
Enabled by default in February, 2022
SQL Changes - Commands & Functions
Behavior Change | Additional Notes |
---|---|
- Bug Fixes
- MySQL Output: Fixed bug with boolean fields that were not written as expected.
- Redshift Output: Fixed race condition in upsert tables that could cause rows not to get deleted in rare cases.
- SQL:
- Improved SQL editor responsiveness.
- Fixed a bug in SQL parsing.
- Fixed an exception arising when using infix operations.
- Fixed join/match expressions not working correctly with >3 terms.
- API:
- Fixed an issue with distinct data sources that had the same name.
- Prevented "SPLIT TABLE ON" on non-Athena Outputs.
- Fixed name suggestion in hierarchical Athena outputs.
- Enhancements
- Azure Event Hubs: Support more features.
- Streaming Output: Support setting an upsert key.
- ContentTypes:
- Support null values in TSV.
- Support fixed width content type.
- Oracle Object Storage: Various enhancements.
- SQL: Support for WHERE filter in sub-select expressions.
- S3 Data Source: Don't require AWS integration when creating S3 data source.
- S3 Output: Support bucket-level access control.
- UI: Added various annotations cluster graphs in the monitoring tab.
- Enhancements
- CSV Content Format: allows repeating header names in files.
- Function changes: the * CONCAT function was changed to ARRAY_JOIN.
- ARRAY_JOIN - gets an array of strings and a delimiter and concats them.
- * CONCAT - now gets multiple arguments and concats them (like || in SQL).
- Bug Fixes
- Athena Output: fixed a performance issue when deleting files due to retention.
- Clusters: Show "Additional Processing Units for Replay" only in Compute Clusters.
- Redshift Spectrum: fixed boolean casting when running output with SELECT *
- API: Show thrown errors from Hive Metastore.
- SQL: Fixed a bug when join with sub-query.
- Enhancements
- Support dynamic position in ELEMENT_AT function.
- Allow updating the boot script in Clusters.
- Support fixed schema in S3 outputs with Avro format.
- Bug Fixes
- Fixed a bug when reading from multiple topics in Kafka Data Source.
- API - Fixed column name suggester when mapping new fields in Athena Output.
- Bug Fixes
- API
- Fixed a bug with Azure Integration not working in some regions
- Fixed validation when updating Columns Retention in Hive Metastore outputs
- Data Source Page: don't show statistics from the preview when querying on a time range without data
- Show output's fields on outputs with SELECT *
- SQL
- Prevent SQL regeneration when updating duplicate handling (APPEND ON DUPLICATE or REPLACE ON DUPLICATE)
- Added some validation errors when trying to create invalid state
- Backend
- Fixed a bug that caused duplicated rows when editing Hive Metastore output with upserts
- Enhancements
- Monitoring Reporters: Support Graphite
- Hive Metastore Output: support splitting the output by schemas/databases in addition to splitting by table names. For example, if the value of the multi table field is "foo.bar", the "foo" will be the schema/database name, and "bar" will be the table name
- Bug Fixes
- S3 Data Sources Advanced: Fixed a bug with Glob File Name pattern
- Hive Metastore Output: save storage by deleting manifest files after their usage
- Enhancements
- Athena output: create Views with Glue API
- Bug Fixes
- Don't show completed dependencies in Lineage tab
- Select * in Hive Metastore Output
- Return the defined fields first
- Removed the multi table column from the view definitions
- Hive Metastore Output: fixed a bug when editing output with upserts
- API: Allow changing the cluster size on Trial plans
- Enhancements
- Added new modal and new SQL syntax for Table Name Suffix Field, which allow you to create multi tables in Hive Metastore output with a single output.
- CDC Data source (MySQL) - added Destination part that allows replicating the source database to your data lake
- Qubole Metastore: allow changing the time partition column type to String
- Bug Fixes
- Fixed health check parameters in Query clusters
- Don't show deleting data sources in the main page
- Hive Metastore output: added a cache layer in the Partition Manager that prevents redundant calls to the Metastore
- API: Limit number of running previews. This should fix high CPU usage of the API when many previews are running in the same time.
- Enhancements
- Support Select * in Redshift Spectrum
- API: Support Select * and Upserts on Preview
- Lookup Table: when running Output with a lookup to a Lookup Table, don't calculate the start/end times of the Lookup Table implicitly but use the original times.
- Bug Fixes
- SAML: Don't regenerate group when changing display name in Upsolver
- Athena Output: fixed bug in Columns Retention
- API: Fixed a bug that caused deleted inputs to not work
- Snowflake Output: fixed columns casing
- Removed "errors" outputs from outputs with Parquet format (Athena/S3)
- Enhancements
- CDC ingestion is more stable when scaling cluster
- Previewing outputs now considers the upsert definition of it
- Compactions are now prioritized by urgency and age in order to prevent starvation
- Support epoch time date pattern with prefixes in Cloud Storage Data Sources
- Bug Fixes
- Fixed database name validation in Microsoft SQL Server Connection
- Enhancements
- HiveMetastoreClient: Better SET LOCATION method
- Enhancements
- Elasticsearch Output: Support Upsert Keys
- CDC: Support Column Exclude List
- Added
SHA512
andSHA3_512
functions
- Bug Fixes
- S3 Connection with SQS now works with paths that ends with slash
- Enhancements
- Added FROM_UNIXTIME function
- Qubole Output: added an option to support changing column types
- Hive Metastore Outputs: trigger more than one compaction if there is a backlog
- Upsolver Output: support new field type: JSON. This type will be extracted when using as an Upsolver Data Source
- CSV Content Format: support custom quote escape char
- When duplicating output, copy the workspaces from the previous output
- Bug Fixes
- Fixed memory leak in External Hive Metastore outputs
- Enhancements
- Added External Hive Metastore to the output types list
- Support
SELECT *
on External Hive Metastore when querying with PrestoDB and SparkSQL - Reference Data can now be deleted after output is not using it (i.e. output deleted or output completed and was edited)
- Reference Data can't be created with the same name as another Reference data or Lookup table
- Enhancements
- Kafka Output - Allow ignoring messages that are too large (According to broker settings and producer settings)
- Streaming Data Sources (Kafka, Kinesis, EventHubs) - Allow deleting offsets metadata files
- API - Performance enhancements when updating Outputs / Lookup Tables
- Bug Fixes
- Hive Metastore: Fixed bug with
SELECT *
- Features
- Support MAX/MIN aggregations on more data types
- Support <,<=,>,>= on timestamps
- Features
- Support
SELECT *
in Hive Metastore Outputs, this will update the table definition every time a new field arrives - Oracle Object Storage Support
- Bug Fixes
- Aggregation calculated fields now works in SQL mode
- Features
- CDC (Capture Data Change) Data Sources
- Dremio and PrestoDB Outputs
- Stop/Start Data Sources
- Enhancements
- Allow setting Lazy Load on Lookup Tables using the Properties tab
- Update base AMI image in AWS to Amazon Linux 2
- Bug Fixes
- Data Lake Output: Filter out partitions that were deleted due to retention compaction
- Features
- Hive Metastore: Allow creating an Output to External Hive Metastore
- Enhancements
- Lower latencies between dependencies in Compute Cluster
- Features
- Ahana Output
- Starburst Output
- Enhancements
- Redshift: Allow inserting 'now' into date / time fields in order to set a column to the insertion time
- Bug Fixes
- Kinesis Stream Autocomplete filter out Upsolver Internal Streams
- Fixed bug in S3 IAM policy generation with slash in end of path
- Avro Schema Registry: Don't treat HTTP errors as parse errors
- SQL Parser: Don't regenerate the SQL when there is an expression that returns boolean with extra parentheses
- Support Real Time Kafka Output - Support running Kafka Outputs on the Real Time cluster with ms latency
- Hive Metastore Output with Upserts - fixed a bug that caused the compaction process to get stuck after edit
- Hive Metastore Output with Upserts - support number as an upsert key
- Lookup Tables: fixed a bug when using sharded lookup tables in outputs
- API: show the current capacity when clicking Update Capacity button on Clusters page
- API: fixed wrong validation on Kafka Outputs (support numbers on topic names)
- Microsoft SQL Server Output: fixed create statement when primary key is empty
- API: fixed a bug when removing mapping of fields
- S3 Data Source with Parquet Content Format - split files by 200MB
- Lookup Table - support compaction shards on lookup tables with multiple windows
- SQL - fixed a bug generating the SQL when "Is Delete Field" is mapped to a column
- Monitoring: Added three metrics to Hive Metastore Outputs
partitions-delay
- The delay between now and the last partition timedata-loading-delay
- The delay on loading data to the metastorepartitions-count
- Number of partitions in the table
- IS_DUPLICATE and Lookup from Data Sources: Don't omit key columns for new versions
- Avro: Fixed escaping of
[]
in array namespaces- Fixes a bug in Snowflake Output with VARIANT column output with arrays
- Azure: Support billing SaaS offering
- DNS: Ability to sync Route53 records with private IP addresses for customers with own Spotinst Account
- SSO/bugfix: attach endpoints don't have permissions
- Partners: Support exporting logs and monitoring to external domain
- Free Plan: Support upgrading account
- Snowflake Output: Configurable DbDecimal
- CSV Content Type: Don't ignore values starting with #
- SQL: Support unmapped columns in JDBC outputs. New mapped columns will be created when deploying the output