How To Re-process Data

Question: Is there a way to re-process data that's already been processed?

It is possible to re-process data that has already been processed by editing the output and running it from a specific start time until an (optional) end time. In the version tab, a new version will appear, which will output the re-processed data. This data can either replace the old data with an upsert or be appended depending on the output definition.

Data Lake Outputs (Athena / Hive Metastore / Glue Catalog)

The process is slightly different when working with outputs according to the partition. Data Lake outputs create a new folder directory for each edit of the output. This directory corresponds to the version of the output and is part of the table’s partitioning schema (upsolver_schema_version). These output types are further divided into two subtypes:

Table is partitioned by processing time (the `time` field in Upsolver or $commit_time in Sqlake) When editing an output Upsolver will update the table’s metadata partition to point to the data of the new output version. However, since Upsolver can only update the pointers at the partition level this means the replay boundaries should be an entire partition. For example, if the table has daily partitions to correct data from January 1st to January 3rd (inclusive), the correction needs to be run from January 1st at 12:00 AM to January 3rd at 11:59 PM. If only part of a partition is required it is still recommended to replay the entire partition to keep things simple. If that is not possible contact Upsolver support for guidance.
Table is partitioned by event time: In this case, it is up to the user to determine what processing times need to be replayed since data is distributed to different partitions based on the actual event time present in the data. The “Excluded Partitions” setting can be used in the Output Properties to help reduce duplicate data for partitions that should be mostly or entirely reprocessed by the correction.

If the data is no longer available in Upsolver for re-processing due to retention policies it will need to be re-ingested into a new data source and the edited output will need to refer to that data source.

PreviousTutorials NextCreate an Amazon S3 data source

Last updated 2 years ago

Was this helpful?