This page contains a directory of Uposlver FAQs.
Yes — when you’re buying Upsolver from the AWS marketplace, you’re actually adding Upsolver to your AWS bill.
Through the marketplace, you can choose to purchase Upsolver units on-demand and pay on a monthly or yearly basis.
We also offer reduced pricing for annual contracts, which can also be purchased through the marketplace after contacting us, and this will also be charged as part of your AWS bill.
Upsolver a compute-based model where pricing is based on Upsolver’s usage of EC2 servers.
To reduce costs, Upsolver uses EC2 spot instances under the hood. You pay for the price of the spot instance with an additional markup for Upsolver’s software, but you only pay for actual data being processed by Upsolver.
From time to time, you may notice instances in your account with a dry run suffix. These are used by Upsolver to improve the quality of our next release. Customer data is not affected in any way, nor is the customer billed in the form of Upsolver credits for these instances.
You can start a free trial of Upsolver, either on upsolver.com or through the AWS marketplace; it will last for 14 days and also includes extensive support from the Upsolver teach team, with both hands-on training as well as technical consultation.
We help you define the use cases and how to best implement them, and we make sure that you’re getting something that works on a production scale before you decide to spend a single dollar.
For ongoing support we are available via In-app chat, Slack, and video calls as needed. We also provide 24/7 phone response for critical issues, based on agreed-upon metrics which we continually monitor.
Upsolver supports additional databases as sources using Amazon’s Data Migration Service (DMS); Upsolver can read the inputs generated by DMS, write them to S3, and create corresponding tables in Athena or use them for other ETL pipelines.
As such, Upsolver can support any of the databases supported by DMS.
Users can define retention policies for every object in Upsolver, whether it’s a data source or an output, and Upsolver will automatically delete data after that period.
Before deleting, Upsolver will also check to see if the files are needed for any ETL process; this minimizes errors in comparison to manually deleting folders on S3.
Upsolver supports batch data loads but it is built on a streaming-first architecture.
For example, if you’re looking to output data once an hour and then query the aggregated data to reduce costs and latency — Upsolver can run that batch operation, but under the hood, it will use stream and micro-batch processing to process every event at a time and will use indexing to join with historical data.
Upsolver’s native connectors to Kafka and HDFS allow you to ingest data from on-premises deployments into the cloud, where the data processing is currently done.
Upsolver’s S3-to-S3 architecture ensures exactly-once processing, meaning that there will be no duplicate or missing data.
In the Upsolver architecture, storage and compute are decoupled. Upsolver handles increases in message volume by scaling out the compute cluster.
You can choose a scaling strategy to keep consistent low latency.
Yes, Upsolver allows you to alter existing tables in Athena, including adding new columns to a table that’s already in use. These changes will take effect both proactively and retroactively, depending on the timestamp you choose.
Yes, using the Upsolver UI, you can define filters that will apply to that output, based on a field (or fields) from the event stream.
For highly complex parsing, you should use the Upsolver API.
The source will only be read a single time.
Data is written to S3 and then distributed to multiple outputs. We do this because keeping a copy of the raw data on S3 is cheaper than Kinesis/Kafka, allows the data retention to be longer, and leaves no risk that one output will cause slowdowns in other outputs.
Yes. For Snowflake, Upsolver will store the data on Amazon S3 and Snowflake will read from there.
Storage and compute are decoupled. S3 is used for storage and EC2 Spot instances are used for compute. Scaling is linear since local disks are not used at all.
Upsolver manages the cluster remotely, including troubleshooting, version updates, scaling, and monitoring.
Our documentation and Support team are always available to get you started and assist along the way. We also provide an on-demand online training course. For more information, please contact our team.
Upsolver is a cloud-based SaaS; as such, the Upsolver application is updated from the cloud periodically.
The process is gradual and the service remains available during the entire update process with zero downtime.
Yes. Upsolver’s monitoring solution enables sending pre-configured metrics to your existing monitoring system. Supported monitoring systems are Datadog, Amazon CloudWatch, InfluxDB, Elasticsearch, SignalFx, and others per request.
Upsolver runs in the cloud. We currently support Amazon Web Services and we will soon add support for Microsoft Azure.
Yes. Please contact the Upsolver Support team so we can customize a solution for this use-case for you.
Upsolver supports data ingestion from several types of sources such as databases, streaming services (e.g. Amazon Kinesis, Apache Kafka), and object store services (e.g. Amazon S3. Google Cloud Storage).
Upsolver supports sending your data to various types of destinations such as distributed SQL query engines (e.g. Amazon Athena), databases and data warehouses, streaming services (e.g. Amazon Kinesis, Apache Kafka), and object store services (e.g. Amazon S3, Google Cloud Storage).
Upserts add the ability to upsert and delete events in data lakes; they are only supported for some of the data outputs.
Upsolver enables you to configure two types of keys in the output: upsert and delete.
These keys are used by Upsolver to perform update and delete operations on the output and might be used on existing fields in the data or new calculated field added to your data.
Upsert keys are used by Upsolver to keep only the latest event per upsert key, while events with the value
truein their deletion key field are deleted.
Both upsert and delete operations take place during Upsolver’s compaction process.
Compaction is a process in which small files are being merged into one bigger file for improved performance.
1. Upsolver ingests the raw data partitioned to Amazon S3.
2. When compaction takes place, only the last events per upsert key are kept. Events marked for deletion based on the delete key are deleted.
3. The final table in Athena points to a view which “unions” (contains the latest data per upsert key and removed events which have the value
truein their delete key field) both the insert and update partitions. This view is necessary, otherwise upserts/delete operations are not reflected in real-time until the compaction process ends.
Read more on compaction on this blog post:
Full traceability (Event Sourcing) is built into the platform.
Upsolver’s architecture follows event sourcing principles and is based on an immutable log of all incoming events. These events are then processed with Upsolver ETL to create a queryable copy of the data.
Unlike databases where the state constantly changes (making it difficult to reproduce its original state without configuring a change-log), in Upsolver you can always "go back in time" and retrace your steps to learn about the exact transformation applied on your raw data, down to the event level.
You can fix a bug in your ETL and then run it using the immutable copy of your raw data.
Yes. Please contact the Upsolver Support team so we can customize a solution for this use-case for you.
Yes! Each Kinesis Stream will be a data source in Upsolver, and you can join any two data sources together: Kinesis-Kafka, Kinesis-Kinesis, Kinesis-S3, S3-S3. Any combination would work out of the box.
With Upsolver, this would be done using SQL; when you write your query, you define the schema that you’re going to create in Athena. This can also be done via the visual UI which allows you to select fields within your data sources to populate Athena tables.
The historical JSON files are batched together and kept in compressed Avro for higher performance and lower cost of storage. Access to historical data is available via the Replay feature.
Yes. Upsolver stores all metadata in the Glue Data Catalog so that once you’ve created a table in Athena, it can also be immediately accessed in Redshift Spectrum or Presto over EMR, which also read metadata from Glue.
Using Upsolver, your data should be available in Athena within 5 minutes of appearing in Kafka; sometimes it may be even faster and appear in just 2-3 minutes.
Upsolver offers unique end-to-end integration with Amazon Athena. Tables are created via Glue Data Catalog, to which Upsolver will:
- optimize S3 storage for performance
- make data available in Athena in near real-time to Athena
- add ability to define updatable tables in Athena (for CDC)
- add option to edit tables
- add historical replay / time-travel
Upsolver continuously optimizes your S3 storage to ensure high query performance in Athena.
We start with 1-minute Parquet files (for latency reasons) and compact the files into bigger files for performance. Upsolver will keep the table data consistent using the Glue Data Catalog.
Yes. Upsolver provides built-in CDC connectors that allow users to stream CDC data from databases to their data lake.
Both! When Upsolver connects to data sources such as Apache Kafka, it serializes all the data from Kafka into an S3 bucket; after performing transformations, every operation is written back to separate storage on S3.
By creating this architecture that leverages two layers of storage on S3, we can guarantee exactly-once processing without data loss or duplication.
Upsolver uses its own data processing-engine coded entirely in Scala and leverages a fully decoupled architecture. This enables all the processing to be done on EC2 without using any local storage and using only S3 for storage.
Upsolver’s indexes are lookup tables, which are also stored on Amazon S3.
These indexes are loaded into memory when you’re actually running the ETL. Thanks to Upsolver’s breakthrough compression technology, you can store much larger indexes in RAM without managing NoSQL database clusters.
Upsolver handles data streams ordering based on the following rules:
- the ordering within each partition or shard is preserved, as long as the number of output shards is equal or lower than the number of input shards
- data is read based on the Select statement
- If a
Unionoperator is used in the statement, the data sources will be processed based on the order in which they appear within the
Union(standard ANSI SQL behavior)
Upsolver doesn’t store any customer data.
Upsolver stores all the data it processes on an S3 bucket on the customer accounts. When Upsolver is deployed on private VPC, even Upsolver employees don’t have access to the data.
The only data sent to Upsolver is billing information and monitoring information to support your deployments remotely. This means there are no issues around compliance, PCI, or PII when using Upsolver.
Upsolver gives you on-premises level data privacy, in the cloud — even Upsolver employees don’t have access to the data (when in private VPC). Users can also implement masking as part of the ETL process.
Upsolver is as secure as your AWS account — it can be deployed in your private VPC, which means that even Upsolver employees will not have access to the data. Alternatively, you can deploy on Upsolver’s VPC on AWS.
You can define read-only users in Upsolver and grant/deny permissions to every object using a similar model to AWS IAM. You can also create separate workspaces to reduce complexity.
Upsolver provides complete separation between your development and production environments, which can be applied to all the entities configured in your Upsolver account.
When you are done developing and testing your ETL in your development environment, deploying it to production takes just a few clicks. You use the same ETL you already tested and developed in your dev environment and run it on top of your production data streams.
Backing up your ETL code is being done using Upsolver’s Git integration feature. Using this functionality allows you to use all the familiar Git capabilities such as source code version management, collaboration, and code ownership.
Upsolver provides a REST API that enables you to manage all Upsolver’s infrastructure from your code. This allows you to perform all the operations performed from Upsolver’s UI using our API in your code if you prefer to do so.