Transforming Data

Upsolver supports sending data to various data outputs, including Amazon Athena, Redshift Spectrum, Upsolver, and more.

Upsolver

Upsolver is the data output for plain ETLs and is useful when you need to transform data and create a new data source with this transformed data. Typically this would include cleaning up and sanitizing data, ordering data, aggregating data and/or splitting data into multiple streams. This sort of data output can be created as part of a pre-preparation process. Once the data source is created this can be duplicated and the output type can be switched, for example, Amazon Athena.

Amazon Athena

Athena is a managed query engine over data in Amazon S3, with the data stored in files in S3 blob storage. Athena provides you SQL access to the data via configuration in the Glue data catalog, which is a metastore that holds the metadata on the location and format of the files.

Amazon Athena is a core data output that enables you to get your data into a data lake and query it cost-effectively. The data is organized, compacted and stored in an Athena table on Amazon S3. You can then run fast, cheap queries on the data lake. A queryable data lake is a useful step in the process as it allows you to insect your data and run ad-hoc queries to understand your data (using an SQL interface) before writing to a database (which is more expensive and requires database). Data is stored in parquet format, is columnar and is partitioned, enabling you to query only the data you need.

There are significant cost/speed benefits to using Athena. Athena only charges for data scanned, for example, $X per TB. For example, a 100MB compressed parquet file may include a GB of raw data. If you only query one column in a ten column data file you are only charged for scanning 10MB of data. You can partition the data by date, for example, per year. This date appears as a column and this can be added to your WHERE clause. Athena resolves this condition first, and only scans the relevant partitions, reducing the cost further. For example, consider if you have stored 50TB of data over two years, at a rate of 50GB of data a week. If you only need to query the past 5 weeks, you will be querying only 350Gb of data, and if you only need to access 5 columns, this means that you are only paying to access 35GB. This means paying cents for a query instead of hundreds of dollars. This huge reduction in cost makes data usable that previously would have simply been too expensive to use.

Athena is a server-less architecture. If you require control of the servers, consider using Qubole.

Redshift Spectrum

Redshift Spectrum provides similar functionality to Amazon Athena at a similar cost. The data is also stored on Amazon S3, and this is connected to a Redshift Cluster. The data is partitioned, and any query looks in the relevant partition only. A Redshift Spectrum data source provides you speedy and cost-effective access to the data. A Redshift Spectrum data source is also useful if you want to query and join data in S3 and a Redshift Spectrum DB.

Qubole

Upsolver supports Qubole, a cloud-native data platform external system that provides you managed clusters on various open sources systems, such as Presto (which is the engine under Athena). With Qubole, you have control over the servers, and you can spin up additional servers, if required. Qubole pricing is based on servers (and not on data), and may be more cost effective if you have huge volumes of data to query.

The output for Athena, Redshift Spectrum and Qubole are equivalent, that is, the files are stored and managed in the same way.

Apache Kafka and Amazon Kinesis

These data outputs enable you to stream your events to your Apache Kafka topic or Amazon Kinesis Stream.

Amazon Redshift, MySQL and Elasticsearch

These are all database data outputs, and are used when projects require a specific database output.

One use case would be if you write all your data into staging tables, and then use periodic queries to aggregate the data, and write this to the final data tables (this follows an Extract Load Translate methodology). You may then clear your staging tables periodically, preserving only the transformed data, ensuring that excessive storage is not required, and controlling the costs. Given that the cost of a database is hundreds of times the cost of storing data in S3, it is important to plan your DB data outputs very carefully. A common scenario for customers moving to Upsolver is to store the raw data in one of the more cost-effective data sources, and then pre-aggregate the data and write the output to the smaller tables on the database.

Another use case might be storing log analytics in Elasticsearch, resulting in a fast-growing database with many duplicates. By deduping the data it would be possible to make significant savings without losing any fidelity.

Lookup Tables

Lookup tables are defined as a transformation like any other output. The output is to a key-value store that you can query. This enables faster lookups, for example, to get all users who clicked on a specific ad in "real-time".

Lookup tables are useful for:

Join between streams: By querying a lookup table, it is possible to enrich one stream with data from another stream.
Flattening data: Given that storing data in a data lake is comparatively cheap, it may be useful to flatten relational data. This means that it is not necessary to consider joins when querying data, making it easier to extract the required data.
Real-time aggregations: An Upsolver lookup table replaces ETL code and a serving DB like Redis or Cassandra. When a lookup table is defined as real time, instead of waiting until the data is written to S3 and to the disk, the event’s details (the delta) are updated directly in-memory and only then stored in S3. This is useful, for example, for supplying data on a user in real-time while the user is browsing a website.

Amazon S3, HDFS, Google Storage and Microsoft Azure Storage

These data outputs enable you to write data to blob storage and continue processing in these systems. For example, if you want to send data to S3, and then trigger notifications and send emails.