Amazon Kinesis

This page shows you how to ingest data from your Amazon Kinesis stream.

Prerequisites

Ensure that you have an Amazon Kinesis connection with the correct permissions to read from your bucket.

See: Connect to your Kinesis stream

Additionally, if you are ingesting to the data lake, you need a metastore connection that can be used to create a staging table as w, though you can define columns if required:, though you can define columns if required:ell as a corresponding storage connection that can be used to store your table's underlying files.

To learn more about how tables work in Upsolver, see SQL tables.

Create a job that reads from Kinesis

You can create a job to ingest your data from Kinesis into a staging table in the data lake or ingest directly into your target.

Jump to

Ingest to the data lake

After completing the prerequisites, you can create your staging tables. The example below creates a table without defining columns or data types, as these will be inferred automatically by Upsolver, though you can define columns if required:

CREATE TABLE default_glue_catalog.upsolver_samples.orders_raw_data()
    PARTITIONED BY $event_date;

Create a staging table to store the data ingested from Kinesis.

Upsolver recommends partitioning by the system column $event_date or another date column within the data in order to optimize your query performance.

Next, create an ingestion job as follows:

CREATE SYNC JOB stage_kinesis_data
    START_FROM = BEGINNING
    CONTENT_TYPE = JSON
    AS COPY FROM KINESIS your_kinesis_connection
                         STREAM = 'sample-stream'
    INTO default_glue_catalog.kinesis.kinesis_data_stg; 

Create an ingestion job to copy data from Kinesis into a staging table in the data lake.

Note that multiple ingestion jobs can write to the same table, resulting in a final table that contains a UNION ALL of all data copied into that table. This means that any duplicate rows that are written are not removed and the columns list may expand if new columns are detected.

This may not be your intended behavior, so ensure you are writing to the correct table before running your job.

The example above only uses a small subset of all job options available when reading from Kinesis. Depending on your use case, there may be other options you want to configure. For instance, you may want to specify the compression of your source data rather than have it auto-detected.

For the full list of job options, including syntax examples, see Amazon Kinesis.

After your data has been ingested into your staging table, you are ready to move on to the next step of building your data pipeline: transforming your data and writing it to your intended target locations.

Ingest directly to the target

Directly ingesting your data enables you to copy your data straight into the target system, bypassing the need for a staging table. The syntax and job options are identical to ingesting into a staging table, however, the target connection differs:

CREATE SYNC JOB ingest_kinesis_to_snowflake
    COMMENT = 'Ingest orders directly to Snowflake'
    CONTENT_TYPE = JSON
    START_FROM = BEGINNING
    AS COPY FROM KINESIS your_kinesis_connection STREAM = 'sample-stream'
    INTO SNOWFLAKE my_snowflake_connection.demo.orders_transformed;

Create a job to ingest from Kinesis directly into a target Snowflake database.

Job options

Transformations can be applied to your ingestion job to correct issues, exclude columns, or mask data before it lands in the target. Furthermore, you can use expectations to define data quality rules on your data stream and take appropriate action.

For more information, see the Ingestion jobs page, which describes the available job options and includes examples.

Alter a job that reads from Kinesis

Some job options are considered mutable, enabling you to run a SQL command to alter an existing ingestion job rather than create a new job. The job options apply equally to jobs that ingest into the data lake or directly to the target, and the syntax to alter a job is identical.

For example, take the job we created earlier:

CREATE SYNC JOB stage_kinesis_data
    START_FROM = BEGINNING
    CONTENT_TYPE = JSON
    AS COPY FROM KINESIS your_kinesis_connection
                         STREAM = 'sample-stream'
    INTO default_glue_catalog.kinesis.kinesis_data_stg; 

Create an ingestion job to copy data from Kinesis into a staging table in the data lake.

If you want to keep the job as is, but only change the cluster that is running the job, execute the following command:

ALTER JOB my_kinesis_ingestion_job 
    SET COMPUTE_CLUSTER = my_new_cluster;

Note that some options such as COMPRESSION cannot be altered once the connection has been created.

To check which job options are mutable, see Amazon Kinesis.

Drop a job that reads from Kinesis

If you no longer need a job, you can easily drop it using the following SQL command. This applies to jobs that ingest into the data lake and directly into the target:

DROP JOB my_kinesis_ingestion_job;

Last updated