Links
Comment on page

Ingest your data into a staging table

This quickstart describes how to create a job to ingest data from Amazon S3 into a table in your data lake.
An ingestion job enables you to copy your data from a given source into a table created in a metastore connection. This table then serves as your staging table, and you can use it with Upsolver's transformation jobs to write to various target locations.

Create a staging table

You need to create a job that reads from the source S3 bucket and writes to a staging table. Upsolver automatically converts your files into Apache Parquet format and creates a table with the appropriate schema in the AWS Glue Data Catalog.
The first step is to create a staging table to store your data.
Here’s the code to create a staging table:
// Syntax
CREATE TABLE default_glue_catalog.<DB_NAME>.<STAGING_TABLE_NAME>()
PARTITIONED BY $event_date;
-- TABLE_DATA_RETENTION = 30 days
// Example
CREATE TABLE default_glue_catalog.upsolver_samples.orders_raw_data;
In this example, no columns or data types have been defined, enabling Upsolver to create them automatically based on the source data.
The next step is to create a job that copies your data into your staging table.
Here's the code to create an ingestion job:
// Syntax
CREATE JOB "<STAGING_JOB_NAME>"
CONTENT_TYPE = JSON
AS COPY FROM S3 <S3 CONNECTION FROM STEP 1>
/* example: BUCKET = 'upsolver-samples' */
BUCKET = '<BUCKET NAME>'
/* example: PREFIX = 'orders/' Note: the trailing slash is required */
PREFIX = '<PREFIX NAME>'
/* this is the table you created in step 1 */
INTO default_glue_catalog.<DB_NAME>.<STAGING_TABLE_NAME>;
// Example
CREATE JOB load_orders_raw_data_from_s3
CONTENT_TYPE = JSON
AS COPY FROM S3 upsolver_s3_samples BUCKET = 'upsolver-samples' PREFIX = 'orders/'
INTO default_glue_catalog.upsolver_samples.orders_raw_data;
Let’s understand what this example job does:
  1. 1.
    Creates an ingestion job named load_orders_raw_data_from_s3.
  2. 2.
    Automatically infers the schema and populates the column names and types in the Glue Data Catalog.
  3. 3.
    Converts the source S3 files to Apache Parquet, which stores the data in a columnar format optimized for faster reads.
  4. 4.
    Copies the orders data from the upsolver-samples bucket into the orders_raw_data table in the data lake.