Ingest to a Staging Table
This quickstart describes how to create a job to ingest data from Amazon S3 into a table in your data lake.
An ingestion job enables you to copy your data from a given source into a table created in a metastore connection. This table then serves as your staging table, and you can use it with Upsolver's transformation jobs to write to various target locations.
Create a staging table
You need to create a job that reads from the source S3 bucket and writes to a staging table. Upsolver automatically converts your files into Apache Parquet format and creates a table with the appropriate schema in the AWS Glue Data Catalog.
The first step is to create a staging table to store your data.
Here’s the code to create a staging table:
// Syntax
CREATE TABLE default_glue_catalog.<DB_NAME>.<STAGING_TABLE_NAME>()
PARTITIONED BY $event_date;
-- TABLE_DATA_RETENTION = 30 days
// Example
CREATE TABLE default_glue_catalog.upsolver_samples.orders_raw_data;
In this example, no columns or data types have been defined, enabling Upsolver to create them automatically based on the source data.
The next step is to create a job that copies your data into your staging table.
Here's the code to create an ingestion job:
// Syntax
CREATE JOB "<STAGING_JOB_NAME>"
CONTENT_TYPE = JSON
AS COPY FROM S3 <S3 CONNECTION FROM STEP 1>
BUCKET = '<BUCKET NAME>'
PREFIX = '<PREFIX NAME>'
INTO default_glue_catalog.<DB_NAME>.<STAGING_TABLE_NAME>;
// Example
CREATE JOB load_orders_raw_data_from_s3
CONTENT_TYPE = JSON
AS COPY FROM S3 upsolver_s3_samples
BUCKET = 'upsolver-samples'
// Note: the trailing slash is required
PREFIX = 'orders/'
// this is the table you created in step 1
INTO default_glue_catalog.upsolver_samples.orders_raw_data;
Let’s understand what this example job does:
Creates an ingestion job named load_orders_raw_data_from_s3.
Automatically infers the schema and populates the column names and types in the Glue Data Catalog.
Converts the source S3 files to Apache Parquet, which stores the data in a columnar format optimized for faster reads.
Copies the orders data from the upsolver-samples bucket into the orders_raw_data table in the data lake.
Last updated