LogoLogo
OverviewQuickstartsHow To GuidesReferenceArticlesSupport
Quickstarts
Quickstarts
  • Quickstarts
  • DATA INGESTION WIZARD
    • Using the Wizard
      • Source Set-up
        • Amazon Kinesis
        • Amazon S3
        • Apache Kafka
        • Confluent Cloud
        • Microsoft SQL Server
        • MongoDB
        • MySQL
        • PostgreSQL
      • Target Set-up
        • Amazon Redshift
        • AWS Glue Data Catalog
        • ClickHouse
        • Polaris Catalog
        • Snowflake
      • Job Configuration
        • Job Configuration
        • Job Configuration for CDC
      • Review and Run Job
  • CONNECTORS
    • Connectors
      • Amazon Kinesis
      • Amazon Redshift
      • Amazon S3
      • Apache Kafka
      • AWS Glue Data Catalog
      • ClickHouse
      • Confluent Cloud
      • Elasticsearch
      • Microsoft SQL Server
      • MongoDB
      • MySQL
      • Polaris Catalog
      • PostgreSQL
      • Snowflake
  • JOBS
    • Ingestion
      • Job Basics
        • Ingest to a Staging Table
        • Output to a Target Table
      • Stream and File Sources
        • Amazon Kinesis
        • Amazon S3
        • Apache Kafka
        • Confluent Kafka
      • CDC Sources
        • Microsoft SQL Server
        • MongoDB
        • MySQL
        • PostgreSQL
    • Transformation
      • Updating Data
        • Upsert Data to a Target Table
        • Delete Data from a Target Table
        • Aggregate and Output Data
        • Join Two Data Streams
      • Data Targets
        • Output to Amazon Athena
        • Output to Amazon Redshift
        • Output to Amazon S3
        • Output to Elasticsearch
        • Output to Snowflake
  • APACHE ICEBERG
    • Optimize Your Iceberg Tables
    • Install the Iceberg Table Analyzer
Powered by GitBook
On this page
  1. JOBS
  2. Ingestion
  3. Job Basics

Ingest to a Staging Table

This quickstart describes how to create a job to ingest data from Amazon S3 into a table in your data lake.

Last updated 11 months ago

An ingestion job enables you to copy your data from a given source into a table created in a metastore connection. This table then serves as your staging table, and you can use it with Upsolver's to write to various target locations.

Create a staging table

You need to create a job that reads from the source S3 bucket and writes to a staging table. Upsolver automatically converts your files into Apache Parquet format and creates a table with the appropriate schema in the AWS Glue Data Catalog.

The first step is to create a staging table to store your data.

Here’s the code to create a staging table:

// Syntax
CREATE TABLE default_glue_catalog.<DB_NAME>.<STAGING_TABLE_NAME>()
    PARTITIONED BY $event_date;
    -- TABLE_DATA_RETENTION = 30 days
    
// Example
CREATE TABLE default_glue_catalog.upsolver_samples.orders_raw_data;

In this example, no columns or data types have been defined, enabling Upsolver to create them automatically based on the source data.

The next step is to create a job that copies your data into your staging table.

Here's the code to create an ingestion job:

// Syntax
CREATE JOB "<STAGING_JOB_NAME>"
    CONTENT_TYPE = JSON
AS COPY FROM S3 <S3 CONNECTION FROM STEP 1> 
    BUCKET = '<BUCKET NAME>'
    PREFIX = '<PREFIX NAME>'       
INTO default_glue_catalog.<DB_NAME>.<STAGING_TABLE_NAME>;
    
// Example
CREATE JOB load_orders_raw_data_from_s3
    CONTENT_TYPE = JSON
AS COPY FROM S3 upsolver_s3_samples 
    BUCKET = 'upsolver-samples' 
    // Note: the trailing slash is required
    PREFIX = 'orders/' 
// this is the table you created in step 1        
INTO default_glue_catalog.upsolver_samples.orders_raw_data;

Let’s understand what this example job does:

  1. Creates an ingestion job named load_orders_raw_data_from_s3.

  2. Automatically infers the schema and populates the column names and types in the Glue Data Catalog.

  3. Converts the source S3 files to Apache Parquet, which stores the data in a columnar format optimized for faster reads.

  4. Copies the orders data from the upsolver-samples bucket into the orders_raw_data table in the data lake.

transformation jobs