LogoLogo
OverviewQuickstartsHow To GuidesReferenceArticlesSupport
Quickstarts
Quickstarts
  • Quickstarts
  • DATA INGESTION WIZARD
    • Using the Wizard
      • Source Set-up
        • Amazon Kinesis
        • Amazon S3
        • Apache Kafka
        • Confluent Cloud
        • Microsoft SQL Server
        • MongoDB
        • MySQL
        • PostgreSQL
      • Target Set-up
        • Amazon Redshift
        • AWS Glue Data Catalog
        • ClickHouse
        • Polaris Catalog
        • Snowflake
      • Job Configuration
        • Job Configuration
        • Job Configuration for CDC
      • Review and Run Job
  • CONNECTORS
    • Connectors
      • Amazon Kinesis
      • Amazon Redshift
      • Amazon S3
      • Apache Kafka
      • AWS Glue Data Catalog
      • ClickHouse
      • Confluent Cloud
      • Elasticsearch
      • Microsoft SQL Server
      • MongoDB
      • MySQL
      • Polaris Catalog
      • PostgreSQL
      • Snowflake
  • JOBS
    • Ingestion
      • Job Basics
        • Ingest to a Staging Table
        • Output to a Target Table
      • Stream and File Sources
        • Amazon Kinesis
        • Amazon S3
        • Apache Kafka
        • Confluent Kafka
      • CDC Sources
        • Microsoft SQL Server
        • MongoDB
        • MySQL
        • PostgreSQL
    • Transformation
      • Updating Data
        • Upsert Data to a Target Table
        • Delete Data from a Target Table
        • Aggregate and Output Data
        • Join Two Data Streams
      • Data Targets
        • Output to Amazon Athena
        • Output to Amazon Redshift
        • Output to Amazon S3
        • Output to Elasticsearch
        • Output to Snowflake
  • APACHE ICEBERG
    • Optimize Your Iceberg Tables
    • Install the Iceberg Table Analyzer
Powered by GitBook
On this page
  • Step 1 - Connect to Amazon S3
  • Create a new connection
  • Use an existing connection
  • Step 2 - Select a source location to ingest from
  • Advanced options
  • Step 3 - Check that files are read successfully
  1. DATA INGESTION WIZARD
  2. Using the Wizard
  3. Source Set-up

Amazon S3

Follow these steps to use Amazon S3 as your source.

Last updated 11 months ago

Step 1 - Connect to Amazon S3

Create a new connection

Click Create a new connection, if it is not already selected. In the Name your connection, type in the name you want to give to this connection.

For the Authentication Method, select either the Role-based or the AccessKey/SecretKey option:

Role-based

Upsolver recommends that you use Role-based access.

  • To define the correct permissions for the role, follow the guide to create an IAM policy.

  • If your S3 bucket runs on a different AWS account than the one running Upsolver, you need to create trust between the role and the account running Upsolver. Follow the guide to create a trusted AWS Role and find your External Id.

AccessKey/SecretKey

To create your Access key id and Secret access key, follow the .

Encryption Key

By default, Upsolver uses the default encryption defined in the AWS bucket to read the files. Alternatively, you can provide the Base64 text representation of the encryption key to use or an ARN for an existing AWS KMS key.

When you have entered your authentication information, click Test Connection.

Use an existing connection

By default, if you have already created a connection, Upsolver selects Use an existing connection, and your Amazon S3 connection is populated in the list.

For organizations with multiple connections, select the source connection you want to use.

Step 2 - Select a source location to ingest from

When the connection is established, Upsolver will attempt to list your buckets if the s3:ListAllMyBuckets permission was provided by the connection above. As an alternative, you can specify the name of your bucket, e.g. s3://upsolver-samples.

Next, you can optionally Select the location to read the files from. Leave this empty to ingest the entire bucket.

To specify the file types to ingest, choose an option from the Select the file's content type / Parse files using list, e.g. JSON, CSV, Parquet. This list defaults to Automatic.

Advanced options

Select the file name pattern for the files you would like to ingest

Upsolver ingests all files in the selected location by default. To change this option, in the list, select Ingest files matching a regular expression.

Load files by partition using a date pattern

If your source files are partitioned by a date pattern, Upsolver can load existing and new files using the pattern. This affects the order of files loaded and avoids delays when many changes occur across the bucket.

By default, Upsolver will list and ingest files in the ingest job’s bucket and folder as soon as they are discovered. When you set a date pattern, Upsolver uses the date in the folder path to understand when new files are added. The date in the path is used to process data in order of arrival. If files are added to a folder named with a future date, these files will not be ingested until that date becomes the present.

Delete the source files following ingestion

To discover new files, when a date pattern is not set, Upsolver lists the top-level prefix and performs a diff to detect newly created files. It then lists the paths adjacent to these newly added files and assumes that if a file was added here, others will be as well. This process is performed at regular intervals to ensure files are not missed.

For buckets with few files and predictable changes, this works well. However, for buckets with many changes across millions of files and hundreds of prefixes, the scanning and diffing process may result in ingestion and processing delays.

To optimize this process, consider setting the Delete the source files following ingestion option to TRUE. This moves ingested files to another staging location, leaving the source folder empty and making it easier and faster for Upsolver to discover new files. Be aware that configuring Upsolver to move ingested files could impact other systems if they depend on the same raw files.

Step 3 - Check that files are read successfully

When you select a bucket and folder, Upsolver will attempt to load a sample of the files.

If Upsolver did not load any sample files, try the following:

  1. Verify that the location on your bucket contains files.

Select a that matches the content type of your stream.

content type
AWS Account and Access Keys guide
Amazon S3 access configuration
Role-Based AWS Credentials
Create a new Amazon S3 connection to use as your ingestion source.
Select an existing Amazon S3 connection for your ingestion job.
Amend the Advanced Options to configure your ingestion job.