Comment on page

Amazon S3 setup guide

Follow these steps to use Amazon S3 as your source.

Step 1 - Connect to Amazon S3

Select an existing S3 connection, or create a new one.

Create a new Amazon S3 connection

Authentication Method
It is recommended to use Role-based access.
  • To define the correct permissions for the role, follow the S3 access configuration guide to create an IAM policy.
  • If your S3 bucket runs on a different AWS account than the one running Upsolver, you need to create trust between the role and the account running Upsolver. Follow the Role-based setup guide to create a trusted AWS Role and find your External Id .
To create your access key ID and secret access key, follow the AWS Account and Access Keys guide.
Encryption Key
By default, Upsolver uses the default encryption defined in the AWS bucket to read the files. Alternatively, you can provide the Base64 text representation of the encryption key to use or an ARN for an existing AWS KMS key.

Step 2 - Select a source location to ingest from

Select a bucket to get started. Upsolver will attempt to list your buckets if the s3:ListAllMyBuckets permission was provided by the connection above. As an alternative, you can specify the name of your bucket (e.g. s3://upsolver-samples).
Select a folder to ingest or leave empty to ingest the entire bucket.

Advanced options

Upsolver ingests all files in the selected location by default. You can use a regular expression to define which files will be ingested if you want to ingest only some of them.
Date Pattern
If your source files are partitioned by a date pattern, Upsolver can load existing and new files using the pattern. This affects the order of files loaded and avoids delays when many changes occur across the bucket.
By default, Upsolver will list and ingest files in the ingest job’s bucket and folder as soon as they are discovered. When you set a date pattern, Upsolver uses the date in the folder path to understand when new files are added. The date in the path is used to process data in order of arrival. If files are added to a folder named with a future date, these files will not be ingested until that date becomes the present.
Delete the source files following ingestion
To discover new files, when a date pattern is not set, Upsolver lists the top-level prefix and performs a diff to detect newly created files. It then lists the paths adjacent to these newly added files and assumes that if a file was added here, others will be as well. This process is performed at regular intervals to ensure files are not missed.
For buckets with few files and predictable changes, this works well. However, for buckets with many changes across millions of files and hundreds of prefixes, the scanning and diffing process may result in ingestion and processing delays.
To optimize this process, consider setting the Delete files option to TRUE. This moves ingested files to another staging location, leaving the source folder empty and making it easier and faster for Upsolver to discover new files. Be aware that configuring Upsolver to move ingested files could impact other systems if they depend on the same raw files.

Step 3 - Check that files are read successfully

When you select a bucket and folder, Upsolver will attempt to load a sample of the files.
If Upsolver did not load any sample files, try the following:
  1. 1.
    Verify that the location on your bucket contains files.
  2. 2.
    Select a Content type that matches the content type of your stream.