Amazon S3
Follow these steps to use Amazon S3 as your source.
Last updated
Follow these steps to use Amazon S3 as your source.
Last updated
Click Create a new connection, if it is not already selected. In the Name your connection, type in the name you want to give to this connection.
For the Authentication Method, select either the Role-based or the AccessKey/SecretKey option:
Upsolver recommends that you use Role-based access.
To define the correct permissions for the role, follow the Amazon S3 access configuration guide to create an IAM policy.
If your S3 bucket runs on a different AWS account than the one running Upsolver, you need to create trust between the role and the account running Upsolver. Follow the Role-Based AWS Credentials guide to create a trusted AWS Role and find your External Id.
To create your Access key id and Secret access key, follow the AWS Account and Access Keys guide.
Encryption Key
By default, Upsolver uses the default encryption defined in the AWS bucket to read the files. Alternatively, you can provide the Base64 text representation of the encryption key to use or an ARN for an existing AWS KMS key.
When you have entered your authentication information, click Test Connection.
By default, if you have already created a connection, Upsolver selects Use an existing connection, and your Amazon S3 connection is populated in the list.
For organizations with multiple connections, select the source connection you want to use.
When the connection is established, Upsolver will attempt to list your buckets if the s3:ListAllMyBuckets permission was provided by the connection above. As an alternative, you can specify the name of your bucket, e.g. s3://upsolver-samples.
Next, you can optionally Select the location to read the files from. Leave this empty to ingest the entire bucket.
To specify the file types to ingest, choose an option from the Select the file's content type / Parse files using list, e.g. JSON, CSV, Parquet. This list defaults to Automatic.
Select the file name pattern for the files you would like to ingest
Upsolver ingests all files in the selected location by default. To change this option, in the list, select Ingest files matching a regular expression.
If your source files are partitioned by a date pattern, Upsolver can load existing and new files using the pattern. This affects the order of files loaded and avoids delays when many changes occur across the bucket.
By default, Upsolver will list and ingest files in the ingest job’s bucket and folder as soon as they are discovered. When you set a date pattern, Upsolver uses the date in the folder path to understand when new files are added. The date in the path is used to process data in order of arrival. If files are added to a folder named with a future date, these files will not be ingested until that date becomes the present.
Delete the source files following ingestion
To discover new files, when a date pattern is not set, Upsolver lists the top-level prefix and performs a diff to detect newly created files. It then lists the paths adjacent to these newly added files and assumes that if a file was added here, others will be as well. This process is performed at regular intervals to ensure files are not missed.
For buckets with few files and predictable changes, this works well. However, for buckets with many changes across millions of files and hundreds of prefixes, the scanning and diffing process may result in ingestion and processing delays.
To optimize this process, consider setting the Delete the source files following ingestion option to TRUE. This moves ingested files to another staging location, leaving the source folder empty and making it easier and faster for Upsolver to discover new files. Be aware that configuring Upsolver to move ingested files could impact other systems if they depend on the same raw files.
When you select a bucket and folder, Upsolver will attempt to load a sample of the files.
If Upsolver did not load any sample files, try the following:
Verify that the location on your bucket contains files.
Select a content type that matches the content type of your stream.