Amazon S3

The Amazon S3 Data Source allows data to be read directly from an Amazon S3 bucket into Upsolver. This Data Source is intended to be use to intake bulk files with multiple events per file and is not recommended to be used to ingest single events per file. The Amazon S3 Data Source will read new files every minute from the provided prefix within the S3 bucket. The files must be stored in an alphabetically ordered folder structure based on the time of the file.

Folder Structure and Date Pattern

When defining an Amazon S3 Data Source we need to ensure we write the files in a well defined path based on the event time and provide the Date Pattern used in this folder structure to the Data Source definition.

All the data must be placed in a folder structure that ends with a date pattern. The date pattern must contain all date parts - year, month and day, in that order. Hour and minute can optionally be added.
An good example of this folder structure would be:

DATA_BUCKET/yyyy/MM/dd/HH/mm/file1.json.gz

In this example the Date Pattern would be

yyyy/MM/dd/HH/mm

The Date Pattern template is the same as Java's date patterns.

If a file is written to a date path more than an hour old, that file might not be ingested into the system. The date paths represent the time in UTC.

The processing time of items read from these files is considered to be the last modified time of the file.

Date Pattern Examples

Files Output from AWS Kinesis Firehose

If you're using AWS Kinesis Firehose to ingest data into S3, your files are probably named like this:

  • a/b/events-2016/12/31/22/filename
  • a/b/events-2016/12/31/23/filename
  • a/b/events-2017/01/01/00/filename

We've emphasized the folder parts that should be included in the date pattern. The prefix can be set as the folder name when setting up the S3 Data Source.

The date pattern should be :

'events-'yyyy/MM/dd/HH

The way to add literal values to the date pattern is by quoting them with single quote. We've added the events- literal to the first folder name in the date pattern, the rest of the string is 4 digit year (e.g. 2017), 2 digit month (01-12), 2 digit day (01-31) and 2 digit hours (00-23). Quoting slash (/) is optional since it's a reserved char in

Input files for AWS Athena / PrestoDB

Athena (or Presto DB) require a specific folder format if you want them to load partitions automatically. The files names should look like the examples below, where the bold part is required and the folder before the date pattern must be fixed.

  • folder/YEAR=2017/MONTH=02/DAY=30/HOUR=01/MINUTE=20/some.filename
  • folder/YEAR=2017/MONTH=02/DAY=30/HOUR=02/MINUTE=05/some.filename
  • folder/YEAR=2017/MONTH=02/DAY=30/HOUR=02/MINUTE=05/some.folder/another.filename

You can see in bold the full date of every file name.

The date pattern should be :

'YEAR='yyyy/'MONTH='MM/'DAY='dd/'HOUR='HH/'MINUTE='mm

We excluded the root folder name from the date pattern, it should be set as the folder parameter.

We use single quotes to mark literals in the date pattern. This date pattern is composed of :

  1. YEAR= literal string
  2. 4 digit year (e.g. 2017)
  3. /MONTH= literal string
  4. 2 digit month (01-12)
  5. /MONTH= literal string
  6. 2 digit month (01-12)
  7. /DAY= literal string
  8. 2 digit day (01-31)
  9. /HOUR= literal string
  10. 2 digit hour (00-23)
  11. /MINUTE= literal string
  12. 2 digit minute (00-59)

Creating A New Amazon S3 Data Source

  1. On the Data Sources page click Add New Data Source.
  2. Select Amazon S3 from the list.
  3. Fill in a name for your Data Source.
  4. Select the Connection Info to be used to connect to the Amazon S3 Bucket with the data to be ingested.
    1. If you have not yet configured a Connection with the relevant info the drop down will allow you to create the connection information.
  5. Fill in the prefix to where the data is stored in the bucket. For example if your data is stored in 'MY_BUCKET/my-data/yyyy/MM/dd/HH/mm/file.json.gz' you would place the text 'my-data' in the Folder text-box.
    1. If the data is directly in the root of the bucket, you can leave the Folder text-box empty.
  6. Fill in the Date Pattern as described above in Folder Structure and Date Pattern.
  7. If you would like to filter the files in the input folder you can fill in the File Name Pattern. By default this is set to All which will read all files in the folder provided. You can change it to either filter based on the start of the file name, the end of the file name or a custom regular expression.
  8. Select the Content Format of the files from the drop down list.
  9. Select the Compute Cluster you would like this Data Source to run on. If you have not created a Compute Cluster yet, you can create one from the drop down menu.

On the right you will see a preview of some of the files which will be loaded based on your configuration. For example:


Use this list to ensure you configured your input correctly and the correct files are listed. Once you are confident the right files are going to be loaded press the Create button.

Compression

The files ingested by an Amazon S3 Data Source can be compressed, Upsolver will automatically detect the compression type. Supported compression types are Zip, GZip and Snappy.

results matching ""

    No results matching ""