Create a data source
This article provides a guide on how to create different types of data sources using an API call.
This API enables you to create a new data source. All API calls require an API token.
Amazon S3 (Quick)
Connect to your AWS S3 Bucket.
In order for Upsolver to read events directly from your cloud storage, files should be partitioned by date and time (which defines the folder structure in the cloud storage).
A prerequisite for defining a cloud storage data source is providing Upsolver with the appropriate credentials for reading from your cloud storage. See: S3 connection
Fields
Field | Name | Type | Description | Optional |
bucket | Bucket | String | The Amazon S3 bucket to read from. | |
globFilePattern | Glob File Pattern | String | The pattern for files to ingest. | |
datePattern | Date Pattern | String | The date pattern in the file name/folder structure (e.g. | |
contentType | Content Format | ContentType | The format of the messages. Supported formats are: JSON, AVRO, CSV, TSV, ORC, Protobuf and x-www-form-urlencoded. For self-describing formats like JSON, the schema is auto-detected. The body should contain of the message should contain the message itself, which should not be url-encoded. Messages can be compressed, Upsolver automatically detects the compression type. Supported compression types are: Zip, GZip, Snappy and None. See: Content formats | |
compression | Compression | Compression | The compression in the data (Zip, GZip, Snappy, SnappyUnframed, Tar, or None). | |
displayData.name | Name | String | The data source name. | |
displayData.description | Description | String | The data source description. | |
softRetention | Soft Retention | Boolean | A setting that prevents data deletion when the retention policy in Upsolver activates. When enabled, the metadata is purged but the underlying data (e.g. S3 object) is not deleted. | |
prefix | Folder | String | If the data resides in a sub folder within the defined cloud storage, specify this folder. | + |
startExecutionFrom | Start Ingestion From | String (ISO-8601) | The time from which to ingest the data from. Files from before this time (based on the provided date pattern) are ignored. If you leave this field empty all files are ingested. | + |
workspaces | Workspaces | String[] | The workspaces attached to this data source. | + |
Example 1
Example 2
Amazon S3 (Advanced)
Connect to your AWS S3 Bucket.
In order for Upsolver to read events directly from your cloud storage, files should be partitioned by date and time (which defines the folder structure in the cloud storage).
A prerequisite for defining a cloud storage data source is providing Upsolver with the appropriate credentials for reading from your cloud storage. See: S3 connection
Fields
Field | Name | Type | Description | Optional |
displayData.name | Name | String | The data source name. | |
displayData.description | Description | String | The data source description. | |
sourceStorage | S3 Connection | String | The cloud storage to ingest files from. | |
datePattern | Date Pattern | String | The date pattern in the file name/folder structure. For example: | |
fileMatchPattern | File Name Pattern | FileNameMatcher | The file name pattern for the files to ingest. If all the files in the specified folders are relevant, specify All. The pattern given is matched against the file path starting from the bucket specified. | |
contentType | Content Format | ContentType | The format of the messages. Supported formats are: JSON, AVRO, CSV, TSV, ORC, Protobuf and x-www-form-urlencoded. For self-describing formats like JSON, the schema is auto-detected. The body should contain of the message should contain the message itself, which should not be url-encoded. Messages can be compressed, Upsolver automatically detects the compression type. Supported compression types are: Zip, GZip, Snappy and None. See Content Formats. | |
computeEnvironment | Compute Cluster | String | The compute cluster to run the calculation on. See Adding a Compute Cluster. | |
destinationStorage | Target Storage | String | The data and metadata files for this data source will be stored in this storage. | |
compression | Compression | Compression | The compression in the data (Zip, GZip, Snappy, SnappyUnframed, Tar, or None). | |
interval | Interval | Int (Minutes) | The sliding interval to wait for data. | |
prefix | Folder | String | If the data resides in a sub folder within the defined cloud storage, specify this folder. | + |
startExecutionFrom | Start Ingestion From | String (ISO-8601) | The time from which to ingest the data from. Files from before this time (based on the provided date pattern) are ignored. If you leave this field empty all files are ingested. | + |
retention | Retention | Int (Minutes) | The retention period for the data. | + |
dataDumpDate | Data Dump Date | String (ISO-8601) | The date that the data starts. | + |
maxDelay | Max Delay | Int (Minutes) | The maximum delay to consider the data, that is, any data that arrives delayed by more than the max delay is filtered out. | + |
Example
Amazon Kinesis Stream (Quick)
Connect to your Amazon Kinesis. Upsolver can read events from your Amazon Kinesis, according to the stream you define.
A prerequisite for defining an Amazon Kinesis stream connection is providing Upsolver with the appropriate credentials for reading from your Amazon Kinesis stream. See: Kinesis connection
Fields
Field | Name | Type | Description | Optional |
region | Region | Region | Your AWS region. | |
streamName | Stream | String | The name of the relevant Kinesis stream. | |
contentType | Content Format | ContentType | The format of the messages. Supported formats are: JSON, AVRO, CSV, TSV, ORC, Protobuf and x-www-form-urlencoded. For self-describing formats like JSON, the schema is auto-detected. The body should contain of the message should contain the message itself, which should not be url-encoded. Messages can be compressed, Upsolver automatically detects the compression type. Supported compression types are: Zip, GZip, Snappy and None. See: Content formats | |
compression | Compression | Compression | The compression in the data (Zip, GZip, Snappy, SnappyUnframed, Tar, or None). | |
displayData.name | Name | String | The data source name. | |
displayData.description | Description | String | The data source description. | |
softRetention | Soft Retention | Boolean | A setting that prevents data deletion when the retention policy in Upsolver activates. When enabled, the metadata is purged but the underlying data (e.g. S3 object) is not deleted. | |
workspaces | Workspaces | String[] | The workspaces attached to this data source. | + |
Example
Amazon Kinesis Stream (Advanced)
Connect to your Amazon Kinesis. Upsolver can read events from your Amazon Kinesis, according to the stream you define.
A prerequisite for defining an Amazon Kinesis stream connection is providing Upsolver with the appropriate credentials for reading from your Amazon Kinesis stream. See: Kinesis connection
Fields
Field | Name | Type | Description | Optional |
displayData.name | Name | String | The data source name. | |
displayData.description | Description | String | The data source description. | |
contentType | Content Format | ContentType | The format of the messages. Supported formats are: JSON, AVRO, CSV, TSV, ORC, Protobuf and x-www-form-urlencoded. For self-describing formats like JSON, the schema is auto-detected. The body should contain of the message should contain the message itself, which should not be url-encoded. Messages can be compressed, Upsolver automatically detects the compression type. Supported compression types are: Zip, GZip, Snappy and None. See: Content formats | |
kinesisConnection | Kinesis Connection | String | The AWS credentials to connect to Kinesis. | |
streamName | Stream | String | The name of the relevant Kinesis stream. | |
readFromStart | Read From Start | String | The time from which to ingest the data from. Messages from before this time will be ignored. If you leave this field empty all messages are ingested. | |
computeEnvironment | Compute Cluster | String | The compute cluster to run the calculation on. See: Compute cluster | |
connectionPointer | Target Storage | String | The data and metadata files for this data source will be stored in this storage. | |
isOnline | Real Time Statistics | Boolean | Calculate this data source's statistics in real time directly from the input stream if a real time cluster is deployed. | |
shards | Shards | Int | How many readers to use in parallel to read the stream. A recommended value would be to increase it by 1 for every 70 MB/s sent to your topic. | |
parallelism | Parallelism | Int | The number of independent shards to parse data, to increase parallelism and reduce latency. This should remain 1 in most cases and be no more than the number of shards used to read the data from the source. | |
compression | Compression | Compression | The compression in the data (Zip, GZip, Snappy, SnappyUnframed, Tar, or None). | |
retention | Retention | Int (Minutes) | A retention period for the data in Upsolver. After this period of time passes, the data is deleted forever. | + |
endExecutionAt | End Read At | String (ISO-8601) | If configured, stop reading after this date. | + |
Example
Amazon S3 over SQS
Connect to your AWS S3 Bucket using SQS Notifications.
You will need to configure SQS Notifications from your S3 Bucket and open permissions to read and delete messages from the SQS Queue to the same access key and secret key you entered to give Upsolver permissions to read from the S3 Bucket. See: S3 over SQS connection
Fields
Field | Name | Type | Description | Optional |
displayData.name | Name | String | The data source name. | |
displayData.description | Description | String | The data source description. | |
sourceStorage | Source Storage | String | The cloud storage to ingest files from. | |
contentType | Content Format | ContentType | The format of the messages. Supported formats are: JSON, AVRO, CSV, TSV, ORC, Protobuf and x-www-form-urlencoded. For self-describing formats like JSON, the schema is auto-detected. The body should contain of the message should contain the message itself, which should not be url-encoded. Messages can be compressed, Upsolver automatically detects the compression type. Supported compression types are: Zip, GZip, Snappy and None. See: Content formats | |
computeEnvironment | Compute Cluster | String | The compute cluster to run the calculation on. See: Compute cluster | |
destinationStorage | Target Storage | String | The data and metadata files for this data source will be stored in this storage. | |
compression | Compression | Compression | The compression in the data (Zip, GZip, Snappy, SnappyUnframed, Tar, or None). | |
softRetention | Soft Retention | Boolean | A setting that prevents data deletion when the retention policy in Upsolver activates. When enabled, the metadata is purged but the underlying data (e.g. S3 object) is not deleted. | |
executionParallelism | Parallelism | Int | The number of independent shards to parse data, to increase parallelism and reduce latency. This should remain 1 in most cases and be no more than the number of shards used to read the data from the source. | |
prefix | Prefix | String | The prefix of the files or directories. To filter a specific directory, add a trailing | + |
suffix | Suffix | String | The suffix of the files to read. | + |
startExecutionFrom | Start Ingestion From | String (ISO-8601) | The time from which to ingest the data from. Messages from before this time will be ignored. If you leave this field empty all messages are ingested. | + |
endExecutionAt | End Read At | String (ISO-8601) | If configured, stop reading after this date. | + |
retention | Retention | Int (Minutes) | A retention period for the data in Upsolver. After this period of time passes, the data is deleted forever. | + |
Example
Apache Kafka (Quick)
Connect to any topic on your Kafka Servers. Upsolver can read events from your Kafka cluster from the specified Kafka topic.
A prerequisite for defining a Kafka stream connection is providing Upsolver with the appropriate credentials for reading from your Kafka cluster. See: Kafka connection
Fields
Field | Name | Type | Description | Optional |
kafkaHosts | Kafka Hosts | String | The Kafka hosts separated with commas (e.g. | |
topicName | Kafka Topic | String | The Kafka topic to ingest the data from. | |
contentType | Content Format | ContentType | The format of the messages. Supported formats are: JSON, AVRO, CSV, TSV, ORC, Protobuf and x-www-form-urlencoded. For self-describing formats like JSON, the schema is auto-detected. The body should contain of the message should contain the message itself, which should not be url-encoded. Messages can be compressed, Upsolver automatically detects the compression type. Supported compression types are: Zip, GZip, Snappy and None. See: Content formats | |
compression | Compression | Compression | The compression in the data (Zip, GZip, Snappy, SnappyUnframed, Tar, or None). | |
displayData.name | Name | String | The data source name. | |
displayData.description | Description | String | The data source description. | |
softRetention | Soft Retention | Boolean | A setting that prevents data deletion when the retention policy in Upsolver activates. When enabled, the metadata is purged but the underlying data (e.g. S3 object) is not deleted. | |
maybekafkaVersion | Kafka Version | KafkaVersion | The version of the Kafka Servers. If unsure, use | + |
workspaces | Workspaces | String[] | The workspaces attached to this data source. | + |
Example
Apache Kafka (Advanced)
Connect to any topic on your Kafka Servers. Upsolver can read events from your Kafka cluster from the specified Kafka topic.
A prerequisite for defining a Kafka stream connection is providing Upsolver with the appropriate credentials for reading from your Kafka cluster. See: Kafka connection
Fields
Field | Name | Type | Description | Optional |
displayData.name | Name | String | The data source name. | |
displayData.description | Description | String | The data source description. | |
kafkaVersion | Kafka Version | KafkaVersion | The version of the Kafka Servers. If unsure, use | |
kafkaHosts | Kafka Hosts | String | The Kafka hosts separated with commas. For example: foo:9092,bar:9092 | |
topicName | Kafka Topic | String | The Kafka topic to ingest the data from. | |
readFromStart | Read From Start | Boolean | Whether to read the data from the start of the topic or to begin from the end. | |
contentType | Content Format | ContentType | The format of the messages. Supported formats are: JSON, AVRO, CSV, TSV, ORC, Protobuf and x-www-form-urlencoded. For self-describing formats like JSON, the schema is auto-detected. The body should contain of the message should contain the message itself, which should not be url-encoded. Messages can be compressed, Upsolver automatically detects the compression type. Supported compression types are: Zip, GZip, Snappy and None. See: Content formats | |
computeEnvironment | Compute Cluster | String | The compute cluster to run the calculation on. See: Compute cluster | |
connectionPointer | Target Storage | String | The data and metadata files for this data source will be stored in this storage. | |
softRetention | Soft Retention | Boolean | A setting that prevents data deletion when the retention policy in Upsolver activates. When enabled, the metadata is purged but the underlying data (e.g. S3 object) is not deleted. | |
shards | Shards | Int | How many readers to use in parallel to read the stream. A recommended value would be to increase it by 1 for every 70 MB/s sent to your topic. | |
executionParallelism | Execution Parallelism | Int | The number of independent shards to parse data, to increase parallelism and reduce latency. This should remain 1 in most cases and be no more than the number of shards used to read the data from the source. | |
isOnline | Real Time Statistics | Boolean | Calculate this data source's statistics in real time directly from the input stream if a real time cluster is deployed. | |
useSsl | Use SSL | Boolean | Set this to true if your connection requires SSL. Contact us to ensure that your SSL certificate is supported. | |
storeRawData | Store Raw Data | Boolean | Store an additional copy of the data in its original format. | |
compression | Compression | Compression | The compression in the data (Zip, GZip, Snappy, SnappyUnframed, Tar, or None). | |
consumerProperties | Kafka Consumer Properties | String | Extra properties for Kafka Consumer. | + |
retention | Retention | Int (Minutes) | A retention period for the data in Upsolver. After this period of time passes, the data is deleted forever. | + |
endExecutionAt | End Read At | String (ISO-8601) | If configured, stop reading after this date. | + |
workspaces | Workspaces | String[] | The workspaces attached to this data source. | + |
Example
Azure Blob storage
Connect to your Azure Blob storage container.
In order for Upsolver to read events directly from your cloud storage, files should be partitioned by date and time (which defines the folder structure in the cloud storage).
A prerequisite for defining a cloud storage data source is providing Upsolver with the appropriate credentials for reading from your cloud storage. See: Azure Blob storage connection
Fields
Field | Name | Type | Description | Optional |
displayData.name | Name | String | The data source name. | |
displayData.description | Description | String | The data source description. | |
sourceStorage | Azure Blob Storage Connection | String | The cloud storage to ingest files from. | |
datePattern | Date Pattern | String | The date pattern in the file name/folder structure (e.g. | |
fileMatchPattern | File Name Pattern | FileNameMatcher | The file name pattern for the files to ingest. If all the files in the specified folders are relevant, specify All. The pattern given is matched against the file path starting from the storage container specified. | |
contentType | Content Format | ContentType | The format of the messages. Supported formats are: JSON, AVRO, CSV, TSV, ORC, Protobuf and x-www-form-urlencoded. For self-describing formats like JSON, the schema is auto-detected. The body should contain of the message should contain the message itself, which should not be url-encoded. Messages can be compressed, Upsolver automatically detects the compression type. Supported compression types are: Zip, GZip, Snappy and None. See: Content formats | |
computeEnvironment | Compute Cluster | String | The compute cluster to run the calculation on. See: Compute cluster | |
destinationStorage | Target Storage | String | The data and metadata files for this data source will be stored in this storage. | |
softRetention | Soft Retention | Boolean | A setting that prevents data deletion when the retention policy in Upsolver activates. When enabled, the metadata is purged but the underlying data (e.g. S3 object) is not deleted. | |
compression | Compression | Compression | The compression in the data (Zip, GZip, Snappy, SnappyUnframed, Tar, or None). | |
interval | Interval | Int (Minutes) | The sliding interval to wait for data. | |
prefix | Folder | String | If the data resides in a sub folder within the defined cloud storage, specify this folder. | + |
initialLoadConfiguration | Initial Load Configuration | InitialLoadConfiguration | If you have initial data, enter in a prefix and regex pattern to list the relevant data and select the required files. | + |
startExecutionFrom | Start Ingestion From | String (ISO-8601) | The time from which to ingest the data from. Files from before this time (based on the provided date pattern) are ignored. If you leave this field empty all files are ingested. | + |
retention | Retention | Int (Minutes) | A retention period for the data in Upsolver. After this amount of time elapsed the data will be deleted forever. | + |
dataDumpDate | Data Dump Date | String (ISO-8601) | The date that the data starts. | + |
maxDelay | Max Delay | Int (Minutes) | The maximum delay to consider the data, that is, any data that arrives delayed by more than the max delay is filtered out. | + |
workspaces | Workspaces | String[] | The workspaces attached to this data source. | + |
Example
Google Cloud Storage
Connect to your Google Storage Bucket.
In order for Upsolver to read events directly from your cloud strage, files should be partitioned by date and time (which defines the folder structure in the cloud storage)
A prerequisite for defining a cloud storage data source is providing Upsolver with the appropriate credentials for reading from your cloud storage. See: Google Storage connection
Fields
Field | Name | Type | Description | Optional |
displayData.name | Name | String | The data source name. | |
displayData.description | Description | String | The data source description. | |
sourceStorage | Google Storage Connection | String | The cloud storage to ingest files from. | |
datePattern | Date Pattern | String | The date pattern in the file name/folder structure (e.g. | |
fileMatchPattern | File Name Pattern | FileNameMatcher | The file name pattern for the files to ingest. If all the files in the specified folders are relevant, specify All. The pattern given is matched against the file path starting from storage source specified. | |
contentType | Content Format | ContentType | The format of the messages. Supported formats are: JSON, AVRO, CSV, TSV, ORC, Protobuf and x-www-form-urlencoded. For self-describing formats like JSON, the schema is auto-detected. The body should contain of the message should contain the message itself, which should not be url-encoded. Messages can be compressed, Upsolver automatically detects the compression type. Supported compression types are: Zip, GZip, Snappy and None. See: Content formats | |
computeEnvironment | Compute Cluster | String | The compute cluster to run the calculation on. See: Compute cluster | |
destinationStorage | Target Storage | String | The data and metadata files for this data source will be stored in this storage. | |
softRetention | Soft Retention | Boolean | A setting that prevents data deletion when the retention policy in Upsolver activates. When enabled, the metadata is purged but the underlying data (e.g. S3 object) is not deleted. | |
compression | Compression | Compression | The compression in the data (Zip, GZip, Snappy, SnappyUnframed, Tar, or None). | |
interval | Interval | Int (Minutes) | The sliding interval to wait for data. | |
prefix | Folder | String | If the data resides in a sub folder within the defined cloud storage, specify this folder. | + |
initialLoadConfiguration | Initial Load Configuration | InitialLoadConfiguration | If you have initial data, enter in a prefix and regex pattern to list the relevant data and select the required files. | |
startExecutionFrom | Start Ingestion From | String (ISO-8601) | The time from which to ingest the data from. Files from before this time (based on the provided date pattern) are ignored. If you leave this field empty all files are ingested. | + |
retention | Retention | Int (Minutes) | A retention period for the data in Upsolver. After this amount of time elapsed the data will be deleted forever. | + |
dataDumpDate | Data Dump Date | String (ISO-8601) | The date that the data starts. | + |
maxDelay | Max Delay | Int (Minutes) | The maximum delay to consider the data, that is, any data that arrives delayed by more than the max delay is filtered out. | + |
workspaces | Workspaces | String[] | The workspaces attached to this data source. | + |
Example
Kinesis-backed HTTP
Connect your stream using HTTP requests from any source.
Once you create the connection, you will be provided with an HTTP endpoint. Upsolver receives the data as POST, with the data in the body, and the data is stored in a Kinesis Stream until processed by Upsolver.
Headers sent with the request are also ingested as part of the stream, so metadata can be added to the request header.
Fields
Field | Name | Type | Description | Optional |
displayData.name | Name | String | The data source name. | |
displayData.description | Description | String | The data source description. | |
contentType | Content Format | ContentType | The format of the messages. Supported formats are: JSON, AVRO, CSV, TSV, ORC, Protobuf and x-www-form-urlencoded. For self-describing formats like JSON, the schema is auto-detected. The body should contain of the message should contain the message itself, which should not be url-encoded. Messages can be compressed, Upsolver automatically detects the compression type. Supported compression types are: Zip, GZip, Snappy and None. See: Content formats | |
computeEnvironment | Compute Cluster | String | The compute cluster to run the calculation on. See: Compute cluster | |
ingestionEnvironment | Ingestion Cluster | String | ||
storageConnection | Target Storage | String | The data and metadata files for this Data Source will be stored in this storage. | |
kinesisConnection | Kinesis Connection | String | The data and metadata files for this Data Source will be stored in this storage. | |
softRetention | Soft Retention | Boolean | A setting that prevents data deletion when the retention policy in Upsolver activates. When enabled, the metadata is purged but the underlying data (e.g. S3 object) is not deleted. | |
retention | Retention | Int (Minutes) | A retention period for the data in Upsolver. After this amount of time elapsed the data will be deleted forever. | + |
startExecutionFrom | Start Ingestion From | String (ISO-8601) | The time from which to ingest the data from. Files from before this time (based on the provided date pattern) are ignored. If you leave this field empty all files are ingested. | + |
endExecutionAt | End Read At | String (ISO-8601) | If configured, stop reading after this date. | + |
workspaces | Workspaces | String[] | The workspaces attached to this data source. | + |
Example
HTTP
Connect your stream using HTTP requests from any source.
Once you create the connection, you will be provided with an HTTP endpoint. Upsolver receives the data as POST, with the data in the body.
Headers sent with the request are also ingested as part of the stream, so metadata can be added to the request header.
Fields
Field | Name | Type | Description | Optional |
displayData.name | Name | String | The data source name. | |
displayData.description | Description | String | The data source description. | |
contentType | Content Format | ContentType | The format of the messages. Supported formats are: JSON, AVRO, CSV, TSV, ORC, Protobuf and x-www-form-urlencoded. For self-describing formats like JSON, the schema is auto-detected. The body should contain of the message should contain the message itself, which should not be url-encoded. Messages can be compressed, Upsolver automatically detects the compression type. Supported compression types are: Zip, GZip, Snappy and None. See: Content formats | |
computeEnvironment | Compute Cluster | String | The compute cluster to run the calculation on. See: Compute cluster | |
connectionPointer | Target Storage | String | The data and metadata files for this Data Source are stored in this storage. | |
softRetention | Soft Retention | Boolean | A setting that prevents data deletion when the retention policy in Upsolver activates. When enabled, the metadata is purged but the underlying data (e.g. S3 object) is not deleted. | |
shards | Shards | Int | How many readers to use in parallel to read the stream. A recommended value would be to increase it by 1 for every 70 MB/s sent to your topic. | |
retention | Retention | Int (Minutes) | A retention period for the data in Upsolver. After this amount of time elapsed the data will be deleted forever. | + |
endExecutionAt | End Read At | String (ISO-8601) | If configured, stop reading after this date. | + |
workspaces | Workspaces | String[] | The workspaces attached to this data source. | + |
Example
Last updated