Adding shards in Upsolver means splitting the workload between multiple tasks that run simultaneously and on different machines within the cluster; this results in lowers latency but also increases CPU usage.
Note that adding shards does not take immediate effect on delayed data. Only data that arrives after the minute when it was changed is affected.
For example, let's say at 12:00 PM we added shards, but the data is delayed by an hour. This means that the data from 11:00 AM is still processing. Once data from 12:00 PM begins being processed, only then do the newly added shards take effect.
Types of Shards
Data Source Shards
Shards — Defines the number of shards/partitions that the data source will use in the ingestion or reading process.
Note that this option can only be used in a streaming data source (e.g. Kafka or Kinesis), as it scales up to the number of shards/partitions in the stream (or the number of headers in the data source).
Parallelism Type > EXECUTION PARALLELISM (Manual - Shards) — Defines the number of shards/partitions that the data source will use in the parsing or execution phase.
For CSV files, execution parallelism splits the source files by rows (e.g. the first row in each file would be the first shard, the second row would be the second shard, etc).
For other file formats (S3, Azure, etc), it splits it based on the files themselves (e.g. the first file in each minute would go to the first shard, the second file to the second shard, etc).
If the number of execution parallelism shards is the same as the number of shards from ingestion, then each ingestion shard is matched to an execution parallelism shard.
Shards — Defines the number of shards/partitions that the output uses in the execution of the output (or in the case of an aggregated output, the building of the smallest part: a minute).
If this number is equal to the number of execution parallelism shards in the data source, then each shard in the source is matched to a shard in the output. Otherwise, each shard in the output gets files from all the shards in the data source and splits it by files on its own.
Output Shards — Defines the number of shards that parallelize the aggregation process. If the output is aggregated, then increasing these shards speeds up its formation process.
Note that these shards can also be used in lookup tables.
Compaction Shards — Defines the number of shards used to speed up compactions. They are only used in Athena Outputs.
Limitation of shards
Shards are a great way to parallelize processes. However, it's important to remember that it isn't a "free" solution—the more shards you run, the higher the computational load and utilization on your cluster.
As a result, at times you won't see any performance advantages from increasing the number of shards as the CPU is already at peak usage.
In such a case, you should increase the number of instances for that cluster. Although this increases your costs, by distributing the workload, you can avoid being limited by the computation prowess of a single CPU.