Cluster
These metrics provide insight into the performance of the cluster running your job.
Last updated
These metrics provide insight into the performance of the cluster running your job.
Last updated
The following metrics will help you diagnose performance issues with the cluster upon which your job is running, and guide you in troubleshooting any issues.
Metric type
Informational
About this metric
The number of job tasks pending execution in the cluster queue.
Timeframe
Now
The number of job tasks pending execution in the cluster queue, which represents the amount of work not currently being processed in the cluster because the cluster is at high utilization.
This number can be above 0 even if the is below 100. This is because of the distribution of work between servers. It might be that work is allocated to a specific server, which is at 100% utilization, but the cluster itself is not at 100% utilization and there are other servers that have free slots, but they are not going to be doing this particular work, so the Tasks in Queue value might be greater than 0.
Upsolver will do re-balancing to ensure this doesn't continue over time. The Tasks in Queue value can be smaller than the number of because the cluster only adds tasks to the queue in chunks, it doesn’t add all of the tasks. For example, if you want to replay 1,000,000 tasks, the number of tasks in will show 1,000,000, but Tasks in Queue will only show the next chunk of work, e.g. 1,000. Therefore Tasks in Queue may only show a subset of the number of tasks.
Metric type
Warning
About this metric
The percentage of time that the server is doing garbage collection rather than working.
Limits
Error when > 10%
Timeframe
Now
The percentage of time that the server is doing garbage collection rather than working should generally be under 10%. If this value is showing red and the cluster is not doing a lot of work or the cluster has crashes, it’s often indicative that you need bigger servers with more memory.
Metric type
Warning
About this metric
The percent of bytes re-loaded into memory from disk.
Limits
Error when > 200%
Warn when > 30%
Timeframe
Today (midnight UTC to now)
The percent of bytes re-loaded into memory from disk. High values indicate more memory is required as many page faults will result in slow processing.
The metric represents how much extra work needs to be done in loading data because there is not enough memory on the server.
Each server has an in-memory cache, the Reload from disk percent is how many operations are served by the cache versus how many have expired from the cache, and need to be reloaded. Numbers above 0% are fine, but if this is significant and over 200%, the cluster is working too hard due to a lack of memory.