LogoLogo
OverviewQuickstartsHow To GuidesReferenceArticlesSupport
Reference
Reference
  • Reference
  • ✨Learning Paths
    • Amazon Kinesis
    • Amazon S3
    • Apache Kafka
    • Confluent Cloud
    • Microsoft SQL Server
    • MongoDB
    • MySQL
    • PostgreSQL
  • SQL COMMANDS
    • Clusters
      • CREATE CLUSTER
      • ALTER CLUSTER
      • ROLL CLUSTER
      • STOP/START CLUSTER
      • DROP CLUSTER
    • Connections
      • CREATE CONNECTION
        • Amazon Kinesis
        • Amazon Redshift
        • Amazon S3
        • Apache Kafka
        • AWS Glue Data Catalog
        • ClickHouse
        • Confluent Cloud
        • Elasticsearch
        • Hive Metastore
        • Microsoft SQL Server
        • MongoDB
        • MySQL
        • Polaris Catalog
        • Iceberg REST catalogs
        • PostgreSQL
        • Snowflake
      • ALTER CONNECTION
      • DROP CONNECTION
      • CDC Connections with SSL
    • Jobs
      • CREATE JOB
        • Ingestion
          • Amazon Kinesis
          • Amazon S3
          • Apache Kafka
          • Confluent Kafka
          • Content Types
          • Microsoft SQL Server
          • MongoDB
          • MySQL
          • PostgreSQL
        • Replication
          • Microsoft SQL Server
          • MongoDB
          • MySQL
          • PostgreSQL
        • Transformation
          • INSERT
            • MAP_COLUMNS_BY_NAME
            • UNNEST
          • MERGE
          • SELECT
          • Job Options
            • Amazon Redshift
            • Amazon S3
            • Apache Iceberg
            • ClickHouse
            • Data Lake Tables
            • Elasticsearch
            • PostgreSQL
            • Snowflake
        • Monitoring
          • Amazon CloudWatch
          • Datadog
          • Dynatrace
      • ALTER JOB
      • PAUSE / RESUME JOB
      • DROP JOB
    • Materialized Views
      • CREATE MATERIALIZED VIEW
      • ALTER MATERIALIZED VIEW
      • DROP MATERIALIZED VIEW
      • Join with a Materialized View
    • Tables
      • CREATE TABLE
      • ALTER TABLE
      • DROP TABLE
    • Iceberg Tables
      • Upsolver Managed Tables
        • CREATE ICEBERG TABLE
        • ALTER ICEBERG TABLE
        • DROP ICEBERG TABLE
        • OPTIMIZE ICEBERG TABLE
      • External Iceberg Tables
        • CREATE EXTERNAL ICEBERG TABLE
        • DROP EXTERNAL ICEBERG TABLE
      • Mirror Iceberg Tables
        • CREATE MIRROR
        • ALTER MIRROR
        • PAUSE / RESUME MIRROR
        • DROP MIRROR
  • FUNCTIONS & OPERATORS
    • Data Types
    • Functions
      • Aggregate
        • APPROX_COUNT_DISTINCT
        • APPROX_COUNT_DISTINCT_EACH
        • AVG
        • AVG_EACH
        • AVG_TIME_SERIES
        • COLLECT_SET
        • COLLECT_SET_EACH
        • COUNT
        • COUNT(*)
        • COUNT(DISTINCT ...)
        • COUNT_EACH
        • COUNT_IF
        • DYNAMIC_SESSIONS
        • FIRST
        • FIRST_ARRAY
        • FIRST_EACH
        • FIRST_TIME_SERIES
        • LAST
        • LAST_ARRAY
        • LAST_EACH
        • LAST_K
        • LAST_K_EACH
        • LAST_TIME_SERIES
        • MAX
        • MAX_BY
        • MAX_EACH
        • MAX_TIME_SERIES
        • MIN
        • MIN_BY
        • MIN_EACH
        • MIN_TIME_SERIES
        • SESSION_COUNT
        • STD_DEV
        • STD_DEV_EACH
        • STRING_MAX
        • STRING_MAX_EACH
        • STRING_MIN_EACH
        • SUM
        • SUM_EACH
        • SUM_TIME_SERIES
        • WEIGHTED_AVERAGE
      • Array
        • ARRAY_DISTINCT
        • ARRAY_JOIN
        • ARRAY_MAX
        • ARRAY_MIN
        • ARRAY_SORT
        • ARRAY_SORT_DESC
        • ARRAY_SUM
        • COUNT_VALUES_IF
        • COUNT_VALUES
        • ELEMENT_AT
        • FIRST_ELEMENT
        • LAST_ELEMENT
        • VALUE_INDEX_IN_ARRAY
        • VALUE_INDEX_IN_ROW
      • Comparison
        • GREATEST
        • LEAST
      • Conditional
        • COALESCE
        • IF_ELSE
        • NULL_IF
      • Date & Time
        • ADD_TIME_ZONE_OFFSET
        • DATE
        • DATE_ADD
        • DATE_DIFF
        • DATE_TRUNC
        • DAY
        • DAY_OF_WEEK
        • DAY_OF_YEAR
        • EXTRACT_TIMESTAMP
        • EXTRACT
        • FORMAT_DATETIME
        • FROM_ISO8601_DATE
        • FROM_UNIXTIME
        • HOUR
        • MILLISECOND
        • MINUTE
        • MONTH
        • QUARTER
        • RUN_END_TIME
        • RUN_START_TIME
        • SECOND
        • SUBTRACT_TIME_ZONE_OFFSET
        • TO_UNIX_EPOCH_MILLIS
        • TO_UNIX_EPOCH_SECONDS
        • TO_UNIXTIME
        • WEEK
        • YEAR_OF_WEEK
        • YEAR
      • Filter
        • IS_DUPLICATE
        • NOT
      • Interval
        • PARSE_DURATION
      • Mathematical
        • ABS
        • CBRT
        • CEIL
        • CEILING
        • DEGREES
        • EXP
        • FLOOR
        • GET_SHARD_NUMBER
        • LN
        • LOG
        • LOG2
        • LOG10
        • MOD
        • MODULO
        • POW
        • POWER
        • RADIANS
        • RAND
        • RANDOM
        • RECIPROCAL
        • ROUND
        • SIGN
        • SORT_VALUES
        • SQRT
        • TRUNCATE
      • Regular Expressions
        • REGEXP_EXTRACT
        • REGEXP_EXTRACT_ALL
        • REGEXP_LIKE
        • REGEX_MATCH_POSITION
        • REGEX_NAMED_GROUPS
        • REGEXP_REPLACE
      • Spatial
        • ST_DISTANCE
        • ST_WGS84_DISTANCE
        • WKT_SPATIAL_CONTAINS
        • WKT_SPATIAL_INTERSECT
      • String
        • BASE64_DECODE
        • BASE64_TO_HEX
        • BYTES_SUBSTRING
        • CONCAT
        • DATE
        • JOIN_ARRAYS
        • LENGTH
        • LOWER
        • LPAD
        • LTRIM
        • MD5
        • PARSE_DATETIME
        • REPLACE
        • REVERSE
        • RPAD
        • RTRIM
        • SHA1
        • SHA3_512
        • SHA256
        • SHA512
        • SORT_VALUES
        • SPLIT
        • SPLIT_TO_RECORD
        • STRING_FORMAT
        • STRIP_MARGIN
        • STRIP_PREFIX
        • STRIP_SUFFIX
        • STRPOS
        • SUBSTR
        • SUBSTRING
        • TRANSLATE
        • TRIM_CHARS
        • TRIM
        • UPPER
        • UUID_GENERATOR
        • XX_HASH
      • Structural
        • FROM_KEY_VALUE
        • GET_RANGE
        • JOIN_ALL_BY_KEY
        • JSON_PATH
        • JSON_TO_RECORD
        • MAP_WITH_INDEX
        • QUERY_STRING_TO_RECORD
        • RECORD_TO_JSON
        • SORT_BY
        • TO_ARRAY
        • ZIP_WITH_INDEX
        • ZIP
      • Trigonometric
        • COS
        • SIN
        • TAN
        • TANH
      • Type Conversion
        • CAST
        • CHR
        • DECIMAL_TO_HEX
        • HEX_TO_DECIMAL
        • TO_BIGINT
        • TO_DOUBLE
        • TO_STRING
      • URL
        • TOP_PRIVATE_DOMAIN
        • URL_DECODE
        • URL_ENCODE
        • URL_PARSER
    • Operators
      • Comparison
      • Conditional
        • CASE
      • Logical
      • Mathematical
      • String
  • MONITORING
    • Clusters
    • Datasets
      • Ingested Data
        • Column
      • Lineage
      • Data Violations
      • Statistics
      • Maintenance
        • Compactions
        • Expire Snapshots
        • Orphan Files
      • Columns
      • Partitions
      • Properties
    • Job Status
      • Stream & File Sources
        • Monitoring
        • Graphs
        • Lineage
        • Settings
      • CDC Sources
        • Monitoring
        • Replication Settings
        • Job Settings
    • System Catalog
      • Information Schema
        • Clusters
        • Columns
        • Connections
        • Jobs
        • Mirrors
        • Tables
        • Users
        • Views
      • Insights
        • job_output_column_stats
        • dataset_column_stats
      • Monitoring
        • CDC Status
        • Clusters
        • Expectations
        • Jobs
        • Partition Statistics
        • Recent Compactions
        • Running Queries
        • Table Statistics
      • Task Executions Table
  • GENERAL
    • Common SQL Syntax
    • View Entity Syntax
    • Keyboard Shortcuts
Powered by GitBook
On this page
  • Utilization Percent
  • Tasks in Queue
  • Memory Load Index (%GC)
  • Cluster Crashes
  • Reload from disk percent
  1. MONITORING
  2. Job Status
  3. Stream & File Sources
  4. Monitoring (V1)

Cluster

These metrics provide insight into the performance of the cluster running your job.

Last updated 12 months ago

The following metrics will help you diagnose performance issues with the cluster upon which your job is running, and guide you in troubleshooting any issues.

Utilization Percent

Metric type

Warning

About this metric

Represents how much of the server's processing capacity is in use.

Limits

Error when > 90%

Warn when > 70%

Timeframe

Now

More information

Represents how much of the server's processing capacity is in use.

Run the following SQL command in a query window in Upsolver, replacing <cluster_id> with the Id for your cluster. For additional columns, alter this statement and use SELECT *.

SELECT STRING_FORMAT('{0,number,#,###.##}%', utilization_percent * 100) AS utilization_percent, IF_ELSE(utilization_percent > 0.9, 'ERROR', IF_ELSE(utilization_percent > 0.7, 'WARNING', 'OK')) AS utilization_percent_severity 
FROM system.monitoring.clusters 
WHERE cluster_id = '<cluster_id>';

Consider increasing the server limit or splitting moving larger jobs to a separate compute cluster.

Tasks in Queue

Metric type

Informational

About this metric

The number of job tasks pending execution in the cluster queue.

Timeframe

Now

More information

The number of job tasks pending execution in the cluster queue, which represents the amount of work not currently being processed in the cluster because the cluster is at high utilization.

This number can be above 0 even if the is below 100. This is because of the distribution of work between servers. It might be that work is allocated to a specific server, which is at 100% utilization, but the cluster itself is not at 100% utilization and there are other servers that have free slots, but they are not going to be doing this particular work, so the Tasks in Queue value might be greater than 0.

Upsolver will do re-balancing to ensure this doesn't continue over time. The Tasks in Queue value can be smaller than the number of because the cluster only adds tasks to the queue in chunks, it doesn’t add all of the tasks. For example, if you want to replay 1,000,000 tasks, the number of tasks in will show 1,000,000, but Tasks in Queue will only show the next chunk of work, e.g. 1,000. Therefore Tasks in Queue may only show a subset of the number of tasks.

Run the following SQL command in a query window in Upsolver, replacing <cluster_id> with the Id for your cluster. For additional columns, alter this statement and use SELECT *.

SELECT STRING_FORMAT('{0,number,#,###}', tasks_in_queue) AS tasks_in_queue, 'OK' AS tasks_in_queue_severity 
FROM system.monitoring.clusters 
WHERE cluster_id = '<cluster_id>';

Memory Load Index (%GC)

Metric type

Warning

About this metric

The percentage of time that the server is doing garbage collection rather than working.

Limits

Error when > 10%

Timeframe

Now

More information

The percentage of time that the server is doing garbage collection rather than working should generally be under 10%. If this value is showing red and the cluster is not doing a lot of work or the cluster has crashes, it’s often indicative that you need bigger servers with more memory.

Run the following SQL command in a query window in Upsolver, replacing <cluster_id> with the Id for your cluster. For additional columns, alter this statement and use SELECT *.

SELECT STRING_FORMAT('{0,number,#,###.##}%', memory_load_percent * 100) AS memory_load_percent, IF_ELSE(memory_load_percent > 0.1, 'ERROR', 'OK') AS memory_load_percent_severity 
FROM system.monitoring.clusters 
WHERE cluster_id = '<cluster_id>';

Cluster Crashes

Metric type

Warning

About this metric

How many server crashes happened in the job’s cluster today.

Limits

Error when > 0

Timeframe

Today (midnight UTC to now)

More information

How many server crashes happened in the job's cluster today.

Run the following SQL command in a query window in Upsolver, replacing <cluster_id> with the Id for your cluster. For additional columns, alter this statement and use SELECT *.

SELECT STRING_FORMAT('{0,number,#,###}', crashes) AS crashes, IF_ELSE(crashes > 0, 'ERROR', 'OK') AS crashes_severity 
FROM system.monitoring.clusters 
WHERE cluster_id = '<cluster_id>';

Typically crashes are due to memory issues when trying to load large lookup tables. Consider changing the cluster to a type with more RAM.

Reload from disk percent

Metric type

Warning

About this metric

The percent of bytes re-loaded into memory from disk.

Limits

Error when > 200%

Warn when > 30%

Timeframe

Today (midnight UTC to now)

More information

The percent of bytes re-loaded into memory from disk. High values indicate more memory is required as many page faults will result in slow processing.

The metric represents how much extra work needs to be done in loading data because there is not enough memory on the server.

Each server has an in-memory cache, the Reload from disk percent is how many operations are served by the cache versus how many have expired from the cache, and need to be reloaded. Numbers above 0% are fine, but if this is significant and over 200%, the cluster is working too hard due to a lack of memory.

Run the following SQL command in a query window in Upsolver, replacing <cluster_id> with the Id for your cluster. For additional columns, alter this statement and use SELECT *.

SELECT STRING_FORMAT('{0,number,#,###.##}%', reload_from_disk_percent * 100) AS reload_from_disk_percent, IF_ELSE(reload_from_disk_percent > 2, 'ERROR', IF_ELSE(reload_from_disk_percent > 0.3, 'WARNING', 'OK')) AS reload_from_disk_percent_severity 
FROM system.monitoring.clusters 
WHERE cluster_id = '<cluster_id>';

Change the cluster to a type with more RAM.

Utilization Percent
Job executions currently in queue
Job executions currently in queue
Job executions currently in queue