LogoLogo
OverviewQuickstartsHow To GuidesReferenceArticlesSupport
Reference
Reference
  • Reference
  • ✨Learning Paths
    • Amazon Kinesis
    • Amazon S3
    • Apache Kafka
    • Confluent Cloud
    • Microsoft SQL Server
    • MongoDB
    • MySQL
    • PostgreSQL
  • SQL COMMANDS
    • Clusters
      • CREATE CLUSTER
      • ALTER CLUSTER
      • ROLL CLUSTER
      • STOP/START CLUSTER
      • DROP CLUSTER
    • Connections
      • CREATE CONNECTION
        • Amazon Kinesis
        • Amazon Redshift
        • Amazon S3
        • Apache Kafka
        • AWS Glue Data Catalog
        • ClickHouse
        • Confluent Cloud
        • Elasticsearch
        • Hive Metastore
        • Microsoft SQL Server
        • MongoDB
        • MySQL
        • Polaris Catalog
        • Iceberg REST catalogs
        • PostgreSQL
        • Snowflake
      • ALTER CONNECTION
      • DROP CONNECTION
      • CDC Connections with SSL
    • Jobs
      • CREATE JOB
        • Ingestion
          • Amazon Kinesis
          • Amazon S3
          • Apache Kafka
          • Confluent Kafka
          • Content Types
          • Microsoft SQL Server
          • MongoDB
          • MySQL
          • PostgreSQL
        • Replication
          • Microsoft SQL Server
          • MongoDB
          • MySQL
          • PostgreSQL
        • Transformation
          • INSERT
            • MAP_COLUMNS_BY_NAME
            • UNNEST
          • MERGE
          • SELECT
          • Job Options
            • Amazon Redshift
            • Amazon S3
            • Apache Iceberg
            • ClickHouse
            • Data Lake Tables
            • Elasticsearch
            • PostgreSQL
            • Snowflake
        • Monitoring
          • Amazon CloudWatch
          • Datadog
          • Dynatrace
      • ALTER JOB
      • PAUSE / RESUME JOB
      • DROP JOB
    • Materialized Views
      • CREATE MATERIALIZED VIEW
      • ALTER MATERIALIZED VIEW
      • DROP MATERIALIZED VIEW
      • Join with a Materialized View
    • Tables
      • CREATE TABLE
      • ALTER TABLE
      • DROP TABLE
    • Iceberg Tables
      • Upsolver Managed Tables
        • CREATE ICEBERG TABLE
        • ALTER ICEBERG TABLE
        • DROP ICEBERG TABLE
        • OPTIMIZE ICEBERG TABLE
      • External Iceberg Tables
        • CREATE EXTERNAL ICEBERG TABLE
        • DROP EXTERNAL ICEBERG TABLE
      • Mirror Iceberg Tables
        • CREATE MIRROR
        • ALTER MIRROR
        • PAUSE / RESUME MIRROR
        • DROP MIRROR
  • FUNCTIONS & OPERATORS
    • Data Types
    • Functions
      • Aggregate
        • APPROX_COUNT_DISTINCT
        • APPROX_COUNT_DISTINCT_EACH
        • AVG
        • AVG_EACH
        • AVG_TIME_SERIES
        • COLLECT_SET
        • COLLECT_SET_EACH
        • COUNT
        • COUNT(*)
        • COUNT(DISTINCT ...)
        • COUNT_EACH
        • COUNT_IF
        • DYNAMIC_SESSIONS
        • FIRST
        • FIRST_ARRAY
        • FIRST_EACH
        • FIRST_TIME_SERIES
        • LAST
        • LAST_ARRAY
        • LAST_EACH
        • LAST_K
        • LAST_K_EACH
        • LAST_TIME_SERIES
        • MAX
        • MAX_BY
        • MAX_EACH
        • MAX_TIME_SERIES
        • MIN
        • MIN_BY
        • MIN_EACH
        • MIN_TIME_SERIES
        • SESSION_COUNT
        • STD_DEV
        • STD_DEV_EACH
        • STRING_MAX
        • STRING_MAX_EACH
        • STRING_MIN_EACH
        • SUM
        • SUM_EACH
        • SUM_TIME_SERIES
        • WEIGHTED_AVERAGE
      • Array
        • ARRAY_DISTINCT
        • ARRAY_JOIN
        • ARRAY_MAX
        • ARRAY_MIN
        • ARRAY_SORT
        • ARRAY_SORT_DESC
        • ARRAY_SUM
        • COUNT_VALUES_IF
        • COUNT_VALUES
        • ELEMENT_AT
        • FIRST_ELEMENT
        • LAST_ELEMENT
        • VALUE_INDEX_IN_ARRAY
        • VALUE_INDEX_IN_ROW
      • Comparison
        • GREATEST
        • LEAST
      • Conditional
        • COALESCE
        • IF_ELSE
        • NULL_IF
      • Date & Time
        • ADD_TIME_ZONE_OFFSET
        • DATE
        • DATE_ADD
        • DATE_DIFF
        • DATE_TRUNC
        • DAY
        • DAY_OF_WEEK
        • DAY_OF_YEAR
        • EXTRACT_TIMESTAMP
        • EXTRACT
        • FORMAT_DATETIME
        • FROM_ISO8601_DATE
        • FROM_UNIXTIME
        • HOUR
        • MILLISECOND
        • MINUTE
        • MONTH
        • QUARTER
        • RUN_END_TIME
        • RUN_START_TIME
        • SECOND
        • SUBTRACT_TIME_ZONE_OFFSET
        • TO_UNIX_EPOCH_MILLIS
        • TO_UNIX_EPOCH_SECONDS
        • TO_UNIXTIME
        • WEEK
        • YEAR_OF_WEEK
        • YEAR
      • Filter
        • IS_DUPLICATE
        • NOT
      • Interval
        • PARSE_DURATION
      • Mathematical
        • ABS
        • CBRT
        • CEIL
        • CEILING
        • DEGREES
        • EXP
        • FLOOR
        • GET_SHARD_NUMBER
        • LN
        • LOG
        • LOG2
        • LOG10
        • MOD
        • MODULO
        • POW
        • POWER
        • RADIANS
        • RAND
        • RANDOM
        • RECIPROCAL
        • ROUND
        • SIGN
        • SORT_VALUES
        • SQRT
        • TRUNCATE
      • Regular Expressions
        • REGEXP_EXTRACT
        • REGEXP_EXTRACT_ALL
        • REGEXP_LIKE
        • REGEX_MATCH_POSITION
        • REGEX_NAMED_GROUPS
        • REGEXP_REPLACE
      • Spatial
        • ST_DISTANCE
        • ST_WGS84_DISTANCE
        • WKT_SPATIAL_CONTAINS
        • WKT_SPATIAL_INTERSECT
      • String
        • BASE64_DECODE
        • BASE64_TO_HEX
        • BYTES_SUBSTRING
        • CONCAT
        • DATE
        • JOIN_ARRAYS
        • LENGTH
        • LOWER
        • LPAD
        • LTRIM
        • MD5
        • PARSE_DATETIME
        • REPLACE
        • REVERSE
        • RPAD
        • RTRIM
        • SHA1
        • SHA3_512
        • SHA256
        • SHA512
        • SORT_VALUES
        • SPLIT
        • SPLIT_TO_RECORD
        • STRING_FORMAT
        • STRIP_MARGIN
        • STRIP_PREFIX
        • STRIP_SUFFIX
        • STRPOS
        • SUBSTR
        • SUBSTRING
        • TRANSLATE
        • TRIM_CHARS
        • TRIM
        • UPPER
        • UUID_GENERATOR
        • XX_HASH
      • Structural
        • FROM_KEY_VALUE
        • GET_RANGE
        • JOIN_ALL_BY_KEY
        • JSON_PATH
        • JSON_TO_RECORD
        • MAP_WITH_INDEX
        • QUERY_STRING_TO_RECORD
        • RECORD_TO_JSON
        • SORT_BY
        • TO_ARRAY
        • ZIP_WITH_INDEX
        • ZIP
      • Trigonometric
        • COS
        • SIN
        • TAN
        • TANH
      • Type Conversion
        • CAST
        • CHR
        • DECIMAL_TO_HEX
        • HEX_TO_DECIMAL
        • TO_BIGINT
        • TO_DOUBLE
        • TO_STRING
      • URL
        • TOP_PRIVATE_DOMAIN
        • URL_DECODE
        • URL_ENCODE
        • URL_PARSER
    • Operators
      • Comparison
      • Conditional
        • CASE
      • Logical
      • Mathematical
      • String
  • MONITORING
    • Clusters
    • Datasets
      • Ingested Data
        • Column
      • Lineage
      • Data Violations
      • Statistics
      • Maintenance
        • Compactions
        • Expire Snapshots
        • Orphan Files
      • Columns
      • Partitions
      • Properties
    • Job Status
      • Stream & File Sources
        • Monitoring
        • Graphs
        • Lineage
        • Settings
      • CDC Sources
        • Monitoring
        • Replication Settings
        • Job Settings
    • System Catalog
      • Information Schema
        • Clusters
        • Columns
        • Connections
        • Jobs
        • Mirrors
        • Tables
        • Users
        • Views
      • Insights
        • job_output_column_stats
        • dataset_column_stats
      • Monitoring
        • CDC Status
        • Clusters
        • Expectations
        • Jobs
        • Partition Statistics
        • Recent Compactions
        • Running Queries
        • Table Statistics
      • Task Executions Table
  • GENERAL
    • Common SQL Syntax
    • View Entity Syntax
    • Keyboard Shortcuts
Powered by GitBook
On this page
  • Viewing data lineage
  • Adjust the view
  • Data source
  • Job
  • Data target
  • Extended lineage
  1. MONITORING
  2. Datasets

Lineage

Last updated 11 months ago

The Lineage tab is visible for source and target datasets managed by Upsolver to provide an insight into the data journey, visually displaying where datasets and jobs interact, enabling you to drill into each entity, and see how pipelines and datasets relate and connect.

Data lineage refers to the journey of data from its original source, through ingestion and transformation jobs, before it lands in its final destination. Lineage provides a record of how and where the data flows through your organization, enabling you to understand the shape of your data landing in the target.

Knowing the lineage of your data is essential for several reasons:

  • Data Quality Assurance: Understanding where data comes from and how it has been manipulated enables organizations to ensure data quality standards are met. By tracing data lineage, you can identify any errors, inconsistencies, or anomalies that may have occurred during your data's journey.

  • Regulatory Compliance: Many sectors, such as finance and healthcare, are subject to strict regulations regarding data management and privacy. Data lineage helps you demonstrate compliance by providing a clear audit trail of how data is handled and ensuring that it meets regulatory requirements.

  • Change Impact Analysis: When making changes to data structures, systems, or processes, it's essential to understand the potential impact on downstream systems and stakeholders. Data lineage helps you assess the impact of any changes by revealing which data assets and processes are affected or at risk from breaking.

  • Data Governance: Data lineage is a fundamental component of data governance initiatives, which aim to establish policies, procedures, and standards for managing data effectively. By documenting data lineage, you can enforce data governance policies, track data usage, and ensure accountability.

  • Decision-making: Access to accurate and reliable data is essential for making informed business decisions. Data lineage provides valuable insights into the reliability and relevance of data, empowering decision-makers to trust the data they rely on.

Data lineage is essential for ensuring data integrity, regulatory compliance, and informed decision-making, and helps to establish transparency, accountability, and trust in data management processes.


Viewing data lineage

From Datasets, expand the navigation tree nodes to display the table you want to view. Click on the table name to display the dataset tabs in the right-hand side pane of the UI, then click the Lineage tab. The current dataset is always highlighted in the lineage diagram, and each entity in the diagram provides information to enable you to identity and investigate the object within your organization.

A visual representation of the dataset journey is displayed:

Adjust the view

You can use your mouse scroller to zoom in and out on the diagram to change the display size, or use the zoom control in the bottom left-hand corner of the screen to set the view. Click fit view to fill the screen with the diagram. If you need to move the image, use your mouse to grab and relocate the diagram on the screen.

Data source

Click on the data source icon to display a pop-up with the name of the topic or bucket location where the data is sourced, along with the connection used by the job to copy the data.

Job

Data target

Click the highlighted dataset icon to open the pop-up and view the table and schema names, and the connection used by the job to write the data.

Schema

Click Info to open the modal, which provides a schema overview into your data:

Instantly you can see the data within the dataset. Click on a column name to drill into the column data:

SQL

The SQL tab in the modal displays the syntax used to create the table, along with job options and configuration settings. Optionally, click Copy to paste the code into a worksheet.


Extended lineage

As well as viewing the immediate entities that form the journey of your dataset, you can click Display Extended Lineage to view where the dataset sits in your ecosystem in relation to other entities. Click in the checkbox to extend the lineage diagram:

We can see in the above example that the data in the source S3 bucket that feeds our selected dataset is also ingested to a Tabular table, so if we wanted to make changes to the data in the source, we can easily discover the downstream impact. In this case, a change to the data in our S3 bucket would impact two jobs and datasets.

As with the previous view, you can click on all entities in the extended diagram to expose further information and drill into the details.

Clicking on the job icon displays information including the full name of the job, and whether it is a . From here, click Info to display the SQL syntax to create the job, and optionally click Copy to paste the code into a worksheet. Alternatively, click Job Page to navigate to .

This will take you into the main tab for the dataset, where you can find more detailed information about your column and dataset.

Schema
Job monitoring
The Lineage tab displays the data source, the job that ingests/transforms the data, and the destination.
Click on Info to display the schema or Entity Page to open the target in Datasets.
View the schema for the current dataset.
View the essential statistics for a column within the dataset.
View where the dataset sits in relation to other pipelines within your organization.
sync or non-sync job