Lineage

The Lineage tab is visible for source and target datasets managed by Upsolver to provide an insight into the data journey, visually displaying where datasets and jobs interact, enabling you to drill into each entity, and see how pipelines and datasets relate and connect.

Data lineage refers to the journey of data from its original source, through ingestion and transformation jobs, before it lands in its final destination. Lineage provides a record of how and where the data flows through your organization, enabling you to understand the shape of your data landing in the target.

Knowing the lineage of your data is essential for several reasons:

  • Data Quality Assurance: Understanding where data comes from and how it has been manipulated enables organizations to ensure data quality standards are met. By tracing data lineage, you can identify any errors, inconsistencies, or anomalies that may have occurred during your data's journey.

  • Regulatory Compliance: Many sectors, such as finance and healthcare, are subject to strict regulations regarding data management and privacy. Data lineage helps you demonstrate compliance by providing a clear audit trail of how data is handled and ensuring that it meets regulatory requirements.

  • Change Impact Analysis: When making changes to data structures, systems, or processes, it's essential to understand the potential impact on downstream systems and stakeholders. Data lineage helps you assess the impact of any changes by revealing which data assets and processes are affected or at risk from breaking.

  • Data Governance: Data lineage is a fundamental component of data governance initiatives, which aim to establish policies, procedures, and standards for managing data effectively. By documenting data lineage, you can enforce data governance policies, track data usage, and ensure accountability.

  • Decision-making: Access to accurate and reliable data is essential for making informed business decisions. Data lineage provides valuable insights into the reliability and relevance of data, empowering decision-makers to trust the data they rely on.

Data lineage is essential for ensuring data integrity, regulatory compliance, and informed decision-making, and helps to establish transparency, accountability, and trust in data management processes.


Viewing data lineage

From Datasets, expand the navigation tree nodes to display the table you want to view. Click on the table name to display the dataset tabs in the right-hand side pane of the UI, then click the Lineage tab. The current dataset is always highlighted in the lineage diagram, and each entity in the diagram provides information to enable you to identity and investigate the object within your organization.

A visual representation of the dataset journey is displayed:

Adjust the view

You can use your mouse scroller to zoom in and out on the diagram to change the display size, or use the zoom control in the bottom left-hand corner of the screen to set the view. Click fit view to fill the screen with the diagram. If you need to move the image, use your mouse to grab and relocate the diagram on the screen.

Data source

Click on the data source icon to display a pop-up with the name of the topic or bucket location where the data is sourced, along with the connection used by the job to copy the data.

Job

Data target

Click the highlighted dataset icon to open the pop-up and view the table and schema names, and the connection used by the job to write the data.

Schema

Click Info to open the modal, which provides a schema overview into your data:

Instantly you can see the data within the dataset. Click on a column name to drill into the column data:

This will take you into the main Schema tab for the dataset, where you can find more detailed information about your column and dataset.

SQL

The SQL tab in the modal displays the syntax used to create the table, along with job options and configuration settings. Optionally, click Copy to paste the code into a worksheet.


Extended lineage

As well as viewing the immediate entities that form the journey of your dataset, you can click Display Extended Lineage to view where the dataset sits in your ecosystem in relation to other entities. Click in the checkbox to extend the lineage diagram:

We can see in the above example that the data in the source S3 bucket that feeds our selected dataset is also ingested to a Tabular table, so if we wanted to make changes to the data in the source, we can easily discover the downstream impact. In this case, a change to the data in our S3 bucket would impact two jobs and datasets.

As with the previous view, you can click on all entities in the extended diagram to expose further information and drill into the details.

Last updated