LogoLogo
OverviewQuickstartsHow To GuidesReferenceArticlesSupport
How To Guides
How To Guides
  • How To Guides
  • SETUP
    • Deploy Upsolver on AWS
      • Deployment Guide
      • AWS Role Permissions
      • VPC Peering Guide
      • Role-Based AWS Credentials
    • Enable API Integration
    • Install the Upsolver CLI
  • CONNECTORS
    • Create Connections
      • Amazon Kinesis
      • Amazon Redshift
      • Amazon S3
      • Apache Kafka
      • AWS Glue Data Catalog
      • ClickHouse
      • Confluent Cloud
      • Elasticsearch
      • Microsoft SQL Server
      • MongoDB
      • MySQL
      • PostgreSQL
      • Snowflake
      • Tabular
    • Configure Access
      • Amazon Kinesis
      • Amazon S3
      • Apache Kafka
      • AWS Glue Data Catalog
      • Confluent Kafka
    • Enable CDC
      • Microsoft SQL Server
      • MongoDB
      • MySQL
      • PostgreSQL
  • JOBS
    • Basics
      • Real-time Data Ingestion — Amazon Kinesis to ClickHouse
      • Real-time Data Ingestion — Amazon S3 to Amazon Athena
      • Real-time Data Ingestion — Apache Kafka to Amazon Athena
      • Real-time Data Ingestion — Apache Kafka to Snowflake
    • Advanced Use Cases
      • Build a Data Lakehouse
      • Enriching Data - Amazon S3 to ClickHouse
      • Joining Data — Amazon S3 to Amazon Athena
      • Upserting Data — Amazon S3 to Amazon Athena
      • Aggregating Data — Amazon S3 to Amazon Athena
      • Managing Data Quality - Ingesting Data with Expectations
    • Database Replication
      • Replicate CDC Data into Snowflake
      • Replicate CDC Data to Multiple Targets in Snowflake
      • Ingest Your Microsoft SQL Server CDC Data to Snowflake
      • Ingest Your MongoDB CDC Data to Snowflake
      • Handle PostgreSQL TOAST Values
    • VPC Flow Logs
      • Data Ingestion — VPC Flow Logs
      • Data Analytics — VPC Flow Logs
    • Job Monitoring
      • Export Metrics to a Third-Party System
    • Data Observability
      • Observe Data with Datasets
  • DATA
    • Query Upsolver Iceberg Tables from Snowflake
  • APACHE ICEBERG
    • Analyze Your Iceberg Tables Using the Upsolver CLI
    • Optimize Your Iceberg Tables
Powered by GitBook
On this page
  • Step 1
  • Connect to your Amazon S3 bucket
  • Step 2
  • Connect to your AWS Glue Data Catalog
  • Step 3
  • Create an S3 storage connection
  • Step 4
  • Create a staging table for your VPC flow logs
  • Step 5
  • Ingest data from S3 into your staging table
  • Conclusion
  1. JOBS
  2. VPC Flow Logs

Data Ingestion — VPC Flow Logs

This how-to guide shows you how to ingest, retrieve, and view data for your VPC Flow Logs.

Last updated 12 months ago

VPC Flow Logs is a feature that enables you to capture information about the IP traffic going to and from network interfaces in your customer VPC. You can publish flow log data to Amazon CloudWatch Logs or Amazon S3. After you create a flow log, you can retrieve and view its data in your chosen destination.

VPC flow logs can help you:

  • Diagnose overly restrictive security group rules

  • Monitor traffic that is reaching your instance

  • Determine the direction of traffic to and from network interfaces

Visit the Amazon AWS documentation to .

Upsolver helps you ingest various VPC flow logs and perform minor transformations before loading your data into Amazon Athena for analysis. What sets Upsolver apart from other tools is its SQL-only solution and its scalable, robust streaming capabilities.

You ingest your VPC flow logs in Upsolver in five steps:

  1. Connect Upsolver to your Amazon S3 bucket

  2. Connect to your AWS Glue Data Catalog

  3. Create an S3 storage connection

  4. Create a staging table for your VPC flow logs

  5. Ingest data from S3 into your staging table

Step 1

Connect to your Amazon S3 bucket

To transfer your data, you must create an Upsolver connection. This connection gives you the ability to configure the AWS IAM credentials that Upsolver needs to access the data.

When you deploy Upsolver, the cluster will already have an IAM role attached that may or may not have permission to access the S3 path where your data resides. In the event that it lacks those permissions, you can assign a scope-down policy to the S3 connection. This is recommended to prevent failures, misconfigurations, and breaches.

Here's the code:

CREATE S3 CONNECTION my_s3_connection
    AWS_ROLE = 'arn:aws:iam::001234567890:role/upsolver-role-<id>'
    READ_ONLY = TRUE
    --PATH_DISPLAY_FILTERS = ('s3://quickstart-vpclogs-test', 
            --'s3://upsolver-tutorials-vpclogs')
    COMMENT = "my s3 connection";

Step 2

Connect to your AWS Glue Data Catalog

A Glue Catalog connection in Upsolver serves as a metadata store connection. It enables you to create Upsolver-managed tables that also double as Athena tables.

Here's the code:

CREATE GLUE_CATALOG CONNECTION my_glue_catalog
    AWS_ROLE = 'arn:aws:iam::001234567890:role/upsolver-role-<id>'
    --REGION = 'us-east-1'
    --DATABASE_DISPLAY_FILTERS = ('database1', 'database2')
    COMMENT = "my glue catalog connection";

Providing AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY can be used in place of AWS_ROLE.

Step 3

Create an S3 storage connection

The tables you create in Upsolver all have underlying files stored in a specified storage location:

  • If you deploy Upsolver Upsolver in your customer VPC, you can find an Upsolver bucket created during the integration process that serves as a default storage location.

  • if you use Upsolver with Upsolver Cloud, you should create an additional S3 connection that serves as an underlying storage location. This ensures your data stays within your account.

Here's the code:

CREATE S3 CONNECTION my_s3_storage
    AWS_ROLE = 'arn:aws:iam::001234567890:role/upsolver-role-<id>'
    READ_ONLY = FALSE
    --PATH_DISPLAY_FILTER = 's3://bucket/storage-folder/';

Step 4

Create a staging table for your VPC flow logs

Before you can transform and output your data, you must ingest it into Upsolver. To do this, copy your data into an Upsolver-managed staging table.

Note that staging tables cannot have primary keys and can only be partitioned on time-based columns, as shown below:

CREATE TABLE my_glue_catalog.database.vpc_flow_logs_staging
STORAGE_CONNECTION = my_s3_storage
STORAGE_LOCATION = 's3://bucket/folder-path/'
--COMPUTE_CLUSTER = 'SecurityDataLake'
--TABLE_DATA_RETENTION = 30 DAYS
COMMENT = "my s3 staging table";

You must set STORAGE_CONNECTION and STORAGE_LOCATION together to configure the storage location of the table's underlying files.

Note that:

  • If you deploy Upsolver in your Customer VPC, you can omit these options, as there's an Upsolver bucket created during the integration process that serves as a default storage location.

  • If you use Upsolver with Upsolver Cloud, you must define these two parameters to ensure your data stays within your account.

Step 5

Ingest data from S3 into your staging table

If you're using Upsolver with Upsolver Cloud, create the staging table explicitly to define the storage location for the table's underlying files.

Here's the code:

CREATE JOB stage_log_data
    --START_FROM = '2020-03-06T05:00:00.000Z'
    --COMPUTE_CLUSTER = 'SecurityDataLake'
    --CREATE_TABLE_IF_MISSING = FALSE
    --CONTENT_TYPE = JSON
    --COMMENT = 'copy s3 data into staging table'
AS COPY FROM my_s3_connection 
    LOCATION = ‘s3://bucket/VPC_logs’
INTO my_glue_catalog.database.vpc_flow_logs_staging;

Next, query your table to ensure everything is working properly.

SELECT * FROM my_glue_catalog.database.vpc_flow_logs_staging LIMIT 10;

Conclusion

By adopting and implementing familiar SQL syntax, you can use Upsolver to create data pipelines and organize your data to easily perform analytics and ML.

As your business needs evolve, so can your data. In the future, you can create additional jobs that use the same staging table as the source of creativity and innovation, while your pipelines indefinitely keep your data fresh.

For future reference, you can copy your AWS_ROLE from your user page.

You might notice this code looks similar to the code you used to create an S3 connection in . Both connections are important. The first one gives you access to your account in S3 and permission to take specific actions. You use the second connection to load your data into a staging table — that is, it provides you with a direct connection to your data.

Using the COPY FROM statement, you copy the data from your S3 connection in and load it into your staging table by specifying the location in your code. Be sure to note the bucket from which you draw your data to ensure you only process the data you wish to see.

At this point, you have a connection to your raw data of VPC Logs and have ingested them into a staging table. The next step is to perform data analytics; see the guide where we walk you through the various transformations you can apply to your data.

learn more about VPC flow logs
AWS IAM
Data Analytics - VPC Flow Logs
Step 1
Step 1