Data Ingestion — VPC Flow Logs

This how-to guide shows you how to ingest, retrieve, and view data for your VPC Flow Logs.

VPC Flow Logs is a feature that enables you to capture information about the IP traffic going to and from network interfaces in your customer VPC. You can publish flow log data to Amazon CloudWatch Logs or Amazon S3. After you create a flow log, you can retrieve and view its data in your chosen destination.

VPC flow logs can help you:

Diagnose overly restrictive security group rules
Monitor traffic that is reaching your instance
Determine the direction of traffic to and from network interfaces

Visit the Amazon AWS documentation to learn more about VPC flow logs.

Upsolver helps you ingest various VPC flow logs and perform minor transformations before loading your data into Amazon Athena for analysis. What sets Upsolver apart from other tools is its SQL-only solution and its scalable, robust streaming capabilities.

You ingest your VPC flow logs in Upsolver in five steps:

Connect Upsolver to your Amazon S3 bucket
Connect to your AWS Glue Data Catalog
Create an S3 storage connection
Create a staging table for your VPC flow logs
Ingest data from S3 into your staging table

Step 1

Connect to your Amazon S3 bucket

To transfer your data, you must create an Upsolver connection. This connection gives you the ability to configure the AWS IAM credentials that Upsolver needs to access the data.

When you deploy Upsolver, the cluster will already have an IAM role attached that may or may not have permission to access the S3 path where your data resides. In the event that it lacks those permissions, you can assign a scope-down policy to the S3 connection. This is recommended to prevent failures, misconfigurations, and breaches.

Here's the code:

CREATE S3 CONNECTION my_s3_connection
    AWS_ROLE = 'arn:aws:iam::001234567890:role/upsolver-role-<id>'
    READ_ONLY = TRUE
    --PATH_DISPLAY_FILTERS = ('s3://quickstart-vpclogs-test', 
            --'s3://upsolver-tutorials-vpclogs')
    COMMENT = "my s3 connection";

For future reference, you can copy your AWS_ROLE from your AWS IAM user page.

Step 2

Connect to your AWS Glue Data Catalog

A Glue Catalog connection in Upsolver serves as a metadata store connection. It enables you to create Upsolver-managed tables that also double as Athena tables.

Here's the code:

CREATE GLUE_CATALOG CONNECTION my_glue_catalog
    AWS_ROLE = 'arn:aws:iam::001234567890:role/upsolver-role-<id>'
    --REGION = 'us-east-1'
    --DATABASE_DISPLAY_FILTERS = ('database1', 'database2')
    COMMENT = "my glue catalog connection";

Providing AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY can be used in place of AWS_ROLE.

Step 3

Create an S3 storage connection

The tables you create in Upsolver all have underlying files stored in a specified storage location:

If you deploy Upsolver Upsolver in your customer VPC, you can find an Upsolver bucket created during the integration process that serves as a default storage location.
if you use Upsolver with Upsolver Cloud, you should create an additional S3 connection that serves as an underlying storage location. This ensures your data stays within your account.

Here's the code:

CREATE S3 CONNECTION my_s3_storage
    AWS_ROLE = 'arn:aws:iam::001234567890:role/upsolver-role-<id>'
    READ_ONLY = FALSE
    --PATH_DISPLAY_FILTER = 's3://bucket/storage-folder/';

You might notice this code looks similar to the code you used to create an S3 connection in Step 1. Both connections are important. The first one gives you access to your account in S3 and permission to take specific actions. You use the second connection to load your data into a staging table — that is, it provides you with a direct connection to your data.

Step 4

Create a staging table for your VPC flow logs

Before you can transform and output your data, you must ingest it into Upsolver. To do this, copy your data into an Upsolver-managed staging table.

Note that staging tables cannot have primary keys and can only be partitioned on time-based columns, as shown below:

CREATE TABLE my_glue_catalog.database.vpc_flow_logs_staging
STORAGE_CONNECTION = my_s3_storage
STORAGE_LOCATION = 's3://bucket/folder-path/'
--COMPUTE_CLUSTER = 'SecurityDataLake'
--TABLE_DATA_RETENTION = 30 DAYS
COMMENT = "my s3 staging table";

You must set STORAGE_CONNECTION and STORAGE_LOCATION together to configure the storage location of the table's underlying files.

Note that:

If you deploy Upsolver in your Customer VPC, you can omit these options, as there's an Upsolver bucket created during the integration process that serves as a default storage location.
If you use Upsolver with Upsolver Cloud, you must define these two parameters to ensure your data stays within your account.

Step 5

Ingest data from S3 into your staging table

If you're using Upsolver with Upsolver Cloud, create the staging table explicitly to define the storage location for the table's underlying files.

Here's the code:

CREATE JOB stage_log_data
    --START_FROM = '2020-03-06T05:00:00.000Z'
    --COMPUTE_CLUSTER = 'SecurityDataLake'
    --CREATE_TABLE_IF_MISSING = FALSE
    --CONTENT_TYPE = JSON
    --COMMENT = 'copy s3 data into staging table'
AS COPY FROM my_s3_connection 
    LOCATION = ‘s3://bucket/VPC_logs’
INTO my_glue_catalog.database.vpc_flow_logs_staging;

Using the COPY FROM statement, you copy the data from your S3 connection in Step 1 and load it into your staging table by specifying the location in your code. Be sure to note the bucket from which you draw your data to ensure you only process the data you wish to see.

Next, query your table to ensure everything is working properly.

SELECT * FROM my_glue_catalog.database.vpc_flow_logs_staging LIMIT 10;

Conclusion

At this point, you have a connection to your raw data of VPC Logs and have ingested them into a staging table. The next step is to perform data analytics; see the Data Analytics - VPC Flow Logs guide where we walk you through the various transformations you can apply to your data.

By adopting and implementing familiar SQL syntax, you can use Upsolver to create data pipelines and organize your data to easily perform analytics and ML.

As your business needs evolve, so can your data. In the future, you can create additional jobs that use the same staging table as the source of creativity and innovation, while your pipelines indefinitely keep your data fresh.

Last updated 1 year ago