Data Ingestion — VPC Flow Logs

This how-to guide shows you how to ingest, retrieve, and view data for your VPC Flow Logs.

VPC Flow Logs is a feature that enables you to capture information about the IP traffic going to and from network interfaces in your customer VPC. You can publish flow log data to Amazon CloudWatch Logs or Amazon S3. After you create a flow log, you can retrieve and view its data in your chosen destination.

VPC flow logs can help you:

  • Diagnose overly-restrictive security group rules

  • Monitor traffic that is reaching your instance

  • Determine the direction of traffic to and from network interfaces

Learn more about VPC flow logs.

Upsolver SQLake helps you ingest various VPC flow logs and perform minor transformations before loading your data into Amazon Athena to analyze. What sets SQLake apart from other tools is its SQL-only solution and its scalable, robust streaming capabilities.

You ingest your VPC flow logs in SQLake in five steps:

  1. Connect SQLake to your Amazon S3 bucket

  2. Connect to your AWS Glue Data Catalog

  3. Create an S3 storage connection

  4. Create a staging table for your VPC flow logs

  5. Ingest data from S3 into your staging table

Step 1

Connect to your Amazon S3 bucket

To transfer your data, you must create a SQLake connection. This connection gives you the ability to configure the AWS IAM credentials that SQLake needs to access the data.

When you deploy SQLake, the cluster will already have IAM role attached that may or may not have permission to access the S3 path where your data resides. In the event that it lacks those permissions, you can assign a scope-down policy to the S3 connection. This is recommended to prevent failures, misconfigurations, and breaches.

Here's the code:

CREATE S3 CONNECTION my_s3_connection
AWS_ROLE = 'arn:aws:iam::001234567890:role/upsolver-role-<id>'
      READ_ONLY = TRUE
      --PATH_DISPLAY_FILTERS = ('s3://quickstart-vpclogs-test', 's3://upsolver-tutorials-vpclogs')
      COMMENT = "my s3 connection";

For future reference, you can copy your AWS_ROLE from your AWS IAM user page.

Step 2

Connect to your AWS Glue Data Catalog

A Glue Catalog connection in SQLake serves as a metadata store connection. It enables you to create Upsolver-managed tables that also double as Athena tables.

Here's the code:

CREATE GLUE_CATALOG CONNECTION my_glue_catalog
AWS_ROLE = 'arn:aws:iam::001234567890:role/upsolver-role-<id>'
--REGION = 'us-east-1'
--DATABASE_DISPLAY_FILTERS = ('database1', 'database2')
COMMENT = "my glue catalog connection";x

Providing AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY can be used in place of AWS_ROLE.

Step 3

Create an S3 storage connection

The tables you create in SQLake all have underlying files stored in a specified storage location:

  • If you deploy Upsolver SQLake in your customer VPC, you can find a SQLake bucket created during the integration process that serves as a default storage location.

  • if you use SQLake with Upsolver Cloud, you should create an additional S3 connection that serves as an underlying storage location. This ensures your data stays within your account.

Here's the code:

CREATE S3 CONNECTION my_s3_storage
AWS_ROLE = 'arn:aws:iam::001234567890:role/upsolver-role-<id>'
READ_ONLY = FALSE
--PATH_DISPLAY_FILTER = 's3://bucket/storage-folder/';

You might notice this code looks similar to the code you used to create an S3 connection in Step 1. Both connections are important. The first one gives you access to your account in S3 and the permission to take specific actions. You use the second connection to load your data into a staging table — that is, it provides you with a direct connection to your data.

Step 4

Create a staging table for your VPC flow logs

Before you can transform and output your data, you must ingest it into SQLake. To do this, copy your data into an Upsolver SQlake-managed staging table.

Note that staging tables cannot have primary keys and can only be partitioned on time-based columns, as shown below:

CREATE TABLE my_glue_catalog.database.vpc_flow_logs_staging
STORAGE_CONNECTION = my_s3_storage
STORAGE_LOCATION = 's3://bucket/folder-path/'
--COMPUTE_CLUSTER = 'SecurityDataLake'
--TABLE_DATA_RETENTION = 30 DAYS
COMMENT = "my s3 staging table";

You must set STORAGE_CONNECTION and STORAGE_LOCATION together to configure the storage location of the table's underlying files.

Note that:

  • If you deploy Upsolver in your Customer VPC, you can omit these options, as there's a SQLake bucket created during the integration process that serves as a default storage location.

  • If you use SQLake with Upsolver Cloud, you must define these two parameters to ensure your data stays within your account.

Step 5

Ingest data from S3 into your staging table

If you're using SQLake with Upsolver Cloud, create the staging table explicitly to define the storage location for the table's underlying files.

Here's the code:

CREATE JOB stage_log_data
--START_FROM = '2020-03-06T05:00:00.000Z'
--COMPUTE_CLUSTER = 'SecurityDataLake'
--CREATE_TABLE_IF_MISSING = FALSE
--CONTENT_TYPE = JSON
--COMMENT = 'copy s3 data into staging table'

AS COPY FROM my_s3_connection LOCATION = ‘s3://bucket/VPC_logs’
INTO my_glue_catalog.database.vpc_flow_logs_staging;

Using the COPY FROM statement, you copy the data from your S3 connection in Step 1 and load it into your staging table by specifying the location in your code. Be sure to note the bucket from which you draw your data to ensure you only process the data you wish to see.

Next, query your table to ensure everything is working properly.

SELECT * FROM my_glue_catalog.database.vpc_flow_logs_staging LIMIT 10;

Conclusion

At this point, you have a connection to your raw data of VPC Logs and have ingested them into a staging table. The next step is to perform data analytics; here, we walk you through the various transformations you can apply to your data.

By adopting and implementing familiar SQL syntax, you can use SQLake to create data pipelines and organize your data to easily perform analytics and ML.

As your business needs evolve, so can your data. In the future, you can create additional jobs that use the same staging table as the source of creativity and innovation, while your pipelines indefinitely keep your data fresh.

Last updated