S3 Iceberg Wizard Setup Guide

Prev Next

Step 1: Create an S3 Bucket and Prefix

  1. Log in to the AWS Management Console.

  2. Navigate to S3 service.

  3. Click Create bucket.

  4. Select your AWS region (e.g. us-east-1).

    Note:

    Select only us-east-1 as the bucket region, current implementation of iceberg allows catalog creation for buckets in us-east-1 region only.

  5. Enter a self-explanatory name for the bucket ( e.g. databee-iceberg-bucket).

  6. In Block Public Access settings, enable all four options to block public access.

  7. Leave to default setting for all other options.

  8. Click Create Bucket.

  9. After creation, navigate into the bucket and create a folder to be used as prefix for the Iceberg catalog (e.g databee-iceberg-catalog-prefix).

    1. Leave the ‘Server Side encryption’ as default.

    2. Click Create Folder.

Step 2: Create an IAM Role

  1. Navigate to the IAM service.

  2. Click Roles in the sidebar.

  3. Click Create role.

  4. Select Custom Trust Policy in the Select trusted entity page.

    1. Add the below dummy policy to allow creation of the Role (this policy will be updated with the policy provided by DataBee while onboarding the data lake).

    2. Dummy Custom Trust Policy

      {
      "Version": "2012-10-17",
      "Statement": [
      {
      "Sid": "Statement1",
      "Effect": "Allow",
      "Principal": {
      "AWS": "arn:aws:iam::000000000000:role
      /CustomerManaged/use1/dummy"
      },
      "Action": "sts:AssumeRole"
      }
      ]
      }

    3. Snapshot of adding Dummy Trust Policy

  5. Skip "Add Permission" Page, by hitting next button. This permission will also be updated with policy provided by DataBee while onboarding the data lake.

  6. In the "Name, review, and create" page

    1. Provide a meaningful name to the Role (e.g iceberg-s3catalog-role)

    2. Verify the dummy trust policy applied in previous page.

    3. Click on the Create role button.

    4. Snapshot of the Role Creation

  7. Get the Role ARN after the creation of the role.

    1. Click on the newly created role to view the ARN of the Role.

Step 3: Raise Jira Request to add the customer Role ARN to be allowed by databee-catalog in prod (this is done as part of the customer onboarding)

  1. Create a Jira ticket with below details

    1. AWS account id

    2. ARN of the Role created by the customer in their AWS account

    3. Cluster and Tenant of the Customer

  2. Reference Ticket - BLUEB-20717

Step 4: Onboarding the data lake in Databee UI

  1. Login to DataBee tenant portal.

  2. Click on the the Configuration Icon (Gear Icon to left of the Profile Icon on the top right corner of the page).

  3. Select System from the drop down.

  4. Select Data Lakes from the left side bar in the “System” Page.

  5. Select Iceberg under the Data Lakes section.

  6. Snapshot highlighting the above configuration steps.

Step 4.1: Configuring Iceberg for AWS S3

Step 4.1.1: Entering Bucket and Role Details in Configuration Page

  1. Under the Iceberg Section in the “Data Lakes” page, click on AWS S3 button.


  2. A pop up will appear for configuring the iceberg in aws s3

    1. Check the Enabled check box.

    2. Enter the name of the S3 Bucket created earlier.

    3. Enter the bucket's AWS region.

    4. Enter the ARN value of the Role created earlier .

    5. Skip KMS Encryption Key as we have not encrypted the bucket.

    6. Enter the prefix / folder name created inside the S3 bucket.

    7. Click Next.

    8. Snapshot of configuration

Step 4.1.2: Copying Bucket Storage Policy from DataBee and Creation of the Policy in AWS

  1. After hitting the Next button we can see the Trust Policy and Storage Policy Displayed in the next section (It takes few seconds to show the next section )

  2. Snapshot of the next page load after hitting next button

  3. Follow the instructions to create a storage policy in AWS and then copy the policy shown in the configuration under the AWS IAM Policy for Data Catalog over S3.

  4. Creating Storage Policy in AWS

    1. Navigate to the IAM service in AWS and select the Policies available under Access Management menu, and then click on the Create Policy button.

    2. Choose JSON option in Policy Editor and then paste the policy provided by Databee and click the Next button.

      Sample Storage Policy

      {
      "Version": "2012-10-17",
      "Statement": [
      {
      "Sid": "AllowS3IcebergObjects",
      "Effect": "Allow",
      "Action": [
      "s3:PutObject",
      "s3:GetObject",
      "s3:GetObjectVersion",
      "s3:DeleteObject",
      "s3:DeleteObjectVersion"
      ],
      "Resource": [
      "arn:aws:s3:::YOUR_BUCKET_NAME/YOUR_BUCKET_PREFIX/*"
      ]
      },
      {
      "Sid": "AllowS3IcebergBucket",
      "Effect": "Allow",
      "Action": [
      "s3:ListBucket",
      "s3:GetBucketLocation"
      ],
      "Resource": [
      "arn:aws:s3:::YOUR_BUCKET_NAME"
      ],
      "Condition": {
      "StringLike": {
      "s3:prefix": [
      "YOUR_BUCKET_PREFIX/*"
      ]
      }
      }
      },
      {
      "Sid": "AllowSSLRequestsOnly",
      "Effect": "Deny",
      "Action": ["s3:*"],
      "Resource": [
      "arn:aws:s3:::YOUR_BUCKET_NAME"
      ],
      "Condition": {
      "Bool": {
      "aws:SecureTransport": "false"
      }
      }
      }
      ]
      }
    3. In the next page Review and Create the Policy, by providing a meaningful name for the storage policy. e.g iceberg_storage_policy and then click the Create Policy button.

  5. Update the Storage policy in the AWS Role

    1. Navigate to Roles Section in the IAM Service and select the role to be edited

    2. Under the Permissions section, click on the Add Permissions drop down menu and select Attach Policies.

    3. In the Attach Policy Section, Search for the policies by adding Customer Managed filter under Filter By Type Dropdown menu, select the checkbox of the policy and click Add Permission button.

Step 4.1.3: Copying Trust Policy from DataBee and attaching to the role in AWS

  1. Follow the instructions under Step 2 of the configuration page for adding the DataBee trust policy.

  2. AWS Snapshots for adding the Trust Policy

    1. Selecting the Role, navigating to the Trust relationships section and the clicking on Edit trust policy button.

    2. Enter the DataBee provided trust policy in the editor box of the trust policy and then click Update Policy button.

Sample Trust Policy

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::DATABEE-PRODUCTION-AWS-ACCOUNTID:
role/CustomerManaged/use1/databee-catalog"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "YOUR_EXTERNAL_ID"
}
}
}
]
}

   

Step 4.1.4: Save Data Lake Configuration and Verify creation of Iceberg Metadata

  1. After completing the policy setup in AWS, come back to the data lake configuration page in DataBee and click Save button. You will see a popup indicating the data lake set up progress.

  2. After waiting for couple of minutes navigate to the iceberg S3 bucket and verify if directories are created for each of the ocsf events and objects as per the latest alembic version under the /ocsf directory within the prefix used while configuring the data lake.

  3. Verify the iceberg metadata by navigating to /ocsf/<event-table-name>/metadata directory in the bucket prefix and reading the metadata json file.

Step 5: Validating Data Ingestion into the Iceberg Catalogs

  1. Onboard an S3 Data source (Refer for steps Integrating S3 buckets into DataBee).

    1. We are onboarding 1password feed and we will drop 7 files with 1 records each , these will land in the below event tables.

      1001

      file_activity

      3001

      account_change

      3002

      authentication

      3006

      group_management

      5001

      inventory_info

  2. Load Data into the S3 Bucket and Verify ingestion

    1. Verifying ingestion in Databee UI via Data Quality Summary Page

      1. Sankey Diagram showing dataflow, along with data quality.

        Summary showing number of records successfully ingested.

    2. Verifying ingestion in s3 iceberg bucket (Group Management Event)


    3. Verifying ingestion in starrocks (Group Management Event)

        

    4. Verifying ingestion in DataBee search page (Group Management Event)


Additional: Creating AWS S3 Catalog with KMS Encryption enabled

  1. Creation of an S3 Bucket with KMS Encryption Enabled

    1. Creation of Customer Managed KMS Key

      1. Navigate to KMS service in AWS S3 and click on Create Key button on the top right corner. Keep all default options and proceed to next page.

      2. Provide an alias for the KMS key and the click Skip to Review and the click Finish.

    2. Creation of Bucket with KMS Encryption

      1. Choose SSE-KMS under the default encryption Section while creating the bucket and then create the Bucket.

      2. Verify and Capture the KMS Key ARN in the bucket, (this will later be used while onboarding the catalog).

  2. The remaining steps are same as Step 1.9 through Step 5 mentioned above.

    1. Below Additional statements would be visible in the storage policy for KMS s3 buckets.

      Additional KMS Statement in Storage Policy

      {
      "Sid": "AllowKMSOperations",
      "Effect": "Allow",
      "Action": [
      "kms:Decrypt",
      "kms:GenerateDataKey"
      ],
      "Resource": [
      "arn:aws:kms:us-east-1:
      340234701830:key/730d77f3-f7b7-4d28-be4e-c675c9b4e4d9 "
      ]
      }