Data Quality
  • 06 Jan 2025
  • 11 Minutes to read
  • Contributors
  • Dark
    Light

Data Quality

  • Dark
    Light

Article summary

The Data Quality features are designed to help data engineering teams and administrators to assess and maintain the data quality per feed. They offer a range of features to monitor data flows, view alerts for quality changes, and provide insights into data performance. Let's explore the key features.

Key Features

Data Ingest Metrics- Displays a wide range of data ingest metrics such as amount of data ingested, amount of data processed, feed mapping percentage, and more.

Real-Time Monitoring- Continuously monitors data flows and processing pipelines, tracking data flow, ensuring ETL pipeline transparency is consistently maintained.

Alerting Mechanism- Notifies the data engineering team and other stakeholders of critical data quality changes. The alerting system offers explanations or diagnostics to expedite issue resolution.

User-Friendly Visualizations - Designed to be intuitive for both data engineers and regular users, presenting data quality metrics and insights in an engaging visual format.

Customizable Views- Users can customize their dashboard views, focusing on specific metrics and data quality assets relevant to their roles and responsibilities.

Historical Data Analysis- Provides access to historical data quality trends and metrics to identify patterns and make informed decisions about data quality improvements.

Data Sources Page

Click on the Data button which takes you to "Your current data sources" page where you can find all the data sources currently configured within the tool. The page displays a clean and organized view of data sources which includes the data source name, its state, the size of raw data, and the data quality score with a graphical representation.

data_sources

Simply click on any data source. A wealth of details pops up on the right side of the page. From the data quality state to essential specifics like the bytes ingested, owner details, and vendor information, the interface provides a comprehensive view. It also provides insights into the configuration timestamp, ingest type, AWS region, active alerts, etc. for a thorough understanding of your data source. Click on View Data Quality Summary to explore additional details related to the selected data source.

Data Quality Page

You are now directed to the "Data Quality" page. This elaborate display showcases a detailed overview of records, items mapped, and data quality score changes per day. It is supplemented by a sankey diagram for a more granular understanding of how the data flows through the DataBee pipelines. The ‘Active Alerts’ section displays all active warnings and errors in the data source.

The metrics used in our data quality analysis are listed below, along with their descriptions. All metrics are bound by the time interval selected using the dropdown menu in the top right corner.

Metric 

Description

Bytes Ingested

The amount of data ingested by DataBee measured in mB/GB

Feed Bandwidth

The rate of log processing measured in logs/sec (or MB/sec)

Owner

The owner specified when the data source was configured

Records Ingested

The number of individual records ingested by DataBee

Feed Mapping Efficiency

The percentage of records that were able to be mapped out of the total ingested

Records Mapped

The number of records that were mapped to an OCSF event or object table

Data Last Ingested

The last date that the feed was successfully run

Data Quality Alerts

You will receive alert notifications when critical issues arise within your data sources. With real-time alerts, you can respond quickly to problems, which is essential for maintaining operational efficiency and ensuring security. Errors and warnings can be viewed by clicking on any data source on “Your current data sources” page. This will open a side panel where active alerts are displayed. The warnings can be dismissed manually, but they will reappear if the underlying issue has not been resolved. Errors cannot be dismissed and will remain visible until the issue is fixed. Click on the gear icon to access the “Data Quality - Alert Settings” page. You can disable the data staleness check by switching the toggle off or adjust the duration by setting the desired number of hours. Once you’re done, click Save.

Refer to the table below to get a detailed understanding of the errors.

Ingestion Type: S3

Actual error

DataBee status code

DataBee error code

Error Explanation

Resolution Tip

ExpiredToken: The security token included in the request is expired

1400

S3_ExpiredToken

Your authentication token has expired and needs to be refreshed.

Request a new authentication token from your AWS service. Check token expiration handling in your code.

AccessDenied

1403

S3_AccessDenied

You don't have permission to perform this operation on the S3 resource.

Verify IAM roles and permissions. Check bucket policies and ACLs. Ensure your credentials have the required permissions.

BucketNotEmpty

1409

S3_BucketNotEmpty

The bucket must be empty before it can be deleted.

Remove all objects from the bucket first, or use the force delete option if available in your SDK.

InvalidBucketName

1401

S3_InvalidBucketName

The specified bucket name is not valid or follows incorrect naming conventions.

Ensure bucket name follows S3 naming rules: lowercase letters, numbers, dots, and hyphens only. Must be 3-63 characters long.

InvalidObjectState

1402

S3_InvalidObjectState

The requested operation cannot be performed on the object in its current state.

Check if object is in Glacier storage. Verify object lock settings. Ensure object is not being modified by another operation.

NoSuchBucket

1404

S3_NoSuchBucket

The specified bucket does not exist.

Verify bucket name and region. Check if bucket was deleted or never created. Ensure you're using the correct AWS account.

NoSuchKey

1405

S3_NoSuchKey

The specified file or object could not be found in the bucket.

Verify object key path. Check if file was deleted. Ensure correct bucket and folder structure.

PreconditionFailed

1412

S3_PreconditionFailed

One or more preconditions you specified for the operation did not hold.

Check ETag matches and conditional headers. Verify if object was modified since last retrieval.

SlowDown

1503

S3_SlowDown

Please reduce your request rate as you are sending too many requests.

Implement exponential backoff. Add request rate limiting. Consider using S3 Transfer Acceleration for better performance.

UnknownError

1500

S3_UnknownError

An unexpected error occurred while processing your S3 request.

Check AWS service health dashboard. Review CloudWatch logs. Contact AWS support if persistent.

Ingestion Type: SQS

Actual error

DataBee status code

DataBee error code

Error Explanation

Resolution Tips

ExpiredToken: The security token included in the request is expired

2400

SQS_ExpiredToken

Your authentication token for accessing SQS has expired and needs to be renewed.

Request a new authentication token from AWS and update your application's credentials.

AccessDenied

2403

SQS_AccessDenied

You don't have the necessary permissions to perform this operation on the SQS queue.

Check IAM roles and policies, ensure your credentials have the required SQS permissions.

InvalidParameterValue

2401

SQS_InvalidParameterValue

One or more parameters provided in your SQS request have invalid values.

Review API documentation for correct parameter formats and validate all input values.

MissingParameter

2402

SQS_MissingParameter

A required parameter is missing from your SQS request.

Check API documentation for required parameters and ensure all are included in your request.

MessageNotInflight

2404

SQS_MessageNotInflight

The message you're trying to process is not currently in flight or being processed.

Verify message receipt handle is valid and message hasn't exceeded visibility timeout.

OverLimit

2405

SQS_OverLimit

You have exceeded the maximum allowed limit for this SQS operation.

Implement request throttling or contact AWS support to increase your quota limits.

QueueDeletedRecently

2406

SQS_QueueDeletedRecently

You cannot create a queue with this name because it was recently deleted.

Wait 60 seconds before recreating a queue with the same name, or use a different queue name.

NonExistentQueue

2407

SQS_NonExistentQueue

The specified SQS queue does not exist.

Verify queue URL/name and region, ensure queue hasn't been deleted.

InvalidMessageContents

2408

SQS_InvalidMessageContents

The message content contains invalid characters or exceeds size limits.

Check message format and size, ensure it meets SQS message requirements.

UnknownError

2500

SQS_UnknownError

An unexpected error occurred while processing your SQS request.

Check AWS service health, review CloudWatch logs, and contact AWS support if persistent.

Ingestion Type: Azure blob

Actual error

DataBee status code

DataBee error code

Error Explanation

Resolution Tips

InvalidAuthenticationInfo

3400

BLOB_InvalidAuthenticationInfo

The authentication information provided for accessing the Azure Blob storage is invalid or malformed.

Verify your connection string, access keys, or SAS token are correct and not expired.

InvalidBlobOrBlock

3401

BLOB_InvalidBlobOrBlock

The blob or block data you're trying to access or modify is invalid or corrupted.

Check the blob name, size limits, and ensure data integrity during upload/download operations.

InsufficientAccountPermissions

3402

BLOB_InsufficientAccountPermissions

Your account lacks the necessary permissions to perform this operation on the blob storage.

Review and update your Azure role assignments and access policies for the storage account.

AuthorizationFailure

3403

BLOB_AuthorizationFailure

The request was not authorized to perform this operation on the blob resource.

Check your shared access signature (SAS) permissions and storage account access policies.

BlobNotFound

3404

BLOB_BlobNotFound

The requested blob could not be found in the specified container.

Verify the blob name and path, ensure the blob hasn't been deleted or moved.

ContainerNotFound

3405

BLOB_ContainerNotFound

The specified container does not exist in the storage account.

Check the container name and ensure it exists in the correct storage account.

ResourceNotFound

3406

BLOB_ResourceNotFound

The requested Azure Blob storage resource could not be found.

Verify the resource path, name, and ensure the storage account is correctly configured.

BlobAlreadyExists

3407

BLOB_BlobAlreadyExists

A blob with this name already exists in the container.

Use a different blob name or implement logic to handle existing blobs (overwrite/skip).

ContainerAlreadyExists

3408

BLOB_ContainerAlreadyExists

A container with this name already exists in the storage account.

Choose a different container name or handle existing container scenarios appropriately.

InvalidQueryParameterValue

4409

BLOB_InvalidQueryParameterValue

One or more query parameters in your blob storage request are invalid.

Review the API documentation and validate all query parameters meet the required format.

QueueNotFound

4410

BLOB_QueueNotFound

The specified Azure Storage queue could not be found.

Verify the queue name and ensure it exists in the correct storage account.

QueueDisabled

4411

BLOB_QueueDisabled

The queue service is currently disabled for this storage account.

Enable the queue service in your storage account settings or use an alternative storage account.

Unknown

4500

BLOB_UnknownError

An unexpected error occurred while accessing Azure Blob storage.

Check Azure service health, review application logs, and contact Azure support if the issue persists.

Ingestion Type: API

HTTP error codes

Error string

DataBee status code

DataBee error code

Error Explanation

Resolution Tip

Bad Request

Invalid redirection uri

5400

API_InvalidRedirectUrl

The redirect URL provided in your request is not valid or properly formatted.

Check the redirect URL format, ensure it matches the allowed URLs in your API settings, and verify it's properly encoded.

Bad Request

Redirection URI is required

5400

API_NoRedirectUri

The request is missing a required redirect URL parameter.

Add a valid redirect URI to your request parameters as specified in the API documentation.

Bad Request

Invalid Authorization Code

5400

API_InvalidAuthCode

The authorization code provided has expired or is not valid.

Request a new authorization code and ensure you're using it promptly before it expires.

Bad Request

Invalid_refresh_token

5400

API_InvalidRefreshToken

The refresh token provided is not valid or has been revoked.

Initiate a new authentication flow to obtain a fresh refresh token.

Bad Request

Refresh Token expired

5400

API_RefreshTokenExpired

The refresh token has exceeded its lifetime and is no longer valid.

Perform a new authentication flow to obtain new access and refresh tokens.

Unauthorized

unauthorized_client

5400

API_UnauthorizedClient

The client is not authorized to request an authorization code.

Verify your client credentials and ensure your application has the necessary permissions.

Invalid response type

Response type must be

5400

API_InvalidResponseType

The response type specified in the authorization request is not supported.

Use one of the supported response types (usually 'code' or 'token') as specified in the API documentation.

Invalid grand type

invalid grand type

5400

API_UnsupportedGrantType

The grant type specified in the token request is not supported.

Use one of the supported grant types (e.g., 'authorization_code', 'refresh_token') as specified in the API documentation.

Invalid request

Invalid request

5400

API_InvalidRequest

The request is missing a required parameter or contains an invalid parameter value.

Review the API documentation and ensure all required parameters are included with valid values.

UnauthorizedError

Invalid access token

5401

API_InvalidResource

The access token provided is not valid or has been revoked.

Obtain a new access token using your refresh token or perform a new authentication flow.

UnauthorizedError

Access token expired

5401

API_ExpiredAccessToken

The access token has exceeded its lifetime and is no longer valid.

Use your refresh token to obtain a new access token, or perform a new authentication flow if the refresh token is also expired.

UnauthorizedError

Access token not approved

5401

API_AccessTokenNotApproved

The access token has not been approved or was rejected by the authorization server.

Check if the user has granted all required permissions and initiate a new authentication flow if necessary.

ForbiddenError

InsufficientScope

5403

API_InsufficientScope

The access token does not have the required permissions to perform this operation.

Request additional scopes during the authentication process or use a token with the necessary permissions.

Sankey Diagram

The Sankey diagram in DataBee is a powerful data visualization tool that helps you understand how data flows through different stages. Each flow is depicted as a stream, where the width is proportional to the amount of data it represents. The diagram provides a comprehensive view of data distribution across various categories, allowing for easy identification of successes, failures, and unmapped data.

Data Flow Categories

Success:

  • The OCSF event tables that are powered by the feed.

Failed:

  • Regex Errors: Failures due to issues in regular expression matching.

  • Parsing Errors: Failures that occurred during data parsing.

  • Mapping Errors: Failures related to data mapping inconsistencies.

Unmapped: Data that has not been assigned to any specific category.

Select the time range (last hour, last day, last 7 days, this month, this year, all history) as per your preference. When you hover over any of the boxes in the diagram, a tooltip will appear showing the percentage of data that the box represents.

Clicking on Success (Process Activity, User Inventory) directs you to a query preloaded Search page, where you can view detailed tables of the corresponding data.

Clicking on Failed (Parsing, Mapping, Regex) takes you to the Unprocessed page. Here, the filters will be preloaded according to the selected time range, allowing you to analyze the specific reasons for failure.

Clicking on Unmapped directs you to a query preloaded search page where you can further investigate the unmapped data.

Unprocessed page

The Unprocessed page provides a detailed table that lists the feed names alongside their corresponding issue type, error message, and the date the issue occurred. This page is designed to help you quickly identify and analyze unprocessed data.

You can access the Unprocessed page in two ways:

  • From the Sankey Diagram: Click on any of the error boxes (Failed – Mapping, Parsing, Regex) within a data feed's Sankey diagram.

  • From the Data drop-down on the top navbar

To streamline your analysis, you can apply various filters:

  • Date Range: Select from predefined options—Last 24 Hours, Last 7 Days, Last Month, or All Time—to focus on a specific timeframe.

  • Error Type: Filter by the type of issue (Parsing, Regex, Mapping) to narrow down the results to specific errors.

  • Feed Selection: Choose specific feeds of interest to view only the relevant unprocessed data.

To explore the raw message and analyze where the failure occurred, click on the magnifying glass to expand the row and view the raw message compared to how DataBee tried to process it.


Was this article helpful?

Changing your password will log you out immediately. Use the new password to log back in.
First name must have atleast 2 characters. Numbers and special characters are not allowed.
Last name must have atleast 1 characters. Numbers and special characters are not allowed.
Enter a valid email
Enter a valid password
Your profile has been successfully updated.
ESC

Eddy AI, facilitating knowledge discovery through conversational intelligence