Data Collector
  • 23 May 2024
  • 18 Minutes to read
  • Contributors
  • Dark
    Light

Data Collector

  • Dark
    Light

Article summary

The Data Collector collects data from diverse sources, applies filters, and adds metadata before securely forwarding it to the DataBee Receiver. It is easily installable on your On-Prem machines. DataBee is a robust centralized platform that tracks multiple Data Collectors. It processes, enriches, and securely stores the incoming data, allowing remote Data Collector configuration and updates.

Syslog stands as a widely accepted protocol employed for message logging within computing systems. TCP logs refer to logs generated by applications or systems that communicate using the Transmission Control Protocol (TCP). Windows Event Logs are a built-in logging mechanism in Windows operating systems. The critical event messages generated by various devices and applications are transmitted to a centralized server. These messages often contain pivotal information about system events, errors, warnings, and operational status, crucial for system analysis and troubleshooting.

In the realm of efficient data management, an on-premises forwarder plays a pivotal role. Acting as an intermediary, the forwarder collects the log messages from diverse sources spread across a network infrastructure. It subsequently forwards them securely and accurately to a central platform DataBee. The forwarder's function lies at the heart of streamlining the transmission of error-free log messages, ensuring the integrity and efficiency of data ingestion and analysis. The data would be filtered and tagged via DataBee receiver services and forwarded to the platform based on the tenant and data source identifier.

High-Level Features

Efficient Data Collection: Users can configure the collector(s) on the DataBee Platform and install it on their On-Prem machines. Collector(s) gathers diverse data, applies filters, and adds metadata before sending it to DataBee Receiver.

Remote Configuration: Users can remotely manage and configure collector(s) on the DataBee Platform.

Centralized Receiver: DataBee Receiver receives and securely stores data received from multiple Collectors.

Compliance and Security: Ensures adherence to compliance standards and robust security measures.

Reliability and Monitoring: Offers high reliability such as handling intermittent network issues along with monitoring using Datadog dashboards.

This guide will walk you through the step-by-step procedure to set up and establish a robust link between your on-premises system and DataBee.

Understanding Terminologies

Fluent Bit

High-performance On-Prem Collector for logs, metrics, and traces, emphasizing lightweight operation and minimal memory usage.

Configuration Adapter

This service acts as a bridge for Fluent-bit configuration management, retrieving the latest configurations from the platform, modifying them for Fluent-bit compatibility, and handling acknowledgment back to the platform.

System Monitor

This service periodically checks the collector's health, gathering metrics like CPU usage, records processed per source, storage details, and uptime. The collected data is then transmitted to the DataBee platform's monitoring endpoint.

Encryption and Security Best Practices

  • All the network communication occurs over a secure channel. This ensures that any communication between the services and external systems or services is done using secure protocols, such as HTTPS. For example, the data sent by the collector and collector services communicating with the DataBee platform is encrypted by TLS.

  • No sensitive data is stored in the logs.

Getting Started

Prerequisites

  • The host should be reachable to the DataBee platform.

  • Root / Administrator privileges on the system where the data collector is to be installed.

  • In the case of Windows systems, Powershell 7.3 (Minimum) is required.

System Requirements

Recommended System Resources

Memory

CPU

Disk

4GB

4

Available space 10 GB

Note

The storage buffer size for the data collector is configured to a limit of 4GB. In case of network disruptions, the data collector will accumulate the latest data up to 4GB in the configured file storage. Upon resolution of the network disruptions, it will resume transmitting the buffered data from the file storage.

Supported Platforms

OS

Version

Architecture

Ubuntu

22.04 LTS (Jammy Jellyfish)

amd64 (x86_64), arm64

RHEL

8.8

amd64 (x86_64), arm64

Windows Server

WS 2022 LTSC (Standard Edition)

x86_64 (64 bit)

Note

The collector can work on other Ubuntu/RHEL versions as well. But, it will give a warning(s) as “OS <current os> is not officially supported. Hence, this might impact installation and cause issues”. Therefore it’s not recommended.

For RHEL operating system, make sure to subscribe using:

subscription-manager register --username <username> --password <password> --auto-attach

The data collector supports a maximum of the below-mentioned EPS, considering an average event size with the recommended system resources.

Log Source

EPS

Average Message Size

Syslog

~16K

1KB

TCP

~38K

250B

Windows

~1.6K

600B (1 Windows Channel)

Flat File

~18K

~1KB (with 10 Files, total 5GB of static data)

Note

  • Flat File data source supports up to 5 GB of static data.

  • SSD is recommended for optimal performance.

  • CPU usage will increase in correlation with the number of files to be monitored, and the total data size. Please plan your system resources accordingly.

Configure Data Collector in DataBee

To configure your data collector in DataBee, follow these steps.

Click on the settings icon at the top right corner of the UI. From the dropdown menu, select System.

data_collector_config_1

From the left sidebar, select Data Collectors. The page displays all the data collectors configured until now. To create a new data collector, scroll to the bottom of the page and click on Add Data Collector.

To set up your data collector, follow the flowchart displayed on the right, for a visual guide. It outlines the step-by-step configuration process.

Step 1: Basic Information

Enter the data collector details in the fields provided.

  • Collector Name: a name for your data collector

  • OS: the operating system used, such as Linux or Windows

If you wish to enable the proxy functionality, check the Enable Proxy checkbox.

  • Proxy URL: HTTP URL or IP used while connecting to DataBee platform

  • Proxy Username: the proxy username to be used for authentication 

  • Password: the password corresponding to the proxy username

Note

When configuring a proxy, ensure its accuracy. After adding the proxy, the data collector will incorporate these changes automatically and proceed with subsequent calls through the specified proxy. If the proxy malfunctions, the data collector may not function correctly. Therefore, the only way to modify the proxy is through manual updates to the on-premise collector configuration. In case you are changing the proxy details, the previous set of values needs to be valid so the collector can fetch the new changes.

Click Next to proceed to the next step.

data_collector_add_basic

Step 2: Installation Steps

Copy the installation command using Copy to clipboard. Execute the command on your host machine terminal where you will be prompted to enter details like Tenant ID, Collector ID, Receiver URL, etc. Copy them by clicking on the Copy to clipboard button. You can view the generated API key by clicking Show API key, and then copy it using Copy to clipboard.

  • Tenant ID: Unique ID of the tenant

  • Receiver URL: DataBee endpoint to forward the collected data to (Only HTTPS URL is supported)

  • Collector ID: Unique ID of the collector

  • API Key: API key to authenticate to DataBee Platform

If you have completed copying the information, click Close.

data_collector_installation_new(1)

To manage your configured data collectors, follow these steps:

Navigate to the “Data Collectors” page. Locate the specific data collector you want to modify, and click on it. You can edit the basic information of the selected data collector.

To disable the data collector connection, simply click on Disable. If you wish to remove the data collector, click on Delete. Make any necessary changes and then click Update to save your modifications.

data_collector_update

Click on Installation Steps to view the installation command, Tenant ID, Receiver URL, Collector ID, and API key.

Click on Data Sources to view all the data sources relying on the selected data collector to ingest data.

Installing Data Collector in your system

Log Type

Supported Collector Versions

Syslog

0.2-20 and later

TCP

0.3-x and later

Windows Event

0.4-x and later

Flat File

0.5-x

To install the data collector along with the required dependencies and packages, follow the below steps:

Linux

Run the copied installation command from DataBee platform on the terminal. On successful installation, you will see the following message on the terminal: Installation completed successfully.

Windows

Run the copied installation command from DataBee platform on your PowerShell terminal as the Administrator.

The image below shows sample collector configuration details provided during the installation (Windows):

On successful installation, you will see the following message on the terminal: Installation completed successfully.

TLS/SSL Support

You will be prompted to choose the default Distinguished Name (DN) parameters (displayed on the console) or manually provide the Distinguished Name (DN).

Linux

Generating certificates required for data sources using TLS support... 
NOTE: Self-signed certificates will be generated with following default fields under /opt/comcast-databee-collector/certs directory. 
Country Name: US 
State or Province Name: Colorado 
Locality Name: Centennial 
Organization Name: Comcast 
Organizational Unit Name: IT 
Common Name: ub22-50-2-121  
Email Address: test@gmail.com 

Would you like to generate certificates with above default fields? If no, enter 'n' to provide custom values for certificate fields. (y/N):

Windows

Generating certificates required for data sources using TLS support... 
NOTE: Self-signed certificates will be generated with following default fields under C:\Program Files\Comcast Databee Collector\certs directory. 
Country Name : US 
State or Province Name : Colorado 
Locality Name : Centennial 
Organization Name : Comcast  
Organizational Unit Name : IT 
Common Name : WIN-62M37L27NDE  
Email Address : test@gmail.com 

Would you like to generate certificates with above default fields? If no, enter 'n' to provide custom values for certificate fields. (y/n):

If you want to continue with the default parameters, press y. It will generate self-signed certificates with the above-mentioned Distinguished Name (DN) parameters. Upon successful generation of the certificate, the console will show the status mentioned below.

If you want to manually enter the Distinguished Name (DN) parameters, then you can give relevant values for all parameters.

Note:

While giving the DN parameters, you must be aware that a distinct Common Name should be provided for both the CA and Server Certificate. For e.g., if CN for the CA certificate is comcast.com, then CN for the server certificate can be test.comcast.com.

After the installation is complete, the default self-signed certificates will be generated at the location mentioned below.

Windows: C:\Program Files\Comcast Databee Collector\certs\

Linux: /opt/comcast-databee-collector/certs/

Note:

If you encounter any issues during the installation process, the script might exit with an error. In such scenarios, when you attempt to install again, you will be given a choice to resume the installation from the previously failed attempt with the following message:

Do you want to resume installation from the previously failed attempt? If not, any previous installation progress will be wiped out and installation will be restarted?

If you provide ‘y’, the installation will resume from the previously failed attempt.

If you provide ‘n’, data from the previously failed attempt will be wiped off and a fresh installation will begin.

Once the installation is completed successfully, the collector.yaml file is updated as per the user-provided details.

Users can configure other parameters in the collector.yaml (under /opt/comcast-databee-collector/conf in case of Linux and C:\Program Files\Comcast Databee Collector\conf in case of Windows) such as polling interval, logging related parameters, etc. as mentioned below:

Parameter Name

Type

Description

Sample value

configadapter.conf-polling-interval

Integer

Polling interval in seconds for configuration updates.

60

monitor.metric-push-interval

Integer

Interval in seconds for pushing metrics.

60

fluentbit.flush

Integer

Time in seconds for Fluent Bit to flush records.

5

fluentbit.log-level

String

Logging level for Fluent Bit. Options: off, error, warn, info, debug, trace

info

fluentbit.port

Integer

Port number for Fluent Bit.

2020

global.api-key

String

API key for authentication.

f02a2228-ed5f-40db-b4d1-e71bfa2aa542

global.collector-id

String

Identifier for the data collector.

5d2af5e7-d9cd-4f59-bfce-47f08b6d340c

global.tenant-id

String

Identifier for the tenant.

testtenant

global.receiver-url

String

URL of receiver endpoint

https://testhost.com

log.encoding

String

Encoding format for log messages (e.g., console).

console

log.level

String

Log level for application logging (e.g., INFO).

INFO(WARN, ERROR, DEBUG)

log.rotator.maxSize

Integer

Max size in MB before the log is rotated

100

log.rotator.maxBackups

Integer

Max number of old log files to keep

10

log.rotator.maxAge

Integer

Max age in days to retain log files

10

log.rotator.compress

Boolean

Compress/zip old log files

true

Sample Collector YAML

configadapter:
  conf-polling-interval: 60
monitor:
  metric-push-interval: 60
fluentbit:
  flush: 5
  log-level: info
  port: 2020
global:
  api-key: f02a2228-ed5f-40db-b4d1-e71bfa2aa542
  collector-id: 5d2af5e7-d9cd-4f59-bfce-47f08b6d340c
  tenant-id: testtenant
  receiver-url: https://testhost.com
log:
  encoding: console
  level: INFO
  rotator:
    maxSize: 100
    maxBackups: 10
    maxAge: 10
    compress: true

Configure Data Feed with Data Collector

To configure your data feed with your data collector, follow the steps below.

Click on the Data button and select +Add New Data Source in DataBee UI. Choose your preferred data source from the list of available options. You will now be directed to choose your ingest method. To fetch data from your on-prem data collector, click on Data Collector.

data_feed_configure


You will now be redirected to the "Configure data source" page. Follow the flowchart displayed on the right, for a visual guide, outlining the step-by-step configuration process.

Step 1: Configure Data Source

Enter the data source details in the fields provided and choose a pre-configured data collector of your choice.

  • Data Source Name: a user-friendly name for the data source

  • Owner Name: the name of the point of contact for the data source

  • Owner E-mail: email address of the owner

  • Collector: list of active data collectors available

Once you have entered the required information, click Next to proceed to the next step.

data_collector_1

Step 2: Configure Inputs

Please enter the required data in the input fields provided below.

Syslog

  • Log Source: the type of log source. Select Syslog while configuring the syslog input

  • Format: the incoming Syslog data format, e.g., [syslog-rfc5424/syslog-rfc3164]

  • Mode: the server's communication protocol, UDP or TCP

  • Port: the listening TCP/UDP port used for receiving syslog data

  • Tags: the tag value(s) to be appended to the log to help identify the source log. It follows a key-value pair, and you can add multiple tags

data_collector_syslog

TCP

  • Log source: the type of log source. Select TCP while configuring the TCP input

  • Format: the incoming data format for e.g., cef, leef, json, other

  • Port: the listening TCP port used for receiving data. This port must be opened up on the collector VM manually by the user

  • Enable TLS: enable the toggle for secure TCP communication (optional)

  • Tags: the tag value(s) to be appended to the log to help identify the source log. It follows a key-value pair, and you can add multiple tags

When the Enable TLS toggle button is enabled, the field for server certificate, server key, and CA certificate will be displayed. These fields will be auto-populated with the default certificate/key path based on the Data Collector OS. You can replace these certificates/key paths if you want to provide your own TLS certificates.

Server Certificate Path

The default server certificate path will be auto-populated in the UI. If you want to configure your own certificate, you can give a server certificate path in this field. The default path will be as mentioned below.

Windows: C:\Program Files\Comcast Databee Collector\certs\server-cert.pem

Linux: /opt/comcast-databee-collector/certs/server-cert.pem

Server Private Key Path

The default server private key path will be auto-populated in the UI. If you want to configure your own certificate, you can give a server private key path in this field. The default path will be as mentioned below.

Windows: C:\Program Files\Comcast Databee Collector\certs\server-key.pem

Linux: /opt/comcast-databee-collector/certs/server-key.pem

CA Certificate Path

The default CA certificate path will be auto-populated in the UI. If you want to configure your own certificate, you can give a CA certificate path in this field. The default path will be as mentioned below.

Windows: C:\Program Files\Comcast Databee Collector\certs\ca-cert.pem

Linux: /opt/comcast-databee-collector/certs/ca-cert.pem

For TLS communication the below mentioned algorithm is supported for the CA and Server Certificate:

  1. EC with pkeyopt = ec_paramgen_curve:prime256v1

  2. EC with pkeyopt = ec_paramgen_curve:secp521r1

  3. RSA

  4. Ed25519

  5. SHA 256/384/512

The certificates with *.pem and *.crt are supported with the above-mentioned algorithm.

After entering the details, click Next to proceed to the next step in the configuration process.

Windows Event

For Windows Events Collection, please refer to the following guide before installing the data collector: Deployment Guide for Windows Events Collection

  • Log Source: the type of log source. Select Windows Event while configuring the syslog input

  • Refresh Interval (seconds): the polling interval for each specified channel (in seconds). The default value is 1 second to achieve optimum performance in terms of EPS. The available options are 1, 5, 10, and 20

  • Channels: names of channels from which the data collector will be fetching events. (Only administrative and operational types of channels are supported.)

    Steps to fetch Channel name from Event Viewer:

    1. Login to your Windows machine and open Event Viewer.

    2. Right-click on the channel and click on Properties.

    3. Copy the value from the field 'Full Name' and paste it on the channels dropdown on DataBee UI.

  • Read Historical Event: Enable the Read Historical Events checkbox in case all the existing events are required to be collected. By default, this is disabled.

    Enabling this option might result in duplicate data ingestion on the platform.

    Note: Historical event collection from forwarder machines to the central collector machines (domain controller) is not configurable from the DataBee UI.

  • Query: Optionally, you can provide the Query (in XPath or XML format) to filter events based on Event ID, time range, etc.

    You can directly copy the XML query from the Event Viewer using the following steps on the Windows machine:

    1. Open the Channel in the Event Viewer.

    2. On the right-hand side, under the Actions pane, click on Filter Current Log….

    3. You can choose the relevant filters and then click on the XML tab.

    4. Copy the query and paste it into the ‘Query’ field on the Databee UI.

  • Tags: the tag value(s) to be appended to the log to help identify the source log. It follows a key-value pair, and you can add multiple tags

Flat File

  • Log Source: it defines the type of log source. Select Flat File

  • Refresh Interval (minutes): the interval (in minutes) of refreshing the list of watched files. The default value is 1. The available options are 1, 5, and 10.

  • Source Files: a list of source files to be monitored. Accepts wildcard patterns.

    Examples:

    • /dc/logs/t?/*.log pattern will monitor all the .log files inside the directory starting with ‘t’ with one additional character.

    • /var/log/*/*.log pattern will monitor all files with the .log extension under /var/log and its subdirectory (up to 1 nested level).

    The data collector keeps track of monitored files and offsets.

    Note

    • The data collector does not support Multiline reading from file(s). The data collector reads every matched file in the Source Files pattern and for every new line found, i.e. separated by a newline character (\n), it ingests an event. Hence, JSON text must be contained in a single row for proper ingestion. The entire JSON body format is not supported.

    • File rotation is properly handled. Note that the paths provided to the Source Files field cannot match the rotated files. Otherwise, the rotated file would be read again and lead to duplicate records. Hence, it is recommended to configure the Exclusion Files accordingly to avoid this.

    • If the data contained in a line exceeds 512k, the file will be skipped from the monitoring list and hence its data will not be ingested.

  • Exclusion Files: a list of files to be excluded. Accepts wildcard patterns. For example /*.gz or /*.zip.

  • Tags: the tag value(s) to be appended to the log to help identify the source log. It follows a key-value pair, and you can add multiple tags

Remote File Log Collection

The data collector does not natively support this. To facilitate this process, you should transfer log files from remote systems to the data collector's host machine.

Please follow these preliminary steps before setting up your data source:

  1. Refer to the Mounting Guide: Remote Log Files Collection on Data Collector for detailed instructions on attaching an external drive to the data collector.

  2. Ensure that the chosen disk has adequate capacity for your log volume needs. For example, select a 1TB drive if you anticipate storing logs from remote machines of that volume.

  3. You can send the logs from remote machines to the newly mounted storage drive on the data collector.

Step 3: Configure Filters

You have the option to filter data based on specific keywords, either through inclusion or exclusion. Here's how you can set it up:

  • Inclusion Filter: from the Filters dropdown list, choose Inclusion. Input the filter value. The collector will include only those records whose message key contains the specified keyword you have entered.

  • Exclusion Filter: from the Filters dropdown list, opt for the filter type Exclusion. Input the filter value. The collector will exclude records with message keys containing the specified keyword you have entered.

  • Multiple Filters: When more than two filters are present, the AND condition applies between Inclusion filters; the OR condition applies between the Exclusion filters.

Note:

Filters will only be applied to the 'message' key in the syslog messages.

To delete the inclusion/exclusion filters, click on the trash icon.

Click Submit to finalize and complete the configuration process.

data_collector_3_filters

Management of services

The management script helps you manage all the collector-related services, i.e., start, stop, and view the services' status. Use the following commands to manage the services:

Linux

Start the services:

/opt/comcast-databee-collector/collector.sh start

Stop the services:

/opt/comcast-databee-collector/collector.sh stop

Check the status of the services:

/opt/comcast-databee-collector/collector.sh status

Print the collector version:

/opt/comcast-databee-collector/collector.sh version

Generate the self-signed certificates to enable TLS support:

This is only supported by 0.3-x and later collector versions.

/opt/comcast-databee-collector/collector.sh generate_certs

When you run this command and the default certificates are not expired, it will prompt whether you still want to generate self-signed certificates or not. If you press ‘N’, it will not generate new certificates.

If you provide ‘y’, it will ask whether to use default Distinguished Name (DN) parameters. You have to follow the same steps mentioned above in the installation section.

Windows

Change the current directory:

cd "C:\Program Files\Comcast Databee Collector"

Start the services:

.\collector.ps1 start

Stop the services:

.\collector.ps1 stop

Check the status of the services:

.\collector.ps1 status

Print the collector version:

.\collector.ps1 version

Generate the self-signed certificates to enable TLS support:

This is only supported by collector versions > 0.2-20-8601dc8.

.\collector.ps1 generatecerts

When you run this command and the default certificates are not expired, it will prompt whether you still want to generate self-signed certificates or not. If you press ‘N’, it will not generate new certificates.

If you press y, it will ask whether to use default Distinguished Name (DN) parameters. You have to follow the same steps mentioned above in the installation section.

Upgrade

Linux

To upgrade your data collector, open the terminal and use the command that you have copied from the DataBee platform. Refer the sample command below.

Sample command:

bash -c "$(curl -L https://artifacts.us-east-1.databee.buzz/data-collector/HEAD/upgrade.sh)"

After the upgrade, verify the latest version using the command below.

/opt/comcast-databee-collector/collector.sh version

When the data collector is upgraded from Data Collector version 0.2-20-8601dc8, the script will prompt you for the certificate generation. The user has to follow the steps mentioned under TLS/SSL Support section.

Windows

Open your PowerShell terminal as Administrator. Use the command that you have copied from DataBee platform and refer the sample command below, to upgrade your data collector.

Sample command:

Invoke-WebRequest -Uri “https://artifacts.us-east-1.databee.buzz/data-collector/HEAD/upgrade.ps1” -OutFile "upgrade.ps1" && .\upgrade.ps1

After the upgrade, verify the latest version using the command below.

. "C:\Program Files\Comcast Databee Collector\collector.ps1" version

When the data collector is upgraded from Data Collector version 0.2-20-8601dc8, the script will prompt you for the certificate generation. The user has to follow the steps mentioned under TLS/SSL Support section.

Uninstallation

Follow the steps below to clean up the installation directory, and logs, and to stop and uninstall all the collector services.

Linux

  1. Make sure you are the root user.

  2. Open the terminal.

  3. Grant executable permissions to the uninstaller, if required.

    chmod +x /opt/comcast-databee-collector/uninstall.sh
  4. Run the command below to uninstall the collector:

    /opt/comcast-databee-collector/uninstall.sh

Windows

To uninstall the collector, run the command below on PowerShell as Administrator.

. "C:\Program Files\Comcast Databee Collector\uninstall.ps1"

Note:

Make sure you’re not present on the C:\Program Files\Comcast DataBee Collector path while running this command. Otherwise, PowerShell will interpret that the installation directory is in use and not remove the directory.


Was this article helpful?

Changing your password will log you out immediately. Use the new password to log back in.
First name must have atleast 2 characters. Numbers and special characters are not allowed.
Last name must have atleast 1 characters. Numbers and special characters are not allowed.
Enter a valid email
Enter a valid password
Your profile has been successfully updated.
ESC

Eddy AI, facilitating knowledge discovery through conversational intelligence