Skip to main content
Splunk

Splunk Data Volume Reduction Techniques

  • by Helge Klein
  • January 24, 2023

As a product built from day one with Splunk as the primary backend platform, uberAgent is heavily optimized to generate only the minimal Splunk data volume. This optimization keeps data ingest costs in check and allows our customers to collect logs and metrics from hundreds of thousands of Windows and macOS endpoints. In this post, we explain the data volume reduction techniques we’ve developed.

Avoid Chatty Data Formats

As a big data product that is (typically) licensed by ingest data volume, Splunk is naturally flexible in the data formats it can process. In addition to application logs, Splunk happily ingests JSON, key-value (KV) data, and even XML. But those formats should be avoided wherever possible because they’re bloated. In addition to the field data (which you want), you’re also ingesting the field names (which you don’t).

Use CSV Instead of JSON, KV, or XML

The format that should be used instead of JSON, KV, or XML is CSV (comma-separated values). CSV data rows only contain actual field data, no field names.

Get Rid of the CSV Header

Of course, you need to get the field names into Splunk – somehow. With CSV, this is typically accomplished by adding a header row. That works well and is efficient enough if you have many data rows. In that case, the header row’s overhead doesn’t matter much. If, on the other hand, you’re mostly sending single-row data to Splunk, CSV needs an additional optimization, or it’s not more efficient than JSON.

To get rid of the CSV header row, we moved the field definitions from the data to the configuration. The following example demonstrates how to do that with uberAgent’s SessionDetail sourcetype.

Add the following to your app’s props.conf file:

[uberAgent:Session:SessionDetail]
KV_MODE = none
REPORT-class_uberagent_session_sessiondetail = extract_uberagent_session_sessiondetail

Add the following to your app’s transforms.conf file:

[extract_uberagent_session_sessiondetail]
DELIMS = ","
FIELDS = Fieldname1,Fieldname2,Fieldname3,...

With the above configuration, Splunk knows the field names for the uberAgent:Session:SessionDetail sourcetype without ingesting a field name even once.

Combine Multiple Similar Events

Application logs typically contain one line per event. A web server, for example, logs every request as an independent log message. Since you’re reading this article, you’ve probably used Splunk to collect such logs with the help of Universal Forwarder. Don’t. It’s incredibly inefficient.

Many log messages are almost redundant. Consider the web server example: when a browser loads a web page, it requests dozens of resources from the web server. The web server diligently writes one log message per requested resource. These resources, however, differ only in name. If you’re collecting and forwarding the web server’s log messages as they’re created, you’re sending a lot of redundant data to Splunk.

If you don’t need the detail that differs between the individual requests of a page load (e.g., the file name of each requested resource), you can combine many similar events into a single aggregate event. Add a request count field and potentially another field for the total duration of the combined requests, and you have all the data that matters at a fraction of the cost.

With uberAgent, we’re employing this technique for multiple metrics, e.g., browser web app monitoring or DNS query monitoring.

Replace Enum Values With IDs + Lookup

You’ll often find that you have fields whose values are from a limited range of predefined values. Developers call such data types enums. Instead of adding the enum values to your events over and over again, replace them with IDs, adding back the descriptive names in Splunk with lookups. Consider the following example:

With uberAgent’s browser web app monitoring, we have a field for the browser’s name. Instead of specifying the browser name as a string, however, we use IDs: 1 stands for Google Chrome, 2 for Internet Explorer, 3 for Mozilla Firefox, and so on. This saves a lot of data volume, as the IDs only “cost” a single byte.

In Splunk, we add the browser’s name to the event with an automatic lookup configured as follows.

props.conf:

[uberAgent:Application:BrowserWebRequests2]
LOOKUP-browser_browsers = lookup_browser_browsers Browser OUTPUTNEW BrowserDisplayName

transforms.conf:

[lookup_browser_browsers]
filename = browser_browsers.csv

Lookup file browser_browsers.csv:

Browser,BrowserDisplayName
1,Chrome
2,Internet Explorer
3,Firefox
4,Edge

Floating Point Number Optimizations

If your data contains numeric fields, use only the decimal places you need. For example, if a field contains a duration in milliseconds, it might be enough to only use a single decimal place and strip off excessive digits.

Trim Trailing Zeros

A nice little optimization is to strip off trailing zeros after the decimal point. In the above example, numbers ending in .0 can be shortened: remove the decimal point and the zero, saving two bytes. A number like 15.0 (4 bytes) would be shortened to 15 (2 bytes), cutting the data volume in half.

Use Calculated Fields

Remove fields that contain data that can be calculated from other fields in the event. Consider the following example from uberAgent’s NetworkTargetPerformance metric.

The events generated by uberAgent include one field for the send data volume (as measured by uberAgent on the endpoint – not to be confused with the Splunk ingest data volume) and another for the receive data volume.

To facilitate searching and dashboard creation, we added a combined field for the total (send + receive) data volume in Splunk. In this particular case, the calculated field is defined in one of uberAgent’s Splunk data models as follows:

{
   "outputFields": [
   {
      "fieldName": "NetTargetSendReceiveMB",
      "owner": "uberAgentUXM_Process",
      "type": "number",
      "fieldSearch": "",
      "required": false,
      "multivalue": false,
      "hidden": false,
      "editable": true,
      "displayName": "NetTargetSendReceiveMB",
      "comment": ""
   }
   ],
   "calculationID": "jbzaokyus6y9zfr",
   "owner": "uberAgentUXM_Process",
   "editable": true,
   "comment": "",
   "calculationType": "Eval",
   "expression": "NetTargetSendMB+NetTargetReceiveMB"
},

Calculated fields need not be defined in a data model, however. They can just as well be configured in props.conf, e.g., as follows:

[uberAgent:Process:NetworkTargetPerformance]
EVAL-NetTargetSendReceiveMB = NetTargetSendMB+NetTargetReceiveMB

Data Collection Frequencies

Set Collection Intervals Depending on Rate of Change

Not all metrics are created equal. For some types of data, you may need a high resolution, meaning you must collect data frequently. This may be true for application performance data, which is collected by uberAgent every 30 seconds, by default.

Data that doesn’t change as often can be collected at lower frequencies. Machine inventory is a good example. In its default configuration, uberAgent only collects inventory data once per day.

Make Collection Frequencies Configurable

Every customer and use case is different. While some organizations may have requirements to collect data at high resolution, others might prioritize data volume costs and opt for lower collection frequencies.

uberAgent accounts for individual requirements and prioritizations by allowing the data collection intervals to be configured individually for each metric.

To facilitate getting started, we provide a configuration optimized for data volume in addition to the default configuration.

About uberAgent

uberAgent is an innovative Windows and macOS user experience monitoring (UXM) and endpoint security analytics (ESA) product.

uberAgent UXM highlights include detailed information about boot and logon duration, application unresponsiveness detection, network reliability drill-downs, process startup duration, application usage metering, browser performance, web app metrics, and Citrix insights. All these varied aspects of system performance and reliability are smartly brought together in the Experience Score dashboard.

uberAgent ESA excels with a sophisticated activity monitoring engine, the uAQL query language, detection of risky activity, DNS query monitoring, hash calculation, registry monitoring, and Authenticode signature verification. uberAgent ESA comes with Sysmon and Sigma rule converters, a graphical rule editor, and uses a simple yet powerful query language instead of XML.

About vast limits

vast limits GmbH is the company behind uberAgent, the innovative user experience monitoring and endpoint security analytics product. vast limits’ customer list includes organizations from industries like finance, healthcare, professional services, and education, ranging from medium-sized businesses to global enterprises. vast limits’ network of qualified solution partners ensures best-in-class service and support anywhere in the world.

Comments

Your email address will not be published. Required fields are marked *