uberAgent Quality – How We Monitor Our Monitoring Product
Software quality assurance is an iterative process that never ends. This article describes the internal procedures and optimizations through which we continuously improve the quality of uberAgent.
Software quality depends on many factors. The following list attempts to compile some of the more important quality aspects for a monitoring product such as uberAgent.
- Does the product work as documented?
- Is the collected data verified for completeness and validity?
- Does the documentation cover the entire functionality?
- Is the documentation written and structured well?
- Error handling & reliability
- How does the product deal with API errors or invalid data?
- Does the product recover from "things that are not supposed to happen"?
- What is the footprint on the monitored machine?
- Is there a negative impact on the performance or stability of other applications?
- Are more permissions requested than needed?
- Is input treated as insecure and checked before use?
We have developed several tools and procedures to monitor software quality, detect anomalies, and identify bugs.
Asserts are checks built into the source code designed to detect situations that are not meant to happen but sometimes do anyway. If that is the case, an exception is thrown, and an application dump is created, which helps a developer understand why it happened.
Asserts are only enabled in internal test builds; they’re never active in public releases.
Unit tests are great for testing the correctness of a component, class, or function that has no outside dependencies. Unit tests use a component in any number of correct, "creative," and incorrect ways while verifying that the computed result matches the expected result.
Unit tests differ from integration tests in that they run as part of the build pipeline, i.e., by the IDE/compiler. Unit tests typically are self-contained and don’t have access to external infrastructure (Active Directory, servers, …).
Every pull request (PR) must be reviewed by at least one person other than the original developer. Git branch policies ensure that this rule is followed without exception.
Along with each PR, the developer creates a test protocol describing how they tested the new or changed functionality. This helps others better understand the nature of the change and which other components might be affected.
After a PR has been completed, the change is verified by an independent tester who also creates a test protocol.
uberAgent contains several self-monitoring techniques by which it detects responsiveness issues, increases in memory usage, and prevents blue screens.
uberAgent uses various lists and similar data containers to maintain the system state (e.g., one list keeps track of all running processes). To help detect if a list is growing over time, which might indicate a memory leak, the agent monitors each list’s size, writing the then-current values to the log file in regular intervals. Agent logs are monitored by our integration tests – see below.
Almost every multi-threaded program uses variables or lists across multiple threads. This can lead to access violations where one thread modifies a variable while another thread is reading from it. To prevent that from happening, variables must be used in a thread-safe manner, which requires a locking mechanism that ensures only one thread can access the variable at any given point in time.
Locks have the drawback of introducing latency which reduces an application’s responsiveness: while a thread waits for a lock, it is frozen and cannot process anything else.
uberAgent constantly monitors its lock latency. If the time required to acquire a lock exceeds a threshold, the agent writes a message to its log file. Agent logs are monitored by our integration tests – see below.
On Windows, uberAgent utilizes drivers to collect data directly in the kernel. As kernel modules, drivers operate in "god mode." Errors that cause an application crash in user mode lead to blue screens in the kernel, typically followed by an automatic reboot of the system.
In extremely unfortunate cases, the same error could happen again, causing another blue screen, followed by another reboot – repeating the cycle again and again in a reboot loop. To prevent this from happening, uberAgent’s drivers track if they finished initialization correctly. If a driver didn’t start correctly during three consecutive boots, it disables itself (docs).
Our test lab contains machines with all supported OS versions, bitnesses (x86, x64, and ARM), and infrastructure products (e.g., Citrix CVAD).
Every night, a new daily build is created and deployed to many of the lab machines. This helps catch many issues quickly after they were introduced.
Some lab machines are updated at a lower frequency: they get new builds weekly instead of daily. These "weekly" machines help identify problems that take time to manifest, e.g., memory leaks where the RAM usage increases by only a few KB per hour.
A subset of our lab machines is continuously stress-tested by Login Enterprise, a synthetic load generation tool that performs logons, starts applications, and simulates user activity.
DriverVerifier, AddressSanitizer, and GFlags are tools that help detect issues like memory leaks or buffer overruns. All three tools are enabled on our lab machines. When they catch a problem, they immediately stop the process or even the machine's execution, generating an application or kernel memory dump, which can then be analyzed by the developers.
All of the above tools in our lab would be useless without our custom extensible test framework, which runs a series of tests against all machines in the lab every day. The framework can easily be extended by additional tests, typically developed in PowerShell.
Over time, we improved and perfected the test framework as well as the number and coverage of its tests. The list of tests includes, but is not limited to:
- check uberAgent's log files for warnings or errors
- check uberAgent's log files for indicators of memory leaks or lock latency
- check if new application dumps or kernel memory dumps were generated
- check Splunk for internal warnings or errors
- verify uberAgent's footprint by checking Splunk for performance metrics uberAgent collected on its own components
- start a random process and verify uberAgent's collected data in Splunk shortly thereafter
All our production machines are running uberAgent, of course. This includes laptops, desktops, VMs, and servers. We use uberAgent's features creatively and extensively, e.g., to run custom scripts to collect business data or query SaaS APIs.
uberWidgetDock visualizes key performance data of uberAgent like CPU usage, RAM usage, or handle count. The tool docks at the top of the screen and is always visible, which makes it easy to detect issues on developer machines while testing new builds.
uberAgent SPI is a powerful combination of ETW provider, log forwarder, and Splunk app. uberAgent SPI helps monitor agent resource utilization and footprint. Its distinctive feature is that it not only logs the current CPU or RAM usage numbers but adds native call stack frames, which help developers identify the source of a memory allocation, for example.
The uberAgent product family offers innovative digital employee experience monitoring and endpoint security analytics for Windows and macOS.
uberAgent UXM highlights include detailed information about boot and logon duration, application unresponsiveness detection, network reliability drill-downs, process startup duration, application usage metering, browser performance, web app metrics, and Citrix insights. All these varied aspects of system performance and reliability are smartly brought together in the Experience Score dashboard.
uberAgent ESA excels with a sophisticated Threat Detection Engine, endpoint security & compliance rating, the uAQL query language, detection of risky activity, DNS query monitoring, hash calculation, registry monitoring, and Authenticode signature verification. uberAgent ESA comes with Sysmon and Sigma rule converters, a graphical rule editor, and uses a simple yet powerful query language instead of XML.
About vast limits
vast limits GmbH is the company behind uberAgent, the innovative digital employee experience monitoring and endpoint security analytics product. vast limits’ customer list includes organizations from industries like finance, healthcare, professional services, and education, ranging from medium-sized businesses to global enterprises. vast limits’ network of qualified solution partners ensures best-in-class service and support anywhere in the world.