The Importance of Distributed Transaction Monitoring

Follow me on Twitter: @ericjbruno
Email me: eric@ericbruno.com

The Importance of Distributed Transaction Monitoring

Eric J. Bruno
Monitoring in general may not always be viewed as a differentiator, but transaction monitoring and tracing can become a critical requirement. The world isn’t perfect, and even the best development practices may not be able to prevent all data integrity issues. Being able to reconcile with good monitoring and reporting can be the difference between staying in and going out of business, or even life and death.

In terms of software, transaction processing is the act of taking individual operations and executing them as part of a single unit. All of the composite operations must either succeed or fail as a whole, meaning if one component fails, the others must be rolled back. Partial completion is never acceptable.

Take, as an example, a purchase made on an e-commerce site on the web. In simplistic terms, the two important components of the transaction are the removal of the item from inventory, and the charging of your credit card for payment. If either one of those operations fails, then the other must not complete. For example, if the item is not in inventory, then your credit card should not be charged. Inversely, if the credit card payment fails, then inventory must not be decremented. The transaction only succeeds when all of the individual components succeed, otherwise actions are rolled back to maintain data integrity.

Transaction processing gets more complicated in scenarios where multiple, distributed, systems are involved; a scenario described as distributed transaction processing. For example, renting a movie on iTunes (or other movie rental service) requires that the movie be available, proper payment credentials are received, and enough space exists on the viewer’s device for download. If either one of those steps fails across each of the distributed components, then the entire transaction fails (meaning you don’t get charged, and the renter doesn’t report the rental to the owner of the video content).

Unfortunately, the complexities of distributed transaction processing extend to other systems as well, such as your monitoring and reporting implementations. Fortunately, there are strategies and tools to help.

Why Transaction Monitoring is Critical

Monitoring in general may not always be viewed as a differentiator, but monitoring can be a critical requirement for transaction systems. Beyond the need to debug and test transaction software, proper monitoring is required to ensure the following important qualities of production transactional systems:

Transactions success or failure
Proving system integrity
Proper reporting of inventory usage to other systems (i.e. content owners)
Service-level agreement (SLA) adherence, such as individual system performance
Security isn’t compromised

Transaction monitoring and reporting become critical when system integrity is compromised due to improper transaction processing. This can be either due to implementation error, total sub-system failure, or human error. Being able to reconcile with good monitoring and reporting can be the difference between going out of business, or even life and death. Let’s look at key features of a monitoring and tracing implementation.

Database Transaction Monitoring

Implementing useful distributed transaction tracing for web-based applications can be complex. First, since most transactions center around a database, your distributed monitoring solution needs to work with the database transaction and monitoring engine regardless of the database(s) used. This includes multi-vendor support (i.e. Oracle, DB2, SQL Server, and so on), different types of databases (i.e. relational and NoSQL), as well as the underlying storage systems.

Making matters more complicated, modern systems may also contain a mix of cloud-based and mainframe-based database processing; all of which needs to be monitored and traced.

End-to-End Transaction Tracing

To ensure integrity, your monitoring solution must follow the performance of critical transactions across your entire application environment, whether it’s client-server based, service-oriented, in the public cloud, or all of the above. This includes application servers, web services, and integration with legacy systems (even mainframes).

Key Product Quality Parameters and SLAs

Your monitoring solution requires you to identify the system’s most critical transactions and measure response times, system profiling data, and error rates perform poorly. Minding your KPQPs and SLAs is more than a game of alphabet. In many industries, it’s needed to meet compliance requirements to avoid penalties or worse.

Code-Level Tracing

You need the ability to drill down with real-time insight into specific component code execution, SOA service calls, and database queries. System profiling must get to the level of measuring the most important and most critical method calls and queries to determine frequency of execution, along with the standard deviation of processing time. This level of data science across your entire code base can be hard to achieve, but the capability will pay you back quickly.

Distributed Transaction Processing Visualization

Monitoring your transaction processing is only the first step. Gathering data and then making sense of it is the second--and perhaps most important--step. Being able to quickly visualize performance metrics for distributed applications, their architecture, and external service status in one dashboard is critical to isolating issues as they occur, or even identifying potential problems before they impact users or customers. It also helps to identify precisely who needs to get involved (an internal group, external vendor, network provider, or other organization) to help resolve the issue and recover data in a timely manner.

Part of efficient transaction monitoring includes identifying the right people to be involved when questions arise. Alerting the wrong people, or the wrong vendor, can be costly not just in terms of wasted resources, time and effort, but also increased risk due to delays in recovery.

Understand and visualize performance spikes in your applications, even if the cause is only a single outlying request, by analyzing trends by individual requests.

Identify Trends and Outliers

Application performance monitoring goes beyond servers, and includes the entire stack, your multiple layers of code through to data storage. You can then cross-correlate information between multiple components and their sources. For example, identify and trace user actions that trigger code executing in your application that requests information from specific database tables, through to the database engine serving those requests.

However, transaction monitoring goes beyond the standard dashboard green or red light—Systems may be up, but transactions may not always execute properly, or be timely. Measure overall system performance, including each component and even network infrastructure, to perform root-cause analysis of issues affecting your users.

Starting from this macro level, you can isolate interesting code and dive into the lowest-level details of your application, including individual web pages and their constituents (see Figure below). This includes a page’s script code as it’s executed, host server activity, network latency, associated database queries, image downloads, and more, all down to the individual line of source.
Dive deep into the user experience, monitoring each transaction, looking for outliers, and reporting real user performance. Some monitoring solutions focus on specific parts of the stack, but to measure the entire transaction envelope, observe the entire stack and how it affects user experience.

Reporting and Beyond

Additionally, transaction reporting is baselined and made available to you from anywhere in the world. This allows you to see how servers in a data center or cloud provider facility are affecting transactions for users in specific regions. This ability to tie end-user transaction performance to specific sets of servers, network infrastructure, or other services helps you resolve issues quickly, perhaps even before users notice.

Continuously monitor trends to alert you of impending trouble before users are impacted. Distributed transaction monitoring services can be deployed across your data centers, or the cloud globally, to track your users’ experience accurately, regardless of where they reside. As a result, you can trace transactions across hosts, measure against baselines and transaction acceptance criteria using the Latency data API, and generate automated real-time alerts when thresholds are crossed.

Conclusion - Monitoring Grows as Your System Grows

Rolling your own system transaction monitoring solution may work for a specific release of your software. However, in the age of agile development and DevOps, iterative development results in systems that constantly change, which can impact your monitoring systems as well. As a result, ensuring your monitoring capabilities match your rate of release can be a challenge. Additionally, a custom monitoring solution for one application may not be easily applied to other systems, now or in the future.

Some organizations view monitoring as a necessary evil. But with distributed transaction systems, thorough monitoring can be a business differentiator, in that it extends to all of your components, helps you become proactive, detect and fix issues before your users experience them, and enables you to quickly recover in times of failure. I often like to draw a comparison between software and the game of golf: even the best golfers occasionally hit a bad shot. It’s what they do to recover that counts most. Effective transaction monitoring can do the same for your business, impressing your customers with your exceptional ability to recover from even the worst of problems. In business, it’s often that final impression that counts.