Tools and techniques for monitoring real-time distributed applications

Date: 
July, 2010

 

Monitoring a real-time distributed system for fault detection and identification is an extremely challenging problem. Faults may manifest themselves at a different node than where the actual error occurred, and may also be dependent on a particular sequencing of events and thus not easily reproducible. In addition the fault may be the loss of connectivity to some segment of the system, rendering a monitoring agent running in one segment of the system unable to communicate with other segments.

A fault in a real-time distributed system may be signified by the failure of one of the applications in the system, but it may also be the violation of the real-time constraints of the system (such as performance). "Hard" faults such as crashes are relatively easy to detect. "Soft" faults such as a momentary spike in latency between two specific nodes may be much harder to detect, let alone identify and resolve.

A real-time distributed system may fail for a variety of reasons. For example:

 

  • Configuration error (operator error)
  • Software error (bugs)
  • Network or hardware failures
  • Resource contention

 

Traditional techniques for detecting and identifying such faults in systems are not feasible for real-time distributed systems. One cannot stop the execution of a real-time distributed system in order to step through the execution sequence step-by-step.  Even if such an approach was technically feasibly it is unlikely it would be desirable. Stopping a real-time distributed system the system would cause it to fail in its interaction with the external world, which likely would introduce additional failure modes. 

SNMP based protocols and tools, such as HP OpenView, cannot address this problem. These techniques require each monitored component to run an SNMP Agent/Server and the tool to connect point-to-point to each of the SNMP Agents. This works well to monitor a few "server" nodes in the network infrastructure, but cannot be deployed, or scale if the requirement is to monitor thousands or tens of thousands of client applications.
This presentation covers early results of SBIR-funded research on monitoring and instrumentation of large-scale real-time distributed systems. Specifically the talk will cover the following subjects:
  1. Information model required to understand the operational state of a real-time distributed system.
  2. Interception techniques and API's to allow collection of application and middleware information with minimal impact on their performance
  3. Visualization techniques that enable an operator to gain global perspective in a large scale system, identify trouble spots, and drill-down to the details.
  4. Techniques and languages that can be used to define the "normal" operating state and detect significant deviation from "normalcy"
The results presented, specifically the application monitoring meta-model and a instrumentation API, could influence a future OMG monitoring and instrumentation standard. 

 

AttachmentSize
Tools_and_techniques_for_monitoring_rt_dist_apps_OMG_RT_Worskshop_2010.pdf442.02 KB