Tools and techniques for monitoring real-time distributed applications
Monitoring a real-time distributed system for fault detection and identification is an extremely challenging problem. Faults may manifest themselves at a different node than where the actual error occurred, and may also be dependent on a particular sequencing of events and thus not easily reproducible. In addition the fault may be the loss of connectivity to some segment of the system, rendering a monitoring agent running in one segment of the system unable to communicate with other segments.
A fault in a real-time distributed system may be signified by the failure of one of the applications in the system, but it may also be the violation of the real-time constraints of the system (such as performance). "Hard" faults such as crashes are relatively easy to detect. "Soft" faults such as a momentary spike in latency between two specific nodes may be much harder to detect, let alone identify and resolve.
A real-time distributed system may fail for a variety of reasons. For example:
- Configuration error (operator error)
- Software error (bugs)
- Network or hardware failures
- Resource contention
Traditional techniques for detecting and identifying such faults in systems are not feasible for real-time distributed systems. One cannot stop the execution of a real-time distributed system in order to step through the execution sequence step-by-step. Even if such an approach was technically feasibly it is unlikely it would be desirable. Stopping a real-time distributed system the system would cause it to fail in its interaction with the external world, which likely would introduce additional failure modes.
- Information model required to understand the operational state of a real-time distributed system.
- Interception techniques and API's to allow collection of application and middleware information with minimal impact on their performance
- Visualization techniques that enable an operator to gain global perspective in a large scale system, identify trouble spots, and drill-down to the details.
- Techniques and languages that can be used to define the "normal" operating state and detect significant deviation from "normalcy"
| Attachment | Size |
|---|---|
| Tools_and_techniques_for_monitoring_rt_dist_apps_OMG_RT_Worskshop_2010.pdf | 442.02 KB |
