User Tools

Site Tools


Sidebar

Welcome to DIDO WIKI

dido:public:ra:1.4_req:2_nonfunc:14_reliability:04_faulttolerance

This is an old revision of the document!


4.2.2.3 Fault Tolerance

Return to Reliability

  • [char]Please Review
  • [DDSFmember]Please Review

About

Return to the Top

faulttolerance is the ability of a system (computer, network, cloud cluster, component, etc.) to continue functioning correctly without interruption during failures. Fault Tolerant systems (or components) prevent disruptions to a system that is considered Safety-Critical System (SCS), Life-Critical System or missioncritical. Usually, this requires an understanding of the single points of failure through the multiple critical execution paths in a running system.

The system characteristics of Fault Tolerance High availability are related in that to achieve high availability, a system must address Fault Tolerance of components on the systems critical paths.

Fault Tolerant systems use redundant (i.e;, spare, backup) components to automatically become available in the event of a component failure to ensure there is no loss of service or data. The ability to use failover mechanisms to quickly, smoothly and transparently transition to the redundant or backup systems requires a well designed system, with contingency plans and special management processes, hardware or software to ensure the transition. There are some Failover components which are acquired. For example:

  • Power sources are ruggedized as fault tolerant by incorporating alternative sources and backups like ups and backup generators. A good description of this is provided in the Tactical Microgrid Standard (TMS) use case,
  • Hardware systems are made Fault Tolerant by deploying identical or equivalent systems that can either be used instead of the original system or use in conjunctions with the original system used as an alternative. For example, a server can be made fault tolerant by using an identical server running in parallel, with all operations mirrored to the backup server.
  • Networks designed as Fault Tolerant by supporting multiple networks paths between any two endpoints within the lan or wan are possible but the actual endpoint also needs to be duplicated (i.e., two Network Interface Cards (NICs)). It is also possible to use two different networks such as a wired, wifi, bluetooth, or zigbee.
  • Software systems or components become fault tolerant when multiple instances of the software are running in parallel using either operating system threads or even more modern containers such as docker or orchestration software such as Kubernetes. For example, a database can be continuously replicated to other machines. If the primary database goes down, operations can be automatically redirected to the second database. Another example, would be use of orchestration software such as Kubernetes to automatically use an alternate application container on the same or different machine.

Fault Tolerance needs to be considered in all disaster recovery plans or strategies. For example, Fault Tolerant systems can use the cloud for backups allowing critical systems to quickly be restored. Although these backups are not true immediate failovers they can a longer horizon fault tolerance. Note: often these backup plans are not geographically local which is particularly important during natural or even human disasters.

DDS Specifics

Return to the Top

data_distribution_service_dds is mom software and as such can not directly help with power source, hardware or networks. However, because it is a many-to-many, p2p, publish-subscribe middleware, it can be used to help monitor these components and can help make informed decisions regarding the proper operations of these components. For example, there can be redundant heat sensors on a chemical mixing tank, both publishing the current temperature of the tank. If one sensor fails, the monitoring software component for the tank can automatically use the the backup sensor without human intervention by configuring the Quality of Service (QoS) parameters on a topic correctly.

  • Note: There are several excellent examples provided in section 03_user.
dido/public/ra/1.4_req/2_nonfunc/14_reliability/04_faulttolerance.1605061859.txt.gz · Last modified: 2020/11/10 21:30 by nick
Translations of this page: