Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Fault Tolerance and Standby Schemes in Computer Systems, Study notes of Mobile Computing

Fault tolerance techniques in computer systems, focusing on fault isolation, containment, and standby schemes. Standby schemes include cold, warm, and hot standby, each with its advantages and disadvantages. Faults are classified based on duration, underlying cause, and behavior. Checkpointing is a fault tolerance technique used to balance recovery cost and system performance.

Typology: Study notes

2010/2011

Uploaded on 09/04/2011

amit-mohta
amit-mohta 🇮🇳

4.2

(152)

89 documents

1 / 7

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
11/26/14 1
Fault Treatment
In this phase, the fault is first isolated and then repaired. The repair procedure
depends on the type of fault. Permanent faults require that the failed
component be replaced by a non-failed component. This requires a standby
component. The standby component has to be integrated into the system,
which means that its state has to be synchronized with the state of the rest of
the system. There are three general types of standby schemes:
Cold standby -- This means that the standby component is not operational, so
that its state needs to be changed fully when the cutover occurs. This may be a
very expensive and lengthy operation. For instance, a large database may have
to be fully reconstructed (e.g., using a log of transactions) on a standby disc.
The advantage of cold standby schemes is that they do not introduce overhead
during the normal operation of the system. However, the cost is paid in fault
recovery time.
Warm standby -- In this case, the standby component is used to keep the last
checkpoint of the operational component that it is backing up. When the
principal component fails, the backward error recovery can be relatively short.
The cost of warm standby schemes is the cost of backward recovery discussed
earlier (mainly high overhead).
Hot standby -- In this approach, the standby component is fully active and
duplicating the function of the primary component. Thus, if an error occurs,
recovery can be practically instantaneous. The problem with this scheme is that
it is difficult to keep two components in lock step. In contrast to warm standby
schemes, in which synchronization is only performed during checkpoints, in this
case it has to be done on a constant basis. Invariably, this requires
communications between the primary and the standby, so that the overhead of
these schemes is often higher than the overhead for warm standby.
pf3
pf4
pf5

Partial preview of the text

Download Fault Tolerance and Standby Schemes in Computer Systems and more Study notes Mobile Computing in PDF only on Docsity!

Fault Treatment

  • (^) In this phase, the fault is first isolated and then repaired. The repair procedure depends on the type of fault. Permanent faults require that the failed component be replaced by a non-failed component. This requires a standby component. The standby component has to be integrated into the system, which means that its state has to be synchronized with the state of the rest of the system. There are three general types of standby schemes:
  • Cold standby -- This means that the standby component is not operational, so that its state needs to be changed fully when the cutover occurs. This may be a very expensive and lengthy operation. For instance, a large database may have to be fully reconstructed (e.g., using a log of transactions) on a standby disc. The advantage of cold standby schemes is that they do not introduce overhead during the normal operation of the system. However, the cost is paid in fault recovery time.
  • Warm standby -- In this case, the standby component is used to keep the last checkpoint of the operational component that it is backing up. When the principal component fails, the backward error recovery can be relatively short. The cost of warm standby schemes is the cost of backward recovery discussed earlier (mainly high overhead).
  • Hot standby -- In this approach, the standby component is fully active and duplicating the function of the primary component. Thus, if an error occurs, recovery can be practically instantaneous. The problem with this scheme is that it is difficult to keep two components in lock step. In contrast to warm standby schemes, in which synchronization is only performed during checkpoints, in this case it has to be done on a constant basis. Invariably, this requires communications between the primary and the standby, so that the overhead of these schemes is often higher than the overhead for warm standby.

Characteristics of Fault

Tolerance

  • (^) The basic characteristics of fault

tolerance require:

  • (^) No single point of failure
  • (^) No single point of repair
  • (^) Fault isolation to the failing component
  • (^) Fault containment to prevent

propagation of the failure

  • (^) Availability of reversion modes

Fault Classifications

  • (^) Based on duration , faults can be classified as transient or permanent. A transient fault will eventually disappear without any apparent intervention, whereas a permanent one will remain unless it is removed by some external agency. While it may seem that permanent faults are more severe, from an engineering perspective, they are much easier to diagnose and handle. A particularly problematic type of transient fault is the intermittent fault that recurs, often unpredictably.
  • (^) A different way to classify faults is by their underlying cause. Design faults are the result of design failures, like our coding example above. While it may appear that in a carefully designed system all such faults should be eliminated through fault prevention, this is usually not realistic in practice. For this reason, many fault-tolerant systems are built with the assumption that design faults are inevitable, and theta mechanisms need to be put in place to protect the system against them. Operational faults, on the other hand, are faults that occur during the lifetime of the system and are invariably due to physical causes, such as processor failures or disk crashes.
  • (^) Finally, based on how a failed component behaves once it has failed, faults can be classified into the following categories:
  • (^) Crash faults -- the component either completely stops operating or never returns to a valid state;
  • (^) Omission faults -- the component completely fails to perform its service;
  • (^) Timing faults -- the component does not complete its service on time;
  • (^) Byzantine faults -- these are faults of an arbitrary nature.

Fault Tolerance

• Checkpointing is a fault tolerance technique

widely used in various types of computer

systems. In checkpointing, an important issue

is how to achieve a good trade-off between

the recovery cost and the system

performance. Excessive checkpointing would

result in the performance degradation due to

the high costly I/O operations during

checkpointing. Equidistant and equicost are

two well-known checkpointing strategies for

addressing this issue.

Common Transaction

• Known database (Typically one)

• Bounded duration (Compared to

long transactions)

• Few or no interactions with other

concurrent events

• ACID properties easy to achieve