Fault tolerance is in the center of distributed system design that covers various methodologies. This research paper aims to investigate different types and techniques of fault tolerance which are being used in many real time distributed systems. The fault can be detected and recovered by many techniques. Moreover, an appropriate fault detector can avoid loss due to system crash or any kind of failure in system. This paper provides a framework for detecting fault in real time system which is supposed to be handled and processed further by the help of coordinator.
Keywords: Fault Tolerance, Fault Detection, Real Time Distributed System, Processor faults, Coordinator, Actuators
When multiple instances of an application are running on several machines and one of the server goes down, there is a need to implement an autonomic fault tolerance technique that can handle these types of faults. Distributed Computing Systems consists of variety of hardware and software components. Failure of any of these components can lead to unpredicted behavior of system which results in failure to guarantee availability and reliability of critical services . A failure occurs when a hardware component is broken and needs replacement or a node/processor is halted or forced to reboot; or software has failed to complete its run. Fault tolerance is the property of a system, where system tends to work even in case of fault present in system.
When a fault present in real time distributed system is not detected and recovered properly on time then it results into failure of system. A task running on real time distributed system should be feasible, reliable and scalable, real time distributed systems like nuclear systems, robotics, air traffic control systems, grid etc. are highly dependable on deadline. Fault present in system can be detected by applying reliable fault detector followed by some recovery technique. These systems must function with high availability even under software and hardware faults. Hardware fault-tolerance is achieved through applying extra hardware like processors, communication links, resource (memory, I/O device) whereas in software fault tolerance tasks, messages are added into the system to deal with faults. The main aim of fault tolerant distributed computing is to provide proper solutions to these system faults upon their occurrence and make the system more dependable by increasing its reliability.
Real-time computer controller is typically provided with sensors which will provide readings at periodic intervals and the computer must respond by sending signals to actuators. There may be unexpected or irregular events and these must also receive a response. In all cases, there will be a time-bound within which the response should be delivered. The ability of the computer to meet these demands depends on its capacity to perform the necessary computations in the given time. The paper is organized as follow: Basic Concepts and Types of Fault is described in section 4 as background along with behavior of failure systems, related works is described in section 5, The proposed model or framework of fault tolerance in section 6, methodologies of framework is described in section 7 and finally we conclude with conclusion and future works in section 8.
Basically system to be fault tolerant is much more similar with concept of dependable system, and any system to be dependable it must be available, reliable and secure. Following are few terminologies that are very closely related to dependability of system and its behavior. Fault can be termed as defect at the lowest level of abstraction and there can be different failure in a system they can be of following types .
Processor faults (Node Faults): Processor faults occur when the processor behaves in an unexpected manner.
It can be classified into three types:
a) Fail-Stop: Here a processor can both be active and
participate in distribute protocols or is totally failed and will
never respond. In this case the neighboring processors can
detect the failed processor.
b) Slowdown: Here a processor might run in degraded
fashion or might totally fail.
c) Byzantine: Here a processor can fail, run in degraded
fashion for some time or execute at normal speed but tries to
fail the computation.
Network fault: A Fault occur in a network due to network
partition, Packet Loss, Packet corruption, destination failure, link
Physical faults: This Fault can occur in hardware like fault in
CPUs, Fault in memory, Fault in storage, etc.
Media faults: Fault occurs due to media head crashes.
The occurrence of fault in a system cannot be predicted and
even small changes or failure in system can lead to tremendous
effect. So, in order to make processing of transaction more reliable
for achieving better outcome of result even in presence of fault,
need of fault tolerance is essential which can avoid faulty system.
There are some important methods for tolerating fault in
various systems given by many authors in their research. According
to Alain Girault et al. the Algorithm Architecture Adequation
(AAA) method will generate a static code automatically for real
time distributed embedded system. This method basically used for
processor failure with fail stop behavior. Luo et al. mention that
TERCOS and DEBUS are the best approaches used to exploiting
redundancies in Fault-Tolerant and Real-Time Distributed Systems.
Girault et al.mention that processor and communication link
failure can be tolerated by using offline scheduling technique and
generate a fault tolerate distributed schedule. Job scheduling is
one of the method in grid computing for scheduling a task. The
fault can be occur in loosely coupled job scheduling with job
replication scheme such that jobs are efficiently and reliably
executed can be tolerated . In programming asynchronous
multiprocessing systems, the customary approach has been to
make process synchronization independent of the execution rates
of any components which means that synchronous algorithm
is required in which one process must wait for another to do
something before it processed ahead. These time- independent
algorithms cannot be fault-tolerant because a process could fail
by doing nothing, and such a failure manifests itself only as a
reduction of the process’s execution rate . Leslie Lamporthas
made an additional assumption that the clock are synchronized to
keep approximately the same absolute time in order to show how
they can be used in solving the synchronization problems that
occurs in distributed systems and use of clock allows elimination
of acknowledgement message. If distributed system is really single
system, then the processes must be synchronized in same way.
Conceptually, the most easiest way to synchronize any process is
just to get them to do the same work at same time. In this paper
also they have implemented a kernal that performs a necessary
synchronization i.e. making sure that two different process do not
try to modify file at same time(Figure 1).
Fault detector is used to detect fault in a system and it runs
on each node in the user space. Fault Detector monitors all the
systems where system activities are classified into two of classes:
a) Normal type and
b) Anomalies type
Here system detector checks whether the activities are of
normal type or is of anomalies type. If the activity happens to be
of anomalies type then the Fault Detector sets an Alarm and next
step is handled by coordinator.
After detection of fault in a system, it is coordinator which is
responsible to carry out the further tasks. Once the fault is detected
in a system, coordinator gets the alarm from fault detector and
after getting those alarm, the coordinator must be able to take
corrective action within certain period of time. Finally, all records
are supposed to be recorded in the fault log table[9,10].
Monitoring task is performed by fault detector where task of
fault detector is to take input from systems. Once input is set in a
system then fault detector detects whether there is any types of
fault present or not. If fault is found then the system send those
occurrence of fault to the coordinator where corrective task is
supposed to be performed by coordinator, but if there is not any
types of fault present in system then further task is supposed to be
carried out by system.
This process is activated as soon as there is presence of fault
in a system. After detecting any sorts of fault in a system, the
coordinator gets alarm from the system once fault is detected.
After that coordinator isresponsible for handling those fault. For
handling the fault, the coordinator searches for the new node with
sufficient resources. It then requests that node to perform the task
and halts the process running is the fault node.
Fault tolerance has been an important issue from the
beginning of the phase. Several researches have been carried out
for solving this issue. Still it is a prominent issue in this area. This
paper deals with existing approaches to solve the problem of fault
tolerance. To address this issue, a framework has been designed
with its working mechanism. Furthermore, this model requires
to be validated which can be performed by using the simulation