Framework for Detecting Fault in Real Time Distributed System

When multiple instances of an application are running on several machines and one of the server goes down, there is a need to implement an autonomic fault tolerance technique that can handle these types of faults. Distributed Computing Systems consists of variety of hardware and software components. Failure of any of these components can lead to unpredicted behavior of system which results in failure to guarantee availability and reliability of critical services [1]. A failure occurs when a hardware component is broken and needs replacement or a node/processor is halted or forced to reboot; or software has failed to complete its run. Fault tolerance is the property of a system, where system tends to work even in case of fault present in system.


Introduction
When multiple instances of an application are running on several machines and one of the server goes down, there is a need to implement an autonomic fault tolerance technique that can handle these types of faults. Distributed Computing Systems consists of variety of hardware and software components. Failure of any of these components can lead to unpredicted behavior of system which results in failure to guarantee availability and reliability of critical services [1]. A failure occurs when a hardware component is broken and needs replacement or a node/processor is halted or forced to reboot; or software has failed to complete its run. Fault tolerance is the property of a system, where system tends to work even in case of fault present in system.
When a fault present in real time distributed system is not detected and recovered properly on time then it results into failure of system. A task running on real time distributed system should be feasible, reliable and scalable, real time distributed systems like nuclear systems, robotics, air traffic control systems, grid etc. are highly dependable on deadline. Fault present in system can be detected by applying reliable fault detector followed by some recovery technique. These systems must function with high availability even under software and hardware faults. Hardware fault-tolerance is achieved through applying extra hardware like processors, communication links, resource (memory, I/O device) whereas in software fault tolerance tasks, messages are added into the system to deal with faults. The main aim of fault tolerant distributed computing is to provide proper solutions to these system faults upon their occurrence and make the system more dependable by increasing its reliability.
Real-time computer controller is typically provided with sensors which will provide readings at periodic intervals and the computer must respond by sending signals to actuators. There may be unexpected or irregular events and these must also receive a response. In all cases, there will be a time-bound within which the response should be delivered. The ability of the computer to meet these demands depends on its capacity to perform the necessary computations in the given time. The paper is organized as follow: Basic Concepts and Types of Fault is described in section 4 as background along with behavior of failure systems, related works is described in section 5, The proposed model or framework of fault tolerance in section 6, methodologies of framework is described in section 7 and finally we conclude with conclusion and future works in section 8.

Basic concepts and types of fault
Basically system to be fault tolerant is much more similar with concept of dependable system, and any system to be dependable it must be available, reliable and secure. Following are few terminologies that are very closely related to dependability of system and its behavior. Fault can be termed as defect at the lowest level of abstraction and there can be different failure in a system they can be of following types [2].

Trends in Technical & Scientific Research
a) Fail-Stop: Here a processor can both be active and participate in distribute protocols or is totally failed and will never respond. In this case the neighboring processors can detect the failed processor. b) Slowdown: Here a processor might run in degraded fashion or might totally fail.

c)
Byzantine: Here a processor can fail, run in degraded fashion for some time or execute at normal speed but tries to fail the computation.

Motivation
The occurrence of fault in a system cannot be predicted and even small changes or failure in system can lead to tremendous effect. So, in order to make processing of transaction more reliable for achieving better outcome of result even in presence of fault, need of fault tolerance is essential which can avoid faulty system.

Objectives
To provide safety to the real-time system that is a system must be designed in a way that if it is not working correctly, it will fail in very safe manner.

Trends in Technical & Scientific Research
There are some important methods for tolerating fault in various systems given by many authors in their research. According to Alain Girault et al. [3]the Algorithm Architecture Adequation (AAA) method will generate a static code automatically for real time distributed embedded system. This method basically used for processor failure with fail stop behavior. Luo et al. [4]mention that TERCOS and DEBUS are the best approaches used to exploiting redundancies in Fault-Tolerant and Real-Time Distributed Systems. Girault et al. [3]mention that processor and communication link failure can be tolerated by using offline scheduling technique and generate a fault tolerate distributed schedule. Job scheduling is one of the method in grid computing for scheduling a task. The fault can be occur in loosely coupled job scheduling with job replication scheme such that jobs are efficiently and reliably executed can be tolerated [5]. In programming asynchronous multiprocessing systems, the customary approach has been to make process synchronization independent of the execution rates of any components which means that synchronous algorithm is required in which one process must wait for another to do something before it processed ahead. These time-independent algorithms cannot be fault-tolerant because a process could fail by doing nothing, and such a failure manifests itself only as a reduction of the process's execution rate [6]. Leslie Lamport [7]has made an additional assumption that the clock are synchronized to keep approximately the same absolute time in order to show how they can be used in solving the synchronization problems that occurs in distributed systems and use of clock allows elimination of acknowledgement message. If distributed system is really single system, then the processes must be synchronized in same way. Conceptually, the most easiest way to synchronize any process is just to get them to do the same work at same time. In this paper also they have implemented a kernal that performs a necessary synchronization i.e. making sure that two different process do not try to modify file at same time (Figure 1).

Framework
The proposed system consists of two major components:

Fault detector
Fault detector is used to detect fault in a system and it runs on each node in the user space. Fault Detector monitors all the systems where system activities are classified into two of classes: a) Normal type and b) Anomalies type Here system detector checks whether the activities are of normal type or is of anomalies type. If the activity happens to be of anomalies type then the Fault Detector sets an Alarm and next step is handled by coordinator [8].

Coordinator
After detection of fault in a system, it is coordinator which is responsible to carry out the further tasks. Once the fault is detected in a system, coordinator gets the alarm from fault detector and after getting those alarm, the coordinator must be able to take corrective action within certain period of time. Finally, all records are supposed to be recorded in the fault log table [9,10].

Method Development
Detecting and handling the Fault in the system is carried out in the following 2 phases (Figure 2

Monitoring phase
Monitoring task is performed by fault detector where task of fault detector is to take input from systems. Once input is set in a system then fault detector detects whether there is any types of fault present or not. If fault is found then the system send those occurrence of fault to the coordinator where corrective task is supposed to be performed by coordinator, but if there is not any types of fault present in system then further task is supposed to be carried out by system.

Corrective action phase
This process is activated as soon as there is presence of fault in a system. After detecting any sorts of fault in a system, the coordinator gets alarm from the system once fault is detected. After that coordinator isresponsible for handling those fault. For handling the fault, the coordinator searches for the new node with sufficient resources. It then requests that node to perform the task and halts the process running is the fault node.

Conclusion
Fault tolerance has been an important issue from the beginning of the phase. Several researches have been carried out for solving this issue. Still it is a prominent issue in this area. This paper deals with existing approaches to solve the problem of fault tolerance. To address this issue, a framework has been designed with its working mechanism. Furthermore, this model requires to be validated which can be performed by using the simulation process.