A Problem-Specific Fault-Tolerance Mechanism for Asynchronous, Distributed Systems

Computer Science – Distributed – Parallel – and Cluster Computing

Scientific paper

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

17 pages, 6 figures

Scientific paper

The idle computers on a local area, campus area, or even wide area network represent a significant computational resource---one that is, however, also unreliable, heterogeneous, and opportunistic. This type of resource has been used effectively for embarrassingly parallel problems but not for more tightly coupled problems. We describe an algorithm that allows branch-and-bound problems to be solved in such environments. In designing this algorithm, we faced two challenges: (1) scalability, to effectively exploit the variably sized pools of resources available, and (2) fault tolerance, to ensure the reliability of services. We achieve scalability through a fully decentralized algorithm, by using a membership protocol for managing dynamically available resources. However, this fully decentralized design makes achieving reliability even more challenging. We guarantee fault tolerance in the sense that the loss of up to all but one resource will not affect the quality of the solution. For propagating information efficiently, we use epidemic communication for both the membership protocol and the fault-tolerance mechanism. We have developed a simulation framework that allows us to evaluate design alternatives. Results obtained in this framework suggest that our techniques can execute scalably and reliably.

No associations

LandOfFree

Say what you really think

Search LandOfFree.com for scientists and scientific papers. Rate them and share your experience with other people.

Rating

A Problem-Specific Fault-Tolerance Mechanism for Asynchronous, Distributed Systems does not yet have a rating. At this time, there are no reviews or comments for this scientific paper.

If you have personal experience with A Problem-Specific Fault-Tolerance Mechanism for Asynchronous, Distributed Systems, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and A Problem-Specific Fault-Tolerance Mechanism for Asynchronous, Distributed Systems will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFWR-SCP-O-203232

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.