As HLA federations are deployed in more and more distributed environments there is an increasing need to be able to operate in a less than perfect world. A number of extensions that adds fault tolerance support to HLA have been suggested and accepted as a part of HLA Evolved. Two types of faults are introduced: “federate lost” as seen from the federation and “connection lost” as seen from the federate. Some of the potential and limitations of this approach to fault tolerance are described in this paper.
To handle fault tolerance in a federation it must be addressed early in the Federation Development Process (FEDEP). During federation agreement and throughout the federation and federate design and implementation the level of fault tolerance must also be related to the purpose and goal of the federation. For example, it is necessary to understand what constitutes a valid federation when federates are lost, what the procedures are to determine this and how to recover. Fault tolerant requirements may also vary between training and analysis federations.
A number of design patterns for fault tolerance in federations are presented in the paper, for example, the required federation subset, the optional federation, the fault monitoring federate, the reoccurring federate, the spontaneous federation the fail-over federate and the fail-over RTI.
For federate a number of design patterns for different operations can be implemented to support fault tolerance. Some of these are fault tolerant updates, regular reconnection attempts, fault tolerant save and failure monitoring.
The resynchronization of a federation is an important issue. Some aspects of rejoin are discussed. The approach for resynchronization needs to consider both technical and scenario management aspects to be able to resynchronize at a relevant and convenient time.
Authors: Björn Möller, Mikael Karlsson, Björn Löfstrand