FORM 4.3
|
Go to the source code of this file.
Macros | |
#define | CACHED_SNAPSHOT |
#define | CACHE_SIZE 4096 |
#define | R_FREE(ARG) if ( ARG ) M_free(ARG, #ARG); |
#define | R_FREE_NAMETREE(ARG) |
#define | R_FREE_STREAM(ARG) |
#define | R_SET(VAR, TYPE) VAR = *((TYPE*)p); p = (unsigned char*)p + sizeof(TYPE); |
#define | R_COPY_B(VAR, SIZE, CAST) |
#define | S_WRITE_B(BUF, LEN) if ( fwrite_cached(BUF, 1, LEN, fd) != (size_t)(LEN) ) return(__LINE__); |
#define | S_FLUSH_B if ( flush_cache(fd) != 1 ) return(__LINE__); |
#define | R_COPY_S(VAR, CAST) |
#define | S_WRITE_S(STR) |
#define | R_COPY_LIST(ARG) |
#define | S_WRITE_LIST(LST) |
#define | R_COPY_NAMETREE(ARG) |
#define | S_WRITE_NAMETREE(ARG) |
#define | S_WRITE_DOLLAR(ARG) |
#define | ANNOUNCE(str) |
Functions | |
int | CheckRecoveryFile () |
void | DeleteRecoveryFile () |
char * | RecoveryFilename () |
void | InitRecovery () |
size_t | fwrite_cached (const void *ptr, size_t size, size_t nmemb, FILE *fd) |
size_t | flush_cache (FILE *fd) |
int | DoRecovery (int *moduletype) |
void | DoCheckpoint (int moduletype) |
Variables | |
unsigned char | cache_buffer [CACHE_SIZE] |
size_t | cache_fill = 0 |
Contains all functions that deal with the recovery mechanism controlled and activated by the On Checkpoint switch.
The main function are DoCheckpoint, DoRecovery, and DoSnapshot. If the checkpoints are activated DoCheckpoint is called every time a module is finished executing. If the conditions for the creation of a recovery snapshot are met DoCheckpoint calls DoSnapshot. DoRecovery is called once when FORM starts up with the command line argument -R. Most of the other code contains debugging facilities that are only compiled if the macro PRINTDEBUG is defined.
The recovery mechanism is atomic, i.e. only if everything went well, the final recovery file is created (and the older one overwritten) in a single step (copying). If some errors occur, a warning is issued and the program continues without having created a new recovery file. The only situation in which the creation of the recovery data leads to a termination of the running program is if not enough disk or memory space is left.
For ParFORM each slave creates its own recovery file, sends it to the master and then it deletes the recovery file. The master stores all the recovery files and on recovery it feeds these files to the slaves. It is nearly impossible to recover after some MPI fault so ParFORM terminates on any recovery failure.
DoRecovery and DoSnapshot do the loading and saving of the recovery data, respectively. Every change in one functions needs to be accompanied by the appropriate change in the other function. The structure of both functions is quite similar. They handle the relevant global structs one after the other and then care about the copying of the hide and scratch files.
The names of the recovery, scratch and hide files are hard-coded in the variables in fold "filenames and system commands".
If the global structs AM,AP,AC,AR are changed, DoRecovery and DoSnapshot usually also have to be changed. Some structs are read/written as a whole (AP,AC), some are read/written only partly as a selection of their individual elements (AM,AR). If AM or AR have been changed by adding or removing an element that is important for the runtime status, then the reading/writing statements have to be added to or removed from DoRecovery and DoSnapshot. If AP or AC are changed, then for non-pointer variables (in the case of a struct it also means that none of its elements is a pointer) nothing has to be changed in the functions here. If pointers are involved, extra code has to be added (or removed). See the comments of DoRecovery and DoSnapshot.
Definition in file checkpoint.c.
#define CACHED_SNAPSHOT |
Definition at line 1213 of file checkpoint.c.
#define CACHE_SIZE 4096 |
Definition at line 1215 of file checkpoint.c.
#define R_FREE | ( | ARG | ) | if ( ARG ) M_free(ARG, #ARG); |
Definition at line 1280 of file checkpoint.c.
#define R_FREE_NAMETREE | ( | ARG | ) |
Definition at line 1283 of file checkpoint.c.
#define R_FREE_STREAM | ( | ARG | ) |
Definition at line 1288 of file checkpoint.c.
#define R_SET | ( | VAR, | |
TYPE ) VAR = *((TYPE*)p); p = (unsigned char*)p + sizeof(TYPE); |
Definition at line 1295 of file checkpoint.c.
#define R_COPY_B | ( | VAR, | |
SIZE, | |||
CAST ) |
Definition at line 1300 of file checkpoint.c.
#define S_WRITE_B | ( | BUF, | |
LEN ) if ( fwrite_cached(BUF, 1, LEN, fd) != (size_t)(LEN) ) return(__LINE__); |
Definition at line 1304 of file checkpoint.c.
#define S_FLUSH_B if ( flush_cache(fd) != 1 ) return(__LINE__); |
Definition at line 1307 of file checkpoint.c.
#define R_COPY_S | ( | VAR, | |
CAST ) |
Definition at line 1312 of file checkpoint.c.
#define S_WRITE_S | ( | STR | ) |
Definition at line 1318 of file checkpoint.c.
#define R_COPY_LIST | ( | ARG | ) |
Definition at line 1326 of file checkpoint.c.
#define S_WRITE_LIST | ( | LST | ) |
Definition at line 1331 of file checkpoint.c.
#define R_COPY_NAMETREE | ( | ARG | ) |
Definition at line 1338 of file checkpoint.c.
#define S_WRITE_NAMETREE | ( | ARG | ) |
Definition at line 1347 of file checkpoint.c.
#define S_WRITE_DOLLAR | ( | ARG | ) |
Definition at line 1358 of file checkpoint.c.
#define ANNOUNCE | ( | str | ) |
Definition at line 1369 of file checkpoint.c.
int CheckRecoveryFile | ( | ) |
Checks whether a snapshot/recovery file exists. Returns 1 if it exists, 0 otherwise.
Definition at line 278 of file checkpoint.c.
References RecoveryFilename().
void DeleteRecoveryFile | ( | ) |
Deletes the recovery files. It is called by CleanUp() in the case of a successful completion.
Definition at line 333 of file checkpoint.c.
char * RecoveryFilename | ( | ) |
Returns pointer to recovery filename.
Definition at line 364 of file checkpoint.c.
Referenced by CheckRecoveryFile(), and DoRecovery().
void InitRecovery | ( | ) |
Sets up the strings for the filenames of the recovery files. This functions should only be called once to avoid memory leaks and after AM.TempDir has been initialized.
Definition at line 399 of file checkpoint.c.
size_t fwrite_cached | ( | const void * | ptr, |
size_t | size, | ||
size_t | nmemb, | ||
FILE * | fd ) |
Definition at line 1221 of file checkpoint.c.
size_t flush_cache | ( | FILE * | fd | ) |
Definition at line 1246 of file checkpoint.c.
int DoRecovery | ( | int * | moduletype | ) |
Reads from the recovery file and restores all necessary variables and states in FORM, so that the execution can recommence in preprocessor() as if no restart of FORM had occurred.
The recovery file is read into memory as a whole. The pointer p then points into this memory at the next non-processed data. The macros by which variables are restored, like R_SET, automatically increase p appropriately.
If something goes wrong, the function returns with a non-zero value.
Allocated memory that would be lost when overwriting the global structs with data from the file is freed first. A major part of the code deals with the restoration of pointers. The idiom we use is to memorize the original pointer value (org), allocate new memory and copy the data from the file into this memory, calculate the offset between the old pointer value and the new allocated memory position (ofs), and then correct all affected pointers (+=ofs).
We rely on the fact that several variables (especially in AM) are already assigned the correct values by the startup functions. That means, in principle, that a change in the setup files between snapshot creation and recovery will be noticed.
Definition at line 1401 of file checkpoint.c.
References TaBlEs::argtail, TaBlEs::boomlijst, BrAcKeTiNfO::bracketbuffer, TaBlEs::buffers, TaBlEs::bufferssize, CopyFile(), TaBlEs::flags, ReNuMbEr::func, ReNuMbEr::funnum, VaRrEnUm::hi, BrAcKeTiNfO::indexbuffer, ReNuMbEr::indi, ReNuMbEr::indnum, VaRrEnUm::lo, TaBlEs::MaxTreeSize, TaBlEs::mm, TaBlEs::numind, TaBlEs::pattern, TaBlEs::prototype, TaBlEs::prototypeSize, RecoveryFilename(), ExPrEsSiOn::renumlists, TaBlEs::reserved, TaBlEs::spare, TaBlEs::sparse, VaRrEnUm::start, ReNuMbEr::symb, ReNuMbEr::symnum, TaBlEs::tablepointers, TimeWallClock(), TaBlEs::totind, ReNuMbEr::vecnum, and ReNuMbEr::vect.
void DoCheckpoint | ( | int | moduletype | ) |
Checks whether a snapshot should be done. Calls DoSnapshot() to create the snapshot.
Definition at line 3108 of file checkpoint.c.
References PF_BroadcastNumber(), PF_RecvFile(), PF_SendFile(), and TimeWallClock().
unsigned char cache_buffer[CACHE_SIZE] |
Definition at line 1218 of file checkpoint.c.
size_t cache_fill = 0 |
Definition at line 1219 of file checkpoint.c.