Monitoring the behavior of parallel programms: how to be scalable?
Author :
Laboratoire de l'informatique du parallélisme Peterschmitt, Jean-Yves Tourancheau, Bernard Xavier-Francois Vigouroux
Abstract :
(eng) It is easy to find errors and inefficient parts of a sequential program, by using a standard debugger/profiler, but there is no such tool in a parallel environment. The only way to study the race conditions of a parallel program is to execute it and collect data about its execution. The programmer can then use the generated trace files and specialized tuning tools to visualize and improve the behavior of the program: idle processors, communications, etc. The problem in large parallel systems is that these tools have to deal with an enormous amount of data. The classical approach to monitor and trace analysis i.e. sequential, event driven, post-mortem monitoring) is no longer realistic. To avoid this bottleneck, we introduced PIMSY (Parallel Implementation of a Monitoring System). The main idea of PIMSY is to let the trace data distributed among the parallel storage and to distribute the program (or the programs) that deal with the trace data.