BigJob is a SAGA-based pilot job implementation in Python. The Simple API for Grid Applications (SAGA) is a high-level, easy-to-use API for accessing distributed resources. Unlike other common pilot job systems SAGA BigJob (i) natively supports MPI job and (ii) works on a variety of back-end systems, generally reflecting the advantage of using a SAGA-based approach. The following figure gives an overview of the SAGA BigJob architecture.
SAGA BigJob comprises of three components: (i) the BigJob Manager that provides the pilot job abstraction and manages the orchestration and scheduling of BigJobs (which in turn allows the management of both bigjob objects and subjobs) (ii) the BigJob-Agent that represents the pilot job and thus, the application-level resource manager on the respective resource, and (iii) the advert service that is used for communication between the BigJob Manager and Agent.
Before running regular jobs, an application must initialize a bigjob object. The BigJob Manager then queues a pilot job, which actually runs a BigJob Agent on the respective resource. For this agent a specified number of resources is requested. Subsequently, sub-jobs can be submitted through the BigJob Manager using the jobID of the BigJob as reference. The BigJob Manager ensures that the subjobs are launched onto the correct resource based upon the specified jobID using the right number of processes. Communication between the BigJob Agent and BigJob Manager is carried out using the SAGA advert service, a central key/value store. For each new job, an advert entry is created by the BigJob? Manager. The agent periodically polls for new jobs. If a new job is found and resources are available, the job is dispatched, otherwise it is queued.