The DAGonCAPIO project “Directed Acyclic Graph on Cross-Application Programmable I/O” has two general objectives:
- Formalize, design, and implement an engine for scientific workflows capable of maximizing the overlap between I/O and computation;
- Integrate the management of the scratch directory in the workflow engine itself through an ad hoc file system;
- Develop an industrial tool leveraging HPC-Cloud-AI convergence, thanks to the synergy with the companies making up the partnership;
- Demonstrate the functionality of DAGonCAPIO through an application that leverages DNSH objectives.
Formalize, design, and implement
To achieve the first objective, it is necessary to introduce the concept of I/O streaming into the definition of data flow represented through a DAG. After defining what is produced by a single iteration of the data producer as a complete unit of data product, it must define a producer/consumer relationship through a discrete temporal dependence relationship aimed at processing the single data unit.
Considering two tasks, A and B, where A, through n iterations, produces n units of data by writing n files, and B, with n iterations, consumes the n units produced by A, we need to define the single relation for which B(Ai)→Bi, with i varying in [0..n[. We can deduce that the proposed formalization will have a field of applicability in all those cases in which the cardinality of the data units produced by A is a function of an iterative cycle whose number of iterations may not be defined a priori. The cases in which A executes all a[0..n[ iterations before producing a single output file will not be immediate beneficiaries of what we propose.
Integrate in the workflow
Once the dependency on data produced in a “progressive” way is formalized, it will be necessary to design the software component managing the proposed pattern using the workflow:// scheme, which allows for grouping of the tasks of a workflow so that they are represented as a virtual directory file system whose root is the name of the workflow. With workflow:// schema, you can define a parallel data dependency, e.g., for each produced unit of data at the end of the computation, several consumer tasks equal to cardinality n will be instantiated. Using a similar approach to formalize the parallel pattern, the relationship of progressive dependence between two tasks will be designed.
Develop an industrial tool leveraging
The DAGonCAPIO workflow engine will be built based on DAGonStar, following the main architectural features, but redesigning the execution core to take advantage of a completely container-oriented design. In this way, the possibility of executing a single task on different computational resources is enhanced. The progressive dependency between two tasks will be implemented thanks to the integration with CAPIO. DAGonCAPIO will be equipped with a software component capable of processing the workflow:// and automatically creating what is necessary for the CAPIO I/O coordination language to allow the annotation of workflow data dependencies with synchronization semantics. By doing so, at runtime, the user-space middleware component of CAPIO automatically and transparently transforms the data dependency relationship into an I/O streaming execution.
Demonstrate the functionality of DAGonCAPIO
As mentioned previously, most workflow engines work in situ, forcing the use of a filesystem shared between all computational nodes or, in a more relaxed way, using data transfer protocols for stage-in operations and stage-out to the computing resources where execution occurs. Some workflow engines, such as DAGonStar, minimize the centralized approach of the shared file system by leveraging only backward references and limiting the staging operation only to transfers from the producer to the consumer task. By integrating the features of CAPIO and DAGonStar into DAGonCAPIO it will be possible to take advantage of the management of staging operations asynchronously. This allows you to implement some form of data processing in transit and the scratch directory will be used as a resource available to a user space file system. Thanks to the partner companies, the objective of this project proposal is to make DAGonCAPIO a tool that can potentially be used in the industrial sector as an orchestrator in the context of business applications in which HPC-AI convergence is implemented through High-Performance Cloud Computing. The hypothesized demonstrator, i.e., interdisciplinary application of environmental modelling, will allow us to highlight how the synergy between research institutions and businesses maximizes most of the six objectives the DNSH envisages.
Expected impact of the research program
From an experimental comparison of DAGonStar with other workflow engines, the use of this workflow engine on multiple infrastructure types was demonstrated with a 50.19% improvement in execution time when using parallel patterns [10]. Meanwhile, CAPIO was tested on synthetic benchmarks that simulate I/O models of typical and two real-world workflows. Experiments show that CAPIO reduces execution time by 10% to 66% for data-intensive workflows that use the file system as the communication medium.
Since DAGonCAPIO is designed to combine the best of the two design solutions indicated, a substantial improvement in the calculation times of workflow-based applications that use files as a data exchange system between tasks is expected. In particular, as will be demonstrated through the use case, applications such as those for the simulation and prediction of environmental phenomena (detailed atmospheric forecasts, transport and diffusion of pollutants in the sea and the air, wave motion prediction, scenario analysis for the mitigation of climate change, etc), in which one of the first requirements is the speed with which the results are produced and made available, they will be able to benefit from the possibility of a continuously produced output. For example, using a traditional workflow engine such as DAGonStar, a detailed weather forecast in which data for the next 24 hours is produced in a time of approximately 5 minutes per hour for running the weather model and approximately 1 minute per hour for post-processing of the data, the first useful output will be available after 144 minutes. Using DAGonCAPIO it is estimated that the first useful output will be available 6 minutes after the start of the computation.
References
- Martinelli, Alberto Riccardo, Massimo Torquati, Marco Aldinucci, Iacopo Colonnelli, and Barbara Cantalupo. “CAPIO: a Middleware for Transparent I/O Streaming in Data-Intensive Workflows.” In 2023 IEEE 30th International Conference on High-Performance Computing, Data, and Analytics (HiPC), pp. 153-163. IEEE, 2023.
- Sánchez-Gallegos, Dante Domizzi, Diana Di Luccio, Sokol Kosta, J. L. Gonzalez-Compean, and Raffaele Montella. “An efficient pattern-based approach for workflow supporting large-scale science: The DagOnStar experience.” Future Generation Computer Systems 122 (2021): 187-203.