Efficient and scalable linear solvers are critical for implicit reservoir simulation, where the linear solver can account for up to 90% of total runtime. In this work, we present an integrated software and algorithmic framework that (i) replaces the inherently sequential ILU stage of the popular CPR preconditioner with a highly parallel FSAI preconditioner enhanced by an augmented decoupling mechanism, and (ii) accelerates the AMG preconditioner with SpGEMM library. On the CPR preconditioning side, the default global system relaxation with LU is substituted by the FSAI preconditioner, while the local pressure correction is retained by AMG. To improve FSAI in the presence of strong transport-induced couplings, a local block-diagonal decoupling is applied on small cell-blocks. The resulting fully local decoupling approach combines quasi-IMPES scaling to reduce pressure saturation couplings, a dynamic row summation to ensure solvability by AMG, and constrained pressure decoupling to improve FSAI preconditioning effect. This yields a highly effective and scalable CPR preconditioning framework. We implemented the preconditioned solver suite in C++/MPI and evaluated it with the OPM simulator on Norne, SPE11C, and Sleipner benchmarks. The resulting framework, deco, matches or improves upon default OPM solvers (DUNE and AMGCL) in a sequential setting (1 MPI rank) and delivers 2–4× speedups in strong-scaling tests with up to 16 MPI ranks. On the SpGEMM side, we developed a C++/MPI/CUDA library called nsp1 that builds on top of a single-GPU nsparse kernel and provides a multi-GPU extension employing a 1D row-wise partitioning to minimize inter-GPU communications and avoid host-mediated transfers. Each GPU executes kernels independently (task parallelism across GPUs) while exploiting data parallelism internally. The multi-GPU framework demonstrated strong scalability on the Leonardo supercomputer in tests employing up to 512 concurrent GPUs and processing matrices with up to ∼15 billion nonzeros and producing outputs of ∼52 billion nonzeros. Starting from the nsparse single-GPU kernel, we introduced several kernel-level improvements — a dynamic sparse accumulator with optimized hash search update, improved workload balancing for irregular sparsity, and specialized kernels for long rows — yielding several-fold speedups for square product tests (A2) and coarse-level operator (RAP) tests corresponding to double SpGEMM product. The proposed nsp library showed up to 2× speedup with respect to the original nsparse library and up to 6× speedup with respect to cuSparse library.
PARALLEL LINEAR SOLVERS FOR RESERVOIR SIMULATIONS
MAVLIUTOV, ARTEM
2026
Abstract
Efficient and scalable linear solvers are critical for implicit reservoir simulation, where the linear solver can account for up to 90% of total runtime. In this work, we present an integrated software and algorithmic framework that (i) replaces the inherently sequential ILU stage of the popular CPR preconditioner with a highly parallel FSAI preconditioner enhanced by an augmented decoupling mechanism, and (ii) accelerates the AMG preconditioner with SpGEMM library. On the CPR preconditioning side, the default global system relaxation with LU is substituted by the FSAI preconditioner, while the local pressure correction is retained by AMG. To improve FSAI in the presence of strong transport-induced couplings, a local block-diagonal decoupling is applied on small cell-blocks. The resulting fully local decoupling approach combines quasi-IMPES scaling to reduce pressure saturation couplings, a dynamic row summation to ensure solvability by AMG, and constrained pressure decoupling to improve FSAI preconditioning effect. This yields a highly effective and scalable CPR preconditioning framework. We implemented the preconditioned solver suite in C++/MPI and evaluated it with the OPM simulator on Norne, SPE11C, and Sleipner benchmarks. The resulting framework, deco, matches or improves upon default OPM solvers (DUNE and AMGCL) in a sequential setting (1 MPI rank) and delivers 2–4× speedups in strong-scaling tests with up to 16 MPI ranks. On the SpGEMM side, we developed a C++/MPI/CUDA library called nsp1 that builds on top of a single-GPU nsparse kernel and provides a multi-GPU extension employing a 1D row-wise partitioning to minimize inter-GPU communications and avoid host-mediated transfers. Each GPU executes kernels independently (task parallelism across GPUs) while exploiting data parallelism internally. The multi-GPU framework demonstrated strong scalability on the Leonardo supercomputer in tests employing up to 512 concurrent GPUs and processing matrices with up to ∼15 billion nonzeros and producing outputs of ∼52 billion nonzeros. Starting from the nsparse single-GPU kernel, we introduced several kernel-level improvements — a dynamic sparse accumulator with optimized hash search update, improved workload balancing for irregular sparsity, and specialized kernels for long rows — yielding several-fold speedups for square product tests (A2) and coarse-level operator (RAP) tests corresponding to double SpGEMM product. The proposed nsp library showed up to 2× speedup with respect to the original nsparse library and up to 6× speedup with respect to cuSparse library.| File | Dimensione | Formato | |
|---|---|---|---|
|
final_thesis_Artem_Mavliutov.pdf
embargo fino al 03/02/2027
Licenza:
Tutti i diritti riservati
Dimensione
2.93 MB
Formato
Adobe PDF
|
2.93 MB | Adobe PDF |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/356943
URN:NBN:IT:UNIPD-356943