Efficient and scalable linear solvers are critical for implicit reservoir simulation, where the linear solver can account for up to 90% of total runtime. In this work, we present an integrated software and algorithmic framework that (i) replaces the inherently sequential ILU stage of the popular CPR preconditioner with a highly parallel FSAI preconditioner enhanced by an augmented decoupling mechanism, and (ii) accelerates the AMG preconditioner with SpGEMM library. On the CPR preconditioning side, the default global system relaxation with LU is substituted by the FSAI preconditioner, while the local pressure correction is retained by AMG. To improve FSAI in the presence of strong transport-induced couplings, a local block-diagonal decoupling is applied on small cell-blocks. The resulting fully local decoupling approach combines quasi-IMPES scaling to reduce pressure saturation couplings, a dynamic row summation to ensure solvability by AMG, and constrained pressure decoupling to improve FSAI preconditioning effect. This yields a highly effective and scalable CPR preconditioning framework. We implemented the preconditioned solver suite in C++/MPI and evaluated it with the OPM simulator on Norne, SPE11C, and Sleipner benchmarks. The resulting framework, deco, matches or improves upon default OPM solvers (DUNE and AMGCL) in a sequential setting (1 MPI rank) and delivers 2–4× speedups in strong-scaling tests with up to 16 MPI ranks. On the SpGEMM side, we developed a C++/MPI/CUDA library called nsp1 that builds on top of a single-GPU nsparse kernel and provides a multi-GPU extension employing a 1D row-wise partitioning to minimize inter-GPU communications and avoid host-mediated transfers. Each GPU executes kernels independently (task parallelism across GPUs) while exploiting data parallelism internally. The multi-GPU framework demonstrated strong scalability on the Leonardo supercomputer in tests employing up to 512 concurrent GPUs and processing matrices with up to ∼15 billion nonzeros and producing outputs of ∼52 billion nonzeros. Starting from the nsparse single-GPU kernel, we introduced several kernel-level improvements — a dynamic sparse accumulator with optimized hash search update, improved workload balancing for irregular sparsity, and specialized kernels for long rows — yielding several-fold speedups for square product tests (A2) and coarse-level operator (RAP) tests corresponding to double SpGEMM product. The proposed nsp library showed up to 2× speedup with respect to the original nsparse library and up to 6× speedup with respect to cuSparse library.

PARALLEL LINEAR SOLVERS FOR RESERVOIR SIMULATIONS

MAVLIUTOV, ARTEM
2026

Abstract

Efficient and scalable linear solvers are critical for implicit reservoir simulation, where the linear solver can account for up to 90% of total runtime. In this work, we present an integrated software and algorithmic framework that (i) replaces the inherently sequential ILU stage of the popular CPR preconditioner with a highly parallel FSAI preconditioner enhanced by an augmented decoupling mechanism, and (ii) accelerates the AMG preconditioner with SpGEMM library. On the CPR preconditioning side, the default global system relaxation with LU is substituted by the FSAI preconditioner, while the local pressure correction is retained by AMG. To improve FSAI in the presence of strong transport-induced couplings, a local block-diagonal decoupling is applied on small cell-blocks. The resulting fully local decoupling approach combines quasi-IMPES scaling to reduce pressure saturation couplings, a dynamic row summation to ensure solvability by AMG, and constrained pressure decoupling to improve FSAI preconditioning effect. This yields a highly effective and scalable CPR preconditioning framework. We implemented the preconditioned solver suite in C++/MPI and evaluated it with the OPM simulator on Norne, SPE11C, and Sleipner benchmarks. The resulting framework, deco, matches or improves upon default OPM solvers (DUNE and AMGCL) in a sequential setting (1 MPI rank) and delivers 2–4× speedups in strong-scaling tests with up to 16 MPI ranks. On the SpGEMM side, we developed a C++/MPI/CUDA library called nsp1 that builds on top of a single-GPU nsparse kernel and provides a multi-GPU extension employing a 1D row-wise partitioning to minimize inter-GPU communications and avoid host-mediated transfers. Each GPU executes kernels independently (task parallelism across GPUs) while exploiting data parallelism internally. The multi-GPU framework demonstrated strong scalability on the Leonardo supercomputer in tests employing up to 512 concurrent GPUs and processing matrices with up to ∼15 billion nonzeros and producing outputs of ∼52 billion nonzeros. Starting from the nsparse single-GPU kernel, we introduced several kernel-level improvements — a dynamic sparse accumulator with optimized hash search update, improved workload balancing for irregular sparsity, and specialized kernels for long rows — yielding several-fold speedups for square product tests (A2) and coarse-level operator (RAP) tests corresponding to double SpGEMM product. The proposed nsp library showed up to 2× speedup with respect to the original nsparse library and up to 6× speedup with respect to cuSparse library.
3-feb-2026
Inglese
JANNA, CARLO
Università degli studi di Padova
File in questo prodotto:
File Dimensione Formato  
final_thesis_Artem_Mavliutov.pdf

embargo fino al 03/02/2027

Licenza: Tutti i diritti riservati
Dimensione 2.93 MB
Formato Adobe PDF
2.93 MB Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/356943
Il codice NBN di questa tesi è URN:NBN:IT:UNIPD-356943