Advances on iterative methods for large-scale optimization, accuracy benchmarking methods and applications on informed learning

Piermarini, Christian

In recent years, people have experienced a significant improvement of industrial and informative systems, and being able to handle substantial amount of information has become mandatory to remain competitive in the modern world. Computers and other technological tools have become extremely powerful, but the need to handle and manage large datasets in an efficient and effective manner is indeed an important topic. Many real world problems require to deal with thousands of decision variables, and these applications, from an optimization point of view, fit into the notion of large scale optimization problems. This kind of problems arise very frequently as a result of modeling systems with a very complex structure. It is possible that generic state-of the-art algorithms would not result efficient when the dimension of the optimization problem becomes excessively high. Therefore, in most cases it is improper to tackle problems with a large number of variables by using one of the many existing algorithms for the small scale case, only relying on the ever-improving performance of modern machines. Given these premises, it is clear that, in these large scale settings, optimization methods which employ iterative procedures should be preferred to direct optimization algorithms. In fact, this last class of optimization methods can result extremely expensive when dealing with large scale optimization problems, due to possible high dimensional matrix storage and use of factorization techniques. On the other hand, iterative methods generate a sequence of improving approximate solutions according to a suitable rule in which each approximation is derived from the previous ones. By doing this, it is possible to reduce the computational cost required to perform the algorithm to a degree which can be set by the user, balancing the trade–off between the accuracy of the solutions and the maximum computational cost allowed. These reasons motivated our work, which focuses on the use of Truncated Newton methods to solve large scale optimization problems. These algorithms employ descent directions computed through an iterative method as approximated solutions of Newton’s system of equations. This feature make Truncated Newton methods suited to solve high dimensional optimization problems in which the matrix-vector multiplications result impracticable. In particular, the challenge we faced was to improve the quality of the solutions achieved by the optimization procedure through the use of negative curvature directions, allowing the algorithm to better exploit second order information. It was mandatory to compute these directions as a by-product of the original SYMMBK algorithm, minimizing any matrix-involving storage and computation as much as possible. Therefore, it is important to point out that, even if our Truncated Newton methods use second order information on the objective function, actually they are “matrix–free” (Hessian–free), since Hessian matrix is never stored and it is always accessed by means of a routine which provides the matrix–vector product of the Hessian times a vector. For quadratic functions, Hessian–vector products can be computed at a cost which is a small multiple of the expense of a gradient evaluation (see, e.g., [8]). This feature makes these methods very attractive also in recently raised machine learning applications (including deep neural network training). Such problems arise, for instance, in training deep neural networks, in low rank subspace clustering problems [62] and in many other applications. Moreover, in several contexts (see, e.g., statistical physics, random matrix theory and multilayer perceptron neural networks training [3, 9, 17, 26]) negative curvature directions are a useful tool for an algorithm to escape from saddle points. As clearly pointed out in the recent paper by Curtis and Robinson [24], it is deeply needful to design new methods capable of efficiently solve nonconvex problems by exploiting negative curvature directions, both in deterministic and stochastic optimization. These authors also affirm that few algorithms converging to second order critical points have been up to now developed. The main reason for this depends on an excessive computational burden related to the evaluation of negative curvature directions, along with an insufficient evidence of all the benefits deriving from incorporating such directions. To provide a wide yet exhaustive overview of our method and its theoretical properties, in Chapter 1 we recall the main results about iterative methods for solving large symmetric linear systems. In particular, we introduce both the Conjugate Gradient method and the Lanczos process [68]. Finally, we present the SYMMBK method, which is a state-of-the-art framework that we used as a basis to build our optimization algorithm. In Chapter 2 we deal with large scale unconstrained optimization problems, recalling the quasi-Newton methods with a focus on memory quasi-Newton algorithms, and the Nonlinear Conjugate Gradient method, describing the restarting strategy and the possibility to use preconditioners to improve the efficiency of the algorithms. We dedicate Chapter 3 to present Truncated Newton methods in depth, focusing on linesearch-based methods, their structure and their convergence properties. The main research contribution is given by Chapters 3, 4 and 5. In Chapter 3, we explain how to compute and employ negative curvature directions within the SYMMBK procedure, to improve the quality of the achieved solutions. Of course, the proposed modifications of the SYMMBK algorithm keep in due considerations the large scale optimization framework. The distinguishing feature of our proposal is that the computation of negative curvatures is basically carried out as by–product of vectors already available within the SYMMBK procedure, without storing no more than two additional vector. Hence, no explicit matrix factorization or matrix storage is required. We performed extensive numerical experimentation, which has shown that, by applying our modified version of the SYMMBK method, the use of negative curvature directions allows to reach much better solutions with a slight increase in computational time. In order to compare the results achieved with all the solvers on the entire test set, it has been mandatory to use a formal benchmarking method. At this point, to the best of our knowledge, we realized that there was no benchmarking procedure in the literature that correctly fitted our needs, since the state-of-the-art methods (namely the performance profiles and the data profiles) focus on gauging the efficiency of the algorithms. Therefore, in Chapter 4 we describe our new benchmarking method for smooth optimization problems, namely the Quality profiles. Our aim is to formulate a comparative procedure which mainly focuses on the accuracy of the achieved solutions, rather than on some performance measure associated with the computational burden. Still, our tool has been developed to provide a complete overview of the capabilities of different procedures to reach accurate solutions, independently of the optimization framework and test set. In our dissertation, we thoroughly describe the theoretical features that quality profiles benefit from, following with the applications of this tool on two different numerical experiments. In the first case, we consider four distinct optimization procedures and collect the results achieved on the same test set, summarizing them through the quality profiles. Secondly, we employ the quality profiles to compare different variations of the same truncated Newton method procedure described in the previous chapter. At the end, conclusions are given. Finally, in Chapter 5, a real-world application of iterative methods in a large-scale optimization setting is given. Specifically, a methodology for informed learning is presented and its effectiveness is demonstrated through numerical experiments with physics-informed learning problems. In this context, we tackle the problem of training machine learning models in both unconstrained and constrained settings, highlighting that, in both instances, an indefinite linear system of equations must be solved. Moreover, given the high complexity of the problem tackled, an iterative optimization method is applied to evaluate a satisfactory approximate solution. The methodology has three main distinguishing features. Firstly, prior information is introduced in the training problem through hard constraints rather than by the means of the typical modern practice of using soft constraints (i.e., regularization terms). Secondly, the methodology does not employ penalty-based (e.g., augmented Lagrangian) methods since the use of such methods results in an overall methodology that is similar to a soft-constrained approach. Rather, the methodology is based on a recently proposed stochastic-gradient-based algorithm that maintains computational efficiency while handling constraints with a Newton-based technique. Thirdly, a new projectionbased variant of the well-known Adam optimization methodology is proposed for settings with hard constraints. Numerical experiments on a set of physics-informed learning problems show that, when compared with a soft-constraint approach, the proposed methodology can be easier to tune, lead to accurate predictions more quickly, and lead to better final prediction accuracy. The research activities presented in Chapter 3 and 4 has been carried out in collaboration with professor Massimo Roma (thesis advisor) for the Department of Computer, Control and Management Engineering of ’La Sapienza’ University of Rome, and professor Giovanni Fasano form ’Ca’ Foscari’ University of Venice. The research activity discussed in Chapter 5 has been conducted in collaboration with professor Frank E. Curtis during the visiting research period of January - May 2024 at Lehigh University.

Advances on iterative methods for large-scale optimization, accuracy benchmarking methods and applications on informed learning

PIERMARINI, CHRISTIAN

2025

Abstract

In recent years, people have experienced a significant improvement of industrial and informative systems, and being able to handle substantial amount of information has become mandatory to remain competitive in the modern world. Computers and other technological tools have become extremely powerful, but the need to handle and manage large datasets in an efficient and effective manner is indeed an important topic. Many real world problems require to deal with thousands of decision variables, and these applications, from an optimization point of view, fit into the notion of large scale optimization problems. This kind of problems arise very frequently as a result of modeling systems with a very complex structure. It is possible that generic state-of the-art algorithms would not result efficient when the dimension of the optimization problem becomes excessively high. Therefore, in most cases it is improper to tackle problems with a large number of variables by using one of the many existing algorithms for the small scale case, only relying on the ever-improving performance of modern machines. Given these premises, it is clear that, in these large scale settings, optimization methods which employ iterative procedures should be preferred to direct optimization algorithms. In fact, this last class of optimization methods can result extremely expensive when dealing with large scale optimization problems, due to possible high dimensional matrix storage and use of factorization techniques. On the other hand, iterative methods generate a sequence of improving approximate solutions according to a suitable rule in which each approximation is derived from the previous ones. By doing this, it is possible to reduce the computational cost required to perform the algorithm to a degree which can be set by the user, balancing the trade–off between the accuracy of the solutions and the maximum computational cost allowed. These reasons motivated our work, which focuses on the use of Truncated Newton methods to solve large scale optimization problems. These algorithms employ descent directions computed through an iterative method as approximated solutions of Newton’s system of equations. This feature make Truncated Newton methods suited to solve high dimensional optimization problems in which the matrix-vector multiplications result impracticable. In particular, the challenge we faced was to improve the quality of the solutions achieved by the optimization procedure through the use of negative curvature directions, allowing the algorithm to better exploit second order information. It was mandatory to compute these directions as a by-product of the original SYMMBK algorithm, minimizing any matrix-involving storage and computation as much as possible. Therefore, it is important to point out that, even if our Truncated Newton methods use second order information on the objective function, actually they are “matrix–free” (Hessian–free), since Hessian matrix is never stored and it is always accessed by means of a routine which provides the matrix–vector product of the Hessian times a vector. For quadratic functions, Hessian–vector products can be computed at a cost which is a small multiple of the expense of a gradient evaluation (see, e.g., [8]). This feature makes these methods very attractive also in recently raised machine learning applications (including deep neural network training). Such problems arise, for instance, in training deep neural networks, in low rank subspace clustering problems [62] and in many other applications. Moreover, in several contexts (see, e.g., statistical physics, random matrix theory and multilayer perceptron neural networks training [3, 9, 17, 26]) negative curvature directions are a useful tool for an algorithm to escape from saddle points. As clearly pointed out in the recent paper by Curtis and Robinson [24], it is deeply needful to design new methods capable of efficiently solve nonconvex problems by exploiting negative curvature directions, both in deterministic and stochastic optimization. These authors also affirm that few algorithms converging to second order critical points have been up to now developed. The main reason for this depends on an excessive computational burden related to the evaluation of negative curvature directions, along with an insufficient evidence of all the benefits deriving from incorporating such directions. To provide a wide yet exhaustive overview of our method and its theoretical properties, in Chapter 1 we recall the main results about iterative methods for solving large symmetric linear systems. In particular, we introduce both the Conjugate Gradient method and the Lanczos process [68]. Finally, we present the SYMMBK method, which is a state-of-the-art framework that we used as a basis to build our optimization algorithm. In Chapter 2 we deal with large scale unconstrained optimization problems, recalling the quasi-Newton methods with a focus on memory quasi-Newton algorithms, and the Nonlinear Conjugate Gradient method, describing the restarting strategy and the possibility to use preconditioners to improve the efficiency of the algorithms. We dedicate Chapter 3 to present Truncated Newton methods in depth, focusing on linesearch-based methods, their structure and their convergence properties. The main research contribution is given by Chapters 3, 4 and 5. In Chapter 3, we explain how to compute and employ negative curvature directions within the SYMMBK procedure, to improve the quality of the achieved solutions. Of course, the proposed modifications of the SYMMBK algorithm keep in due considerations the large scale optimization framework. The distinguishing feature of our proposal is that the computation of negative curvatures is basically carried out as by–product of vectors already available within the SYMMBK procedure, without storing no more than two additional vector. Hence, no explicit matrix factorization or matrix storage is required. We performed extensive numerical experimentation, which has shown that, by applying our modified version of the SYMMBK method, the use of negative curvature directions allows to reach much better solutions with a slight increase in computational time. In order to compare the results achieved with all the solvers on the entire test set, it has been mandatory to use a formal benchmarking method. At this point, to the best of our knowledge, we realized that there was no benchmarking procedure in the literature that correctly fitted our needs, since the state-of-the-art methods (namely the performance profiles and the data profiles) focus on gauging the efficiency of the algorithms. Therefore, in Chapter 4 we describe our new benchmarking method for smooth optimization problems, namely the Quality profiles. Our aim is to formulate a comparative procedure which mainly focuses on the accuracy of the achieved solutions, rather than on some performance measure associated with the computational burden. Still, our tool has been developed to provide a complete overview of the capabilities of different procedures to reach accurate solutions, independently of the optimization framework and test set. In our dissertation, we thoroughly describe the theoretical features that quality profiles benefit from, following with the applications of this tool on two different numerical experiments. In the first case, we consider four distinct optimization procedures and collect the results achieved on the same test set, summarizing them through the quality profiles. Secondly, we employ the quality profiles to compare different variations of the same truncated Newton method procedure described in the previous chapter. At the end, conclusions are given. Finally, in Chapter 5, a real-world application of iterative methods in a large-scale optimization setting is given. Specifically, a methodology for informed learning is presented and its effectiveness is demonstrated through numerical experiments with physics-informed learning problems. In this context, we tackle the problem of training machine learning models in both unconstrained and constrained settings, highlighting that, in both instances, an indefinite linear system of equations must be solved. Moreover, given the high complexity of the problem tackled, an iterative optimization method is applied to evaluate a satisfactory approximate solution. The methodology has three main distinguishing features. Firstly, prior information is introduced in the training problem through hard constraints rather than by the means of the typical modern practice of using soft constraints (i.e., regularization terms). Secondly, the methodology does not employ penalty-based (e.g., augmented Lagrangian) methods since the use of such methods results in an overall methodology that is similar to a soft-constrained approach. Rather, the methodology is based on a recently proposed stochastic-gradient-based algorithm that maintains computational efficiency while handling constraints with a Newton-based technique. Thirdly, a new projectionbased variant of the well-known Adam optimization methodology is proposed for settings with hard constraints. Numerical experiments on a set of physics-informed learning problems show that, when compared with a soft-constraint approach, the proposed methodology can be easier to tune, lead to accurate predictions more quickly, and lead to better final prediction accuracy. The research activities presented in Chapter 3 and 4 has been carried out in collaboration with professor Massimo Roma (thesis advisor) for the Department of Computer, Control and Management Engineering of ’La Sapienza’ University of Rome, and professor Giovanni Fasano form ’Ca’ Foscari’ University of Venice. The research activity discussed in Chapter 5 has been conducted in collaboration with professor Frank E. Curtis during the visiting research period of January - May 2024 at Lehigh University.

Scheda breve

Scheda completa

Scheda completa (DC)

	Corso di studio
	
				Automatica, bioingegneria e ricerca operativa - Abro
			
	Data di pubblicazione
	
				21-gen-2025
			
	Lingua
	
				Inglese
			
	Relatore, Supervisor, Advisor o Tutor
	
				ROMA, Massimo
			
	Correlatore, Controrelatore, Co-Supervisor,  Co-Tutor o Coordinatori
	
				PALAGI, Laura
			
	Nome Editore
	
				Università degli Studi di Roma "La Sapienza"
			
	Collezione di appartenenza
	
				Università degli Studi di Roma La Sapienza

File in questo prodotto:

File	Dimensione	Formato
Tesi_dottorato_Piermarini.pdf accesso aperto Dimensione 2.27 MB Formato Adobe PDF Visualizza/Apri	2.27 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/189675

Il codice NBN di questa tesi è URN:NBN:IT:UNIROMA1-189675