ACML 2020 Tutorial: Optimization Methods for Machine Learning

Description
Machine learning algorithms often use optimization to solve problems -- for example, when model(s) are constructed to fit data, they are usually trained by solving an underlying optimization problem. This helps to learn parameters of loss functions and possibly regularization functions if they are used. In the process of model selection and validation, the optimization problem may be solved many times. This entwining of machine learning and optimization makes it possible for researchers to use advances in mathematical programming to study the speed, accuracy and robustness of machine learning algorithms. In this tutorial, we will investigate how popular machine learning algorithms can be posed as unconstrained optimization problems and solved using well known techniques in literature including Line Search Methods, Newton and Quasi-Newton methods, and Conjugate-Gradient and Projection methods. Implementation of algorithms and illustrative examples in the R programming language will be presented.
Relevance and Significance
The topic is relevant to the machine learning community since unconstrained optimization algorithms can be used to solve popular machine learning algorithms including but not limited to empirical risk minimization, supervised, semi-supervised and unsupervised learning, relational and deep learning techniques.
Prerequisites
This tutorial will assume prior background in linear algebra and probability theory. It will also assume familiarity with the R programming language since some of the algorithms will be coded using R and made available to the audience.
Tutorial Structure
The following methods will be discussed in the tutorial:
  • Motivation -- Case Studies
  • Introduction and Background
    • Why Optimize?
    • Optimization Formulation -- Modeling, Objective Function, Constraints
    • Local versus Global, Stochastic versus Deterministic, Constrained versus Unconstrained Optimization
    • Step Length Selection Methods
    • Armijo Condition
    • Wolfe and Goldstein Conditions
  • Iterative Descent Methods
  • First Order Methods
    • Batch Gradient Descent
    • Stochastic Gradient Descent
    • Mini-Batch Methods
    • Scaling
    • Momentum Methods -- Heavy Ball and Nesterov
    • Stochastic Variance Reduced Gradient (SVRG) Descent
    • Optimal Learning Rate Schedules
  • Second Order Methods
    • Newton's Method
    • Quasi-Newton Method
Github
Code and Slides for the tutorial are available here
Presenter
Haimonti Dutta is an Assistant Professor in the Department of Management Science and Systems (MSS), School of Management, State University of New York (SUNY) at Buffalo, NY. She is also a faculty member of the Computational and Data Enabled Science and Engineering (CDSE) Program at UB. Prior to her current appointment, she served as an Associate Research Scientist at the Center for Computational Learning Systems (CCLS) at Columbia University, NY. Her research broadly focuses on machine learning, distributed optimization and large scale distributed and parallel mining -- the primary focus is in a relatively new sub-field of machine learning called Consensus Based Machine Learning. The US Federal Government (including the National Science Foundation and National Endowment of Humanities), several industry partners (including Amazon Web Services, EMC, Mathworks Inc, Epilepsy Research Foundation and the Consolidated Edison Company of New York), the School of Management and College of Arts and Sciences at UB have generously supported her research.
References
  • Numerical Optimization, Nocedal and Wright
  • Convex Optimization Algorithms. D. P. Bertsekas. Athena Scientific
  • Convex Optimization. S. Boyd and L. Vandenberghe
  • Optimization Methods for Large Scale Machine Learning. L. Bottou, F. E. Curtis and J. Nocedal.
  • Optimization for Machine Learning. S. Sra, S. Nowozin, and S. J. Wright.
  • The Interplay of Optimization and Machine Learning Research. K. P. Bennett and E. P. Hernandez.