Demystifying the inner workings of BFGS optimization

Introduction

Surely anyone who has dabbled in machine learning is familiar with gradient descent, and possibly even its close counterpart, stochastic gradient descent. If you have more than dabbled, then you’re surely also aware of the fancier extensions like gradient descent with momentum, and Adam optimization.

Perhaps less well-known are a class of optimization algorithms known as quasi-Newton methods. Though these optimization methods are less fervently advertised in popular accounts of machine learning, they hold an important place in the arsenal of machine learning practitioners.

The goal of this article is to provide an introduction to the mathematical formulation of BFGS optimization, by far the most widely used quasi-Newton method. As such, the focus will be on the mathematical derivation of results, rather than the application of BFGS in code. It is my hope that by the end of the article, you will have gained an appreciation for what BFGS is, how it works, and why it was developed. A top priority of mine when crafting this article was to make it as accessible as possible. To this end, any non-trivial derivation will be explicitly shown, and the dreaded phrase “it is straightforward to show” will never make an appearance. …