Demystifying the inner workings of BFGS optimization

Surely anyone who has dabbled in machine learning is familiar with gradient descent, and possibly even its close counterpart, stochastic gradient descent. If you have more than dabbled, then you’re surely also aware of the fancier extensions like gradient descent with momentum, and Adam optimization.

Perhaps less well-known are a class of optimization algorithms known as quasi-Newton methods. Though these optimization methods are less fervently advertised in popular accounts of machine learning, they hold an important place in the arsenal of machine learning practitioners.

The goal of this article is to provide an introduction to the mathematical formulation of BFGS…

