JuliaCon 2020 (times are in UTC)

Taylor-Mode for Higher Order Automatic Differentiation

Higher-order AD can be achieved by nested calls to first-order AD, but this approach costs exponential in the order of differentiation. Taylor-mode AD is a known alternative method for efficiently computing higher derivatives. I describe this method, suitable for machine learning, capable of computing higher derivatives of large neural networks. I describe improvements to the algorithm for computing higher derivatives of ODE solutions to be matrix-free and scalable to high-dimensional systems.


Higher-order AD can be readily achieved by composing first-order AD, differentiating the emitted code from the lower orders. However, naively nesting first-order AD can require a number of evaluations exponential in the order of differentiation. This is because the contributions of lower-order derivatives to the higher orders are complicated, given by a general formula called Faa di Bruno, which generalizes the chain rule to higher orders. Further, naively nesting first order differentiation may not efficiently share the results computed from lower orders. Many expressions that appear in higher derivatives were required to compute the lower orders, and recomputing these expressions will result in exponential time cost O(exp K) in differentiation order K.

Fortunately, there is a known solution in the AD literature, which we are calling Taylor-mode. This recognizes that the terms from higher-order differentiation are exactly the coefficients to a truncated Taylor polynomial (up to constant factor scaling by factorial of order). The desired higher derivatives of a composite function an be computed by propagating Taylor polynomials through primitive functions.

The implementation details are a direct generalization of first-order AD via "dual numbers". We define propagation rules for primitive operations, not just for the first order sensitivity, but for higher order sensitivities. This allows us to propagate Taylor polynomials through functions composed of primitive operations. In general, these operations will follow the Faa di Bruno formula and require sub-exponentially evaluations O(exp sqrt K). However, many of the primitive operations we care about in practice have special forms and can have propagation rules that scale like O(K log K).

First-order AD is a fundamental tool in machine learning, for gradient-based optimization, and has modern, scalable implementations that integrate well into popular tools. However, higher-order AD has not found wide use-cases, and the existing implementations do not integrate into machine learning research tools.

There is a special relationship between Taylor polynomials and ODEs. To exploit this relationship with Neural ODEs, we need AD tools that work with numerical integration, higher-order Taylor derivatives, and that scale to many parameters found in machine learning models. Further, we can use these tools to improve known algorithms for computing higher derivatives of solutions to ODEs.